Differentially Private Federated $k$-Means with Server-Side Data
We propose a federated and differentially private k-means clustering algorithm that leverages server-side data for initialization, achieving strong empirical performance with theoretical guarantees on convergence and cluster identification.
摘要
评审与讨论
This paper considers the -means in the federated learning setting, where there is a central server and multiple clients holding users' data. The objective is to design a differentially private algorithm that minimizes both the -means cost and the communication cost. Assuming the central server owns a small amount of public data, the authors propose FedDP-Init to initialize the centers. They then privatize Lloyd's algorithm, resulting in FedDP-Lloyds, and refer to the entire algorithm as FedDP-KMeans.
Theoretically, they show that FedDP-KMeans performs well in the Gaussian mixture setting. The authors also conduct experiments on both synthetic and real-world datasets, using different initialization algorithms as baselines.
优点
The paper tackles an interesting and practical problem. The algorithm descriptions are clear and easy to understand. The authors address both the theoretical aspects and provide empirical validation through experiments.
缺点
- Typos and notation inconsistencies are present throughout the paper (see Questions below).
- The assumption of public data on the server feels somewhat unrealistic or restrictive, and it would be helpful to explore alternatives or justify this assumption further.
- I am not deeply familiar with Gaussian mixture models and did not verify the proof in detail. However, Definition 1 appears to rely on a strong assumption, and the proof of Theorem 2 seems to build heavily on existing results. As such, I am uncertain about the contribution and value of the theoretical proof.
- The experimental results do not fully convince me. For example, the rank of the subspace is specified as , but this appears to be a critical hyperparameter that could benefit from more discussion, especially in the context of the experiments. Additionally, I would appreciate more discussion on certain design choices and experimental parameters to understand better their impact (see Questions for more details).
问题
- Typos: (1) Typo in line 120. (2) Inconsistent notation between and in lines 142 and 145.
- and . How is the product defined? Similarly, how is the product computed?
- Why does Algorithm 2 use the basic composition for the privacy budget instead of the advanced composition?
- The bound on Delta is a little strange. Why can we tolerate a larger radius of the domain when epsilon gets smaller?
- Could the authors clarify how the matrix is generated in the experiments?
- In Algorithm 1, the variable is not initialized or specified in line 20.
- Could the Johnson–Lindenstrauss lemma be applied as an alternative (more straightforward) approach for constructing the low-rank subspace?
- I would suggest another baseline, such as transferring all client data to a central location and running private k-means clustering.
The assumption of public data on the server feels somewhat unrealistic or restrictive, and it would be helpful to explore alternatives or justify this assumption further.
We respectfully disagree, precisely because we do not assume that the server data is identically distributed to the client data. As discussed in the paper this is a common scenario in practice, please see the references and justification on lines 096-100. E.g. one of the original papers on FL use cases does pertaining on related, but OOD, public proxy data [1].
For example, the rank of the subspace is specified as , but this appears to be a critical hyperparameter that could benefit from more discussion, especially in the context of the experiments.
Indeed, one could think of the rank of the subspace as a separate hyperparameter, though we are not sure why you believe this to be critical. Setting the rank to (number of clusters) is motivated by the theory and our guarantees stem from this. Moreover, setting it to for the experiments also leads to good results. Therefore, we prefer to avoid an additional hyperparameter. Note that none of the baselines use the subspace construction, so their performance is not hindered by not varying the rank of the subspace.
Typos: (1) Typo in line 120. (2) Inconsistent notation between and in lines 142 and 145.
Thank you for pointing this out.
and . How is the product defined? Similarly, how is the product computed?
Yes this is a typo, it is a matrix product and
Why does Algorithm 2 use the basic composition for the privacy budget instead of the advanced composition?
This was only to simplify notations, but we will change it to optimize the privacy budget. In our experiments we use advanced composition.
The bound on Delta is a little strange. Why can we tolerate a larger radius of the domain when epsilon gets smaller?
Indeed, this statement may be confusing. This is explained by the requirement of theorem 2 on the number of samples, which has to be larger than (so, when gets smaller, we can have a larger radius but need a larger number of samples as well). We will clarify this in the manuscript.
Note that using Lemma 16 we can remove completely the limitation on the diameter.
Could the authors clarify how the matrix is generated in the experiments?
This is explained in detail in Appendix G.1. In summary, Q is a mixture of a small amount of data following the client data distribution and additional related but OOD data. We will expand our description in lines 416-418 accordingly.
In Algorithm 1, the variable is not initialized or specified in line 20.
, we will add this to line 20.
Could the Johnson–Lindenstrauss lemma be applied as an alternative (more straightforward) approach for constructing the low-rank subspace?
In terms of theoretical guarantee, it can be applied but gives a much weaker separation condition: the center of the gaussians need to be separated by at least (instead of with PCA) [3]. In practice in non-private clustering PCA is always preferred to Johnson-Lindenstrauss. In private clustering it might be an alternative to save on privacy budget, but we are not aware of prior work in this direction.
I would suggest another baseline, such as transferring all client data to a central location and running private k-means clustering.
Thanks for the suggestion. We can include this. Though notice that our method is already able to recover the central location non-private clustering in many cases (given a large enough privacy budget). So this already provides a sanity check on how performance in the federated setting compares to the central setting (which we could not run in practice anyway).
[1] Hard et al. Federated learning for mobile keyboard prediction, 2019
[2] Bhattacharyya, Kannan, Kumar: How many Clusters? - An algorithmic answer. SODA 2022
[3] Dasgupta: Learning Mixtures of Gaussians. FOCS 1999
Thank you for your response. After reviewing the other comments, I found that they share similar concerns to mine or raise additional reasonable points. As a result, I have decided to maintain my score at this time.
In this paper, the authors propose FedDP-KMeans, a novel algorithm combining federated learning and differential privacy to perform k-means clustering on decentralized, privacy-sensitive data. It introduces FedDP-Init to leverage small, server-side datasets for better initialization and FedDP-Lloyds to refine clusters while maintaining privacy. The method ensures high clustering accuracy with minimal communication rounds, addressing the privacy-utility trade-off by adding controlled noise. Experiments on synthetic and real-world datasets demonstrate its superiority over existing methods, making it a practical solution for applications like cross-device and cross-silo federated learning.
优点
+. This paper integrates differential privacy with federated learning to address privacy issues in decentralized data environments. +. The proposed FedDP-KMeans algorithm uses out-of-distribution, potentially small, server-side data to address the initialization challenge in DP k-means clustering. This innovation mitigates the impact of privacy noise and reduces the number of required Lloyd's steps, which is beneficial in reducing communication overhead. +. The authors provide a theoretical analysis, proving that FedDP-KMeans converges exponentially to true cluster centers under Gaussian mixture data assumptions. In addition, empirical results built on both synthetic and real-world datasets show that FedDP-KMeans outperforms baseline approaches, especially when the server data is out-of-distribution, thus demonstrating robustness in realistic scenarios.
缺点
-. This paper emphasizes the importance of good initialization, but it does not clearly articulate the impact that initialization can have. Specifically, it does not explain how different initializations specifically affect clustering quality in the context of differential privacy. -. The paper emphasizes the utility of leveraging server-side data, but it does not explain why this is beneficial. Additionally, even if using server-side data is useful, this dependency could limit the applicability of FedDP-KMeans in cases where such data is unavailable or does not adequately represent the client data structure. If the data distributions differ significantly, the reliance on server-side data may introduce bias during clustering. -. Additionally, the paper’s comparison with related work includes relatively outdated studies, as the baseline methods are from 2007, 2017, and 2021. It is recommended to incorporate comparisons with the latest research findings to provide a more current perspective on the proposed methods. -. The algorithm requires careful hyperparameter tuning for privacy budgets and sensitivity settings. This challenge is acknowledged but not fully addressed within the main content, making practical implementation more complex without further guidance.
问题
Q1: What is the Exact Impact of Initialization? Q2: Why Use Server-Side Data Specifically?
伦理问题详情
N/A
Additionally, even if using server-side data is useful, this dependency could limit the applicability of FedDP-KMeans in cases where such data is unavailable
As we discuss in lines 096-100 it is a common scenario in FL that the server has access to OOD proxy data, in this section we provide references supporting this claim e.g. one of the original papers on FL use cases does pertaining on related public proxy data [1]. On the most common data modalities (images and text) there are many public datasets available online. The assumption is valid precisely because we do not require the server data to be identically distributed to the clients.
If the data distributions differ significantly, the reliance on server-side data may introduce bias during clustering.
In both our theoretical and experimental results the server possesses differing data to the clients and our method obtains strong performance. Could you please point more specifically to the aspects of the theory and experiments that fail to address your concern?
It is recommended to incorporate comparisons with the latest research findings to provide a more current perspective on the proposed methods.
We believe we have included the relevant and most recent baselines. Can you please provide references?
The algorithm requires careful hyperparameter tuning for privacy budgets and sensitivity settings. This challenge is acknowledged but not fully addressed within the main content, making practical implementation more complex without further guidance.
We already provide a guide to setting hyperparameters in Appendix G.4 and we link to this in the main body of text. We can move this guide to the main body, but please note that it is there.
Q1: What is the Exact Impact of Initialization?
If you have a poor initialization it is likely that you must either run many steps of Lloyds to obtain a good final clustering or you end up stuck in a poor local optimum. This is well known in the non-private literature, see e.g. [2]. Additionally, in the private setting if you have to run many iterations then you have to add a large amount of noise on each iteration in order to guarantee differential privacy. This large noise destroys the quality of the clustering.
Several of our experimental baselines evaluate using different initializations and they perform substantially worse than our method. So clearly the initialization is an important factor impacting final clustering performance.
Q2: Why Use Server-Side Data Specifically?
OOD server data is often available, can be acted on without incurring a privacy cost, and can be made to look more like the client data through reweighting. We include a baseline that initializes without server data (SpherePacking) and it performs the worst out of all methods.
[1] Hard et al. Federated learning for mobile keyboard prediction, 2019
[2] David Arthur, Sergei Vassilvitskii, k-means++: The Advantages of Careful Seeding
Thank you very much for the author's response. Unfortunately, it still fails to address most of my concerns. Therefore, I will maintain my current rating.
The paper studies the problem of compute -means clustering of edge devices by using the data at the server to compute the weights given to the clusters.
优点
The paper gives an algorithm that uses server data to weight the client's cluster and give both theoretical as well as empirical evaluation of its results.
缺点
The idea of using projection to a low-dimensional subspace and perform clustering there is not new. It has been done in a lot of work. Also, for the idea to work, I think there is an implicit assumption that the server's data has the same distribution as the client's data. One can use any PPCA matrices introduced in Cohen et al (STOC 2015) and this has been known for a really long time in the ML literature. Using Gaussian noise on top of PPCA has been also studied by several works on k-means and k-median clustering. So, I am not sure if this is a new idea.
Using the server's data to find cluster center that clients can use to weigh its projected point is also studied (without privacy constraints) in the literature of randomized numerical linear algebra.
The server's data is considered to be public. However, in real world, server cloud contains data that might be stored by some users or in a business setup, consist of private data pertaining to the firm. So, I am unsure if this assumption is reasonable.
问题
What are the assumptions made about the client's and server's data in this paper?
What would be a reasonable setting to assume that the server's data is non-private?
Assuming that you are using \widehat n_r, if the value of n_r < log(n/\beta), then with reasonable probability you will have numerical instability. How do you make sure that does not happen?
The main criticism of the paper seems to be that you believe the steps of our algorithm have individually occurred in different contexts. This is true of almost any new algorithm if you break it down into small enough subcomponents. Moreover, even if this is true (which we don’t believe to be the case) it doesn’t mean that the algorithm is not a valuable contribution. We solve a relevant problem (private -means) in a relevant context (federated learning) that nobody has been able to do so far.
The idea of using projection to a low-dimensional subspace and perform clustering there is not new. It has been done in a lot of work. Also, for the idea to work, I think there is an implicit assumption that the server's data has the same distribution as the client's data. One can use any PPCA matrices introduced in Cohen et al (STOC 2015) and this has been known for a really long time in the ML literature
It seems there is a misunderstanding: we do not assume that the server’s data has the same distribution as the client’s data and in the theory and experiments this is not the case.
The non-private version of our algorithm is from [1], and we do not pretend it to be novel. Doing PCA privately is not the only part where privacy changes – we need to control the noise added, and our way of doing so is new (to the best of our knowledge).
Using Gaussian noise on top of PPCA has been also studied by several works on k-means and k-median clustering
Using PCA is indeed a standard dimension reduction for clustering; and using Gaussian noise is the basic of DP. In our opinion, it is not because our algorithm uses standard first steps that there is no contribution. Again, to the best of our knowledge, our way of controlling the Gaussian noise is new.
Using the server's data to find cluster center that clients can use to weigh its projected point is also studied (without privacy constraints) in the literature of randomized numerical linear algebra.
That would be an interesting connection. Could you provide a reference?
What are the assumptions made about the client's and server's data in this paper?
As stated in Theorem 2, for our theoretical results to hold we assume the server has at least one datapoint from each true Gaussian mixture component. But this is not an essential condition for the algorithm to work in practice. In our experiments we give the server related but OOD data and obtain strong performance e.g. for clustering text embeddings the server receives text embeddings of OOD text data.
What would be a reasonable setting to assume that the server's data is non-private?
The most common example is publicly available proxy data of the same modality, e.g. existing public datasets, or general image/text data from the web. This is commonly used in FL, see our discussion and references in lines 096-100, e.g. one of the original papers on FL use cases does pertaining on related public proxy data [2]. In a less common modality one could for instance use anonymized data from a subset of consenting users.
Assuming that you are using \widehat n_r, if the value of n_r < log(n/\beta), then with reasonable probability you will have numerical instability. How do you make sure that does not happen?
Could you clarify what and are? We assume you are asking about the possibility of being close to 0. This can be easily dealt with on the server by dividing by instead of , as is common practice in DP. Note this could only occur when is small and hence the th cluster has vanished and has little impact on the -means cost.
[1] Pranjal Awasthi, Or Sheffet, Improved Spectral-Norm Bounds for Clustering, 2012
[2] Hard et al. Federated learning for mobile keyboard prediction, 2019
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.