PaperHub
5.7
/10
Poster3 位审稿人
最低5最高7标准差0.9
7
5
5
3.0
置信度
正确性3.3
贡献度2.7
表达2.7
NeurIPS 2024

Differential Privacy in Scalable General Kernel Learning via $K$-means Nystr{\"o}m Random Features

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-14

摘要

关键词
differential privacykernel learningkernel mean embeddingkernel empirical risk minimization

评审与讨论

审稿意见
7

The paper proposes differentially private scalable kernel ERM and KME algorithms that are more general than the prior works. The authors provide privacy and convergence proofs for their proposed approaches, and experimentally compare their approach with prior works.

优点

  • Thorough discussion on background and comparison with prior works
  • Detailed theoretical proofs for the proposed approach
  • Experimental evaluation of the proposed approach

缺点

None that I could find, although I'm not very familiar with the topic.

问题

None

局限性

I do not feel there are any negative societal impacts of this work.

作者回复

We greatly appreciate the reviewer's positive feedback.

审稿意见
5

The paper studies scalable kernel learning algorithms under differential privacy. First, the authors propose an algorithm for DP K-means Nyström approximation to obtain an orthonormal basis and their corresponding random feature map. Then, the authors use the basis and feature map for kernel ERM algorithm and kernel mean embedding algorithm. The paper presents both theoretical and empirical evaluations of the proposed methods, demonstrating their performance and superiority over existing approaches.

优点

The proposed method is novel and practical. Both theoretical results and experiments are provided.

缺点

  1. The quality of the Nyström approximation depends on the DP-K-means method. However, the proposed method is based on an existing library.

  2. This paper claims to provide a scalable solution. However, the experiments are somewhat limited in scope. (adult dataset)

问题

  1. How does the allocation of the privacy budget between different components of the algorithm (e.g., K-means, DP ERM) affect the overall performance? Could we improve the performance by choosing privacy budget more carefully?

  2. The proposed methods claim to offer improved scalability compared to existing DP kernel learning algorithms. Can you provide a detailed comparison of the complexities of your algorithms versus existing methods? Specifically, how does the time/sample complexity scale with the size of the dataset and the dimensionality of the data?

局限性

The limitations are addressed.

作者回复

W1 Contributions

While it is true that our algorithms utilize existing private learning schemes, we emphasize two novel contributions and one practically critical point:

First, while the Nyström-based method is a known approach, applying it under privacy constraints is a novel and significant contribution. Existing scalable private kernel learning algorithms typically combine random Fourier features with private linear ERM algorithms, which was inadequate for private general kernel learning. We found that integrating the Nyström-based scheme with private linear ERM is effective, marking the first application of the Nyström-based scheme in the context of differential privacy.

Second, implementing an effective Nyström-based scheme under privacy constraints is another key contribution. Various Nyström-based schemes have differing quality under privacy constraints. As shown in Figure 1(b) in the paper, the solid lines show the accuracy of private KME estimation using Nyström-based schemes implemented by various landmark selection methods. The standard subsampling-based Nyström method (gray line) performs poorly under privacy constraints. Our discovery that the DP KK-means landmark selection is effective underlines the importance of investigating clustering-based Nyström methods under privacy constraints.

Finally, our composition framework is a practical strength, enabling adaptive design of private learning algorithms in various settings. Existing private linear ERM algorithms are tailored to specific scenarios, such as convex or non-convex loss, smooth or non-smooth loss, differentiable or non-differentiable regularizer, etc. Since the composition framework only requires the subroutines to be differentially private, it allows one to privately learn various models by substituting private linear ERM algorithms for the corresponding setting.

W2. Scalability

While the scalability of our algorithm is discussed in terms of computational complexity in line 240, we will provide additional experimental results to address the reviewer's concern in the revised manuscript, if the paper is accepted. For instance, Figure 2 in the attached PDF demonstrates the utility of our algorithm for KME estimation across multiple datasets, with the Gaussian and polynomial kernels. We have conducted similar experiments on additional datasets and obtained consistent results. This additional experiment confirms that the superiority of the proposed algorithm is not confined to the adult dataset alone.

Q1. Privacy budget allocation

We agree that optimizing privacy budget allocations can enhance the effectiveness of private learning. Figure 1 in the attached PDF illustrates that there is a budget combination that outperforms others. One possible explanation is that the optimal allocation may depend on the strength of the clustered structures in the data. For instance, when the number of landmarks mm is small relative to the sample size nn, it is advisable to allocate a larger portion of the privacy budget to the linear ERM rather than the DP KK-means. This is because a smaller mm typically results in larger cluster sizes. Since DP KK-means algorithms acquire private centroids by averaging the members of each cluster privately, larger clusters tend to lead to more accurate centroids given fixed privacy budgets. Therefore, we can afford to allocate more privacy resources to linear ERM. Further exploration to connect the cluster structure to the quality of the private landmarks is suggested as future work.

Q2. Comparison of complexities

We can provide a comparison of the time complexity of our algorithms versus existing methods for kernel ridge regression. Note that the time complexity of kernel ridge regression in the non-private setting is O(n3)O(n^3), arising from the inversion of the n×nn\times n kernel matrix.

The time complexity of our DP kernel ERM algorithm (Algorithm 1) for kernel ridge regression is O(nm2+nmd)O(nm^2+nmd) as given in line 240. The nmdnmd term is the time complexity of the DP KK-means algorithm, and the nm2nm^2 term is the time complexity of the linear regression for nn observations of mm dimensions with m<nm<n. Other candidates, such as the functional noise adding algorithm in Hall et al. (2013) and Algorithm 3 in Jain et al. (2013), have different complexities. The former has a time complexity of O(n3)O(n^3) since it adds noise for the privacy guarantee after evaluating the non-private model, equating it to the time complexity of non-private kernel ridge regression. The latter has a time complexity of O(nd)O(n^d).

Although we have only compared the time complexity of DP kernel ERM algorithms for kernel ridge regression, we note that the superiority of our algorithm extends to other kernel ERMs as well. Our algorithm reduces the time complexity by transforming an optimization problem of nn observations of nn dimensions to nn data of mm dimensions. However, the algorithm in Hall et al. (2013) requires solving the original optimization problem. Thus our algorithm will be more scalable than Hall et al. (2013) for general ERM. Also, the algorithm in Jain and Thakurta (2013) has a time complexity of O(nd)O(n^d) for general ERM.

Finally, for the KME estimation (Algorithm 3), the time complexity of our algorithm is O(nm2+nmd)O(nm^2+nmd), whereas the existing DP KME estimation algorithm suggested in Balog et al. (2018) has a time complexity of O(nm2)O(nm^2). In this case, our algorithm can be slower than Algorithm 1 in Balog et al. (2018). However, our algorithm demonstrates superior performance compared to Algorithm 1 in Balog et al. (2018), as shown in Figure 1(b) and discussed in lines 318-322.

评论

Thank you for your detailed response. My concerns regarding W1, Q1, Q2 are adequately addressed. However, I still have some reservations about the scalability results. In the provided PDF, the performance of the proposed method does not seem as good as the ADULT dataset, especially for the 4th column MNIST. MNIST is relatively small compared to many modern large-scale datasets used in practical scenarios, these results could raise concerns about the claimed scalability. Thus, my score remains unchanged.

审稿意见
5

This paper considers the problem of differentially private kernel learning. The main idea of proposed algorithm is to approximate the kernel matrix using the Nystrom kernel embeddings. The landmark points on which the approximation is built are chosen as the centroids given by k-means algorithm on data based-on the relationship between the kernel approximation error and k-means problem. Given the Nystrom kernel embeddings, the authors propose algorithms for three tasks: DP-ERM, kernel mean embeddings, and data release.

优点

  • The authors make an important observation about the relationship between the kernel matrix approximation error and k-means problem. Based on the observation, they designs an algorithm that releases approximate feature map under differential privacy.
  • The released DP Nystrom approximation is useful as it can be used for other tasks. The authors demonstrate how the released feature map can be used for other tasks including ERM, hypothesis testing, and data release.
  • The paper addresses practical issues in applying DP kernel learning methods: scalability, ability to use general kernels and objective functions, and test data-free approach.

缺点

  • While the proposed Nyström kernel embedding approach is versatile, its utility for a single task might be lower that that for the algorithms entirely dedicated for the task. Unfortunately, empirical evaluations provided in Section 4 do not provide systematic comparison of utility with existing approaches for each task.
  • The equal privacy budget split between Nystrom approximation and the linear ERM task seems arbitrary. Since Theorem 1 and 4 allows to express the excess risk of kernel ERM algorithm (Algorithm 2) as the sum of two errors, each of which can be expressed as a function of ϵ. It might be possible to find a more advanced approach for the privacy budget allocation.
  • When it comes to privatization method, the proposed algorithm seems incremental as its privacy relies on the privacy of subroutines.

问题

  • While Eq. (2) shows the approximation error of kernel matrix is similar to that of k-means, it seems that the set Z in here can be viewed as an optimization variable. Is it possible to formulated the problem as a bi-level optimization?
  • Algorithm 1 embeds the landmark points into functions in a reproducing kernel Hilbert space (RKHS) and releases the random feature map along with the bases. Line 5 in Algorithm 5 computes the scale factor R. Is R computed independent of data?
  • In line 3 of Algorithm 1, the algorithm samples the landmark points from distribution Q when K<mK < m. Although the authors mentioned that this distribution Q can be arbitrarily chosen, it is not intuitive and they should be landmark points representative of the underlying dataset. How does the choice of distribution Q affect the results?

局限性

Due to the use of KK-means as a sub-routine, the proposed approach inherits the limitations of KK-means, for example, difficulty of handling datasets with mixed data types and curse of dimensionality.

作者回复

W1. The utility of the versatile method

We acknowledge that the utility of learning from privatized data for versatile DP kernel ERM may be inferior compared to a DP kernel ERM algorithm dedicated to a specific task. For instance, when the RKHS of a given kernel has a finite dimension dd, kernel regression via privately released data (from our method) has an excess empirical risk of O(dn12)O(\sqrt{d}n^{-\frac{1}{2}}). In contrast, the DP algorithm specifically designed for kernel regression can achieve an excess empirical risk of O(d2n2)O(d^2n^{-2}) (Chaudhuri et al., 2011), which is faster when d<nd<n. However, we point out that, in practice, it is common not to have a specific learning method in mind beforehand, and multiple methods are often experimented with. In such scenarios, our proposed approach can guarantee privacy for various learning problems, unlike DP algorithms tailored for a specific task.

W2. Privacy budget allocation

We agree that optimizing privacy budget allocations can enhance the effectiveness of private learning. Figure 1 in the attached PDF implies that the 50-50 allocation may not always be the best. A possible heuristic rule is to allocate the budget depending on the strength of the clustered structures in the data. For instance, when the number of landmarks mm is small relative to the sample size nn, it is advisable to allocate a larger portion of the privacy budget to the linear ERM rather than the DP KK-means. This is because a smaller mm typically results in larger cluster sizes. Since DP KK-means algorithms acquire private centroids by averaging the members of each cluster privately, larger clusters tend to lead to more accurate centroids given fixed privacy budgets. Therefore, we can afford to allocate more privacy resources to linear ERM. A deeper exploration of the connection between the cluster structure and the quality of the private landmarks is suggested as future work.

W3. Contributions

While it is true that our algorithms utilize existing private learning schemes, we emphasize two novel contributions and one practically critical point:

First, while the Nyström-based method is a known approach, applying it under privacy constraints is a novel and significant contribution. Existing scalable private kernel learning algorithms typically combine random Fourier features with private linear ERM algorithms, which was inadequate for private general kernel learning. We found that integrating the Nyström-based scheme with private linear ERM is effective, marking the first application of the Nyström-based scheme in the context of differential privacy.

Second, implementing an effective Nyström-based scheme under privacy constraints is another key contribution. Various Nyström-based schemes have differing quality under privacy constraints. As shown in Figure 1(b) in the paper, the solid lines show the accuracy of private KME estimation using Nyström-based schemes implemented by various landmark selection methods. The standard subsampling-based Nyström method (gray line) performs poorly under privacy constraints. Our discovery that the DP KK-means landmark selection is effective underlines the importance of investigating clustering-based Nyström methods under privacy constraints.

Finally, our composition framework is a practical strength, enabling adaptive design of private learning algorithms in various settings. Existing private linear ERM algorithms are tailored to specific scenarios, such as convex or non-convex loss, smooth or non-smooth loss, differentiable or non-differentiable regularizer, etc. Since the composition framework only requires the subroutines to be differentially private, it allows one to privately learn various models by substituting private linear ERM algorithms for the corresponding setting.

Q1. Bi-level optimization.

It is possible to tune ZZ through optimization, and similar approaches have been explored in non-private settings. However, there are important issues to deal with. The optimization objective is neither convex nor Lipschitz, and involves many variables (i.e., landmarks). Note that most literature on private optimization assumes a convex or at least Lipschitz objective. Nevertheless, we believe that exploring private landmark selection through the suggested optimization approach could be an interesting direction for future research.

Q2. The scale factor R

The scale factor RR is defined in line 5 of Algorithm 1, and this definition is consistent throughout all subsequent discussions. It is a parameter that depends on the kernel function kk and the data space X\mathcal{X}, and is independent of the individual data.

Q3. The effect of QQ

In what follows we explain why the landmark points are sampled from QQ rather than using centroids or other possible representatives of the underlying dataset. Naturally, it would be preferable to choose landmark points that represent the underlying dataset accurately rather than using random samples. However, when mm is too large, the accuracy of the private centroid estimates degrades since the size of each cluster shrinks, which makes centroid estimates (i.e., sample averages) vulnerable to noise for privacy guarantees. Such inaccurate estimates would result in ineffective landmark points. Therefore, we selected K(<m)K(<m) representatives of the underlying dataset by using KK private centroids z1,,zKz_1,\ldots,z_K, and chose the remaining mKm-K landmark points from some distribution QQ. Thus, QQ is intended to enhance the quality of the landmark points by adopting information from the DP KK-means.

`QQ can be arbitrarily chosen' means that the choice of QQ does not affect the privacy of the algorithm as long as QQ does not directly reference the data. Although we used a truncated normal centered at the private cluster centroids, we observed that the results from a different choice of QQ (e.g., uniform) are insignificant.

评论

I thank the authors for their detailed response to my questions and explanation on the challenges of applying the Nystrom method under differential privacy. I would like to keep my current rating of 5.

作者回复

We would like to thank all reviewers for their constructive feedback. We have addressed all comments to the best of our ability. Detailed point-by-point responses to the reviewers' comments are provided below. Additionally, please see the attached one-page PDF for further experimental results.

最终决定

All reviewers felt this paper sits above the threshold for acceptance, agreeing that it presents novel and practical algorithms for private kernel methods, backed up by both theoretical and empirical evidence, and that the authors have clearly situated it with respect to prior work. The authors are encouraged to incorporate suggestions from the reviewers in their final version, particularly by adding experiments demonstrating scalability and answers to reviewer questions from the rebuttals, which were helpful.