PaperHub
6.8
/10
Poster4 位审稿人
最低3最高5标准差0.8
5
4
3
5
3.3
置信度
创新性2.8
质量3.0
清晰度3.0
重要性2.5
NeurIPS 2025

Online robust locally differentially private learning for nonparametric regression

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Online learningnonparametric regressionlocal differential privacyHuber loss

评审与讨论

审稿意见
5

This paper addresses the problem of nonparametric regression in a streaming data setting, where data arrives sequentially and privacy is a major concern. The authors propose a novel framework and two new algorithms to tackle this challenge. The first algorithm, H-FSGD (Huber Functional Stochastic Gradient Descent), is a one-pass online method that uses the Huber loss to achieve robustness against heavy-tailed errors and outliers. The second algorithm, PH-FSGD, extends H-FSGD to provide Local Differential Privacy (LDP) by injecting calibrated Gaussian process noise at each iteration, thus protecting individual data points at the source without a trusted curator. The paper provides a comprehensive non-asymptotic convergence analysis for both estimators, establishing optimal convergence rates and exploring the trade-offs involving step size and privacy budget. The theoretical findings are supported by experiments on synthetic data, demonstrating the robustness and privacy-preserving capabilities of the proposed methods.

优缺点分析

Strengths

  1. Significance and Originality: The paper tackles an important problem at the intersection of streaming data analysis, robust statistics, and privacy. To the best of the authors' knowledge and my own, developing a scalable, statistically sound method for online, robust, and locally private nonparametric regression is an open problem. The proposed framework, which combines functional SGD, Huber loss for robustness, and LDP for privacy, is a novel contribution to this area.

  2. Theoretical Quality: The paper provides a rigorous and comprehensive non-asymptotic convergence analysis. The authors establish finite-sample guarantees for their estimators in a general setting that does not assume the true function lies within the chosen RKHS. The analysis of convergence rates under various smoothness assumptions and step-size schemes is thorough and provides valuable theoretical insights into the behavior of the algorithms. The proofs provided in the appendix appear to be detailed and correct.

  3. Clarity: The paper is well-written. The authors clearly motivate the problem, lay out their contributions, and present their algorithms in a way that is easy to follow. The inclusion of a comparison table and flowchart is also helpful for understanding the overall structure of their contribution.

Weaknesses

  1. Experimental Scope: My main reservation about this paper is the limited scope of the experiments. While the synthetic data simulations do a good job of supporting the theoretical claims about efficiency and privacy, they stop short of showing how the method would perform on a real-world streaming dataset. Given that the work is motivated by practical applications like sensor networks and autonomous vehicles, the absence of a real-data experiment feels like a missed opportunity to demonstrate the framework's utility.

  2. Characterization of Robustness: The paper's notion of robustness is tied to handling heavy-tailed errors via the Huber loss. This is a perfectly reasonable approach, but it would be interesting to see a discussion of how this translates to robustness in an adversarial sense, for instance, in a Huber contamination model where a fraction of the data might be arbitrarily corrupted. The trade-off between the degree of robustness and the privacy guarantee is a subtle but important aspect that isn't fully explored. An example of this in the local privacy model is "On the Privacy-Robustness-Utility Trilemma in Distributed Learning", ICML '23.

  3. Incremental Technical Novelty: While the synthesis of ideas here is original and valuable, the core components (Functional SGD, Huber loss, and the Gaussian mechanism for privacy) are all well-established tools in their respective fields. The contribution is therefore less about creating a new technical tool and more about demonstrating how existing tools can be thoughtfully combined and analyzed to solve a new, complex problem. This is a valid and useful contribution, but it's worth noting.

问题

I have a few questions for the authors that I believe could help clarify some points and strengthen the paper:

  1. Real-world Performance: I understand that setting up real-world streaming experiments can be challenging. However, could you perhaps comment on how you expect your method to perform on a benchmark streaming dataset? Even a discussion of the potential challenges (e.g., concept drift, non-i.i.d. data) and how your framework might adapt would be very insightful.

  2. Robustness in Contamination Models: Your use of Huber loss is well-motivated for heavy-tailed noise. I'm curious if you have any intuition on how the method would perform under a stronger threat model, like Huber's contamination model, where an adversary could corrupt a small fraction of the data points. Does your analysis provide any clues about the breakdown point of the estimator?

  3. The Robustness-Privacy-Utility Trilemma: There's a known tension in the local model between robustness, privacy, and statistical utility (see e.g. "On the Privacy-Robustness-Utility Trilemma in Distributed Learning", ICML '23). It seems your framework has all the knobs to explore this: the Huber parameter τ, the privacy budget (ϵ,δ), and the step size γ. Could you elaborate on the interplay between these? For example, how should one co-tune τ and ϵ to achieve a good trade-off, especially when outliers are present?

  4. A quick note on notation: In line 185, you define wτw_\tau in terms of a variable uu. It's clear from the line above that u=yf(x)u=y−f(x), but for maximum clarity, you might consider making that explicit in the definition itself.

局限性

Yes.

最终评判理由

I agree with other reviewers that the paper clarity could be improved, e.g., definitions of differential privacy and outlier robustness. For the interplay between privacy and robustness, I recommend that the authors cite the trilemma work I referenced in my original review and any other relevant ones. The rebuttal and new experiments have addressed my initial concerns, and will introduce important changes to the paper. I maintain my positive assessment of the paper.

格式问题

N/A

作者回复

Thank you for your thoughtful and critical assessment. Many of your comments will help us produce a more readable and self-contained version of the paper. Below, we address each of your specific concerns in turn.

  • Real-world performance.

    • Real data analysis. To demonstrate the practical utility of our method, we evaluated it on the “Health and Fitness” dataset from Kaggle website, aiming to assess the impact of endurance on overall fitness. We used 40000 samples for training and 1000 for testing. The table below reports out-of-sample R2R^2 comparisons across methods.

      Step size \ MethodH-FSGDL2-FSGDOffline(2,0.2)-LDP(3,0.1)-LDPlm
      Constant0.6220.6170.6090.5890.6090.581
      Non-constant0.6230.6180.6090.5260.5960.581

      H-FSGD consistently outperforms L2-FSGD and Offline methods in streaming settings. Under LDP constraints, PH-FSGD remains competitive with or exceeds linear regression, underscoring its robustness and practical relevance. If accepted, we will include this analysis in the final version.

    • Extension to non-iid data. We agree that addressing concept drift and non-i.i.d. data is both important and challenging. Our current framework, like standard functional SGD theory, assumes i.i.d. data, and does not directly extend to non-i.i.d. settings—a limitation shared by many SGD-based methods. Challenges include parameter drift, slower convergence (see, e.g., "Federated learning with non-iid data", arXiv, 2018; "On the convergence of FedAvg on non-IID data", ICLR, 2020), and bias from dependent observations. While one could define the population risk as _n=1pnL_τ(Ynfn(Xn))\sum\_{n=1}^{\infty}p_nL\_{\tau}(Y_n-f_n(X_n)) with distribution weight pnp_n, this is infeasible in online settings due to unknown distribution shifts and the ill-defined infinite-sum objective. A more practical alternative models the data as coming from MM sub-populations with different distributions, minimizing the global risk m=1MpmL_τ(Y(m)f(X(m)))\sum_{m=1}^M p_m L\_{\tau}(Y^{(m)}-f(X^{(m)})). This leads to a parallel SGD scheme: at each round tt, the global estimate f^_t\hat{f}\_t is sent to local devices, which compute updates f^(m)_t+1=f^_tγtL^_τ(f^_t)(Xn(m),Yn(m))+γtξt,\hat{f}^{(m)}\_{t+1}=\hat{f}\_t-\gamma_t \widehat{\nabla L}\_\tau (\hat{f}\_t)(X_n^{(m)}, Y_n^{(m)})+\gamma_t \xi_t, followed by weighted aggregation f^_t+1=m=1Mpmf^_t+1(m).\hat{f}\_{t+1}=\sum_{m=1}^M p_m \hat{f}\_{t+1}^{(m)}. This setup accommodates distributional heterogeneity and aligns with recent federated learning approaches for non-i.i.d. data (see, e.g., "EFL: elastic federated learning on non-IID data", Conference on lifelong learning agents, 2022; "Asynchronous online federated learning for edge devices with non-iid data", 2020 IEEE International Conference on Big Data, 2020). We will add this discussion to the revised version.

  • Robustness in contamination models.

    • While our method is designed for heavy-tailed noise via the Huber loss, we also assess its robustness under Huber’s ε\varepsilon^\star-contamination model, where an ε\varepsilon^\star-fraction of data is adversarially corrupted. Using Case 1 (true model) and Case 2 (contaminated model), we simulate performance across varying ε\varepsilon^\star at n=10000n=10000, reporting mean and standard deviation of MSE (×102\times 10^{-2}) over 5050 runs:

      Step sizeε\varepsilon^\star00.10.20.30.40.50.60.70.80.9
      ConstantH-FSGD0.0722 (0.0158)0.657 (0.0985)2.85 (0.251)8.08 (0.497)20.4 (1.27)56.7 (3.91)114 (3.47)155 (2.49)186 (2.21)211 (2.10)
      L2-FSGD0.0619 (0.015)2.46 (0.263)9.49 (0.611)21.2 (0.928)37.4 (1.16)58.4 (1.60)84.0 (1.70)114 (1.74)149 (2.15)189 (2.14)
      Non-constantH-FSGD0.0939 (0.0326)0.683 (0.105)2.89 (0.26)8.11 (0.517)20.2 (1.23)56.4 (4.23)114 (3.78)155 (2.61)186 (2.32)211 (2.16)
      L2-FSGD0.152 (0.136)2.55 (0.31)9.63 (0.683)21.3 (1.01)37.5 (1.23)58.5 (1.64)84.2 (1.83)114 (1.94)149 (2.25)189 (2.18)

      H-FSGD remains stable up to ε=0.3\varepsilon^\star = 0.3, with degradation beginning around ε=0.4\varepsilon^\star = 0.4, suggesting a breakdown point near 40%. In contrast, L2-FSGD deteriorates earlier at ε=0.3\varepsilon^\star = 0.3. Across all settings, H-FSGD consistently outperforms L2-FSGD. We will include these results in the revised version.

  • Robustness-privacy-utility trilemma.

    • Interplay among Huber parameter, privacy budget, and step size.

      • In the non-private setting, smaller τ\tau improves robustness to outliers but increases bias—a classical bias–robustness trade-off (see, e.g., "Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions", JRSSB, 2017). In the private setting, a larger τ\tau amplifies gradient sensitivity, requiring stronger noise for given (ε,δ)(\varepsilon, \delta)-LDP, which raises gradient variance and necessitates smaller γ0\gamma_0 for stability.
      • A tighter privacy budget (e.g., smaller ε\varepsilon) also demands more noise, further constraining γ0\gamma_0.
      • The step size follows a schedule: γi=γ0nζ\gamma_i = \gamma_0 n^{-\zeta} (constant step size) or γi=γ0iζ\gamma_i = \gamma_0 i^{-\zeta} (non-constant). Corollaries 1–4 show the optimal ζ\zeta depends on smoothness parameters rr and α\alpha. Under strong privacy noise, smaller γ0\gamma_0 improves stability, but overly small steps slow convergence.
    • Co-tune strategy in practice.

      • Huber parameter: we first compute an L2-FSGD estimate and residuals, then estimate noise scale via MAD. The Huber parameter is set to τ=1.345×MAD\tau = 1.345 \times \text{MAD}, achieving 9090\\% efficiency under normality while preserving robustness (Lines 191–193, Algorithm 2), following standard practice (see, e.g., "Robust regression using iteratively reweighted least-squares", Communications in Statistics-theory and Methods, 1977).

      • Privacy budget: chosen based on application needs. Where possible, a looser budget (larger ε\varepsilon) reduces noise and improves utility.

      • Step size: in the absence of theory-guided tuning, we recommend grid search for γ0\gamma_0 and ζ\zeta on a validation set—a practical and widely used approach (see, e.g., "The step decay schedule: a near optimal, geometrically decaying learning rate procedure for least squares", NeurIPS, 2019; "Benefits of learning rate annealing for tuning‑robustness in stochastic optimization", arxiv, 2025). To assess sensitivity, we simulate H-FSGD under Case 1 using both step-size schedules, with γ0[4,24]\gamma_0 \in [4, 24] and ζ[0.3,0.8]\zeta \in [0.3, 0.8], and report average MSE (×103\times 10^{-3}) over 50 repetitions at n=10000n = 10000 (tables below). These results will be included in the revision.

        • Constant step size setting:

          ζ\zeta \ γ0\gamma_04812162024
          0.31.521.762.052.382.793.32
          0.41.591.491.561.651.761.87
          0.52.061.671.541.501.491.51
          0.64.032.301.931.761.671.60
          0.712.285.183.372.662.312.11
          0.859.1417.129.696.775.214.25
        • Non-constant step size setting:

          ζ\zeta \ γ0\gamma_04812162024
          0.31.631.992.443.044.135.12
          0.41.511.621.771.962.132.33
          0.51.691.591.631.661.801.85
          0.62.271.941.921.841.871.93
          0.73.593.142.952.852.682.63
          0.87.746.276.506.466.586.39

        Our results indicate that the proposed estimators exhibit robustness to the choice of γ0\gamma_0 and ζ\zeta under both constant and non-constant step size settings, with wide parameter regions yielding stable and comparable MSEs.

  • Incremental technical novelty. While our work builds on established tools, its key novelty lies in being the first to adapt them to online, robust, and privacy-preserving nonparametric learning in streaming environments, a setting not addressed in prior work. We simultaneously tackle three challenging constraints: robustness to heavy-tailed noise, rigorous differential privacy, and efficient streaming updates. Our theoretical analysis establishes non-asymptotic guarantees under both constant and decaying step sizes, with or without LDP, and shows that proper step size tuning yields optimal convergence rates, extending prior results from finite- to infinite-dimensional settings. Finally, our experiments demonstrate superior empirical performance.

  • Notation. We agree that making the relationship u=yf(x)u=y-f(x) explicit in the definition of wτ(u)w_{\tau}(u) will enhance readability and prevent any potential ambiguity.

Thank you again for your comments. We hope our responses resolve your concerns. If anything remains unclear, we’d be happy to provide further clarification.

评论

Thank you for the detailed rebuttal. For the interplay between privacy and robustness, please cite the trilemma work I referenced in my original review and any other relevant ones. I agree with other reviewers that the paper clarity could be improved. I maintain my positive assessment of the paper.

评论

We sincerely appreciate your constructive feedback and positive assessment of our work. Thank you for your guidance on improving the discussion of the robustness–privacy–utility trilemma. In the revised version, we will incorporate the following enhancement into the Introduction (line 56):

However, privacy protection inevitably introduces tension with two other key objectives: model robustness against adversarial perturbations and statistical utility. This fundamental trade-off known as the privacy-robustness-utility trilemma has been extensively studied across different learning paradigms, including distributed learning ("On the Privacy-Robustness-Utility Trilemma in Distributed Learning", ICML, 2023), adversarial learning with certified guarantees ("Scalable differential privacy with certified robustness in adversarial learning", ICML, 2020), and decentralized Byzantine-robust systems ("On the tradeoff between privacy preservation and Byzantine-robustness in decentralized learning", ICASSP, 2024).

We will also incorporate a comparative discussion of our method relative to existing literature in the Introduction (line 87):

In contrast to minimizer-optimal loss alignment approaches for addressing the privacy–robustness–utility trilemma, our method achieves inherent outlier robustness via a Huber loss framework.

Additionally, we will improve the paper’s clarity by including a list of notation, a summary table of key theoretical results, and a reorganized problem formulation. Thank you again for your suggestions, they have been instrumental in enhancing both the rigor and presentation of our work.

审稿意见
4

This paper addresses nonparametric regression on streaming data by introducing two algorithms:

H-FSGD: A functional stochastic gradient descent method that employs the Huber loss to mitigate heavy-tailed noise and outliers by truncating large residuals.

PH-FSGD: A locally differentially private variant of H-FSGD that injects carefully calibrated Gaussian process noise at each iteration, ensuring each data point is protected at the source without requiring a trusted curator.

优缺点分析

Strengths:

  1. The manuscript is well written and clearly structured.

  2. Incorporating the Huber loss is a sensible choice: it projects estimates into a finite-dimensional RKHS subspace, handles heavy-tailed noise and outliers effectively, and yields a uniformly bounded gradient—ideal for private algorithm design.

  3. Both algorithms operate in a single pass, making them well suited for streaming data scenarios.

Weaknesses:

  1. The notation is dense and could overwhelm readers; a summary table contrasting the theoretical guarantees for the non-private and private settings would improve clarity.

问题

  1. In Corollary 1 (Line 264), under a constant step size, the bound appears independent of the privacy parameters. Could the authors clarify this?

  2. In Line 144, how is the closure of H measured or defined?

局限性

See above.

最终评判理由

I think my concerns have been addressed, so I increase my score.

格式问题

NA

作者回复

Thank you for your thoughtful and critical assessment. Many of your comments will help us produce a more readable and self-contained version of the paper. Below, we address each of your specific concerns in turn.

  • Clarification of notation and theoretical results.

    • Thank you for the valuable suggestion. We agree that a notation list and a summary table contrasting the theoretical guarantees for the non-private and private settings would enhance clarity and highlight our contributions. If accepted, we will include these in the revised manuscript to improve reader comprehension.

    • The list of notation:

      • XnX_n: the nn-th copy of the random covariate XX;
      • YnY_n: the nn-th copy of the response YY;
      • ene_n: the nn-th copy of the error ee;
      • σ2\sigma^2: the variance of the error ee;
      • ff^\star: the target nonparametric function lying in the space L2(PX)L^2(P_X);
      • KK: the reproducing kernel;
      • H\mathcal{H}: a reproducing kernel Hilbert space (RKHS);
      • ,H\langle \cdot, \cdot\rangle_{\mathcal{H}}: the inner product in H\mathcal{H};
      • Hˉ\bar{\mathcal{H}}: the closure of H\mathcal{H};
      • fHf_{\mathcal{H}}: the best approximation of ff^\star in the RKHS Hˉ\bar{\mathcal{H}};
      • LτL_{\tau}: the Huber loss with the parameter τ\tau;
      • Lτ^\widehat{\nabla L_{\tau}}: the estimator of the Fréchet gradient of the Huber loss;
      • B2B^2: the uniform upper bound of K(x,x)K(x,x) for all xx;
      • γn\gamma_n: the step size at the nn-th iteration;
      • f^_n\hat{f}\_n: the current private estimator of fHf_{\mathcal{H}} at the nn-th iteration;
      • fˉ_n\bar{f}\_n: the averaged private estimator of fHf_{\mathcal{H}} at the nn-th iteration;
      • fˉ_n0\bar{f}\_n^0: the averaged non-private estimator of fHf_{\mathcal{H}} at the nn-th iteration;
      • εn\varepsilon_n, δn\delta_n: the privacy budget at the nn-th iteration;
      • zjj=1m\\{z_j\\}_{j=1}^m: the function grids;
      • Σ\Sigma: the covariance operator associated with the kernel KK, i.e., Σ=E(KXKX)\Sigma=\mathbb{E}(K_X\otimes K_X);
      • α\alpha: the parameter characterizing the decay rate of the eigenvalues of the covariance operator Σ\Sigma;
      • rr: the parameter quantifying the regularity of the target function fHf_{\mathcal{H}} with respect to the eigenbasis of Σ\Sigma.
    • Summary of theoretical results: we will add a table below contrasting the theoretical guarantees under private and non-private settings across different regimes of the decay rates rr, covering both constant and non-constant step-size schemes.

      • Constant step size:

        Private setting: O\left(\left(\sigma^2+\frac{8 \tau^2 B^2\log(2/\delta)}{\varepsilon^2}\right)\left(\gamma^{1/\alpha}n^{-1+1/\alpha}+n^{-1}\right)+(1+q(\gamma, n))\gamma^{-2r}n^{-2\min\\{r,1\\}}\left\\|\Sigma^{-r}f\_{\mathcal{H}}\right\\|\_{L\_{P\_X}^2}^2\right);

        Non-private setting: O\left(\sigma^2\left(\gamma^{1/\alpha}n^{-1+1/\alpha}+n^{-1}\right)+(1+q(\gamma, n))\gamma^{-2r}n^{-2\min\\{r,1\\}}\left\\|\Sigma^{-r}f\_{\mathcal{H}}\right\\|\_{L\_{P_X}^2}^2\right).

        rr rangeOptimal ζ\zeta in γinζ\gamma_i\asymp n^{-\zeta}Private / non-private convergence rate
        (0,(α1)/(2α)](0, (\alpha - 1)/(2\alpha)]00O(n2r)O(n^{-2r})
        ((α1)/(2α),1]((\alpha-1)/(2\alpha) ,1](2rα+1α)/(2rα+1)(2r\alpha+1-\alpha)/(2r\alpha+1)O(n2rα/(2rα+1))O(n^{-2r\alpha / (2r\alpha + 1)})
        (1,(α+2)/2](1, (\alpha+2)/2](α+1)/(2rα+1)(\alpha+1)/(2r\alpha+1)O(n(2rα2r+2)/(2rα+1))O(n^{-(2r\alpha - 2r + 2)/(2r\alpha + 1)})
        ((α+2)/2,)((\alpha+2)/2, \infty)1/(1+α)1/(1+\alpha)O(nα/(1+α))O(n^{-\alpha / (1+\alpha)})
      • Non-constant step size:

        Private setting: for 0ζ1/20\leq\zeta\leq 1/2, O\left(\left(\sigma^2+\frac{8 \tau^2 B^2\log(2/\delta)}{\varepsilon^2}\right)\gamma_n^{1/\alpha}n^{-1+1/\alpha}+\gamma_n^{-1}(n\gamma_n)^{-2r}\left\\|\Sigma^{-r}f\_{\mathcal{H}}\right\\|\_{L\_{P_X}^2}^2\right);

        ​ for 1/2<ζ<11/2<\zeta<1, O\left(\left(\sigma^2+\frac{8 \tau^2 B^2\log(2/\delta)}{\varepsilon^2}\right)(n\gamma_n)^{-2+1/\alpha}+\gamma_n^{-1}(n\gamma_n)^{-2r}\left\\|\Sigma^{-r}f\_{\mathcal{H}}\right\\|\_{L\_{P_X}^2}^2\right);

        Non-private setting: for 0ζ1/20\leq\zeta\leq 1/2, O\left(\sigma^2\gamma_n^{1/\alpha}n^{-1+1/\alpha}+\gamma_n^{-1}(n\gamma_n)^{-2r}\left\\|\Sigma^{-r}f\_{\mathcal{H}}\right\\|\_{L\_{P_X}^2}^2\right);

        ​ for 1/2<ζ<11/2<\zeta<1, O\left(\sigma^2(n\gamma_n)^{-2+1/\alpha}+\gamma_n^{-1}(n\gamma_n)^{-2r}\left\\|\Sigma^{-r}f\_{\mathcal{H}}\right\\|\_{L\_{P_X}^2}^2\right).

        rr rangeOptimal ζ\zeta in γinζ\gamma_i\asymp n^{-\zeta}Private / non-private convergence rate
        (0,(α1)/(2α)](0, (\alpha - 1)/(2\alpha)]00O(n2r)O(n^{-2r})
        ((α1)/(2α),(1+α)/(2α))((\alpha-1)/(2\alpha) ,(1+\alpha)/(2\alpha))(2rα+1α)/(2rα+1+α)(2r\alpha+1-\alpha)/(2r\alpha+1+\alpha)O(n(2rα+α1)/(2rα+1+α))O(n^{-(2r\alpha + \alpha - 1)/(2r\alpha + 1 + \alpha)})
        [(1+α)/(2α),)[(1+\alpha)/(2\alpha), \infty)1/(1+α)1/(1+\alpha)O(nα/(1+α))O(n^{-\alpha/(1+\alpha)})
      • In all, the privacy term introduces an additional constant factor of (8τ2B2log(2/δ))/(ε2)(8\tau^2 B^2 \log(2/\delta))/(\varepsilon^2), which increases the variance constant but does not change the asymptotic rates under optimal step sizes.

  • Role of the privacy parameters in Corollary 1.

    • We appreciate the reviewer’s careful reading. Corollary 1 is a direct consequence of Theorem 2, where the dependence on the privacy parameters is made explicit. In Corollary 1, we focus on the asymptotic order in terms of the sample size nn, all privacy parameters (ε\varepsilon, δ\delta) are treated as constants; their contribution is absorbed into the O()O(\cdot) notation. To clarify this point, we have now restored the explicit dependence on the privacy parameters in the convergence rates. Specifically, for different regimes of rr, the convergence rate becomes:

      • When 0<r(α1)/(2α)0<r\leq (\alpha-1)/(2\alpha), the optimal order is O(n2r);O\left(n^{-2r}\right);
      • When (α1)/(2α)<r1(\alpha-1)/(2\alpha)<r\leq 1, the optimal order is O((σ2+8τ2B2log(2/δ)ε2)n2rα/(2rα+1));O\left(\left(\sigma^2+\frac{8 \tau^2 B^2\log(2/\delta)}{\varepsilon^2}\right)n^{-2r\alpha/(2r\alpha+1)}\right);
      • When 1<r(α+2)/21<r\leq (\alpha+2)/2, the optimal order is O((σ2+8τ2B2log(2/δ)ε2)n(2rα2r+2)/(2rα+1));O\left(\left(\sigma^2+\frac{8 \tau^2 B^2\log(2/\delta)}{\varepsilon^2}\right)n^{-(2r\alpha-2r+2)/(2r\alpha+1)}\right);
      • When r>(α+2)/2r>(\alpha+2)/2, the optimal order is O((σ2+8τ2B2log(2/δ)ε2)nα/(1+α))O\left(\left(\sigma^2+\frac{8 \tau^2 B^2\log(2/\delta)}{\varepsilon^2}\right)n^{-\alpha/(1+\alpha)}\right).

    We will incorporate this clarification into the revised manuscript to enhance transparency regarding the role of the privacy parameters.

  • Definition of the closure of H\mathcal{H}.

    • We thank the reviewer for pointing this out. As defined in the paper, X\mathcal{X} is the domain of the random covariate XX with marginal distribution PXP_X, H\mathcal{H} is a reproducing kernel Hilbert space on X\mathcal{X} with inner product ,H\langle \cdot, \cdot \rangle_{\mathcal{H}} and reproducing kernel K(,):X×XRK(\cdot,\cdot): \mathcal{X} \times \mathcal{X} \to \mathbb{R}, and L2(PX)L^2(P_X) is the space of square-integrable functions with respect to PXP_X, i.e., L2(PX)=f:XR_Xf(x)2dPX(x)<L^2(P_X) = \\{f: \mathcal{X} \rightarrow \mathbb{R} | \int\_{\mathcal{X}}f(x)^2dP_X(x) < \infty \\}. Then the closure of H\mathcal{H} in L2(PX)L^2(P_X), denoted by Hˉ\bar{\mathcal{H}}, is defined as the set of all limits of sequences in H\mathcal{H}. Formally,

      Hˉ=fL2(PX):fkH such that fkfL2(PX)0.\bar{\mathcal{H}} = \\{ f \in L^2(P_X): \exists \\{f_k\\} \subset \mathcal{H} \text{ such that } \\|f_k - f\\|_{L^2(P_X)} \to 0 \\}.

      Intuitively, Hˉ\bar{\mathcal{H}} consists of all functions in L2(PX)L^2(P_X) that can be approximated arbitrarily well (in L2(PX)L^2(P_X) norm) by a sequence of functions in H\mathcal{H}. This construction is standard in the kernel methods literature (see, e.g., "Nonparametric stochastic approximation with large step-sizes", The Annals of Statistics, 2016), and it allows us to avoid assuming that ff^\star belongs to H\mathcal{H} or that H\mathcal{H} is dense in L2(PX)L^2(P_X). We will clarify this point in the revised version if the paper is accepted.

Thank you again for your comments. We hope our responses resolve your concerns. If anything remains unclear, we’d be happy to provide further clarification.

评论

Thank you to the authors for the detailed responses. My concerns have been addressed, and I will increase my score.

评论

We’re glad our responses addressed your concerns. Thank you for your careful reading and thoughtful feedback, which significantly improved the paper’s clarity and completeness.

审稿意见
3

The authors address the problem of online nonparametric regression and propose an algorithm that is both outlier-robust and locally differentially private. The key novelty lies in the use of the Huber loss to enhance robustness against outliers, combined with the addition of Gaussian process noise to ensure local differential privacy. The paper provides non-asymptotic theoretical guarantees for the proposed method and validates the results with synthetic numerical experiments.

优缺点分析

The paper addresses a compelling problem that is likely to interest the machine learning community, particularly those focused on streaming data, robustness, and privacy. Building on functional SGD methods for the squared loss [Liu et al., 2023], the authors introduce the Huber loss to enhance robustness and incorporate Gaussian process noise to achieve local differential privacy.

In my opinion, the paper is difficult to read due to heavy notation and a lack of a clear formal problem setup. A more structured presentation would help place the contributions in a better context and improve overall readability.

  1. Could the authors provide formal definitions of outlier robustness and local differential privacy (LDP) in the context of this work? While the theorems address properties related to LDP, the main text does not clearly define it. Including formal definitions would help clarify the contributions and make the theoretical results more accessible.
  2. The paper does not appear to provide any formal guarantees regarding outlier robustness. Interestingly, the choice of τ\tau also influences the privacy guarantees through the term 8τ2B2log(2/δ)/ε28 \tau^2 B^2 \log (2/\delta)/\varepsilon^2. Suppose one does not have any concern about outliers, can we keep τ\tau to \infty and essentially work with the squared loss? How does this affect the privacy guarantees?
  3. Similarly, if privacy is not a concern, then one does not see any direct dependency on τ\tau in Corollary 3. How does the choice of τ\tau affect the guarantees in Corollary 3?
  4. In my opinion, Lemma 1 should be included in the main text, as it appears to play a central role in supporting the main results.
  5. Related to point 1, could the authors clarify how Proposition 1, which (as far as I understand) concerns central differential privacy, is used in the proof of Theorem 1, which provides guarantees for local differential privacy?
  6. The step size depends on hyperparameters (B,M,κ)(B, M, \kappa), which may not be readily available in practice. Could the authors clarify how the step size should be selected to ensure convergence when these parameters are unknown?

问题

Please see strengths and weaknesses.

局限性

yes

最终评判理由

After reading the rebuttal and subsequent discussion, I have decided to maintain my current score. In my opinion, the paper needs a major restructuring and remains difficult to parse in its current state. The authors have promised to improve the clarity of the paper, but I believe that would require major changes in the current presentation, and should ideally go through another round of fresh reviews for a better assessment.

The authors have also included some additional guarantees for the robustness part. I have not verified these results. In any case, I believe these would be a good addition in a fresh version of the paper.

格式问题

作者回复

Thank you for your thoughtful and critical assessment. Many of your comments will help us produce a more readable and self-contained version of the paper. Below, we address each of your specific concerns in turn.

  • Clarification of notation and problem setup. We agree that the current notation may hinder readability. Due to space constraints, we refer the reviewer to the notation table in our response to Reviewer x3FU, and a concise version will be included in the revision. While the formal setup appears in Section 2, we will reorganize and clarify its structure to better contextualize our contributions and improve clarity.

  • Formal definitions.

    • Outlier robustness. In our work, outlier robustness refers to the estimator’s stability and accuracy under contaminated or heavy-tailed data. Classical examples include Huber's M-estimators, which cap large residuals to balance efficiency and robustness ("Robust estimation of a location parameter", The Annals of Mathematical Statistics, 1964). Recent results (e.g., "Concentration study of M‑estimators using the influence function", Electronic Journal of Statistics, 2022) further establish their non-asymptotic stability under adversarial contamination, though sensitivity to high-leverage points remains due to unbounded structural influence. We will clarify this in the revised version.

    • Local differential privacy (LDP). We agree that including key definitions such as LDP in the main text is important for clarity. Due to space constraints, the formal definition currently appears in Appendix A.2 (Definition 3). If accepted, we will move it to the main text to improve readability. For completeness, we restate the definition below.

      Definition3 (LDP, "A comprehensive survey on local differential privacy", Security and Communication Networks, 2020). Let ε>0\varepsilon>0 and δ0\delta\geq 0. A randomized algorithm M:XYM: \mathcal{X} \rightarrow \mathcal{Y} is said to be (ε,δ)(\varepsilon,\delta)-LDP if for any pair of input individual values x,xXx, x^\prime\in\mathcal{X}, and any measurable set EYE\subset \mathcal{Y}, P(M(x)E)eεP(M(x)E)+δ,\mathbb{P}(M(x)\in E)\leq e^{\varepsilon}\mathbb{P}(M(x^\prime)\in E)+\delta, where the probabilities are computed over the randomness of the mechanism MM.

  • Robustness and privacy guarantee without outlier concern.

    • Our use of the Huber loss is motivated by its well-established robustness, balancing squared loss sensitivity for small residuals with absolute loss stability for large outliers. While we omit formal robustness guarantees here, its strong theoretical backing, from Huber’s original minimax analysis ("Robust estimation of a location parameter", The Annals of Mathematical Statistics, 1964) to modern concentration bounds (e.g., "Concentration study of M‑estimators using the influence function", Electronic Journal of Statistics, 2022), supports its effectiveness. We will elaborate on this rationale in the revised version.

    • When outliers are not a concern (τ\tau \to \infty), the Huber loss reduces to the squared loss. In the DP setting, the privacy guarantee depend on gradient sensitivity. Without additional assumptions, it is standard practice to apply gradient clipping to ensure bounded sensitivity under the squared loss (see, e.g., "Evading the curse of dimensionality in unconstrained private glms", AISTATS, 2021). If the response variable YY is further assumed to be bounded, then the gradient sensitivity is naturally finite, and the privacy guarantee directly depends on its magnitude. Thus, squared loss remains privacy-compatible with proper sensitivity control. We will clarify this connection in revision.

  • Role of τ\tau on the upper bound in Corollary 3.

    • When privacy is not a concern, the parameter τ\tau does not appear explicitly in Corollary 3, as it is absorbed into the constant cc defined in Lemma 1 and further into the O()O(\cdot) notation. However, τ\tau implicitly influences the bound via its role in defining constant cc.

    • To make this dependence more transparent, we provide a modified version of Corollary 3 in which τ\tau is not absorbed into the O()O(\cdot) notation.

      Corollary 3' (Constant step size, without privacy).

      Suppose Assumptions 1-7 hold. Consider the estimator fˉn0\bar{f}_n^0 without privacy protection defined via the recursion (5). Take any constant choice γi=γ=Γ(n)\gamma_i=\gamma=\Gamma(n), for 1in1\leq i\leq n. If c1γB2<1c^{-1}\gamma B^2<1 with cc defined in Lemma 1, i.e., c:=c(\tau)=2m\kappa\min\left\\{1, \frac{\tau}{m+BM+\\|f^{\star}\\|\_{\infty}}\right\\}, then Efˉn0fHLPX22O(c2σ2(1(c1γB2)1/2)2(c1/αγ1/αn1+1/α+n1)+(1+q(c,γ,n))c2rγ2rn2minr,1ΣrfHLPX22),\mathbb{E}\\|\bar{f}_n^0-f_{\mathcal{H}}\\|_{L_{P_X}^2}^2\leq O\left(\frac{c^{-2}\sigma^2}{(1-(c^{-1}\gamma B^2)^{1/2})^2}\left(c^{1/\alpha}\gamma^{1/\alpha}n^{-1+1/\alpha}+n^{-1}\right)+(1+q(c,\gamma, n))c^{-2r}\gamma^{-2r}n^{-2\min\\{r,1\\}}\\|\Sigma^{-r}f_{\mathcal{H}}\\|_{L_{P_X}^2}^2\right), where q(c,γ,n)=0q(c,\gamma, n)=0 for r1/2r\leq 1/2, and q(c,γ,n)=(B22cγB2)2r1c(2r1)/αγ(1+α)(2r1)/αn(2r1)/αq(c,\gamma, n)=\left(\frac{B^2}{2c-\gamma B^2}\right)^{2r-1}c^{(2r-1)/\alpha}\gamma^{(1+\alpha)(2r-1)/\alpha}n^{(2r-1)/\alpha} for r>1/2r>1/2.

    • The parameter τ\tau, which is embedded in cc, establishes a trade-off between robustness and bias in the estimation, which is consistent with classical literature (see, e.g., "Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions", Journal of the Royal Statistical Society, Series B, 2017). In practice, we first compute an L2-FSGD estimate and residuals, then estimate the noise scale via median absolute deviation (MAD). The robustness parameter τ\tau is set to 1.345×\mboxMAD1.345\times \mbox{MAD}, ensuring 9090\\% efficiency under normal errors while resisting outliers, following standard robust regression practice (Lines 191-193, Algorithm 2). This builds on classical Huber loss methods (see, e.g., "Robust regression using iteratively reweighted least-squares", Communications in Statistics-theory and Methods, 1977). We will incorporate these clarifications in the revised revision.

  • Placement of Lemma 1. We agree that Lemma 1 plays a central role in supporting our main results. If the paper is accepted, we will include it in the main text to improve clarity and accessibility.

  • Clarification on the use of Proposition 1 in the proof of Theorem 1. You are right that Proposition 1 concerns central differential privacy (CDP). However, CDP reduces to local differential privacy (LDP) in the single-sample setting. In our proof of Theorem 1, we apply Proposition 1 with single sample D=(Xn,Yn)D = \\{(X_n, Y_n)\\}, effectively leveraging CDP results to establish LDP guarantees.

  • Selection of step size.

    • Theoretically, the choice of step size is indeed crucial for the convergence rate of the proposed estimators, both under constant and non-constant step size settings. Corollaries 1-4 demonstrate that, depending on the range of rr, different choices of ζ\zeta are required to attain the optimal convergence rate.

    • In practice, when the hyperparameters involved in the theoretical step size are unavailable, we recommend selecting the step size using a decaying schedule of the form γi=γ0nζ\gamma_i = \gamma_0 n^{-\zeta} (for constant step size and the total sample size nn) or γi=γ0iζ\gamma_i = \gamma_0 i^{-\zeta} (for non-constant step size). The parameters γ0\gamma_0 and ζ\zeta can be tuned via grid search on a validation set. This approach is widely adopted in practice and shown to be effective across various optimization settings (see, e.g., "The step decay schedule: a near optimal, geometrically decaying learning rate procedure for least squares", NeurIPS, 2019; "Benefits of learning rate annealing for tuning‑robustness in stochastic optimization", arxiv, 2025).

    • To address practical concerns, we have conducted additional simulation studies to examine the sensitivity of estimation performance with respect to different values of step size. We examine the H-FSGD performance under Case 1 for both constant and non-constant step size settings, with γ0[4,24]\gamma_0 \in [4, 24] and ζ[0.3,0.8]\zeta \in [0.3, 0.8]. The tables below report the average MSE (×103\times 10^{-3}) over 50 repetitions when n=10000n = 10000.

      • Constant step size setting:

        ζ\zeta \ γ0\gamma_04812162024
        0.31.521.762.052.382.793.32
        0.41.591.491.561.651.761.87
        0.52.061.671.541.501.491.51
        0.64.032.301.931.761.671.60
        0.712.285.183.372.662.312.11
        0.859.1417.129.696.775.214.25
      • Non-constant step size setting:

        ζ\zeta \ γ0\gamma_04812162024
        0.31.631.992.443.044.135.12
        0.41.511.621.771.962.132.33
        0.51.691.591.631.661.801.85
        0.62.271.941.921.841.871.93
        0.73.593.142.952.852.682.63
        0.87.746.276.506.466.586.39
    • Our results indicate that the proposed estimators exhibit robustness to the choice of γ0\gamma_0 and ζ\zeta under both constant and non-constant step size settings, with wide parameter regions yielding stable and comparable MSEs. If the paper is accepted, we will clarify the step size selection procedure and provide a sensitivity analysis.

Thank you again for your comments. We hope our responses resolve your concerns. If anything remains unclear, we’d be happy to provide further clarification.

评论

Thank you for a detailed response. However, some of my concerns remain unresolved. Besides, in my opinion, the paper needs a major restructuring and remains difficult to parse in its current state. Therefore, I would maintain my current score.

  1. I still don't understand what the formal guarantees regarding outlier robustness look like in the proposed setup. While I appreciate the authors for pointing to a reference, it would be good if formal guarantees were explicitly spelled out in the paper. This is important as outlier robustness is marketed as one of the key contributions in the paper.
  2. At a high level, it seems τ\tau is used as a tool for gradient clipping, and the privacy analysis only needs it to make sure that the gradients are bounded.
评论

We sincerely appreciate the reviewer’s thoughtful comments. We have carefully addressed each concern below and respectfully ask for a reevaluation of the paper.

  • Formal guarantees regarding outlier robustness. In the revised version, we will include the following formal definition of outlier robustness and the proof of the robustness of the Huber-based estimator.

    • Outlier robustness in statistics and machine learning is typically characterized by the influence function [1][2], which we formally define below.

      Definition (Robust statistics): Let F\mathscr{F} be a space of probability distributions on Z=X×R\mathcal Z=\mathcal X\times\mathbb R, and let T:FGT:\mathscr{F}\rightarrow\mathcal{G} be a functional mapping each FFF\in\mathscr{F} to a function T(F)=fFGT(F)=f_F\in\mathcal G. For any contamination point z=(x,y)Zz=(x',y')\in\mathcal Z, define the contaminated distribution Fϵ=(1ϵ)F+ϵΔz,F_{\epsilon}=(1-\epsilon)F+\epsilon\Delta_{z}, where Δz\Delta_{z} is the point mass at zz and ϵ(0,1)\epsilon\in(0,1). Then the influence function of TT at zz under FF is the function IF(z;T,F):XR\text{IF}(z;T,F):\mathcal{X}\rightarrow\mathbb{R} given pointwise by

      IF(z;T,F)(x):=limϵ0+T(Fϵ)(x)T(F)(x)ϵ.\text{IF}(z;T,F)(x):=\lim_{\epsilon\to0^+}\frac{T(F_{\epsilon})(x)-T(F)(x)}{\epsilon}.

      If the influence function is uniformly bounded, then the statistics T(F)T(F) are considered robust, as no single outlier can have a disproportionate effect on the estimator.

    • We now rigorously show that the Huber loss yields an estimator with a uniformly bounded influence function in the nonparametric setting. Consider the model Y=f(X)+eY=f(X)+e and the estimator f^=argmin_fE_F[L_τ(Yf(X))],\hat{f}=\arg\min\_{f}\mathbb{E}\_F[L\_{\tau}(Y-f(X))], where the Huber loss is L_τ(u)=12u2Iuτ+(τu12τ2)Iu>τL\_{\tau}(u)=\frac{1}{2}u^2 I\\{|u|\leq\tau\\}+(\tau|u|-\frac{1}{2}\tau^2) I\\{|u|>\tau\\}, with ψ_τ(u)=dduL_τ(u)=uIuτ+τsign(u)Iu>τ\psi\_{\tau}(u)=\frac{d}{du}L\_{\tau}(u)=u I\\{|u|\leq\tau\\}+\tau\text{sign}(u) I\\{|u|>\tau\\} and ψ_τ(u)=Iu<τ\psi'\_{\tau}(u)=I\\{|u|<\tau\\}.

      The corresponding influence funtion [1] is

      IF((x,y);T_x0,F)=ψ_τ(yfF(x0))I_{x=x0}ψ_τ(yfF(x0))F_YX=x0(dy).\text{IF}((x',y');T\_{x_0},F)=\frac{\psi\_{\tau}(y'-f_F(x_0))I\_{\{x'=x_0\}}}{\int\psi\_{\tau}'(y-f_F(x_0))F\_{Y\mid X=x_0}(dy)}.

      Note that ψk(yfF(x0))τ|\psi_k(y'-f_F(x_0))|\leq\tau, I{x=x0}1|I_{\{x'=x_0\}}|\leq1, the denominator ψτ(yfF(x0))FYX=x0(dy)=P(YfF(x0)<τX=x0)\int\psi_{\tau}'(y-f_F(x_0))F_{Y\mid X=x_0}(dy)=\mathbb{P}(|Y-f_F(x_0)|<\tau\mid X=x_0) is lower bounded by a constant c>0c>0 under standard regularity conditions (continuity and positivity of the conditional error density at zero). Hence, for all contamination points (x,y)(x',y') and all x0x_0,

      IF((x,y);Tx0,F)τc,|\text{IF}((x',y');T_{x_0},F)|\leq\frac{\tau}{c},

      which is uniform over all contamination points.

  • Role of τ\tau. We clarify that the parameter τ\tau in the Huber loss serves a purpose beyond acting as a gradient clipping mechanism in the DP context. It is primarily motivated by the goal of robust estimation, with its compatibility with DP being a beneficial by-product.

    • The Huber loss and its threshold τ\tau are designed to improve robustness to outliers—a well-established property in robust statistics. In the non-private setting, an appropriately chosen τ\tau limits the influence of extreme residuals and enhances estimator stability. This benefit is demonstrated empirically in Example 5.1.

    • In the private setting, the Huber loss naturally induces bounded gradients, which helps control sensitivity without relying on ad hoc gradient clipping—a common practice in the DP literature that can reduce statistical efficiency. This allows us to apply DP mechanisms directly, with both improved robustness and controlled sensitivity.

  • Restructuring. We will revise the main text with a focus on the following to improve the paper's clarity.

    • Notation Table: We will include a list of notations to enhance readability. Due to space constraints, we refer the reviewer to the notation table in our response to Reviewer x3FU.

    • Setup: Formal definitions of outlier robustness, local differential privacy, and a rigorous proof of the robustness of the Huber-based estimator will be included.

    • Problem Formulation: We will reorganize this section to present a clearer and more formal setup. Specifically, consider sequential observations (Xn,Yn),n=1,2,\\{(X_n, Y_n), n = 1, 2, \ldots\\} generated from the nonparametric regression model Yn=f(Xn)+en,Y_n = f^\star(X_n) + e_n, where XnXX_n \in \mathcal{X} is the nn-th copy of a covariate XX with marginal distribution PXP_X, and YnRY_n \in \mathbb{R} is the corresponding response. The noise ene_n satisfies E(en)=0\mathbb{E}(e_n) = 0, Var(en)=σ2\text{Var}(e_n) = \sigma^2, and Cov(en,Xn)=0\text{Cov}(e_n, X_n) = 0. The unknown regression function ff^\star lies in L2(PX)L^2(P_X) and satisfies f<\\|f^\star\\|_\infty < \infty. Our goal is to estimate f()f^\star(\cdot) in an online manner under both robustness and privacy constraints.

Reference

[1] Robust Statistics: The Approach Based on Influence Functions, Wiley, 1986.

[2] Privacy-Preserving Parametric Inference: A Case for Robust Statistics, JASA, 2021.

评论

Dear Reviewer nvq6,

As we approach the end of the discussion phase, we want to check in and see whether our official comments addressed the questions and concerns you raised in your official comment. If anything remains unclear or if additional information would be useful, we would be happy to provide it promptly.

Thank you for your time and for contributing to the review process.

Best regards,

The Authors

审稿意见
5

The title says it all: this paper proposes an online learning method for solving nonparametric regression problems under the constraint of local differential privacy (LDP). The algorithm is stochastic gradient descent (SGD) on the reproducing kernel Hilbert space (RKHS) H\mathcal{H}. The main contributions are:

  • finite sample analysis (theoretical)
  • optimizing convergence rates over the hyperparameters (theoretical)
  • experiments (empirical)

There are two key aspects of the problem which motivates their solution concept:

  1. They do not assume the true regressor fHf^* \in \mathcal{H} (i.e. the problem is misspecified).
  2. They want the method to have good empirical performance when the noise is heavy-tailed.

The main idea is to use the Huber loss to get some robustness against the heavier noise. This is a standard approach (as they point out themselves). The benefit of this loss is that the gradient becomes bounded, avoiding the clipping needed for DP-SGD for quadratic losses like in ridge regression.

优缺点分析

Strengths: I am not able to evaluate the novelty of the theory w.r.t. the literature on nonparametric regression and online learning.

  • The problem is straightforward to state, mathematically.
  • The analysis, although quite complex, is interesting.

Weaknesses:

  • It's not clear why this combination of constraints/properties would be needed in any particular application. In particular, the privacy piece seems almost incidental.
  • The experimental evaluation is quite cursory and doesn't explore the parameter space adequately.

问题

Main Question 1: Does the convergence analysis really take into account heavy-tailed noise?

  • In Section 2, the error is zero-mean but has finite bounded variance σ2\sigma^2. While σ2\sigma^2 could be large, assuming it is finite means that whole classes of two-tailed distributions are out, such as α\alpha-stable distributions for α2\alpha \ne 2 or tt-distributions with ν2\nu \le 2.

  • In the experiments in the submitted manuscript (the supplement has code only as far as I could see) the authors use Gaussian noise with only one value of the variance (0.250.25) and the Student-tt distribution with ν=3\nu = 3 so the variance is 331=1.5\frac{3}{3-1} = 1.5. If I am misunderstanding, I am happy to be corrected! However, these are hardly the "extensive experiments" promised in the abstract. Obviously, space is limited, but I would have expected a better exploration of how robust the method is to actual heavy tailed noise.

  • Considering this I think the claims that this method is itself robust is misleading: the loss function is chosen for its robustness properties, but neither the analysis nor the experiments properly take into account real heavy tails.

Main Question 2: How is this algorithm supposed to be run? This is a bit of a disconnect between the idea of the algorithm and its implementation. Since empirically everything is computed on the grid of mm points, everything turns into an mm-dimensional vector gradient problem. Of course, the underlying structure comes from the RKHS assumption.

  • Figure 1 suggests that for each individual, they either need privacy or do not need privacy so each gradient can be computed with or without noise. However, the experiments seem to only do private or nonprivate. Is the setting under consideration really reflected in Figure 1?

  • The privacy piece is "turnkey": just add Gaussian noise at each iteration according to the gradient sensitivity. This is the same SGD algorithm which has been studied for 10+ years now. So, the novelty is in the analysis of the functional gradient. The grid determines the covariance of the noise, but how should we choose the grid?

  • Because the analysis/theory is done in the RKHS, the grid size doesn't appear in the finite sample bounds. Commenting on the computational complexity in terms of the sample size alone also erases the impact of the grid: is there a way to incorporate it?

Main Question 3: Is this really about privacy at all? In the end, this is just a noisy gradient algorithm for nonparametric regression. The privacy "story" is an OK motivation, but the experiments are on synthetic data and have no bearing on problems involving actual privacy considerations.

  • Under LDP there is no need to do privacy accounting so much of the "privacy apparatus" that has been developed over the years does not apply here.

  • If minibatching were allowed then perhaps we would see some more relevance for privacy. Is there a way to incorporate minibatching into this framework?

  • Several applications are mentioned in the introduction, but is there a vignette that the authors could provide in which someone has a privacy constraint, is worried about heavy-tailed noise, and has to use a nonparametric regression model (i.e. that's the right model for this kind of data)?

Other comments:

  • There is some notation clash in using zz for the pair (x,y)(x,y) and a grid point (which is only in the X\mathcal{X} space, right?)

  • Shouldn't the kernel function be an input to the algorithm as well?

局限性

Yes.

格式问题

N/A.

作者回复

Thank you for your thoughtful and critical assessment. Many of your comments will help us produce a more readable and self-contained version of the paper. Below, we address each of your specific concerns in turn.

  • Heavy-tailed noise in convergence analysis.

    • Our theoretical analysis requires finite-variance noise for tractable proof techniques, excluding some extreme heavy-tailed cases. However, it still covers common heavy-tailed distributions like student-tt distributions with degrees of freedom ν>2\nu > 2, Laplace distribution, and symmetric Pareto distributions with shape parameter α>2\alpha>2.

    • While our main manuscript focuses on Gaussian and t(3)t(3) noise, we have conducted additional experiments with heavier-tailed noise, including Student-t(2.5)t(2.5) and Cauchy(0,1)\text{Cauchy}(0,1) noises. The tables below show MSE (×10310^{-3}) means and standard deviations over 200 repetitions for n=10000,20000n=10000, 20000, and 4000040000, under both constant and non-constant step sizes.

      • t(2.5)t(2.5) noise:

        ConstantNon-constant
        Method \ nn100002000040000100002000040000
        H-FSGD2.15 (0.758)1.25 (0.415)0.669 (0.214)2.26 (1.06)1.41 (0.682)0.907 (0.467)
        L2-FSGD3.97 (2.25)2.04 (1.20)1.11 (0.551)6.13 (4.40)3.38 (2.89)2.02 (1.71)
        Offline4.21 (2.17)2.17 (1.26)1.16 (0.569)6.75 (5.49)3.65 (3.25)2.12 (1.85)
      • Cauchy(0,1)\text{Cauchy}(0,1) noise:

        ConstantNon-constant
        Method \ nn100002000040000100002000040000
        H-FSGD3.64 (1.57)1.97 (0.816)1.00 (0.435)5.22 (3.25)2.98 (1.81)1.65 (1.03)
        L2-FSGD5.83e+07 (8.2e+08)1.89e+07 (2.65e+08)5.7e+06 (7.89e+07)2.24e+08 (3.17e+09)1.01e+08 (1.42e+09)4.36e+07 (6.15e+08)
        Offline1.73e+08 (2.45e+09)4.86e+07 (6.85e+08)1.33e+07 (1.86e+08)3.67e+08 (5.19e+09)1.38e+08 (1.95e+09)5.32e+07 (7.51e+08)
    • The results show H-PSGD's superior empirical performance over L2-FSGD and offline methods even under extreme heavy-tailed noise. While lacking theoretical guarantees for infinite-variance settings, this robustness motivates future theoretical investigation.

  • Implementation of the algorithm.

    • Mixed privacy. Algorithm 1 (PH-FSGD) supports mixed privacy regimes through iteration-specific privacy budgets. Large εn\varepsilon_n values effectively imply non-private updates. While our current experiments focus on fully private or non-private settings, the algorithm naturally accommodates hybrid cases by adjusting per-iteration noise. We will clarify this in the revision.

    • Our method employs functional SGD, which differs fundamentally from finite-dimensional SGD by optimizing over infinite-dimensional function spaces ("Online learning with kernels", IEEE transactions on signal processing, 2004). This involves Fréchet derivatives, RKHS-based analysis, and more complex convergence proofs (see, e.g., "Nonparametric stochastic approximation with large step-sizes", The Annals of Statistics, 2016). Our key contribution is a theoretically grounded private functional estimator with convergence guarantees for both step size schemes.

    • Choice of grids. We clarify that our method and theory are developed in the infinite-dimensional RKHS, while the grid-based implementation is a discretized approximation for computational feasibility. The current bounds do not reflect the effect of grid size mm, as the analysis assumes sufficiently fine discretization to approximate the continuous function. We acknowledge that small mm may introduce non-negligible error. In practice, a dense grid is needed to recover the function accurately, as also done in prior work (e.g., "Scalable statistical inference in non-parametric least squares", arxiv, 2023), which similarly builds theory in function space but uses discrete grids for implementation. Extending the theory to account for finite grid is left for future work.

    • Computational complexity. Our theoretical analysis operates directly in RKHS, with finite-sample bounds depending on kernel eigenvalue decay rate α\alpha rather than grid size mm. The grid size mm affects the computational costs, with O(m3+nm2)O(m^3+nm^2) total runtime and O(m2)O(m^2) storage. This reflects an accuracy-efficiency tradeoff that we will clarify in revision.

  • Practical applicability.

    • Our method is grounded in LDP, which avoids the need for trusted data collectors or centralized privacy accounting, unlike CDP-based methods. With disjoint data access at each iteration, our algorithm leverages parallel composition to provide rigorous per-user privacy protection, particularly valuable in streaming or decentralized environments.

    • Minibatching integration. Our framework can be easily extended to support minibatching. Specifically, at iteration nn, if a mini-batch (Xnt,Ynt)t=1Bn\\{(X_{nt}, Y_{nt})\\}_{t=1}^{B_n} with size BnB_n is available, the update rule becomes: f^n=f^_n1+γn1Bn_t=1Bnw_τ(Y_ntf^_n1,K_X_nt_H)(Y_ntf^_n1,K_X_nt_H)K_X_nt+γnξn,\hat{f}_n=\hat{f}\_{n-1}+\gamma_n \frac{1}{B_n}\sum\_{t=1}^{B_n}w\_{\tau}\left(Y\_{nt}-\langle \hat{f}\_{n-1},K\_{X\_{nt}}\rangle\_{\mathcal{H}}\right)\left(Y\_{nt}-\langle \hat{f}\_{n-1},K\_{X\_{nt}}\rangle\_{\mathcal{H}}\right)K\_{X\_{nt}}+\gamma_n \xi_n, fˉn=n1nfˉ_n1+1nf^_n,\bar{f}_n=\frac{n-1}{n}\bar{f}\_{n-1}+\frac{1}{n}\hat{f}\_n, where ξn\xi_n is the sample path of a Gaussian process having mean zero and covariance function 8τ2B2log(2/δn)εn2K\frac{8 \tau^2 B^2\log(2/\delta_n)}{\varepsilon_n^2}K with (εn,δn)(\varepsilon_n, \delta_n) being the privacy budget at the nn-th iteration. The noise term ξn\xi_n still suffices to ensure differential privacy, as the modified Fréchet gradient has the same sensitivity and the privacy mechanism remains applicable in this case. Moreover, the theoretical analysis can be readily modified by replacing the single-sample gradient with its minibatch-averaged counterpart. Minibatching may in fact help improve the privacy–utility trade-off by reducing gradient variance, allowing for smaller noise magnitudes under the same privacy budget; however, it also requires a trusted data collector to aggregate and access the mini-batch data. We will discuss the incorporation of minibatching in the revised version.

    • Applications. Many real-world applications simultaneously require privacy protection, robustness to heavy-tailed noise, and nonparametric modeling. Examples include:

      • Healthcare analytics. Wearable fitness data (e.g., heart rate, sleep patterns) demands LDP since raw physiological signals can re-identify users; exhibits heavy-tailed noise due to irregular activities (e.g., sudden spikes in heart rate during exercise or sensor artifacts); and necessitates nonparametric regression because the relationship between metrics (e.g., sleep duration vs. recovery rate) is often nonlinear and complex.
      • Financial fraud detection. Transaction histories require LDP when shared due to their identifiable nature, legitimate spending patterns contaminated by fraudulent outliers create heavy-tailed distributions, and the adversarial nature of fraud evolution necessitates nonparametric methods to detect novel attack patterns beyond rigid rule-based systems.
      • Consumer behavior analysis. Browsing logs need LDP protection against profiling, purchase amounts exhibit heavy-tailed distributions dominated by rare large orders, and market segmentation reveals irregular price elasticity patterns that nonparametric models can adequately capture.
    • To better validate our method's practical utility, we conducted experiments on the real data "Health and fitness dataset" from Kaggle website. Our goal is to investigate how endurance levels affect overall fitness. We select 40000 samples as the training set and 1000 samples as the test set. We compared our H-FSGD and PH-FSGD methods with baseline methods on out-of-sample R2R^2 performance, as displayed in the table below.

      Step size \ MethodH-FSGDL2-FSGDOffline(2,0.2)-LDP(3,0.1)-LDPlm
      Constant0.6220.6170.6090.5890.6090.581
      Non-constant0.6230.6180.6090.5260.5960.581

      The results show that H-FSGD consistently outperforms L2-FSGD and Offline methods, showing strong robustness. All non-private nonparametric methods outperform linear regression (lm) in capturing complex relationships. Under LDP, PH-FSGD remains competitive, matching or exceeding lm, confirming its effectiveness under privacy constraints. These findings validate our method’s ability to model nonlinear patterns while preserving privacy. We will include additional practical application details and real-world data analysis in the revision.

  • Notation and presentation.

    • We will revise the notation to eliminate conflicts, include the kernel function as an explicit algorithm input, and review the presentation for clarity and consistency in the revision.

Thank you again for your comments. We hope our responses resolve your concerns. If anything remains unclear, we’d be happy to provide further clarification.

评论

Thanks for the detailed clarifications. I would like to know how you will change the description of your method and results (e.g. in the introduction). Because your theoretical analysis does not hold for infinite variance settings I think it's best to make a clean distinction between the theoretical guarantees and the experimental evaluation. Experiments showing that the method works beyond the theoretical limits can then be appreciated better (assuming you will include them).

This also manifests in the grid discussion: the fact that the theory optimizes directly over the RKHS but the computational method requires this grid is not so clear in the manuscript.

The computational complexity is nontrivial: the grid would involve an exponential blowup in complexity if one tried to do this for multivariate problems, right?

评论

Many thanks to the reviewer’s constructive suggestions to clarify the distinction between our theoretical and empirical contributions. We address each point below.

  • Theory-experiment boundary clarification. We fully agree with the reviewer's suggestion to clearly delineate the boundaries between our theoretical guarantees and empirical findings. In the revision, we will implement the following clarifications in two key sections:

    • "Problem formulation" section: we will significantly expand the discussion of the finite-variance noise assumption and include the following discussion. Specifically, our theoretical analysis requires finite-variance noise assumption to establish tractable bounds on gradient variances, but our method can still demonstrate remarkable empirical robustness even in infinite-variance scenarios like Cauchy noise.
    • "Additional experimental results" section: we will add a subsection titled "Performance beyond theoretical assumptions" that presents comprehensive results under infinite-variance noise (e.g., Cauchy) and include the following discussion. Our experimental results demonstrate that H-PSGD maintains superior robustness compared to both L2-FSGD and offline methods, even when handling infinite-variance distributions that fall outside our current theoretical framework. While formal guarantees for these extreme cases remain to be established, the observed empirical performance strongly motivates future theoretical investigation of such challenging scenarios.
  • Theoretical RKHS analysis and practical grid approximation. We thank the reviewer for highlighting the need to clarify the relationship between our theoretical RKHS framework and practical grid-based implementation. In the revised version, we will implement the following clarifications in two key parts:

    • The explanation part of Algorithm 1: while our method is developed in the infinite-dimensional RKHS, the computational implementation employs grid discretization as a finite approximation, which is a standard approach in functional data analysis to balance computational feasibility with theoretical fidelity. Following established practice (e.g., "Nonparametric stochastic approximation with large step-sizes", The Annals of Statistics, 2016; "Scalable statistical inference in non-parametric least squares", arxiv, 2023), we use dense grids to ensure accurate function recovery, with approximation error diminishing as grid density increases.
    • The discussion part of Theorem 2: our current theoretical bounds assume sufficiently fine discretization to guarantee the grid-based solution closely approximates the RKHS optimum. While small mm may introduce non-negligible error, our empirical results confirm that larger mm effectively mitigates this discrepancy. Extending the theory to explicitly account for finite grid effects remains an important direction for future work.
  • Computational complexity in multivariate settings. We thank the reviewer for highlighting this important computational consideration. While our current analysis focuses on univariate settings with tractable complexity, requiring O(m3)O(m^3) setup time for covariance matrix construction and Cholesky decomposition followed by O(nm2)O(nm^2) operational cost for nn functional SGD iterations with O(m2)O(m^2) storage, we fully acknowledge the exponential scaling in multivariate cases. Specifically, dd-dimensional problems would require O(m3d)O(m^{3d}) setup and O(nm2d)O(nm^{2d}) operational costs with O(m2d)O(m^{2d}) storage, making naive grid-based approaches impractical for d3d\geq 3. To address this, additive kernel methods ("Minimax-optimal rates for sparse additive models over kernel classes via convex programming", Journal of Machine Learning Research, 2012) offer a viable alternative, reducing complexity to O(dm3+ndm2)O(dm^3 + ndm^2) and storage to O(dm2)O(dm^2) through dimension-wise decomposition. This preserves theoretical guarantees while maintaining computational feasibility. We will emphasize this crucial point in the discussion of the revision.

Thank you again for your time and insightful feedback!

最终决定

This paper proposes methods for online nonparametric regression that are robust to outliers, including one that satisfies local differential privacy. The reviewers generally found the results compelling, believed the problem to be important and fundamental, and thought the analysis was high-quality and interesting. These qualities all argue in favor of acceptance at NeurIPS.

While the authors are encouraged to address all reviewer comments for the final version (and of course to make the revisions promised in their responses), they should especially focus on the clarity of the notation, which was a concern for several reviewers (as acknowledged by the authors in their final remarks).