PaperHub
6.2
/10
Poster5 位审稿人
最低6最高7标准差0.4
6
6
7
6
6
3.4
置信度
正确性3.2
贡献度2.8
表达2.8
NeurIPS 2024

Continual Counting with Gradual Privacy Expiration

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

We consider a variant of the continual counting problem in which privacy loss is allowed to grow over time.

摘要

Differential privacy with gradual expiration models the setting where data items arrive in a stream and at a given time $t$ the privacy loss guaranteed for a data item seen at time $(t-d)$ is $\epsilon g(d)$, where $g$ is a monotonically non-decreasing function. We study the fundamental *continual (binary) counting* problem where each data item consists of a bit and the algorithm needs to output at each time step the sum of all the bits streamed so far. For a stream of length $T$ and privacy *without* expiration continual counting is possible with maximum (over all time steps) additive error $O(\log^2(T)/\varepsilon)$ and the best known lower bound is $\Omega(\log(T)/\varepsilon)$; closing this gap is a challenging open problem. We show that the situation is very different for privacy with gradual expiration by giving upper and lower bounds for a large set of expiration functions $g$. Specifically, our algorithm achieves an additive error of $O(\log(T)/\epsilon)$ for a large set of privacy expiration functions. We also give a lower bound that shows that if $C$ is the additive error of any $\epsilon$-DP algorithm for this problem, then the product of $C$ and the privacy expiration function after $2C$ steps must be $\Omega(\log(T)/\epsilon)$. Our algorithm matches this lower bound as its additive error is $O(\log(T)/\epsilon)$, even when $g(2C) = O(1)$. Our empirical evaluation shows that we achieve a slowly growing privacy loss that has significantly smaller empirical privacy loss for large values of $d$ than a natural baseline algorithm.
关键词
differential privacycontinual observationprivacy expiration

评审与讨论

审稿意见
6

This paper studies the continual counting problem, under a “relaxed” privacy notion, where the privacy budget is a function of time, aka differential privacy with expiration functions. This notion, proposed by Bolot et al.[2], captures the settings where data becomes less sensitive with time. The paper proposes an algorithm based on the pan-private tree-based mechanism of Dwork et al. [12]. They show that this algorithm verifies DP with expiration function g(d)1+logλ(d)g(d) \sim 1 + \log^\lambda(d), which improves on the constant g(d)=log(T)g(d) = \log(T) expiration function of directly using the analysis of Chan et al [6], or the g(d)=2log(d+1)+2g(d) = 2\log(d + 1) +2 of directly analysing the pan private tree-based mechanism of Dwork et al. [12].

优点

  • The problem is well-motivated and interesting

  • The paper is well-written and easy to follow

  • The analysis is complete (upper/lower bounds, different types of expiration functions), and the main ideas are stated clearly

  • The experiments validate the theoretical findings.

缺点

  • The technical gap is limited compared to Dwork et al. and Chan et al..

  • Algorithm 3 is less intuitive, especially the role of λ\lambda, and its analysis is less clear.

  • There is a need for better/more examples to motivate the setting. The example of tracking visits to a location/website arguably does not capture the definition used. For example, if the website changes its content from offensive to less offensive content, one may argue that visiting the website in the old version is even more revealing and more sensitive than visiting the new version, which is the opposite setting to yours.

问题

  • How should λ\lambda in Algorithm 3 be chosen?

  • Can the techniques/ideas proposed to analyse Algorithm 3 be adapted to improve the analysis of the pan-private algorithm of Dwork et al. [12] for the classic setting without expiration?

Minor comments/typos:

  • In Definition 1.1, shouldn’t S be of length τ\tau?

  • In the algorithms pseudo codes, shouldn’t the noise drawing steps (Line 2 in Algorithm 1 and Line 3,4 in Algorithm 2) be inside the “for t “ loop? Or do you draw all the noises beforehand, i.e. the whole sequence (Zt)t=1(Z_t)_{t = 1}^\infty?

  • In algorithm 3, shouldn’t Line 8 be IItZI\sum_{I \in \mathcal{I}_t} Z_I, since It\mathcal{I}_t defined in Line 2 only includes steps tBt-B?

  • It would be better if the paper included a proper definition of the accuracy of a counting task and the different types of errors.

局限性

As noted by the authors, the only limitation is that the "gradual expiration" setting may only be relevant for specific applications. As expressed in the weakness section, a better motivating example would give better intuitions.

作者回复

We thank the reviewer for their reading and questions.

Q1: how to set λ\lambda?

λ\lambda is provided by the user together with the input, and should reflect their privacy goals. The larger we make λ\lambda, the weaker the asymptotic privacy guarantee becomes to the benefit of the error, as outlined by Theorem 1.2. Ultimately, running the algorithm with some λ\lambda is more private than a direct non-private computation.

Q2: can our analysis improve Dwork et al.’s analysis?

Our idea of not including the left-most intervals in the dyadic interval set can be used to add O(logt)O(\log t) noise terms at time tt, instead of O(logT)O(\log T). This translates into a constant improvement for their mean squared and \ell_\infty error. We however note that it would be interesting future work to formalize a notion of privacy expiration with pan-privacy.

C1: length of SS

SS is a set of tuples of outputs. The probability of any tuple of length τ\neq \tau is 00, so the definition is valid.

C2: drawing of noise

We assume the noise is ‘drawn on demand’ inside the for-loops, not before entering them. Lines 175-177 describe how noise is drawn for our algorithms, but we will try to make this more clear.

C3: Line 8 algorithm 3

Correct, we will fix it.

C4: proper definition of accuracy

We will include a formal definition in the full paper.

评论

Thank you for the rebuttal.

审稿意见
6

The paper studies differential privacy with gradual expiration models, gives upper and lower bounds for a large set of expriation functions gg. The proposed algorithm can achieve an additive error of O(log(T)/ϵ)O(log(T) / \epsilon), matching the lower bound. Empirical evaluationshows that the algorithm can achieve smaller privacy loss.

优点

  1. An analysis of upper and lower bounds for privacy with gradual expiration is provided. The proposed algorithm has an error that matches the lower bound, theoretically guaranteeing its performance.

  2. The algorithm can run on unbounded streams, unlike previous methods that require a fixed TT. This adds more flexibility to its usage.

缺点

  1. I find the description in Section 3.2 very abstract. Providing examples similar to those in [2] could help readers understand the descriptions better.

[2] Bolot, Jean, Nadia Fawaz, Shanmugavelayutham Muthukrishnan, Aleksandar Nikolov, and Nina Taft. "Private decayed predicate sums on streams." In Proceedings of the 16th International Conference on Database Theory, pp. 284-295. 2013.

问题

  1. Is it possible to use the matrix mechanism (techniques from references [8], [10], [11]) to improve the constant term in the error analysis? Could you discuss the challenges of applying these techniques in differential privacy models with gradual expiration?

局限性

See Weakness and Question

作者回复

We thank the reviewer for their reading and question.

W1: Section 3.2 is too abstract

We will include examples to make the related concepts in Section 3.2 more intuitive.

Q1: matrix mechanisms with privacy expiration

It is possible that techniques related to matrix mechanisms can be leveraged to improve the constant term in the error analysis. In the case of our algorithm, it is not clear how it could be expressed as a matrix mechanism, nor is it clear what the structure of the matrices must be to implement a given privacy expiration function. We believe it is an interesting future direction and will include a discussion of it.

评论

Thank you!

审稿意见
7

Modern DP solutions are often working in the setting where the new data is coming daily. However, DP doesn't allow querying the data indefinitely; therefore, all of these solutions are either treating database as partitioned by time, or refresh the budget on some schedule.

Both of these approaches are not ideal, this paper studies an alternative solution: it assumes that older data is less sensitive, therefore releasing more information about the past is ok. Using this definition, it proposes continual release schemes that gives better utility guarantees than solutions in the case where there is no privacy expiration.

优点

The setting the paper studies is very important and it is the first paper that was able to achieve better guarantees in this regime.

缺点

In most of the settings where the data is sensitive, sensitivity doesn't diminish with time or at least not diminish uniformly; e.g., ads information might reveal some information about sexual orientation and sensitivity of this information is not guaranteed to expire, while it could also reveal the fact that a person is looking for a new job and sensitivity of this information is decreasing with time (however, it is not clear how quickly it decreases).

问题

On line 21: Here and further down the text, there are a lot of references given as a list: it would be either cite an overview or say something about each reference. On line 30: As it was mentioned before, information about visiting some websites could be less sensitive if they happened a week ago, but information about visiting others is not diminishing in sensitivity. On line 35: Such strong statements require more than personal communication. On line 36: Usually not users, but datasets are allocated budget.
On line 39: It would be appropriate to give some references on systems that refresh budgets. On line 42: While I agree that refreshing budgets could be problematic, it is worth to note that it is e ssentially equivalent to the setting with linear expiration. On line 54: Claim about O(1) run-time is not supported by a theorem in the text. On line 111: Why do you need f, wouldn't it be enough to define everything with g? On lines 152-157: This section is supposed to give intuition to the proofs, but it is not clear what it says. On line 187: You need to define q(N) On line 235: I_t is infinite, how can you bound its size? On line 242: It should be y=x_j - x'_j On line 254: It doesn't seemed to be true that z_I' = z_i + y only for one I

局限性

NA

作者回复

We thank the reviewer for their reading and their exact questions/comments. We will incorporate the comments into the final version, and address a subset of them below.

line 42: refreshing budgets means linear expiration

We agree; we will state this explicitly in the introduction.

line 54: no theorem for time/space complexity

This is true but the argument for it is direct, and it is sketched out on lines 293-298.

line 111: why ff

It is only to say that if ff dominates gg, then by definition the algorithm is also private with expiration function ff. See lines 105-108.

line 235: how we bound It|\mathcal{I}_t|

The critical observation is that the dyadic interval set I\mathcal{I} excludes every interval containing 00, and so tt intersects intervals up to level =logt\ell = \lceil\log t\rceil. A proof sketch is given in Appendix C.2.

line 254: about zIz_I’

The statement is true. For each II in the decomposition D[j,τ]D_{[j, \tau]}, we shift its corresponding noise zIz_I. By definition of D[j,τ]D_{[j, \tau]}, it cannot contain overlapping intervals, and so I_t\mathcal{I}\_t cannot intersect with more than one interval from D[j,τ]D_{[j, \tau]}. There is therefore a unique interval IItI\in\mathcal{I}_t where zI=zI+yz_I’ = z_I + y.

评论

Thanks for the rebuttal!

审稿意见
6

This paper studies a variant of the continual release model of differential privacy where the privacy parameter degrades (potentially smoothly) with time, i.e., the ϵ\epsilon associated with changing a datapoint at time tt is ϵg(Tt)\epsilon g(T-t) (the model was introduced by Bolot et al.). This model makes sense when a datapoint becomes less important to hide with time. This model also makes sense in the face of two facts about vanilla differential privacy under continual release: a) natural problems are subject to strong lower bounds, b) In practical implementations, it is often assumed that the privacy budget 'renews' each day, and a privacy expiration function could allow for a more smooth decay of the privacy of a datapoint.

The main results of the authors are as follows: a) They show that assuming a logarithmic privacy loss function (g(d)logdg(d) \approx \log d), a variant of the binary tree mechanism gives O(logT)O(\log T) error (absolute difference). In contrast, in the vanilla privacy model, the best known error is O(log2T)O(\log^2 T) and a long open problem is whether O(logT)O(\log T) is achievable. This paper shows that under the weaker notion of privacy, this error is achievable. They use a variant of the binary tree mechanism inspired by pan-privacy (adding Lap(1/ϵ)Lap(1/\epsilon) noise to each interval). Intuitively, a change at tilmestep tt would only affect intervals between timestep tt and TtT-t that include datapoint tt, which is roughly log(Tt)\log (T-t) intervals so composing over these intervals would give privacy degradation g(d)logdg(d) \approx \log d.

b) They also show a richer tradeoff between privacy degradation functions (which are roughly offset polylogarithmic functions) and error via a more careful variant of the binary tree mechanism.

c) They use a packing argument to argue that these tradeoffs are tight in the worst case.

优点

  1. The privacy model is interesting and merits further work, since it suggests a way to get around lower bounds/gaps in the continual release model of differential privacy while giving meaningful privacy. It's nice that the longstanding logarithmic gap in accuracy for continual counting can be plugged under this relaxation.

  2. The paper is well written and easy to follow.

缺点

  1. The results presented seem relatively unsurprising given the intuition based on composition discussed in the summary

  2. The logT\log T vs log2T\log^2 T gap is relatively small; it would be much more interesting if larger (say, polynomial) gaps could be plugged using this model with reasonable privacy expiration functions.

问题

  1. I wonder if the pan-private inspired counting algorithm is needed? In particular, I wonder if the vanilla binary tree mechanism would give you the same bounds given the intuition in the summary (this might be formalizable using post-processing functions that depend on the pair of neighboring datasets but not on the individual dataset within the pair, and noise couplings similar to what the authors do- but perhaps I am mistaken). If such an argument worked, it might be nice to write some general 'composition' theorems that can be used in followup work that prove privacy in this model.

  2. Do the authors have any thought on whether this model could help plug larger gaps between the continual release and batch models? Some discussion on this and an open problems section would be good to include in the final version of the paper.

局限性

Yes.

作者回复

We thank the reviewer for their reading and questions.

Q1: why not use the vanilla binary mechanism

It can be shown that the vanilla binary mechanism satisfies the same error/privacy expiration as our algorithm for λ=1\lambda=1. We however do not believe that the vanilla binary mechanism can be easily adapted (by varying the noise distributions across levels) to match our guarantees on error and privacy expiration for λ1\lambda \neq 1. We sketch an argument for why we believe so next.

Our privacy argument relies on the fact that shifting noise values covering the interval [j,τ][j, \tau] can hide the value of xjx_j for t[j,τ]t\in[j, \tau]. It is important that the noise values used in the cover are from level at most =O(log(τj))\ell = O(\log(\tau - j)). For the vanilla binary mechanism however, to hide the exact value of an input xjx_j, we need to shift all ancestors to xjx_j that are used when producing outputs for t[j,τ]t\in[j, \tau], and for this set of noise values we do not have the same guarantee on what level in the tree they are from. For example there exists 1<j=τ1 < j=\tau where the single node needed to shift will be at level =Ω(log(τ))\ell = \Omega(\log(\tau)). This makes the privacy proof in Appendix C.3.1 breaks down when λ1\lambda \neq 1.

Q2: privacy expiration for other problems

We believe it is plausible that the model could be useful to get new trade-offs in other continual problems. Specifically, it would be an interesting direction for future research to study problems in this model, where there is known to be a polynomial error gap between the batch model and the continual release model (for example max sum and counting distinct elements). We will include a discussion of this direction as potential future work in the final version.

评论

Thanks to the authors for the clarification re the vanilla binary tree mechanism.

审稿意见
6

This paper studies the DP counting problem in the continual release setting. The difference between the problem studied by this paper and the standard problem is that it allows earlier change of the element has a more relaxed privacy budget. The high level idea is to follow the previous pan-private continual counting framework but with different noise setup.

优点

I believe the problem is fundamental and have a lot of potential application in the world. The DP continual release model with gradual privacy expiration would be a good tradeoff between accuracy and the potential privacy risk by classic refresh of privacy budget.

The algorithm is clean and easy to understand but the theoretical analysis requires solid work.

Authors also show almost matching lower bounds.

缺点

When the function g grows faster than a constant but slower than log^{o(1)}, then it is not clear what the tight error bound would be.

问题

Can authors provide any discussion of the tight bound when the function g grows as log^{o(1)}(.)?

局限性

There is no obvious potentially negative societal impact.

作者回复

We thank the reviewer for their reading and question.

Q1: error bound for expiration growing as logo(1)(d)\log^{o(1)}(d).

Our algorithm also supports privacy expiration g(d)=O(loglog(d))g(d) = O(\log\log(d)) when λ=0\lambda = 0, which is shown in Appendix C.3.1. The accuracy guarantees for this case is given by setting λ=0\lambda=0 in Theorem 1.2. For this case our lower bound is not tight. Achieving tightness for sublogarithmic privacy expiration is an interesting future direction.

最终决定

This paper considers a continual counting problem in the privacy expiration model of [Bolot et al., ICDT 2013]. The input here is a stream x1,dotsx_1, \\dots; the goal is to, after receiving each xtx_t, release an estimate of the sum x1+dots+xtx_1 + \\dots + x_t. The privacy expiration model states that, for each xix_i, the privacy loss after step tt can increase as tit - i increases; namely, it can be epsiloncdotg(ti)\\epsilon \\cdot g(t - i) for some non-decreasing function gg. The main result of this paper is that, for a large class of function gg, there is an algorithm with (ellinfty\\ell_{\\infty}-)error O(logT/epsilon)O(\\log T / \\epsilon), which improves on the standard model (where gg is a constant) whose best known error is O(log2T/epsilon)O(\\log^2 T / \\epsilon). The algorithm is based on the so-called binary tree mechanism [Dwork et al.; Chan et al.] with two main changes: (i) delaying the release of recent xix_i by ignoring them completely as long as tleqi+Bt \\leq i + B for some BB and (ii) adjusting the privacy budget allocation of each level of the tree. In addition to this, the authors also provide a lower bound which is tight for some setting of gg.

Overall, even though the algorithm is a simple modification of a known algorithm, the algorithm is clean & practical and it demonstrates the power of the privacy expiration model, which might offer an alternative solution to the standard continual release setting. Furthermore, given that the continual counting algorithm is a subroutine for many algorithms, it is likely that the result in this paper can be used to build other algorithms in this model. Due to this, we recommend accepting this paper as a poster.