Continual Counting with Gradual Privacy Expiration
We consider a variant of the continual counting problem in which privacy loss is allowed to grow over time.
摘要
评审与讨论
This paper studies the continual counting problem, under a “relaxed” privacy notion, where the privacy budget is a function of time, aka differential privacy with expiration functions. This notion, proposed by Bolot et al.[2], captures the settings where data becomes less sensitive with time. The paper proposes an algorithm based on the pan-private tree-based mechanism of Dwork et al. [12]. They show that this algorithm verifies DP with expiration function , which improves on the constant expiration function of directly using the analysis of Chan et al [6], or the of directly analysing the pan private tree-based mechanism of Dwork et al. [12].
优点
-
The problem is well-motivated and interesting
-
The paper is well-written and easy to follow
-
The analysis is complete (upper/lower bounds, different types of expiration functions), and the main ideas are stated clearly
-
The experiments validate the theoretical findings.
缺点
-
The technical gap is limited compared to Dwork et al. and Chan et al..
-
Algorithm 3 is less intuitive, especially the role of , and its analysis is less clear.
-
There is a need for better/more examples to motivate the setting. The example of tracking visits to a location/website arguably does not capture the definition used. For example, if the website changes its content from offensive to less offensive content, one may argue that visiting the website in the old version is even more revealing and more sensitive than visiting the new version, which is the opposite setting to yours.
问题
-
How should in Algorithm 3 be chosen?
-
Can the techniques/ideas proposed to analyse Algorithm 3 be adapted to improve the analysis of the pan-private algorithm of Dwork et al. [12] for the classic setting without expiration?
Minor comments/typos:
-
In Definition 1.1, shouldn’t S be of length ?
-
In the algorithms pseudo codes, shouldn’t the noise drawing steps (Line 2 in Algorithm 1 and Line 3,4 in Algorithm 2) be inside the “for t “ loop? Or do you draw all the noises beforehand, i.e. the whole sequence ?
-
In algorithm 3, shouldn’t Line 8 be , since defined in Line 2 only includes steps ?
-
It would be better if the paper included a proper definition of the accuracy of a counting task and the different types of errors.
局限性
As noted by the authors, the only limitation is that the "gradual expiration" setting may only be relevant for specific applications. As expressed in the weakness section, a better motivating example would give better intuitions.
We thank the reviewer for their reading and questions.
Q1: how to set ?
is provided by the user together with the input, and should reflect their privacy goals. The larger we make , the weaker the asymptotic privacy guarantee becomes to the benefit of the error, as outlined by Theorem 1.2. Ultimately, running the algorithm with some is more private than a direct non-private computation.
Q2: can our analysis improve Dwork et al.’s analysis?
Our idea of not including the left-most intervals in the dyadic interval set can be used to add noise terms at time , instead of . This translates into a constant improvement for their mean squared and error. We however note that it would be interesting future work to formalize a notion of privacy expiration with pan-privacy.
C1: length of
is a set of tuples of outputs. The probability of any tuple of length is , so the definition is valid.
C2: drawing of noise
We assume the noise is ‘drawn on demand’ inside the for-loops, not before entering them. Lines 175-177 describe how noise is drawn for our algorithms, but we will try to make this more clear.
C3: Line 8 algorithm 3
Correct, we will fix it.
C4: proper definition of accuracy
We will include a formal definition in the full paper.
Thank you for the rebuttal.
The paper studies differential privacy with gradual expiration models, gives upper and lower bounds for a large set of expriation functions . The proposed algorithm can achieve an additive error of , matching the lower bound. Empirical evaluationshows that the algorithm can achieve smaller privacy loss.
优点
-
An analysis of upper and lower bounds for privacy with gradual expiration is provided. The proposed algorithm has an error that matches the lower bound, theoretically guaranteeing its performance.
-
The algorithm can run on unbounded streams, unlike previous methods that require a fixed . This adds more flexibility to its usage.
缺点
- I find the description in Section 3.2 very abstract. Providing examples similar to those in [2] could help readers understand the descriptions better.
[2] Bolot, Jean, Nadia Fawaz, Shanmugavelayutham Muthukrishnan, Aleksandar Nikolov, and Nina Taft. "Private decayed predicate sums on streams." In Proceedings of the 16th International Conference on Database Theory, pp. 284-295. 2013.
问题
- Is it possible to use the matrix mechanism (techniques from references [8], [10], [11]) to improve the constant term in the error analysis? Could you discuss the challenges of applying these techniques in differential privacy models with gradual expiration?
局限性
See Weakness and Question
We thank the reviewer for their reading and question.
W1: Section 3.2 is too abstract
We will include examples to make the related concepts in Section 3.2 more intuitive.
Q1: matrix mechanisms with privacy expiration
It is possible that techniques related to matrix mechanisms can be leveraged to improve the constant term in the error analysis. In the case of our algorithm, it is not clear how it could be expressed as a matrix mechanism, nor is it clear what the structure of the matrices must be to implement a given privacy expiration function. We believe it is an interesting future direction and will include a discussion of it.
Thank you!
Modern DP solutions are often working in the setting where the new data is coming daily. However, DP doesn't allow querying the data indefinitely; therefore, all of these solutions are either treating database as partitioned by time, or refresh the budget on some schedule.
Both of these approaches are not ideal, this paper studies an alternative solution: it assumes that older data is less sensitive, therefore releasing more information about the past is ok. Using this definition, it proposes continual release schemes that gives better utility guarantees than solutions in the case where there is no privacy expiration.
优点
The setting the paper studies is very important and it is the first paper that was able to achieve better guarantees in this regime.
缺点
In most of the settings where the data is sensitive, sensitivity doesn't diminish with time or at least not diminish uniformly; e.g., ads information might reveal some information about sexual orientation and sensitivity of this information is not guaranteed to expire, while it could also reveal the fact that a person is looking for a new job and sensitivity of this information is decreasing with time (however, it is not clear how quickly it decreases).
问题
On line 21: Here and further down the text, there are a lot of references given as a list: it would be either cite an overview or say something about each reference.
On line 30: As it was mentioned before, information about visiting some websites could be less sensitive if they happened a week ago, but information about visiting others is not diminishing in sensitivity.
On line 35: Such strong statements require more than personal communication.
On line 36: Usually not users, but datasets are allocated budget.
On line 39: It would be appropriate to give some references on systems that refresh budgets.
On line 42: While I agree that refreshing budgets could be problematic, it is worth to note that it is e ssentially equivalent to the setting with linear expiration.
On line 54: Claim about O(1) run-time is not supported by a theorem in the text.
On line 111: Why do you need f, wouldn't it be enough to define everything with g?
On lines 152-157: This section is supposed to give intuition to the proofs, but it is not clear what it says.
On line 187: You need to define q(N)
On line 235: I_t is infinite, how can you bound its size?
On line 242: It should be y=x_j - x'_j
On line 254: It doesn't seemed to be true that z_I' = z_i + y only for one I
局限性
NA
We thank the reviewer for their reading and their exact questions/comments. We will incorporate the comments into the final version, and address a subset of them below.
line 42: refreshing budgets means linear expiration
We agree; we will state this explicitly in the introduction.
line 54: no theorem for time/space complexity
This is true but the argument for it is direct, and it is sketched out on lines 293-298.
line 111: why
It is only to say that if dominates , then by definition the algorithm is also private with expiration function . See lines 105-108.
line 235: how we bound
The critical observation is that the dyadic interval set excludes every interval containing , and so intersects intervals up to level . A proof sketch is given in Appendix C.2.
line 254: about
The statement is true. For each in the decomposition , we shift its corresponding noise . By definition of , it cannot contain overlapping intervals, and so cannot intersect with more than one interval from . There is therefore a unique interval where .
Thanks for the rebuttal!
This paper studies a variant of the continual release model of differential privacy where the privacy parameter degrades (potentially smoothly) with time, i.e., the associated with changing a datapoint at time is (the model was introduced by Bolot et al.). This model makes sense when a datapoint becomes less important to hide with time. This model also makes sense in the face of two facts about vanilla differential privacy under continual release: a) natural problems are subject to strong lower bounds, b) In practical implementations, it is often assumed that the privacy budget 'renews' each day, and a privacy expiration function could allow for a more smooth decay of the privacy of a datapoint.
The main results of the authors are as follows: a) They show that assuming a logarithmic privacy loss function (), a variant of the binary tree mechanism gives error (absolute difference). In contrast, in the vanilla privacy model, the best known error is and a long open problem is whether is achievable. This paper shows that under the weaker notion of privacy, this error is achievable. They use a variant of the binary tree mechanism inspired by pan-privacy (adding noise to each interval). Intuitively, a change at tilmestep would only affect intervals between timestep and that include datapoint , which is roughly intervals so composing over these intervals would give privacy degradation .
b) They also show a richer tradeoff between privacy degradation functions (which are roughly offset polylogarithmic functions) and error via a more careful variant of the binary tree mechanism.
c) They use a packing argument to argue that these tradeoffs are tight in the worst case.
优点
-
The privacy model is interesting and merits further work, since it suggests a way to get around lower bounds/gaps in the continual release model of differential privacy while giving meaningful privacy. It's nice that the longstanding logarithmic gap in accuracy for continual counting can be plugged under this relaxation.
-
The paper is well written and easy to follow.
缺点
-
The results presented seem relatively unsurprising given the intuition based on composition discussed in the summary
-
The vs gap is relatively small; it would be much more interesting if larger (say, polynomial) gaps could be plugged using this model with reasonable privacy expiration functions.
问题
-
I wonder if the pan-private inspired counting algorithm is needed? In particular, I wonder if the vanilla binary tree mechanism would give you the same bounds given the intuition in the summary (this might be formalizable using post-processing functions that depend on the pair of neighboring datasets but not on the individual dataset within the pair, and noise couplings similar to what the authors do- but perhaps I am mistaken). If such an argument worked, it might be nice to write some general 'composition' theorems that can be used in followup work that prove privacy in this model.
-
Do the authors have any thought on whether this model could help plug larger gaps between the continual release and batch models? Some discussion on this and an open problems section would be good to include in the final version of the paper.
局限性
Yes.
We thank the reviewer for their reading and questions.
Q1: why not use the vanilla binary mechanism
It can be shown that the vanilla binary mechanism satisfies the same error/privacy expiration as our algorithm for . We however do not believe that the vanilla binary mechanism can be easily adapted (by varying the noise distributions across levels) to match our guarantees on error and privacy expiration for . We sketch an argument for why we believe so next.
Our privacy argument relies on the fact that shifting noise values covering the interval can hide the value of for . It is important that the noise values used in the cover are from level at most . For the vanilla binary mechanism however, to hide the exact value of an input , we need to shift all ancestors to that are used when producing outputs for , and for this set of noise values we do not have the same guarantee on what level in the tree they are from. For example there exists where the single node needed to shift will be at level . This makes the privacy proof in Appendix C.3.1 breaks down when .
Q2: privacy expiration for other problems
We believe it is plausible that the model could be useful to get new trade-offs in other continual problems. Specifically, it would be an interesting direction for future research to study problems in this model, where there is known to be a polynomial error gap between the batch model and the continual release model (for example max sum and counting distinct elements). We will include a discussion of this direction as potential future work in the final version.
Thanks to the authors for the clarification re the vanilla binary tree mechanism.
This paper studies the DP counting problem in the continual release setting. The difference between the problem studied by this paper and the standard problem is that it allows earlier change of the element has a more relaxed privacy budget. The high level idea is to follow the previous pan-private continual counting framework but with different noise setup.
优点
I believe the problem is fundamental and have a lot of potential application in the world. The DP continual release model with gradual privacy expiration would be a good tradeoff between accuracy and the potential privacy risk by classic refresh of privacy budget.
The algorithm is clean and easy to understand but the theoretical analysis requires solid work.
Authors also show almost matching lower bounds.
缺点
When the function g grows faster than a constant but slower than log^{o(1)}, then it is not clear what the tight error bound would be.
问题
Can authors provide any discussion of the tight bound when the function g grows as log^{o(1)}(.)?
局限性
There is no obvious potentially negative societal impact.
We thank the reviewer for their reading and question.
Q1: error bound for expiration growing as .
Our algorithm also supports privacy expiration when , which is shown in Appendix C.3.1. The accuracy guarantees for this case is given by setting in Theorem 1.2. For this case our lower bound is not tight. Achieving tightness for sublogarithmic privacy expiration is an interesting future direction.
This paper considers a continual counting problem in the privacy expiration model of [Bolot et al., ICDT 2013]. The input here is a stream ; the goal is to, after receiving each , release an estimate of the sum . The privacy expiration model states that, for each , the privacy loss after step can increase as increases; namely, it can be for some non-decreasing function . The main result of this paper is that, for a large class of function , there is an algorithm with (-)error , which improves on the standard model (where is a constant) whose best known error is . The algorithm is based on the so-called binary tree mechanism [Dwork et al.; Chan et al.] with two main changes: (i) delaying the release of recent by ignoring them completely as long as for some and (ii) adjusting the privacy budget allocation of each level of the tree. In addition to this, the authors also provide a lower bound which is tight for some setting of .
Overall, even though the algorithm is a simple modification of a known algorithm, the algorithm is clean & practical and it demonstrates the power of the privacy expiration model, which might offer an alternative solution to the standard continual release setting. Furthermore, given that the continual counting algorithm is a subroutine for many algorithms, it is likely that the result in this paper can be used to build other algorithms in this model. Due to this, we recommend accepting this paper as a poster.