PaperHub
4.5
/10
Rejected4 位审稿人
最低3最高5标准差0.9
5
3
5
5
3.8
置信度
ICLR 2024

Learning Differentially Private Rewards from Human Feedback

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11

摘要

We study the privacy of reinforcement learning from human feedback. In particular, we focus on solving the problem of reinforcement learning from preference rankings, subject to the constraint of differential privacy, in MDPs where true rewards are given by linear functions. To achieve this, we analyze $(\epsilon,\delta)$-differential privacy (DP) for both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. We provide a differentially private algorithm for learning rewards from human rankings. We further show that the privately learned rewards can be used to train policies achieving statistical performance guarantees that asymptotically match the best known algorithms in the non-private setting, which are in some cases minimax optimal.
关键词
Learning to RankDifferential PrivacyMinimax Optimal

评审与讨论

审稿意见
5

This paper studies the privacy of reinforcement learning from human feedback. More specifically, given the dataset consisting of preference rankings, this paper proposes a method to learn rewards under the constraint of differential privacy. Furthermore, with the rewards satisfying DP constraints, the authors present an algorithm to learn a near-optimal policy. The sub-optimality bounds are shown to be minimax optimal.

优点

  1. The setting of privately learning rewards and policy from human feedback (preference rankings) is important.
  2. This paper considers both the contextual bandit setting and the MDP setting.
  3. The result is shown to be minimax optimal given ϵ\epsilon is a constant, the proof looks correct to me.
  4. The presentation is clear in general, the paper is easy to follow.

缺点

  1. My main concern is about the technical difficulty. Given the algorithm in [1], Algorithm 1 is a standard application of objective perturbation, Algorithm 2 is a straightforward application of Gaussian mechanism, and Algorithm 3 is derived from replacing the estimations by their private counterparts. It would be better if the authors could highlight their technical contributions.

[1] Banghua Zhu, Jiantao Jiao, and Michael I Jordan. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons.

  1. The sub-optimality bound has dependence O~(1/nϵ)\tilde{O}(1/\sqrt{n\epsilon}), where the dependence is not optimal and the second term could dominate if ϵ\epsilon goes to 0. As is shown in [2], the additional cost due to DP could be of O~(1/nϵ)\tilde{O}(1/n\epsilon) for empirical risk minimization. In addition, [3] shows that the additional cost due to DP could be of O~(1/nϵ)\tilde{O}(1/n\epsilon) for offline RL tasks. What is the challenge to derive such dependence?

[2] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression.

[3] Dan Qiao and Yu-Xiang Wang. Offline reinforcement learning with differential privacy.

  1. For the MDP setting, the reward for a whole trajectory is still a linear function of θ\theta. Given that the occupancy measure ρ\rho is an input of Algorithm 3, is this setting identical to the contextual bandit setting?

  2. It would be better if the authors could discuss more about the papers about RL with JDP or LDP guarantees and the relationship to this work. For instance, would an algorithm with JDP/LDP still be JDP/LDP if the rewards are learned privately (as in this paper)? Here are some papers regarding online RL with JDP/LDP that may be relevant. [4,5] considers RL with JDP and LDP for tabular MDP, while [6,7] considers RL with JDP and LDP for linear mixture MDPs.

[4] Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes.

[5] Dan Qiao and Yu-Xiang Wang. Near-optimal differentially private reinforcement learning.

[6] Chonghua Liao, Jiafan He, and Quanquan Gu. Locally differentially private reinforcement learning for linear mixture markov decision processes.

[7] Xingyu Zhou. Differentially private reinforcement learning with linear function approximation.

  1. There are many typos in the paper. Here I list the typos I find.

For the definition of ΣD\Sigma_{D}, the summation should go from k=j+1k=j+1 instead of j=k+1j=k+1 (Page 2).

The transition ThT_h should be a mapping from S×AS\times A (Page 3).

For the occupancy measure ρπ\rho_\pi, where is the dependence on hh (Page 3)?

For the probability of the ranking, the summation should go from j=kj=k instead of j=mj=m (Page 4).

In equation (5), the '\leq' is missing (Page 8).

For the definition of J^(π)\hat{J}(\pi), the dependence on vv is missing (Page 9).

问题

Please see the weakness section. I will be willing to raise the score if my concerns (especially 1 & 2) are addressed.

评论

"1. My main concern is about the technical difficulty. Given the algorithm in [1], Algorithm 1 is a standard application of objective perturbation, Algorithm 2 is a straightforward application of Gaussian mechanism, and Algorithm 3 is derived from replacing the estimations by their private counterparts. It would be better if the authors could highlight their technical contributions."

The main technical contribution is that standard objective perturbation (as used in Algorithm 1) requires strong-convexity of the original loss in order to achieve the best possible rates for DP. In our case, the loss is not in fact strongly convex in the standard sense (i.e. with respect to the 2\ell_2-norm, but rather is strongly convex with respect to a data-dependent norm. This makes it completely unclear if there are instantiations of these DP algorithms, that when combined will achieve DP with the same statistical rates of the non-private algorithms. Our main contribution is in showing that this is indeed possible, which requires a somewhat delicate analysis of how these different DP methods interact with strong-convexity with respect to a dataset-dependent norm.

"2. The sub-optimality bound has dependence O~(1/nϵ)\tilde{O}(1/\sqrt{n\epsilon }), where the dependence is not optimal and the second term could dominate if ϵ\epsilon goes to 0. As is shown in [2], the additional cost due to DP could be of O~(1/nϵ)\tilde{O}(1/n\epsilon) for empirical risk minimization. In addition, [3] shows that the additional cost due to DP could be of O~(1/nϵ)\tilde{O}(1/n\epsilon) for offline RL tasks. What is the challenge to derive such dependence?

The additional cost of DP from [3] in the linear function approximation case includes an explicit dependence on 1κ\frac{1}{\kappa}, where κ\kappa is the minimum eigenvalue of the data covariance matrix (analogous to our ΣD\Sigma_D). In contrast our work, precisely as in Zhu et al. (2023), requires no assumption on the eigenvalues of ΣD\Sigma_D, and indeed still holds when some eigenvalues are zero. This is the main source of the difference in bounds achieved in our case versus the offline RL setting of [3].

"For the MDP setting, the reward for a whole trajectory is still a linear function of θ\theta. Given that the occupancy measure ρ\rho is an input of Algorithm 3, is this setting identical to the contextual bandit setting?"

The settings are similar but not quite identical. In particular, the loss on line 1 of Algorithm 3 needs to be modified to be:

J^(π)=minθΘ(θ~,λ)Esρπ[θ,ϕ(s,π(s))v] \hat{J}(\pi) = \min_{\theta\in\Theta(\tilde{\theta},\lambda)} \mathbb E_{s\sim\rho_{\pi}}[\langle\theta,\phi(s,\pi(s))-v\rangle] where ρπ\rho_{\pi} is the occupancy measure induced by the policy π\pi. That is, the expectation defining the loss is now over a policy-dependent distribution, rather than a fixed distribution as in the bandit case.

审稿意见
3

This paper aims to offer DP guarantees to Zhu et al. (2023). To this end, the authors follow Zhu et al. (2023) to derive private version of estimation error bound and then use it to derive guarantees for offline RL as in Zhu et al. (2023)

优点

Introducing DP is an interesting topic, especially consider private information in the labeling process

缺点

  1. The techniques are quite standard and the results are straightforward
  2. Only upper bounds are presented, no lower bound to show the tightness of the bound, especially in terms of the dependence of the privacy parameters.
  3. No simulation results, which is somehow weird in ICLR conferences

问题

I do have several questions about this paper. In general, I think this paper is written in a somewhat sloppy way.

  1. In the first equation of proof of Theorem 4.2, the lambda is should be \sqrt{\lambda}.
  2. The equation right above eq. 14 is not correct in terms of \gamma
  3. The equation right after eq. 14 is not correct in terms of \lambda
  4. More importantly, there must be some condition of the accuracy parameter \beta with respect to the privacy parameter and smoothness of the loss. This will in turn limit the value of the privacy parameter I think.

伦理问题详情

NA

评论

Thank you for your careful reading in identifying typos in the proofs. We have fixed the problems you point out and updated the paper.

" there must be some condition of the accuracy parameter \beta with respect to the privacy parameter and smoothness of the loss."

Note that in Algorithm 1 β<1n\beta < \frac{1}{n} is required for our results. Then the regularization parameter α\alpha is set based on the strong convexity parameter γ\gamma. Note that this setting for α\alpha only achieves the desired accuracy and privacy goals when β<1n\beta < \frac{1}{n}. The noise added to the output for privacy further depends on the relationship between β\beta and α\alpha.

Note also that by Lemma A.3 the smoothness of the loss is bounded by 2L2L, and we have assumed that LL is a fixed constant independent of nn. Thus, extra factors depending on the smoothness are absorbed in the O()O(\cdot) in our main theorems.

审稿意见
5

This paper introduces differential privacy (DP) to reinforcement learning with human feedback (RLHF). In RLHF, humans provide rankings for multiple (state, action) pairs (in LLM applications, (prompt, text) pairs) to train a reward model, which is used for downstream reinforcement learning tasks. This paper tries to study how to train the reward model while ensuring DP on the ranking dataset. To do this, this paper takes the theoretical model of [Zhu et al, 2023], which assumes that humans have real numerical preferences for (state, action) pair in a linear form rθ(s,a)=θ,ϕ(s,a)r_{\theta^*}(s, a) = \langle \theta^*, \phi(s, a)\rangle and sample rankings according to some classical probability models based on the numerical preferences (Bradley-Terry-Luce, Plackett-Luce). So, the central task is to learn the parameter θ\theta^* differentially privately. The paper shows that this can be done by standard DP techniques. Also, the paper shows that the downstream reinforcement learning performance using a reward model with the DP-learned parameter is similar to that of the non-DP learned parameter.

优点

(1) The problem is very well motivated and practically important. As RLHF is getting more popular, privacy becomes an increasingly important issue.

(2) This paper might be a good starting point for future works to explore further in the direction of "differentally private RLHF".

缺点

(1) My major concern is the lack of experimental results. I know that this is a theoretical paper, but I do think experimental results are necessary here, for the following reasons: The purpose of introducing differential privacy to RLHF is to protect the privacy of the human data providers in practice. This paper's theoretical results provide upper bounds on privacy loss and the learning performance loss due to privacy guarantee, under an idealized model ([Zhu et al, 2023]'s model). Humans' ranking behavior may not follow that idealized model, and the RLHF algorithms (for both the reward training and the downstream RL) used in practice are not necessarily the algorithms analyzed in this paper. Whether the theoretical results in this paper can be applied is questionable. Given that the theoretical contribution of this paper is only marginal (see below), I think an empirical contribution (trying DP on real datasets with real RLHF algorithms) is needed here.

(2) My second concern is that the technical contribution of this paper is marginal. The DP technique (adding Gaussian noises to the loss function and the solution in a convex optimization) is a standard technique from [Bassily et al, 2019a] and [Kifer et al, 2012]. The linear reward + BTL/PL model is the same as [Zhu et al, 2023]. The proofs of the theorems are basically a combination of the proofs from these two lines of previous work.

Due to the above two concerns I recommend weak reject.

问题

Questions:

(1) How does [Zhu et al, 2023]'s idealized model capture real-world RLHF algorithms?

(2) How does adding DP to their model really inform RLHF in practice?

Suggestions:

Typos (that do not affect my rating):

  1. Page 3, Th:S×AΔ(S)T_h: S \times A \to \Delta(S)

  2. Page 4, Assumption 2.1: rθ(s,a)=θ,ϕ(s,a)r_\theta(s, a) = \langle \theta, \phi(s, a)\rangle

  3. Page 4, Definition 2.2: better to say that "... private if for all datasets D,D\mathcal D, \mathcal D' with DD11||\mathcal D - \mathcal D'||_1\le 1, for all ORange(A)\mathcal O\subseteq \mathrm{Range}(A), ..."

  4. Page 8, equation (5): Σ~DK\tilde \Sigma_{\mathcal{D}_K}

  5. Page 14, equation (8): D(θ),Δ)\langle \nabla \ell_D(\theta), \Delta) \rangle

  6. Page 14, Lemma A.5: "Then f(θ)f(θ^)γ2θ^θM2f(\theta) - f(\hat \theta) \ge \frac{\gamma}{2}|| \hat \theta - \theta||_M^2"

评论

"(1) How does [Zhu et al, 2023]'s idealized model capture real-world RLHF algorithms?"

One main way in which [Zhu et al, 2023]'s model captures real-world issues is in the necessity of using pessimistic policy optimization in the RLHF setting. That is, they show that one gets asymptotically worse convergence rates using standard policy optimization. Real-world RLHF algorithms generally include a KL-divergence regularizer designed to ensure that the learned policy does not deviate too far from the original LLM, which can be seen as a form of pessimism in policy optimization.

"(2) How does adding DP to their model really inform RLHF in practice?"

The high-level takeaway of our DP results for practical RLHF is that one can maintain privacy without asymptotically affecting the amount of data that needs to be collected for RLHF to succeed. This sets a target for practical research, by demonstrating that in the theoretical setting where provable statistical gaurantees are possible, privacy for RLHF comes at no additional cost.

审稿意见
5

The primary aim of this paper is to explore offline Reinforcement Learning in situations where the agent is limited to observing human feedback in the form of preference rankings rather than direct reward information. In contrast to prior studies, the authors incorporate the Gaussian mechanism to safeguard sensitive information and put forth a private Maximum Likelihood Estimation (MLE) algorithm. The authors contend that the proposed algorithm achieves a near-optimal sub-optimality gap with a guarantee of differential privacy.

优点

  1. Protecting privacy information holds paramount importance in reinforcement learning, and the proposed algorithm attains a guarantee of differential privacy without compromising performance.

  2. The paper is well-written and easy to comprehend.

缺点

  1. This study lacks novelty, as the algorithm essentially integrates previous Reinforcement Learning with human feedback results [1] with a Gaussian mechanism, which is a widely employed approach for ensuring differential privacy guarantees.

[1] Principled reinforcement learning with human feedback from pairwise or k-wise comparisons.

  1. In this work, the authors assert that the sub-optimality gap is near-optimal. However, the absence of a lower bound in this study creates confusion regarding the actual near-optimality of the results.

  2. In addressing the general Markov Decision Process (MDP) setting, the author exclusively focuses on estimating the reward function and undertaking pessimistic policy optimization. However, a fundamental challenge in learning an MDP lies in acquiring knowledge about the transition probability function PP. The determination of the occupancy ρ\rho for a given policy π\pi remains unclear. The assumption appears to presume that the transition process is already known, effectively simplifying the MDP problem to a bandit problem and streamlining the learning process. It is imperative to explicitly state all assumptions before making any claims.

  3. The proposed algorithm lacks experimental results to substantiate its efficacy.

问题

  1. On page 4, in the definition of the Plackett-Luce Model, mm is not mentioned. It seems more appropriate for it to be kk.

  2. The Gaussian mechanism in Algorithm 2 is commonly utilized to ensure differential privacy guarantees. Nevertheless, it would be beneficial for the author to provide additional explanations for the private process in Algorithm 1, as it appears to deviate from the standard procedure.

  3. In the related work section, it appears that the author overlooked several works that also aim to provide differential privacy guarantees [1,2]. It would be valuable for the authors to provide comments on the relationship or differences between their approach and these existing works.

[1] Locally differentially private reinforcement learning for linear mixture markov decision processes.

[2] Differentially private reinforcement learning with linear function approximation.

评论

Related work: [1] Locally differentially private reinforcement learning for linear mixture markov decision processes, [2] Differentially private reinforcement learning with linear function approximation.

Thank you for pointing out this related work, we will gladly add it to the related work section of the paper. To briefly highlight the differences with our work, note that in the RLHF setting the goal is to learn a reward function, and then train a policy using the learned reward. In contrast, the papers [1] and [2] both achieve differential privacy in standard reinforcement learning settings, where a reward function is already given as part of the interaction with the MDP. The difference in settings is significant even without differential privacy, and different algorithmic choices are required (e.g. optimistic algorithms in standard MDPs, vs pessimistic policy optimization needed for RLHF).

AC 元评审

This paper proposes a framework to gain DP in RLHF.

Pros. The problem is well-motivated and the paper is well-written

Cons. Reviewers raised various concerns on the lack of technical novelty and insufficient experiments.

为何不给更高分

The Cons were not addressed after the rebuttal, unfortunately.

为何不给更低分

n/a

最终决定

Reject