PaperHub
6.4
/10
Poster5 位审稿人
最低3最高5标准差0.9
5
4
3
5
3
3.4
置信度
创新性3.0
质量3.0
清晰度2.8
重要性2.8
NeurIPS 2025

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

In the context of off-policy RL, we give a theoretical analysis of the role of an additive reward correction in improving performance, accompanied by experiments on bandits and LLM posttraining.

摘要

Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.
关键词
Reinforcement learningoff-policy RLLLM finetuningbandits

评审与讨论

审稿意见
5

The paper 'Reinforcement Learning From Others' Successes, Not Their Failures' propose to study the off-policy RL surrogate objective and discuss the implication of different choice of baseline value function VV. The paper develops a clean theoretical analysis of the off-policy vanilla policy gradient (off-VPG) metho. Also provide a simple off-policy algorithm for LLM fine-tuning with almost no engineering overhead. Experimental validation verifies the finding that a small baseline VV benefits the training.

优缺点分析

Strengths: The paper has clear motivation (off-policy RL) and theoretical analysis. The experiments (both bandit and LLM) are clear and verify the theoretical findings with clear evidence.

Weakness (please respond to the Questions section directly): Unclear why only analyzing the surrogate loss; The theoretical analysis requires more intepretation as it's not intuitive for now; The experiments might not be adequate and requires further comparison with on-policy algorithms.

问题

  1. First, I'm curious why the authors chose to study the surrogate loss function (1). As the authors claimed "As it does not involve any off-policy correction, such as importance sampling, we do not expect this algorithm to converge to an optimal policy in general." If the purpose is to study the off-policy algorithm as close to what it is in practice, I'm wondering if the analysis will be altered if a KL regularized or clipped objective is used (as in PPO objective).
  2. It seems not very clear to me what Theorem 4.2 implies. In particular, when V<VμV<V^{\mu}, the optimal policy seems to be a reweighting of the sample policy μ\mu by reward minus value r(y)Vr(y)-V, but for VVμV\geq V^{\mu} I feel it very hard to interpret the results.
  3. What if we do adam or other adaptive algorithm in the experiments? Would the effect of different VV be mitigated?
  4. The "Intuition" paragraph is indeed intuitive, however for LLM finetuning, we often have a continuous reward value than a binary reward. Could the authors comment on the implication of the theory for this kind of rewards?
  5. Does the same collapse phenomenon observed in on-policy algorithms? I'm also curious on how the proposed method performs comparing to on-policy algorithms?

局限性

The limitations are reflected in the questions section.

I don't think there are any potential negative societal impact for this work.

格式问题

NA

作者回复

We are very thankful to the reviewers for their time and for their insightful comments.

While the reviewers seem to agree that the papers' contributions are impactful, two shared concerns appear:

  • a lack of diversity in the experimental setup (not enough models, not enough datasets), and
  • a lack of comparison to popular RL algorithms.

We want to address these shortcomings in this general response (which we include in all individual responses), before answering each reviewer's questions separately.

Our initial intent was for the article to be equally balanced between a theoretical section, a bandits section, and a section on real-world applications, which is why we chose to keep the experimental section concise. However, the reviewers' comments have helped us realize that the paper would be strengthened by the inclusion of additional experiments, and we thank them for that.

We have run a series of new experiments, which we hope will fully answer their concerns. In particular, we add two additional models (Qwen 2.5 3B and Llama 3.2 3B), a much larger dataset (NuminaMath) and a comparison to GRPO. The resulting tables can be found below (sadly, a recent change in the conference's rules does not allow us to share graphs anymore):

Table 1: Training Dynamics of the off-VPG Algorithm Across Different Baselines. This extends our Figure 3 with additional models and datasets (Same hyperparameters. Reported values are the maximum pass@1 accuracy (in %) on the test set observed during training (2000 steps, with evaluation every 200 steps). Some runs crash during training, as in Figure 3; these are marked in bold.)

DatasetModelδV=-0.5δV=-0.3δV=-0.2δV=-0.1δV=0δV=0.1δV=0.2δV=0.3
MATHLlama8B48.749.149.349.348.947.546.143.5
Llama3B45.946.746.446.445.444.543.945.0
Qwen3B62.263.565.165.664.937.814.713.0
NuminaMathLlama8B20.521.522.223.623.117.818.017.2
Llama3B20.521.821.821.722.720.719.819.6
Qwen3B20.337.640.642.040.823.721.919.7

In our article, Figure 3 provides empirical evidence of how the choice of baseline δV affects the performance of our off-VPG algorithm. We extend the same experiment to two additional models (Llama3B and Qwen3B) and an additional dataset (a 144k-token subset of NuminaMath)

We consistently observe the same two phenomena highlighted in Figure 3:

  • When the baseline is large (δV ≳ 0), training becomes unstable and often crashes (bold entries in Table 1).
  • As the baseline increases, the training becomes faster—reflected in the average left-to-right progression of higher scores within each row (for runs that did not crash).

The general appearance of the training and test accuracies are also coherent with our earlier experiments. These results reinforce the robustness of our recommendation: to use a slightly negative baseline, such as δV = -0.1.

Table 2: Comparison of off-VPG and GRPO, off-policy (10000 steps, with evaluation every 200 steps)

Dataset (Train & Test)Modeloff-VPG (off-policy)GRPO (off-policy)
MATHLlama8B52.050.9
Llama3B49.049.1
Qwen3B64.861.7
NuminaMathLlama8B26.124.8
Llama3B25.525.1
Qwen3B44.340.8

We conduct in Table 2 a performance comparison between GRPO and off-VPG. We use the GRPO variant from (DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Shao et al), with KL regularization coefficient β = 0.001 and symmetric clipping at 0.2. For off-VPG, in line with our previous conclusions, we use a fixed baseline of δV = -0.1.

Under these conditions, off-VPG appears to perform slightly better on average than GRPO, despite being more simple. All training runs are stable. We also observe that off-VPG achieves faster score improvements than GRPO, especially when training Qwen3B.

Note that we do not yet claim that off-VPG is universally superior to GRPO--a fully rigorous comparison is beyond the scope of this work. Rather, we argue that its performance is competitive and reliable across diverse setups (three models and two datasets), and we plan on pursuing a more systematic investigation of this comparison in future work.

If the article is accepted, we plan on including two additional graphs in the main text (a variant of Figure 3 with Qwen3B, and a comparison between off-VPG and GRPO with Llama 8B), and to put the other results in the Appendix, as they do not change the article's overall narrative (though they make our conclusions more robust).

Note that we are very willing to run additional experiments in the remaining discussion period, should the reviewers deem it necessary.

--- End of general response ---

  • First, I'm curious why the authors chose to study the surrogate loss function (1). [...] I'm wondering if the analysis will be altered if a KL regularized or clipped objective is used (as in PPO objective).

It was precisely the simplicity of the proposed loss which raised our interest: we wanted to study the potential impact of the simplest lever of action that we could think of, namely our baseline V. Adding a KL regularization or an importance ratio would certainly dramatically impact the theoretical analysis, though we expect that a KL regularization would mainly have the effect of keeping the trained policy closer to th reference policy, as in other algorithms. We leave this for future work. Note however that we have included preliminary comparisons to GRPO in our general response (Tables 2 and 3), where our loss seems to be competitive.

  • It seems not very clear to me what Theorem 4.2 implies. In particular, when V<VμV<V^\mu, the optimal policy seems to be a reweighting of the sample policy μ\mu by reward minus value r(y)Vr(y)-V, but for VVμV\geq V^\mu I feel it very hard to interpret the results.

In the case V<VμV<V^\mu, the limit policy (which is not necessarily optimal) is more precisely the positive part of a reweighting of μ\mu minus a constant. The key takeaway is that the probability mass of an action yy w.r.t. the limit policy is a somewhat complicated increasing function of μ(y)(r(y)V)\mu(y)(r(y)-V). When VVμV\geq V^\mu, the limit policy is generically going to be a singleton (generically w.r.t. the choice of rewards and of μ\mu): large VVs lead to a loss in diversity in the limit policy.

  • What if we do adam or other adaptive algorithm in the experiments? Would the effect of different VV be mitigated?

In fact, our LLMs experiments are already conducted using AdamW, as is common when training large transformers (this is stated in Appendix B). Figure 4 shows that the effect is not mitigated.

  • The "Intuition" paragraph is indeed intuitive, [...] for this kind of rewards?

All of our theorems hold for continuous rewards (we never assume that the rewards are discrete). We only give the example of {0,1}\{0,1\} rewards to help the readers' intuition, but the general idea is the same in the case of continuous rewards: letting VV be large puts comparatively more emphasis on the trajectories with the lowest rewards, and vice versa.

  • Does the same collapse phenomenon observed in on-policy algorithms? I'm also curious on how the proposed method performs comparing to on-policy algorithms?

Regarding the impact of the baseline δV\delta V on the risk of collapse, we have run some very preliminary experiments in an on-policy setting and observed similar results, but the risk of collapse seems to be more dependent on the choice of model, hyperparameters, and even possibly codebases (as many efficient RL codebases are never 100% on-policy to minimize the downtime of the worker Gpus tasked with generating training trajectories). These are still preliminary experiments from which we do not draw definitive conclusions. We have also run a comparison with GRPO in an on-policy setting, reported in Table 3 below.

Table 3: Comparison between off-VPG and GRPO, on-policy (10000 steps with evaluation every 200 steps, bold values indicate that the run collapsed during training)

Modeloff-VPG seed 1seed 2GRPO seed 1seed 2
Qwen2.5-3B65.464.261.861.9
Llama3.1-8B52.052.450.050.1

Though it it not the main focus of our work, we also conducted a comparison between the off-VPG algorithm and GRPO in an on-policy settings to assess whether off-VPG remains competitive. The results are presented in Table 3. The experiment uses the MATH dataset, with hyperparameters identical to those used in Table 2, except for the update frequency of the actor policy. In this experiment, updates occur at every gradient step, unlike previous experiments where updates occurred every 250 steps.

We run two seeds for each configuration and report the highest score achieved on the test set. All learning curves increase steadily and plateau after approximately 3000 steps. The plateau is consistently higher for off-VPG than for GRPO, across both seeds and models. We also observe a sharp performance collapse during training in 3 out of 4 GRPO runs (highlighted in bold in the table), while no such collapse occurs with off-VPG.

These results suggest that off-VPG remains a robust and effective choice in on-policy settings, with the added benefit of a simple implementation.

评论

I thank the authors for their detailed response. I'll maintain my positive score.

审稿意见
4

The paper analyses a simple off-policy REINFORCE algorithm (off-VPG) whose advantage is A=rVA=r-V, with a user-tunable baseline V.

  • Theory For tabular soft-max policies the authors prove exponential convergence of off-VPG to a limit distribution πμ,V\pi^{*}_{\mu,V} (Theorem 4.2) and show a phase-transition at the critical value V=VμV=V^{\mu} (the behavior-policy mean reward) whereby the policy’s support suddenly collapses to a singleton . Repeated application forms a policy-improvement scheme with guaranteed reward increase when V<VμV<V^{\mu} (Theorem 4.3) .
  • Intuition Lower VV accentuates high-reward (“success”) trajectories, higher VV emphasises pushing down low-reward (“failure”) ones; unlike the on-policy case, this baseline changes the expected gradient direction off-policy .
  • Experiments (i) 100-arm bandit shows the predicted phase transition in reward, support, and entropy when VV crosses VμV^{\mu} . (ii) Fine-tuning Llama-3 8B on the MATH dataset confirms that setting δV0.1\delta V \approx -0.1 (i.e. slightly below the prompt-wise mean reward) yields stable learning, whereas δV0\delta V\ge 0 causes catastrophic collapse .

The work offers a computationally cheap knob—just shift the baseline—to trade off stability versus exploration when using off-policy trajectories for LLM RLHF.

优缺点分析

Strengths

  1. Rigorous characterization of off-VPG dynamics; explicit closed-form of πμ,V\pi^{*}_{\mu,V} and monotone support shrinkage clarify how baseline controls bias/variance trade-off .
  2. Actionable guideline: choose VV slightly below the behavior—policy mean to avoid support collapse—validated on both bandits and an 8B LLM .
  3. Engineering simplicity: no importance sampling, no critic, minimal code changes, yet competitive stability.
  4. Clear discussion of why on-policy intuition fails off-policy and how this links to practical RLHF sampling delays .

Weaknesses

  1. Limited model and task diversity: only one LLM architecture (Llama-3 8B) and one dataset (MATH).
  2. Missing baselines: lacks direct comparison to IS-PPO, GRPO, or regularized PG, so relative performance ceiling is unclear.
  3. Statistical under-reporting: LLM results use three seeds without CIs; bandit plots omit error bars.
  4. Code not yet released, reducing reproducibility.

问题

  1. Lines 596–599: The Lyapunov function is called “Φt\Phi_t is bounded above on the simplex”. Strictly speaking FF is concave when b>0b>0; here “above” could be re-worded as “has a finite supremum”.
  2. Lines 628–631: The bound 2F2b\|\nabla^2F\|\le2b → step-size <1/b<1/b. Readers may ask why the factor 2 disappears; worth stating that any constant <2/b< 2/b suffices, but 1/b1/b is chosen for margin.
  3. In the third statement of Theorem 4.3, does the author mean that by setting the value of VV within a certain range, we can ensure that the policy value converges to the optimal value? If so, the use of the phrase “if and only if” is quite confusing.

局限性

  1. Proofs rely on a finite-arm bandit with a tabular softmax policy; behavior with function approximation and delayed updates remains open.
  2. The monotonic-improvement claim hinges on picking a baseline VV that is less than the unknown expected return VμV^{\mu}, yet in real-world RLHF that return is buried under reward noise, so telling practitioners to “just choose a slightly smaller VV” simply dumps the toughest task back onto them.

最终评判理由

The authors have addressed my concerns during the rebuttal phase. I maintain my positive evaluation.

格式问题

N/A

作者回复

We are very thankful to the reviewers for their time and for their insightful comments.

While the reviewers seem to agree that the papers' contributions are impactful, two shared concerns appear:

  • a lack of diversity in the experimental setup (not enough models, not enough datasets), and
  • a lack of comparison to popular RL algorithms.

We want to address these shortcomings in this general response (which we include in all individual responses), before answering each reviewer's questions separately.

Our initial intent was for the article to be equally balanced between a theoretical section, a bandits section, and a section on real-world applications, which is why we chose to keep the experimental section concise. However, the reviewers' comments have helped us realize that the paper would be strengthened by the inclusion of additional experiments, and we thank them for that.

We have run a series of new experiments, which we hope will fully answer their concerns. In particular, we add two additional models (Qwen 2.5 3B and Llama 3.2 3B), a much larger dataset (NuminaMath) and a comparison to GRPO. The resulting tables can be found below (sadly, a recent change in the conference's rules does not allow us to share graphs anymore):

Table 1: Training Dynamics of the off-VPG Algorithm Across Different Baselines. This extends our Figure 3 with additional models and datasets (Same hyperparameters. Reported values are the maximum pass@1 accuracy (in %) on the test set observed during training (2000 steps, with evaluation every 200 steps). Some runs crash during training, as in Figure 3; these are marked in bold.)

DatasetModelδV=-0.5δV=-0.3δV=-0.2δV=-0.1δV=0δV=0.1δV=0.2δV=0.3
MATHLlama8B48.749.149.349.348.947.546.143.5
Llama3B45.946.746.446.445.444.543.945.0
Qwen3B62.263.565.165.664.937.814.713.0
NuminaMathLlama8B20.521.522.223.623.117.818.017.2
Llama3B20.521.821.821.722.720.719.819.6
Qwen3B20.337.640.642.040.823.721.919.7

In our article, Figure 3 provides empirical evidence of how the choice of baseline δV affects the performance of our off-VPG algorithm. We extend the same experiment to two additional models (Llama3B and Qwen3B) and an additional dataset (a 144k-token subset of NuminaMath)

We consistently observe the same two phenomena highlighted in Figure 3:

  • When the baseline is large (δV ≳ 0), training becomes unstable and often crashes (bold entries in Table 1).
  • As the baseline increases, the training becomes faster—reflected in the average left-to-right progression of higher scores within each row (for runs that did not crash).

The general appearance of the training and test accuracies are also coherent with our earlier experiments. These results reinforce the robustness of our recommendation: to use a slightly negative baseline, such as δV = -0.1.

Table 2: Comparison of off-VPG and GRPO, off-policy (10000 steps, with evaluation every 200 steps)

Dataset (Train & Test)Modeloff-VPG (off-policy)GRPO (off-policy)
MATHLlama8B52.050.9
Llama3B49.049.1
Qwen3B64.861.7
NuminaMathLlama8B26.124.8
Llama3B25.525.1
Qwen3B44.340.8

We conduct in Table 2 a performance comparison between GRPO and off-VPG. We use the GRPO variant from (DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Shao et al), with KL regularization coefficient β = 0.001 and symmetric clipping at 0.2. For off-VPG, in line with our previous conclusions, we use a fixed baseline of δV = -0.1.

Under these conditions, off-VPG appears to perform slightly better on average than GRPO, despite being more simple. All training runs are stable. We also observe that off-VPG achieves faster score improvements than GRPO, especially when training Qwen3B.

Note that we do not yet claim that off-VPG is universally superior to GRPO--a fully rigorous comparison is beyond the scope of this work. Rather, we argue that its performance is competitive and reliable across diverse setups (three models and two datasets), and we plan on pursuing a more systematic investigation of this comparison in future work.

If the article is accepted, we plan on including two additional graphs in the main text (a variant of Figure 3 with Qwen3B, and a comparison between off-VPG and GRPO with Llama 8B), and to put the other results in the Appendix, as they do not change the article's overall narrative (though they make our conclusions more robust).

Note that we are very willing to run additional experiments in the remaining discussion period, should the reviewers deem it necessary.

--- End of general response ---

  • Limited model and task diversity: only one LLM architecture (Llama-3 8B) and one dataset (MATH). and
  • Missing baselines: lacks direct comparison to IS-PPO, GRPO, or regularized PG, so relative performance ceiling is unclear.

We have ran additional experiments with 2 other models, one additional dataset, and a baseline (GRPO) (see the general response).

  • Statistical under-reporting: LLM results use three seeds without CIs; bandit plots omit error bars.

Figure 3 was only ran with 3 seeds because training 838\cdot 3 8B models is quite expensive. Do you think that showing intervals representing the variance of these trajectories would improve the figure? In Figure 4, we show all trajectories for 7 different seeds to prove the robustness of the phenomenon studied to randomness. In the bandit setting, we study the expected version of the algorithm: as such, the gradient descent is non-stochastic. The only randomness comes from the drawing of the rewards associated to each arms, i.e. when definining the bandit task. Combining results corresponding to different tasks (e.g. with error bars) yields a graph that is very hard to interpret. We have observed that the general appearance of the curves is very stable w.r.t. the random seed; should we add variants of these curves for other random seeds in the Appendix?

  • Code not yet released, reducing reproducibility.

Our code is shared with another team within our lab; we should be able to release it a month (perhaps even in time for the camera-ready version).

  • Lines 596–599: The Lyapunov function is called “Φt\Phi_t is bounded above on the simplex”. Strictly speaking Φt\Phi_t is concave when b>0b>0 ; here “above” could be re-worded as “has a finite supremum”.

Thank you for this wording suggestion.

  • Lines 628–631: The bound 2F2b|\nabla^2 F|\leq 2b \rightarrow step-size <1/b<1/b. Readers may ask why the factor 2 disappears; worth stating that any constant <2b<2b suffices, but 1/b1/b is chosen for margin.

Thank you for pointing out this potential confusion. The factor of 2 disappears because we are using the bound on the step size, <2/2F< 2 / \| \nabla^2 F \|. We are not aware of a tighter result.

  • In the third statement of Theorem 4.3, does the author mean that by setting the value of VV within a certain range, we can ensure that the policy value converges to the optimal value? If so, the use of the phrase “if and only if” is quite confusing.

As you guessed, the statement should be understood as "the limit policy is optimal if and only if V<V0,μV<V_{0,\mu}". We agree that the phrasing is awkward, and will change it.

  • Proofs rely on a finite-arm bandit with a tabular softmax policy; behavior with function approximation and delayed updates remains open. The behaviour with respect to delayed updates is described in the case of finite-arm bandits by Thm 4.3. We agree that going from a tabular setting to the case of LLMs is a non-trivial change, which our theory does not fully capture (though our experiments show that the intuition from bandits still applies in the case of LLMs). In our defense, there is no agreed upon mathematical framework that convincingly captures the regularity of transformers with respect to the prompts xx (e.g. the fact that similar prompts are treated similarly), and such a framework would be needed to reach theoretical conclusions regarding LLMs.

  • The monotonic-improvement claim hinges on picking a baseline VV that is less than the unknown expected return VμV^\mu, yet in real-world RLHF that return is buried under reward noise, so telling practitioners to “just choose a slightly smaller VV” simply dumps the toughest task back onto them.

In the case of LLMs, which is our main motivation, this problem is partially averted thanks to the additive renormalization by the prompt-dependent average reward Vμ(x)V^{\mu(\cdot |x)}, which we estimate with the group-average V^=1Gi=1Gr(yi,x)\hat{V} = \frac{1}{G}\sum_{i=1}^G r(y_i,x): E[1Gi=1G(r(yi,x)(V^+δV))log(π(yix))]\mathbb{E}[\frac{1}{G} \sum_{i=1}^{G} (r(y_i,x) - (\hat V + \delta V)) \log(\pi(y_{i}|x))], where the expectation is w.r.t. xD,{yi}i=1Gμ(.x){x \sim D, \{y_i\}_{i=1}^{G} \sim \mu(.|x)}. As an estimator of Vμ(x)V^{\mu(\cdot |x)} is substracted to the rewards for prompt xx, the expected return after the renormalization is 00, which means that the critical value now becomes δV=0\delta V = 0 uniformly across the prompts xxs. Our experiments (including the new ones) show that δV=0.1\delta V =-0.1 seems like a safe bet, though we do not claim that it is optimal.

评论

Thank you for your response. I don't have further questions.

审稿意见
3

This work discusses the design of off-policy RL for LLMs finetuning. By introducing a baseline V, the paper discusses the impact of policy convergence and explores how different baseline settings affect model performance. The conclusion drawn is that higher optimization weights should be given to positive trajectories, achieved through setting a baseline slightly lower than the expected value of the behavior policy.

优缺点分析

Strengths: Utilizing off-policy RL to optimize LLMs holds research value and may enhance the efficiency and deployment of RL-based fine-tuning. The theoretical analysis provides substantial support for the algorithm.

Weaknesses:

  1. The conclusion that the baselines in off-policy learning not only influence the variance as in the on-policy algorithms but also introduce policy bias is not much novel in the off-policy RL scope. Some similar conclusions have been discussed in previous off-policy RL algorithms. Besides, the final conclusion (optimize more on the positive trajectories) quite aligns with the intuition. These factors somewhat diminish the contribution of the theoretical analysis slightly.
  2. The off-VPG is evaluated merely with one model on one dataset. Thus, the experimental support for the algorithm is limited.
  3. There is no comparison or discussion regarding on-policy finetuing. Although off-policy methods may have advantages in efficiency or implementation, is there a trade-off between performance and efficiency?
  4. This off-policy algorithm design does not specifically target LLMs and lacks discussions related to existing off-policy RL algorithms.

问题

  1. There are relevant discussions on the influence of bias in off-policy contexts within Off-Policy Actor-Critic[1]. It would be better if a reasonable citation were added.
  2. Off-VPG does not account for the relationship between behavior policy and current policy when utilizing trajectories. Off-policy policy gradient algorithms often correct bias through the importance weights. Should this be considered as a baseline? For instance, naive Off-Policy PG [2].
  3. Should the experiments be conducted on larger datasets to further validate the effectiveness of the Off-VPG? Additionally, the evaluation is only performed on the MATH dataset without assessing the LLM's performance in general textual tasks or other mathematical benchmarks. Lack a more comprehensive experimental validation.
  4. Some specific implementation details for Off-VPG are missing. For example, is the behavior policy completely unupdated or updated with a delay? If it remains unupdated, could it degrade into rejection sampling fine-tuning?
  5. There is an absence of comparisons between on- and off-policy algorithms concerning performance and sample efficiency to substantiate claims such as "Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques."

[1]Off-Policy Actor-Critic. Thomas Degris, Martha White, and Richard S. Sutton. ICML 2012.

[2]Off-Policy PG: https://lilianweng.github.io/posts/2018-04-08-policy-gradient/

Trivial:

  1. The organization and formatting of theorem sections make them difficult to read.
  2. The x-axis title in Figure 2 lacks clarity.
  3. Color differentiation in line graphs is too low. It takes effort to discern them.

局限性

yes

最终评判理由

This article offers a perspective on how to allocate good and bad data in off-policy data for the strength of LLM updates, which is worth studying. However, the author does not clearly delineate the prior work on off-policy RL in the article, lacks discussion on the applicability compared with on-policy methods in LLMs, and the experiments are insufficient to adequately illustrate the claim. Therefore, I still believe this paper is not ready for NeurIPS 2025 from both the presentation and experimental aspects.

格式问题

No formatting issues.

作者回复

We are very thankful to the reviewers for their time and for their insightful comments.

While the reviewers seem to agree that the papers' contributions are impactful, two shared concerns appear:

  • a lack of diversity in the experimental setup (not enough models, not enough datasets), and
  • a lack of comparison to popular RL algorithms.

We want to address these shortcomings in this general response (which we include in all individual responses), before answering each reviewer's questions separately.

Our initial intent was for the article to be equally balanced between a theoretical section, a bandits section, and a section on real-world applications, which is why we chose to keep the experimental section concise. However, the reviewers' comments have helped us realize that the paper would be strengthened by the inclusion of additional experiments, and we thank them for that.

We have run a series of new experiments, which we hope will fully answer their concerns. In particular, we add two additional models (Qwen 2.5 3B and Llama 3.2 3B), a much larger dataset (NuminaMath) and a comparison to GRPO. The resulting tables can be found below (sadly, a recent change in the conference's rules does not allow us to share graphs anymore):

Table 1: Training Dynamics of the off-VPG Algorithm Across Different Baselines. This extends our Figure 3 with additional models and datasets (Same hyperparameters. Reported values are the maximum pass@1 accuracy (in %) on the test set observed during training (2000 steps, with evaluation every 200 steps). Some runs crash during training, as in Figure 3; these are marked in bold.)

DatasetModelδV=-0.5δV=-0.3δV=-0.2δV=-0.1δV=0δV=0.1δV=0.2δV=0.3
MATHLlama8B48.749.149.349.348.947.546.143.5
Llama3B45.946.746.446.445.444.543.945.0
Qwen3B62.263.565.165.664.937.814.713.0
NuminaMathLlama8B20.521.522.223.623.117.818.017.2
Llama3B20.521.821.821.722.720.719.819.6
Qwen3B20.337.640.642.040.823.721.919.7

In our article, Figure 3 provides empirical evidence of how the choice of baseline δV affects the performance of our off-VPG algorithm. We extend the same experiment to two additional models (Llama3B and Qwen3B) and an additional dataset (a 144k-token subset of NuminaMath)

We consistently observe the same two phenomena highlighted in Figure 3:

  • When the baseline is large (δV ≳ 0), training becomes unstable and often crashes (bold entries in Table 1).
  • As the baseline increases, the training becomes faster—reflected in the average left-to-right progression of higher scores within each row (for runs that did not crash).

The general appearance of the training and test accuracies are also coherent with our earlier experiments. These results reinforce the robustness of our recommendation: to use a slightly negative baseline, such as δV = -0.1.

Table 2: Comparison of off-VPG and GRPO, off-policy (10000 steps, with evaluation every 200 steps)

Dataset (Train & Test)Modeloff-VPG (off-policy)GRPO (off-policy)
MATHLlama8B52.050.9
Llama3B49.049.1
Qwen3B64.861.7
NuminaMathLlama8B26.124.8
Llama3B25.525.1
Qwen3B44.340.8

We conduct in Table 2 a performance comparison between GRPO and off-VPG. We use the GRPO variant from (DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Shao et al), with KL regularization coefficient β = 0.001 and symmetric clipping at 0.2. For off-VPG, in line with our previous conclusions, we use a fixed baseline of δV = -0.1.

Under these conditions, off-VPG appears to perform slightly better on average than GRPO, despite being more simple. All training runs are stable. We also observe that off-VPG achieves faster score improvements than GRPO, especially when training Qwen3B.

Note that we do not yet claim that off-VPG is universally superior to GRPO--a fully rigorous comparison is beyond the scope of this work. Rather, we argue that its performance is competitive and reliable across diverse setups (three models and two datasets), and we plan on pursuing a more systematic investigation of this comparison in future work.

If the article is accepted, we plan on including two additional graphs in the main text (a variant of Figure 3 with Qwen3B, and a comparison between off-VPG and GRPO with Llama 8B), and to put the other results in the Appendix, as they do not change the article's overall narrative (though they make our conclusions more robust).

Note that we are very willing to run additional experiments in the remaining discussion period, should the reviewers deem it necessary.

--- End of general response ---

  • The conclusion that the baselines [...] contribution of the theoretical analysis slightly.

It is true that the observation that baselines introduce biases is not novel; our main contribution here is a theoretical analysis of the effects of such a bias, as well as empirical confirmations that such biases can have a positive impact. Regarding the final conclusion, it is in fact very common to hear that negative trajectories can be just as important, or even more important than positive ones (see e.g. [1]). A common missconception is that penalizing negative samples has the effect of encouraging diversity and exploration, because the probability mass diverted from the penalized action would be somewhat uniformly distributed over other actions. This is in fact not the case, as shown by our results: when penalizing a negative action, most of the probability mass goes to the modes of the distribution π\pi, as visible in the proof of Thm 4.2, which leads to the lack of diversity described in Thm 4.2 ("the rich get richer"). [1] The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning, Zhu et al.

  • The off-VPG is evaluated merely with one model on one dataset. Thus, the experimental support for the algorithm is limited.
  • and
  • Should the experiments be [...] Lack a more comprehensive experimental validation.

We agree; see the additional experiments in our general response.

  • This off-policy algorithm design does not specifically target LLMs and lacks discussions related to existing off-policy RL algorithms.

We see the simplicity of our algorithm (combined with its good performance), which is not specific to LLMs, as an advantage. Are there any particular works that you think would be good additions to our related works section (besides [1])?

  • There are relevant discussions [...].

Thank you very much, we will be sure to add this work to our related works section.

  • Off-VPG does not account for the relationship between behavior policy and current policy [...] For instance, naive Off-Policy PG [2].

It is common in off-policy RL to correct bias with an importance ratio as follows Eyμ[π(y)μ(y)A(y)]\mathrm{E}_{y \sim \mu}[\frac{\pi(y)}{\mu(y)}A(y)], where A(y)A(y) is some advantage term (e.g. the reward itself). In the context of LLMs, the action yy is a long sequence of tokens t1,,tkt_1,\ldots, t_k, and the importance ratio π(y)μ(y)=π(t1)π(t2t1)μ(t1)μ(t2t1)\frac{\pi(y)}{\mu(y)} = \frac{\pi(t_1) \pi(t_2|t_1)\ldots}{\mu(t_1) \mu(t_2|t_1)\ldots} is often almost 00 or extremely large when πμ\pi \neq \mu, which makes it impractical. In practice, the ratio is usually clipped on a per-token basis, e.g. in PPO or GRPO. As GRPO is considered one of the sota losses, we compare it to our algorithm in our additional experiments (see the general response).

  • Some specific implementation details for Off-VPG are missing. For example, [...] into rejection sampling fine-tuning?

In Definition 4.1, we define the algorithm that we call off-VPG, in which the behaviour policy remains fixed. The behaviour of this algorithm is described in Thm 4.2, and illustrated in the bandits setting in Figure 1. In the paragraph "Policy improvement scheme with off-VPG", we describe the derived policy improvement scheme in which the behaviour policy is updated with the (current) trained policy π\pi with a delay (i.e. every N gradient steps on π\pi, we update μπ\mu \leftarrow \pi), and which we also call "delayed updates setting". Thm 4.3 applies to this case, and Figures 2, 3 and 4 illustrate it in the bandits and the LLMs setting respectively.

  • There is an absence of comparisons [...] than on-policy techniques." and
  • There is no comparison [...] between performance and efficiency?

Our phrasing might have been overly enthusiastic. The goal of this article was not to prove that off-policy RL is better than on-policy RL (e.g. in terms of performance/efficiency trade-off). Our point of view was rather: practitioners often want or have to train off-policy (to benefit from some of the advantages of off-policiness, or because they are constrained by their setup), which makes off-policy RL worth studying. We will rephrase this sentence in a more neutral way. Though a rigorous study of the trade-offs between on-policy and off-policy RL is beyond the scope of this article and left to future work, we note that our preliminary experiments suggest that off-policy training (with off-VPG or with GRPO) with an update frequency of 250 gradient steps for the actor policy leads to final scores that are similar to those obtained with on-policy training (e.g. compare Tables 1 and 2 from our general response with Table 3 in our response to reviewer oGYm)

评论

Thanks for the response. While I will raise the score to borderline reject, my view remains slightly negative because the experiments lack adequate evaluation of off-policy RL methods and comparison with potential on-policy baselines, leaving the claims insufficiently supported.

审稿意见
5

This paper analyze off policy policy gradient (in the bandit setting) and shows how different baselines can affect the learned policy. They show when the baseline is lower than the behavior value, the learned policy improves; the behavior value is a critical point, using a baseline beyond which can lead to collapsing in policy support. The algorithm is tested in a synthesized bandit problem and training a 8B Llama. The experimental results support the theory's insight.

优缺点分析

S1. To my knowledge, despite many off-policy PG papers, this paper is the first to analyze directly off-policy without importance weight or state occupancy correction.

S2 The theoretical analyses reveal an interesting landscape of off-policy PG's dynamics and is quite complete. It is also good to see the experimental results corroborate with the theoretical analyses.

S3. The paper is overall well-written and well motivated. The analyses here are timely given that PG algorithms are gaining interests again in the LLM era. .

W1. The current analyses assumes a bandit setting, which is different form the contextual setting used in practice. However, I think this limitation is minor given the contribution of the paper.

W2. Some minor writing clarity can be improved.

问题

  1. In the analyses, are the rewards implicitly assumed to be non-negative?

  2. In Theorem 4.3, does the set Y^inf exist always? since the policy can be stochastic.

  3. I do not understand the statement of point 3 in Theorem 4.3 from the current writing, and reading the proof doesn't help. Can you expand more on what you wish to convey?

  4. Do you have a bound on how large V^inf is? Maybe this is related to the misunderstanding above.

  5. How does the convergence rate and V^inf change with V in Theorem 4.3? I

局限性

Yes

格式问题

No

作者回复

We are very thankful to the reviewers for their time and for their insightful comments.

While the reviewers seem to agree that the papers' contributions are impactful, two shared concerns appear:

  • a lack of diversity in the experimental setup (not enough models, not enough datasets), and
  • a lack of comparison to popular RL algorithms.

We want to address these shortcomings in this general response (which we include in all individual responses), before answering each reviewer's questions separately.

Our initial intent was for the article to be equally balanced between a theoretical section, a bandits section, and a section on real-world applications, which is why we chose to keep the experimental section concise. However, the reviewers' comments have helped us realize that the paper would be strengthened by the inclusion of additional experiments, and we thank them for that.

We have run a series of new experiments, which we hope will fully answer their concerns. In particular, we add two additional models (Qwen 2.5 3B and Llama 3.2 3B), a much larger dataset (NuminaMath) and a comparison to GRPO. The resulting tables can be found below (sadly, a recent change in the conference's rules does not allow us to share graphs anymore):

Table 1: Training Dynamics of the off-VPG Algorithm Across Different Baselines. This extends our Figure 3 with additional models and datasets (Same hyperparameters. Reported values are the maximum pass@1 accuracy (in %) on the test set observed during training (2000 steps, with evaluation every 200 steps). Some runs crash during training, as in Figure 3; these are marked in bold.)

DatasetModelδV=-0.5δV=-0.3δV=-0.2δV=-0.1δV=0δV=0.1δV=0.2δV=0.3
MATHLlama8B48.749.149.349.348.947.546.143.5
Llama3B45.946.746.446.445.444.543.945.0
Qwen3B62.263.565.165.664.937.814.713.0
NuminaMathLlama8B20.521.522.223.623.117.818.017.2
Llama3B20.521.821.821.722.720.719.819.6
Qwen3B20.337.640.642.040.823.721.919.7

In our article, Figure 3 provides empirical evidence of how the choice of baseline δV affects the performance of our off-VPG algorithm. We extend the same experiment to two additional models (Llama3B and Qwen3B) and an additional dataset (a 144k-token subset of NuminaMath)

We consistently observe the same two phenomena highlighted in Figure 3:

  • When the baseline is large (δV ≳ 0), training becomes unstable and often crashes (bold entries in Table 1).
  • As the baseline increases, the training becomes faster—reflected in the average left-to-right progression of higher scores within each row (for runs that did not crash).

The general appearance of the training and test accuracies are also coherent with our earlier experiments. These results reinforce the robustness of our recommendation: to use a slightly negative baseline, such as δV = -0.1.

Table 2: Comparison of off-VPG and GRPO, off-policy (10000 steps, with evaluation every 200 steps)

Dataset (Train & Test)Modeloff-VPG (off-policy)GRPO (off-policy)
MATHLlama8B52.050.9
Llama3B49.049.1
Qwen3B64.861.7
NuminaMathLlama8B26.124.8
Llama3B25.525.1
Qwen3B44.340.8

We conduct in Table 2 a performance comparison between GRPO and off-VPG. We use the GRPO variant from (DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Shao et al), with KL regularization coefficient β = 0.001 and symmetric clipping at 0.2. For off-VPG, in line with our previous conclusions, we use a fixed baseline of δV = -0.1.

Under these conditions, off-VPG appears to perform slightly better on average than GRPO, despite being more simple. All training runs are stable. We also observe that off-VPG achieves faster score improvements than GRPO, especially when training Qwen3B.

Note that we do not yet claim that off-VPG is universally superior to GRPO--a fully rigorous comparison is beyond the scope of this work. Rather, we argue that its performance is competitive and reliable across diverse setups (three models and two datasets), and we plan on pursuing a more systematic investigation of this comparison in future work.

If the article is accepted, we plan on including two additional graphs in the main text (a variant of Figure 3 with Qwen3B, and a comparison between off-VPG and GRPO with Llama 8B), and to put the other results in the Appendix, as they do not change the article's overall narrative (though they make our conclusions more robust).

Note that we are very willing to run additional experiments in the remaining discussion period, should the reviewers deem it necessary.

--- End of general response ---

  • In the analyses, are the rewards implicitly assumed to be non-negative?

No, the analyses are invariant to any transformation of the rewards of the form αr+β\alpha \cdot r + \beta (and the same holds for the baseline VV).

  • In Theorem 4.3, does the set Y^inf exist always? since the policy can be stochastic.

Theorem 4.3 assumes access to the full population of samples (i.e., we know μ\mu) and the ability to perfectly optimize our surrogate objective (i.e., we can apply T{\cal T} exactly). This removes the stochasticity usually referred to as approximation and optimization error. In this setting, even if π\pi is stochastic, Y^\inf always exists. We did not address approximation and optimization errors, as they can be handled with classical techniques, which would lengthen the proofs without adding novel insights.

  • I do not understand the statement of point 3 in Theorem 4.3 from the current writing, and reading the proof doesn't help. Can you expand more on what you wish to convey?

The phrasing is indeed a bit ambiguous; thank you for pointing it out, we will rephrase it. The third point of Theorem 4.3 states that the limit reward is going to be optimal if and only if the baseline VV is smaller than some threshold V0,μV_{0,\mu} (which depends on the interplay between μ\mu and π\pi^*).

  • Do you have a bound on how large V^inf is? Maybe this is related to the misunderstanding above.

We have not sought an explicit bound for V^\inf. A natural lower bound is V^\inf > \max_y \mu(y) r(y).

  • How does the convergence rate and V^inf change with V in Theorem 4.3?

This is a very good question that highlights a trade-off: if VV is too large, there is a risk of eliminating good arms (indeed, V^\inf is a decreasing function of VV); conversely, if VV is too small, the constant cc becomes small (indeed, cc is an increasing function of VV). In other words, the larger VV is, the faster the convergence, but the higher the risk of converging to a suboptimal policy.

评论

Thanks for the rebuttal. I remain my positive rating. Glad to see this phenomena empirically validated in other settings too.

审稿意见
3

RL has been very successful in aligning LLMs for various downstream tasks. However, the most popular RL methods are on-policy and have several drawbacks. This paper proposes a new off-policy method to address the limitations of on-policy RL methods when aligning LLMs. The paper proves two theorems: one regarding the limits of the algorithm in relation to various baselines, and the other concerning policy improvement. Additionally, the paper conducts several experiments using LLaMA on mathematical tasks. The results of the experiment show that the various baselines, as shown in the proof, empirically affect the models' downstream performance.

优缺点分析

Strengths:

The ability to train off-policy for LLMs alignment is a very important problem. The paper's proposal is scalable and has the potential to impact training models in an asynchronous manner. The paper comes with guarantees that hold in practice, which makes it a promising direction.

Weaknesses:

The paper has several shortcomings. First, it omits some key baseline algorithms necessary for a thorough comparison. The evaluation is also limited to a single model and task, which raises concerns about its robustness and applicability to other tasks. Moreover, without including baseline algorithms, it’s unclear how the proposed method compares to existing approaches in the literature, such as [1]. Finally, some of the theorems in the appendix seem underexplained.

[1] Fine-Tuning Language Models with Advantage-Induced Policy Alignment by Banghua Zhu et al. 2023

问题

  • How does this paper compare to [1], PPO [2] in terms of the importance ratio enabling off-policy updates, and GRPO?
  • The original motivation of the paper was efficiency, but there are no comparisons to on-policy algorithms to demonstrate efficiency in any way.
  • For equation (1), how are you avoiding the importance sampling issue?
  • How do you choose the N to use for delayed updates? How sensitive is the proposed algorithm with respect to N?
  • Are the bounds in the bandit RL setting or the multi-step RL setting?
  • Could you provide more motivation for why you have the additional delta V term? The fact that the value of V varies for each prompt is expected, so I'm not sure why adding the correction term is necessary.
  • How is delta V trained in practice because it seems to be different than V^\mu, which is computed using multiple rollouts.
  • For most of the bounds in the appendix, there are no assumptions made regarding the ratio of \pi(y)/\mu(y), which could be bad if the ratio is not assumed to be bounded, right?

[1] Fine-Tuning Language Models with Advantage-Induced Policy Alignment by Banghua Zhu et al. 2023

[2] Batch size-invariance for policy optimization by Jacob Hilton et al 2022

局限性

Yes

格式问题

No

作者回复

We are very thankful to the reviewers for their time and for their insightful comments.

While the reviewers seem to agree that the papers' contributions are impactful, two shared concerns appear:

  • a lack of diversity in the experimental setup (not enough models, not enough datasets), and
  • a lack of comparison to popular RL algorithms.

We want to address these shortcomings in this general response (which we include in all individual responses), before answering each reviewer's questions separately.

Our initial intent was for the article to be equally balanced between a theoretical section, a bandits section, and a section on real-world applications, which is why we chose to keep the experimental section concise. However, the reviewers' comments have helped us realize that the paper would be strengthened by the inclusion of additional experiments, and we thank them for that.

We have run a series of new experiments, which we hope will fully answer their concerns. In particular, we add two additional models (Qwen 2.5 3B and Llama 3.2 3B), a much larger dataset (NuminaMath) and a comparison to GRPO. The resulting tables can be found below (sadly, a recent change in the conference's rules does not allow us to share graphs anymore):

Table 1: Training Dynamics of the off-VPG Algorithm Across Different Baselines. This extends our Figure 3 with additional models and datasets (Same hyperparameters. Reported values are the maximum pass@1 accuracy (in %) on the test set observed during training (2000 steps, with evaluation every 200 steps). Some runs crash during training, as in Figure 3; these are marked in bold.)

DatasetModelδV=-0.5δV=-0.3δV=-0.2δV=-0.1δV=0δV=0.1δV=0.2δV=0.3
MATHLlama8B48.749.149.349.348.947.546.143.5
Llama3B45.946.746.446.445.444.543.945.0
Qwen3B62.263.565.165.664.937.814.713.0
NuminaMathLlama8B20.521.522.223.623.117.818.017.2
Llama3B20.521.821.821.722.720.719.819.6
Qwen3B20.337.640.642.040.823.721.919.7

In our article, Figure 3 provides empirical evidence of how the choice of baseline δV affects the performance of our off-VPG algorithm. We extend the same experiment to two additional models (Llama3B and Qwen3B) and an additional dataset (a 144k-token subset of NuminaMath)

We consistently observe the same two phenomena highlighted in Figure 3:

  • When the baseline is large (δV ≳ 0), training becomes unstable and often crashes (bold entries in Table 1).
  • As the baseline increases, the training becomes faster—reflected in the average left-to-right progression of higher scores within each row (for runs that did not crash).

The general appearance of the training and test accuracies are also coherent with our earlier experiments. These results reinforce the robustness of our recommendation: to use a slightly negative baseline, such as δV = -0.1.

Table 2: Comparison of off-VPG and GRPO, off-policy (10000 steps, with evaluation every 200 steps)

Dataset (Train & Test)Modeloff-VPG (off-policy)GRPO (off-policy)
MATHLlama8B52.050.9
Llama3B49.049.1
Qwen3B64.861.7
NuminaMathLlama8B26.124.8
Llama3B25.525.1
Qwen3B44.340.8

We conduct in Table 2 a performance comparison between GRPO and off-VPG. We use the GRPO variant from (DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Shao et al), with KL regularization coefficient β = 0.001 and symmetric clipping at 0.2. For off-VPG, in line with our previous conclusions, we use a fixed baseline of δV = -0.1.

Under these conditions, off-VPG appears to perform slightly better on average than GRPO, despite being more simple. All training runs are stable. We also observe that off-VPG achieves faster score improvements than GRPO, especially when training Qwen3B.

Note that we do not yet claim that off-VPG is universally superior to GRPO--a fully rigorous comparison is beyond the scope of this work. Rather, we argue that its performance is competitive and reliable across diverse setups (three models and two datasets), and we plan on pursuing a more systematic investigation of this comparison in future work.

If the article is accepted, we plan on including two additional graphs in the main text (a variant of Figure 3 with Qwen3B, and a comparison between off-VPG and GRPO with Llama 8B), and to put the other results in the Appendix, as they do not change the article's overall narrative (though they make our conclusions more robust).

Note that we are very willing to run additional experiments in the remaining discussion period, should the reviewers deem it necessary.

--- End of general response ---

  • First, it omits some [...] off-policy updates, and GRPO?
  • The evaluation is also [...] in the literature, such as [1].

Regarding the diversity of datasets and models, as well as comparisons to other algorithms (GRPO), we refer to Tables 1 and 2 of the general response above, where it is shown that our method is competitive and that our results are robust. Thank you for pointing out the APA paper, which we will reference in our revised manuscript. This paper shares goals with ours. However, the approaches differ significantly in terms of methodology, theory, and experiments.

  • Finally, some of the theorems in the appendix seem underexplained.

Do you have particular examples in mind? We are happy to give you additional explanations, and potentially add them to the appendix.

  • The original motivation [...] efficiency in any way.

In this context, efficiency should not be understood purely in terms of "final score at the end of training". Some of the practical advantages of off-policy RL, such as the capacity to handle asynchronicity between the worker Gpus (in charge of generating trajectories) and trainer Gpus (in charge of updating the weights) or the greater simplicity in implementation (no constant communication between worker Gpus and trainer Gpus), are already known to the community. Others, such as the fact that it allows several passes over the same data points (resulting in more data efficiency), have not yet been analyzed in depth in the context of LLMs, but we plan on rigorously studying these in future work. (For a pure score comparison, consider also Tables 1 and 2 from our general response and Table 3 in our response to reviewer oGYm: off-policy and on-policy final accuracies seem comparable, but these are preliminary experiments)

  • For equation (1), how are you avoiding the importance sampling issue?

We do not face this issue, because the objective function that we optimize is NOT the expected reward Eyπ[r(y)]\mathrm{E}_{y \sim \pi}[r(y)] of the trained policy π\pi (though it is not far from it, as the actor policy μ\mu is kept close to π\pi in the setups that we consider). This is what makes theorems 4.2 and 4.3 interesting, as they describe the limit policy of a gradient descent on this unusual off-policy objective function.

  • How do you choose the N to [...] respect to N?

The larger the number NN of steps between two policy updates, the closer we are to the limit distribution πμ,V\pi_{\mu,V}^\star (using the notations of Thm 4.2), hence to the situation that our theory describes. In our bandits experiments, we have observed that the training curves in the policy improvement setting remain similarly shaped when we let NN vary, showing the robustness of our conclusions. In real world applications, the choice of NN is often dictated by practical concerns, such as the ratio between the number of trainer Gpus and worker Gpus available.

  • Are the bounds in the bandit RL setting or the multi-step RL setting?

Theorem 4.2 applies in the case where the actor policy μ\mu is fixed, and a gradient descent is performed on the trained policy π\pi's logits. Theorem 4.3 describes the policy improvement case, where the policy π\pi is trained on a given actor policy μ\mu, then the actor policy is updated μπ\mu \leftarrow \pi and the process is repeated. '

  • Could you provide more motivation [...] is necessary.
  • How is delta V trained in practice [...].

In the bandits setting, VV is not trained: it is a fixed hyperparameter. The behaviour of the algorithm depends on whether VV is larger or smaller than VμV^\mu. In the LLM setting, as you rightly observe, Vμ=Vμ(x)V^\mu = V^{\mu(\cdot |x)} depends on the prompt xx. We substract a correction V^+δV\hat V + \delta V to each reward r(x,y)r(x,y). V^\hat V is an estimator of Vμ(x)V^{\mu(\cdot |x)}, computed as in GRPO (for example). δV\delta V, just like VV in the bandit setting, is a fixed hyperparameter: it is the degree of freedom of our algorithm. The reward corrected by V^\hat V, i.e. r(x,y)V^r(x,y) - \hat V, has expected value 00 w.r.t. to μ(x)\mu(\cdot|x) : Eyμ(x)[r(x,y)V^]=0\mathrm{E}_{y \sim \mu(\cdot |x)}[r(x,y) - \hat V] = 0 (here the notation hides the group sampling). Hence having δV<0\delta V <0 is the same as having V<VμV<V^\mu in the bandits setting, and having δV0\delta V \geq 0 is the same as having VVμV\geq V^\mu in the bandits setting (see also our answer to a similar question from Reviewer sBrs).

  • For most of the bounds in the appendix, [...] to be bounded, right?

We assume that the set of yy is finite and that μ\mu has full support, so the ratio is necessarily bounded. The ratio π(y)/μ(y)\pi(y) / \mu(y) does not appear explicitly in our derivations, but the interplay between π\pi^\star and μ\mu can be seen implicitly. For example, if π\pi^* differs significantly from μ\mu, and the subtracted baseline VV is not conservative enough, good arms may be eliminated too early. This can be seen, e.g., in Theorem 4.2: when V<VμV < V^\mu, we have

suppπμ,V={yμ(y)(r(y)V)τμ,V>0}. \operatorname{supp} \pi^*_{\mu, V} = \{y \mid \mu(y)(r(y) - V) - \tau_{\mu, V} > 0\}.
评论

I thank the authors for their thorough response. I will keep my score.

评论

Dear reviewer,

In response to the weaknesses that you had listed, we have included new models and tasks to our experiments, as well as a comparison to a baseline. Is there any other concern we haven’t yet addressed that, if answered, would make you consider raising your score?

最终决定

The paper analyzes a simple off-policy policy gradient algorithm, showing both theoretically and empirically that emphasizing positive rewards while down-weighting failures yields policy improvement guarantees and more stable performance when fine-tuning LLMs and in bandit settings. The reviewers generally recognized the novelty of the theoretical analysis and appreciated the empirical validation, though some raised concerns about the limited diversity of experiments and missing comparisons to key baselines The additional experiments and comparisons added during rebuttal helped address some of these concerns, but the paper would benefit from a more candid discussion of its limitations. In particular, the reliance on bandit-style analyses and the narrow experimental scope leave open questions about generalization. The authors are encouraged to explicitly acknowledge these limitations and discuss how future work could broaden the empirical validation and strengthen connections to on-policy baselines.