PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.8
置信度
创新性2.5
质量2.5
清晰度2.5
重要性2.8

摘要

关键词
Reinforcement LearningTest-Time TrainingLarge Language Model

评审与讨论

审稿意见
4

This paper introduces Test-Time Reinforcement Learning (TTRL), a framework for adapting Large Language Models (LLMs) during inference on unlabeled data. The key idea is to apply Reinforcement Learning (RL) using rewards derived from majority voting among model-generated outputs, even at test time, where ground-truth labels are absent. The authors argue that this leads to self-evolving models that generalize better to unseen data. Experimental results show significant performance gains when applying TTRL on various tasks and models.

优缺点分析

Strengths

  • Creative use of majority voting: The proposed use of consensus-based surrogate rewards is an interesting direction and aligns with recent trends in weak or self-supervised reward shaping.
  • Strong empirical results: The experiments show substantial improvements on reasoning tasks, suggesting the majority voting mechanism has a nontrivial signal even without labels.

Major Concerns

Methodological Validity of “Training at Test Time”

The core assumption of Test-Time Reinforcement Learning (TTRL) is fundamentally problematic. The paper proposes to train a pre-trained model during inference using rewards inferred via majority voting over generated outputs, therefore effectively updating model weights at test time. This goes against a foundational tenet of machine learning: test time is reserved for inference, not training. While test-time adaptation is a known concept in some domains (e.g., Test-Time Augmentation, few-shot prompting), training on the test set as explicitly described in Section 2.1, even with pseudo-labels, raises major scientific concerns:

  • The TTRL setting violates the standard separation between training and testing data. Despite the lack of ground-truth labels, updating weights using test inputs, regardless of how reward is obtained, constitutes training on the test set.
  • This blurs the line between generalization and memorization. Using test data to update the model even via a proxy reward, is still exploiting the test distribution, potentially inflating performance claims.
  • The term “Test-Time RL” is misleading and could give the impression of a valid ML paradigm.

Unless the authors clearly define this as a simulation scenario (e.g., what-if offline adaptation using test-like data) or reframe the setting as an unsupervised domain adaptation task, the current formulation is not scientifically grounded.

Reward Inference is Non-Causal and Optimistic

The rewards in TTRL are generated after the model samples outputs from itself and checks for majority agreement. This is inherently a non-causal, retrospective reward shaping method that I believe may introduce confirmation bias:

  • Majority voting across generations does not reliably capture correctness in tasks like reasoning or math where the majority could be confidently wrong.
  • Optimizing the model to maximize agreement among its own outputs risks reinforcing poor behavior especially early in training where the model is uncertain.
  • The reported performance boosts might be partially due to overfitting to the test distribution via pseudo-reward engineering.

Minor Issues

  • Line 27: “has recently” → “have recently”
  • Figure 1: The Caption could include more details about what is being visualized and its relevance.
  • Mathematical notation: Line 71: the symbol \pi is introduced without definition
  • Writing style: The paper often reads more like a technical report than a formal scientific manuscript. Several sections lack formalism, and some claims (e.g., "we fully advance AI evolution") are overstated and should be toned down.

问题

Hereafter come some questions which I would appreciate getting addressed:

  • Why should I consider Test-Time Reinforcement Learning (TTRL) a sound and scientifically valid machine learning approach?Specifically, how do you justify updating model weights during inference time, and in what way does this respect standard ML evaluation principles?

  • How do you ensure that training on test data—even without ground-truth labels—does not lead to overfitting or inflated results?

  • Could the proposed approach be reframed as offline adaptation or self-supervised training, avoiding the test-time terminology?

I believe these questions are critical to the scientific validity of the proposed setting. While I appreciate the strong empirical results of the paper, I remain unconvinced that Test-Time Reinforcement Learning, as currently framed, constitutes a sound machine learning methodology. That said, I am open to reconsidering my evaluation if the authors can provide a clear and rigorous justification for the proposed setting and address the core methodological concerns.

局限性

yes

最终评判理由

See my comment about the clarification about Test-Time RL

格式问题

No formatting concerns

作者回复

Response to Major Concerns

Weakness 1: Methodological Validity of “Training at Test Time”

We appreciate the reviewer’s thoughtful concern regarding the scientific framing of Test-Time Reinforcement Learning (TTRL) and the potential conflict with traditional machine learning principles. We would like to clarify that TTRL is best viewed as part of a well-established and growing body of research on test-time adaptation and unsupervised test-time training, which has gained substantial traction in recent years [1, 2, 3, 4].

In particular, prior work such as Test-Time Training [1] and Meta Test-Time Training (MT3) [5] have demonstrated that updating model parameters at test time using unsupervised or self-supervised objectives can significantly improve generalization under distribution shift. These methods operate under the same core idea as TTRL: leveraging unlabeled test inputs to adapt the model dynamically, without accessing ground-truth labels.

Further, more recent work such as TeST [6] and BiTTA [7] has applied similar paradigms across diverse domains, showing that model updates based on pseudo-rewards or binary correctness feedback at test time are not only scientifically grounded, but also practically effective. These approaches, like ours, adhere strictly to label-free adaptation settings on test data and do not violate the separation between training and ground-truth-labeled test data.

Thus, rather than violating core ML principles, our work extends this paradigm to LLMs and RL, combining self-supervised test-time adaptation with reward-driven optimization. For clarity and to avoid confusion with conventional evaluation protocols, we will revise our terminology in the final version to emphasize the connection with unsupervised online adaptation or self-supervised test-time optimization.

[1] Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. ICML 2020 (1000+ citations)

[2] Tent: Fully Test-time Adaptation by Entropy Minimization. ICLR 2021 Spotlight (1500+ citations)

[3] A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts. IJCV 2024 (300+ citations)

[4] Continual Test-Time Domain Adaptation. CVPR 2022 (600+ citations)

[5] MT3: Meta Test-Time Training for Self-Supervised Test-Time Adaption. ICAIS 2022 (100+ citations)

[6] TeST: Test-time Self-Training under Distribution Shift. WACV 2023 (30+ citations)

[7] Test-Time Adaptation with Binary Feedback. ICML 2025

Weakness 2: Reward Inference is Non-Causal and Optimistic

The reviewer's concern about confirmation bias introduced by Majority Voting (MV) is valid, errors in MV labels can propagate during training. However, in our method, this issue is significantly mitigated, and we believe it can be further alleviated. We address your concern from several perspectives:

2.1 Inherent Limitation of the Self-Training Paradigm: The use of MV or self-consistency as supervision has become a general paradigm in LLM self-training with unlabeled data [1, 2, 3]. This stems from two key observations: (a) model consistency correlates with correctness, and (b) self-training setups exclude external information, leaving self-consistency as the only viable signal. Thus, confirmation bias is a shared limitation across current self-training approaches, with no fully satisfactory solution yet.

2.2 Reward Accuracy as a Diagnostic Tool: We have proposed "Reward Accuracy" to quantify the quality of the training signal and assess the impact of MV errors. Our experiments across 13 models (LLaMA, DeepSeek, Qwen, Mistral series; 1.5B, 3B, 7B, 8B, 32B; Base, Instruct, Reasoning) show consistent gains with no collapse, even after up to 80 episodes training. High Reward Accuracy (with initial values surpassing 90%) across all experiments essentially and directly responds to the fact that the confirmation bias in TTRL introduced by MV's error is minimal. We will include these metrics in the next version.

2.3 TTRL Improves Supervision Quality and Reduces Confirmation Bias: Unlike traditional self-training (e.g., LMSI [1]), which directly uses MV outputs for SFT, TTRL leverages MV labels in a RL manner, which is inherently more tolerant to label noise (see Section 4.2, Reward Calculations):

  • Improved Supervision Quality: Label Accuracy and Reward Accuracy are metrics for evaluating the quality of supervision signals of traditional self-training methods and TTRL respectively. As shown in Figure 9, when Label Accuracy (maj@n) is only 37%, the corresponding Reward Accuracy reaches 92%, demonstrating the superiority of TTRL over previous methods. This implies that our method extracts much higher-quality learning signals even when the underlying MV label is weak, explaining why TTRL improves even low-performing models (even on model with 2% pass@1), as noted in Response to Weakness 2 of Reviewer 2bYH (last table).
  • RL’s Tolerance to Reward Inaccuracies: As noted by Reviewer JAq8, RL relies on directional and relatively imprecise feedback, making RL inherently more robust to reward inaccuracies, a property that has also been observed in prior work [4]. For example, when the reward accuracy is 92%, the actual noise introduced into training might be even less than 8%.

2.4 TTRL Requires a Minimal Base Competence: Self-improvement methods like TTRL does requires a base model with minimal capability. This is natural and expected. We may consider an extreme scenario: could a randomly initialized 7B model emerge intelligence or complex world knowledge purely through self-supervision? We believe that evolution through self-training builds upon the model’s pre-existing capabilities. Moreover, TTRL can drive substantial improvement even when the model achieves only a 2 points pass@1, as noted in our response to Weakness 2 of Reviewer 2bYH (last table). We consider this to represent a sufficiently weak model.

2.5 TTRL Generalizes Beyond the Training Distribution: As shown in Response to Weakness 2 of Reviewer 2bYH (First Table) and Figure 3 in our paper, TTRL demonstrates strong generalization ability, rather than merely overfitting via pseudo-reward engineering.

Lastly, TTRL can be extended to further mitigate confirmation bias, for instance, by selectively using only negatively rewarded responses as it may be difficult to determine whether a response is correct, introducing randomized rewards (akin to dropout), or applying entropy-based regularization.

[1] Large Language Models Can Self-Improve. EMNLP 2023 (600+ citations)

[2] Recursive Introspection: Teaching LLM Agents How to Self-Improve. NeurIPS 2024

[3] Importance Weighting Can Help Large Language Models Self-Improve. AAAI 2025

[4] What makes a reward model a good teacher? an optimization perspective. Arxiv 2025

Response to Minor Issues

Thank you for carefully reviewing our manuscript and providing such detailed and thorough feedback! We will address these issues and carefully revise the entire paper to improve formalism and avoid any overstated claims.

Response to Questions

  1. Why should I consider Test-Time Reinforcement Learning (TTRL) a sound and scientifically valid machine learning approach?Specifically, how do you justify updating model weights during inference time, and in what way does this respect standard ML evaluation principles?

Regarding the validity of the TTRL setting, we have already cited several highly influential papers above to explain that TTT is currently a widely used experimental setup, and many works have been published through peer review under this setting. Please further refer to Response to Methodological Validity of “Training at Test Time”.

  1. How do you ensure that training on test data—even without ground-truth labels—does not lead to overfitting or inflated results?

As for distribution overfitting, our paper includes comparisons where the model is trained on a single dataset (AIME/MATH/AMC) and tested on other datasets, demonstrating that TTRL possesses strong generalization ability. In addition, we have stated in the paper that our primary goal is to address self-/unsupervised RL problems. Test-time training is a specific scenario (which is also widely studied at present), but our method can also be adapted to conventional training settings, and recent work has provided validation for its effectiveness in these scenarios as well [1]. We will highlight and discuss this point in future revisions of the paper. TTT is merely a broadly accepted evaluation setting, while TTRL can be applied to all unsupervised RL scenarios.

  1. Could the proposed approach be reframed as offline adaptation or self-supervised training, avoiding the test-time terminology?

We will incorporate your suggestions to discuss the relationships among these concepts in the paper, and we will further clarify in the paper that our primary aim is to explore RL in LLMs under unsupervised settings, with the applications to tasks such as test-time training (TTT) and test-time adaptation (TTA) being only preliminary. We will also refine the relevant descriptions and terminology to avoid potential misunderstanding by readers.

[1] Can Large Reasoning Models Self-Train?


We hope these revisions will fully address your concerns. Our work essentially follows and builds upon existing tasks and previous works, and it is not our intention to propose a valid ML paradigm. We respectfully hope you could reconsider the score assigned to our paper.

评论

I would like to thank the authors for taking the time to clarify the concept of test-time reinforcement learning and for pointing to relevant prior work. I have carefully reviewed the suggested references and now have a clearer understanding of the setting and its significance. This additional context has helped me better appreciate the value of the contributions. Given this and the strength of the empirical results, I am now inclined to vote for accepting the paper.

审稿意见
4

Over the last year, RL fine-tuning of LLMs has led to significantly improved reasoning abilities in domains such as mathematics and coding. This work introduces Test-Time Reinforcement Learning (TTRL), a method for enhancing an LLMs task performance by fine-tuning on test samples via self-generated rewards. To avoid test-set leakage, these synthetic binary rewards are obtained by generating multiple rollouts and considering the majority answer as correct (reward=1) and others as incorrect (reward=0). TTRL improves performance on a variety of math and coding benchmarks, including MATH-500 and AIME2024. Consequently, this work introduces a simple but effective method to improve performance on hold-out data.

优缺点分析

Strengths:

  1. The authors propose a simple method that results in considerable performance gains.
  2. They provide valuations for 4 (1/2) model families (Qwen2.5-Math, Qwen2.5, Qwen3, Llama3.1 IT, Skywork OR1-Math) at the small to medium scale (1.5B, 7/8B, 32B), and show interesting analyses.
  3. The writing is clear, and the takeaways are presented well.

Weaknesses:

  1. The performance improvements are unclear, as the performances for the base models seem to be weaker than those reported in the original publications. For example, when comparing Table 18 in [1] and Table 1 in this work, Qwen 2.5 7B attains 77.6% on MATH-500 in [1] vs. 60.5% in this work. Similarly, the GPQA performances are 36.4% vs. 31.8%. Similar trends are observed for other models (e.g., for MATH 500, 54.8% vs. 48.6% for Llama3.1 8B-Instruct, Qwen3 8B 87.4 vs. 89.3). Therefore, it is unclear where the performance differences come from, and how large the gains of TTRL are.
  2. Connected to point 1, it is a bit unclear what effect TTRL has. The primary performance benefit of TTRL may come from learning to adhere to the restricted token budget and the output format. Moreover, there are differences in the token generation budget between [1] and this work.
    • Therefore, within the restricted budget, the base models may not produce a valid answer in the right format. The majority vote may primarily help filter out too-long answers. Consequently, via TTRL, the model may only learn to adhere to the token budget. In contrast, the base models do not have this advantage and may run against budget limits, producing no valid answer.
    • To reject this hypothesis, an ablation with a larger token generation budget for both the base model and the TTRL fine-tuned models would strengthen the paper. In addition, statistics on how often the base models run into generation budgets would be interesting.
    • This point is very important because the claimed performance gains of TTRL (e.g., 211% on AIME 2024) are pretty large and would need to be revised if the base model performances change.
  3. It is stated that TTRL is very sensitive to hyper-parameters, but no in-depth ablations on this are presented. Also, it would be good to highlight the sensitivity of TTRL to hyperparameters in the main text.

[1] Qwen3 Technical Report

问题

  1. Could the authors clarify the differences in base model performances between this work and the results reported in other works?
  2. What are your thoughts on the restricted token generation budget? Can you provide an ablation to reject the hypothesis presented above?
  3. Do you see any signs of collapse of reward hacking? For example, does the fraction of the majority class approach 1?
  4. Can you clarify the sensitivity to hyperparameters of TTRL (Appendix C)? It would be valuable to see more in-depth analysis on various temperatures and batch sizes (similar to Figure 10) to understand the sensitivity of TTRL better.

局限性

yes

最终评判理由

New experiments (in Weakness 2) on extended generation budgets strengthen the core message of the paper.

格式问题

none

作者回复

Response to Weaknesses

Weakness1: Weak Baselines

We fully understand your concerns, as reliable baseline performance is essential for validating claims, and we would like to address them from the following perspectives:

1.1 Our evaluation strictly follows prior works.

a. First, we would like to clarify that our evaluation strictly follows the evaluation setup used in prior studies without introducing any custom components, using only third-party or officially released tools.

  • Decoding: We follow R1 [1] with the temperature = 0.6, top-p = 0.95 to get avg@16 (with a 3k context length following the baseline Dr GRPO [2] for a fair comparison, which significantly affects performance).
  • Prompt Template: We use the official prompt template of the corresponding models via apply_chat_template of transformers, along with the formatting instructions specified in R1.
  • Others: Publicly available datasets, Qwen-Math Verifier, official verl code, etc.

Note: To the best of our knowledge, none of our configurations deviates from the accepted practices. Should the reviewer find any questionable practices in our evaluation practices, we strongly welcome suggestions and will make the necessary modifications and update the experiments.

b. We are not the only ones achieving such results, and our scores are not consistently lower than those of other studies as the reviewer has claimed. As shown in Table 1 of Luffy [3] and Table 5 (Default) of Spurious Rewards [4], other studies employing similar configurations have reported scores comparable to ours.

1.2 The performance gap is entirely due to differences in evaluation settings, especially for Base models.

First, the large performance gap introduced by prompts and evaluation settings is well studied and acknowledged [2, 4]. Our settings differ from Qwen3, so different results are expected. Comparing scores under different settings is not meaningful, particularly for Base models. By examining the evaluation settings in Table 18 of Qwen3 (as described in Section 4.6), we identify two main differences:

a. Context length: Qwen3 uses context lengths up to 32k, and as high as 38k for AIME2024, whereas our context length is only 3k. This difference alone is sufficient to create a large performance gap, making the results non-comparable.

b. Decoding parameters: The following is a comparison table of decoding parameters, which shows a substantial discrepancy. We simply followed R1’s configuration without tuning these parameters:

Qwen3 (Table 18)TTRL (follow R1 settings)
temprature0.70.6
top-p0.80.95
top-k20-1
presence penalty1.50.0

Additionally, the prompt template may differ. Qwen3 did not specify the prompt used for non-Qwen3 models, so we cannot make a direct comparison. However, prompt differences can significantly affect results. Dr GRPO [2] shows that performance on MATH-500 can vary by up to 30 points depending on the prompt. Spurious Rewards [4] even show that randomly generated prompts can sometimes yield the best scores. This does not imply that one should search for random prompts solely to maximize a baseline. We strictly followed the settings used in previous studies and never attempted to intentionally report lower baselines.

1.3 We have provided our code and are willing to assist with reproduction and verification of the above details.

We have provided our code in Supplementary Material and are willing to open-source and maintain it to facilitate reproducibility. We strongly encourage you to run the code for reproduction and verification of the implementation details mentioned above. Moreover, our improvements are already substantial, and minor variations in the scores do not affect the overall conclusions. There is no incentive for us to misreport baseline results.

Weakness 2: Learning to Adapt to Limited Context and Output Format

In our initial implementation, we simply followed the 3k context used by Dr. GRPO as shown in Appendix E. To address the concern that TTRL may only be adapting to constrained token budget and formatting, we conducted three sets of experiments:

2.1 Larger token generation budget (on Base models):

As the most direct approach, we extended the context to 8k tokens on Qwen-2.5-Math-1.5B, using 32 rollouts per prompt for both label estimation and training (without downsampling). This setup also reduces the overall computation compared to the original paper’s configuration and matches the computational cost of standard RL.

We perform TTRL on AIME 2024 and evaluate its performance not only on AIME 2024 but also on unseen benchmarks AIME 2025, AMC 2023, and MATH-500 to assess generalization. The evaluation metrics include avg@32 and maj@32.

MetricModelAIME 2024AIME 2025 (OOD)AMC 2023 (OOD)MATH-500 (OOD)
avg@32Qwen-2.5-Math-1.5B7.73.529.331.7
w/ TTRL17.96.542.563.7
maj@32Qwen-2.5-Math-1.5B16.46.836.937.7
w/ TTRL27.312.455.876.2
  • Sufficient token budget: Throughout training, the token budget was sufficient (average maximum response length: 1968.11 tokens, maximum clip ratio: 1.9%), TTRL still yielded substantial performance improvements, indicating the gains are not due to optimizing the model’s output length.
  • Generalization: Applying TTRL in AIME 2024 led to improved performance on various unseen datasets, including the more challenging AIME 2025. If TTRL were merely optimizing for output length, its effects would likely be limited to prompts seen during training (i.e., AIME 2024) and would not generalize well to new data.
  • Rising Maj@32: We further analyzed the evolution of maj@32 performance. To ensure the validity of this analysis, we controlled for computational budget as mentioned above. The results show that maj@32 increases consistently during training, indicating that TTRL does not simply optimize superficial properties of the original model distribution, such as output length.

2.2 Larger token generation budget (on Reasoning models):

The base models discussed above have limited reasoning lengths and therefore do not fully utilize a larger context. To further evaluate the effectiveness of TTRL under a larger token generation budget, we conduct additional experiments on reasoning models (some results have already been provided in Table 2 of the paper). Specifically, we perform experiments using DeepSeek-R1-LLaMA-8B with a 32k context. The results demonstrate that even with a sufficiently large budget, such as 32k tokens, TTRL continues to deliver substantial performance improvements.

AIME 2024AMC 2023MATH-500
DeepSeek-R1-LLaMA-8B51.781.689.6
w/ TTRL69.288.990.9
+17.5+7.3+1.3

2.3 Output format (on Instruct models):

It is undeniable that base models often struggle with instruction following, and may benefit from improvements in output formatting during training. To further confirm that the performance gains from TTRL are not merely due to optimizing the output format, we conducted additional experiments using various Instruct models, including the DeepSeek, Mistral, and LLaMA series. These Instruct models have stronger instruction-following capabilities, allowing a more accurate assessment of the improvements introduced by TTRL. The results show that TTRL continues to yield substantial performance gains across all tested models:

AIME 2024AMC 2023MATH-500
LLaMA-3.2-3B-Instruct6.019.443.9
w/ TTRL13.331.361.6
Mistral-Nemo-Instruct0.815.440.8
w/ TTRL024.851.0
DeepSeek-Math-7B-Instruct1.916.342.3
w/ TTRL2.522.952.4

Weakness 3: Sensitivity to Hyperparameters

Please see Response to Weakness 2 of Reviewer cYnp.

Response to Questions

  1. Please refer to Response to Weakness 1.
  2. Please refer to Response to Weakness 2.
  3. We have observed no hacking signs for most the majority of our experiments. Taking AIME 2024 as an example, we almost did not observe hacking behavior even after 80 training episodes (i.e., each sample was seen by the model 80 times). Therefore, we believe that the impact of this issue is currently manageable. Moreover, we consider that it can be addressed by monitoring and controlling the number of training steps. Intervening during training can help mitigate the problem, similar in spirit to techniques such as gradient clipping used in conventional deep learning.
  4. Please see Response to Weakness 2 of Reviewer cYnp.

References

[1] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

[2] Understanding R1-Zero-Like Training: A Critical Perspective

[3] Learning to Reason under Off-Policy Guidance

[4] Spurious Rewards: Rethinking Training Signals in RLVR

评论

Thank you for your detailed response. My comments in weakness 1 specifically targeted the differing generation budgets between your work (3K) and other works (e.g., 32K for Qwen as you noted in your rebuttal). I understand that the results are therefore not directly comparable. It is good to see that works with comparable settings achieve comparable results. My hypothesis was that TTRL may primarily help to produce valid answers within the constrained token budget and to adhere the output formatting. However, your new experiments (in Weakness 2) address this and I think they strengthen the core message of your paper. I will increase my score to 4.

审稿意见
4

This paper introduces Test-Time Reinforcement Learning (TTRL), a novel framework for training Large Language Models (LLMs) with Reinforcement Learning (RL) on unlabeled test data. The core challenge addressed is reward estimation during inference without ground-truth labels. The authors leverage common Test-Time Scaling (TTS) practices, such as majority voting, to generate effective reward signals for RL training. TTRL enables LLMs to self-evolve by exploiting priors from pre-trained models, demonstrating consistent performance improvements across various tasks and model scales. Experiments show that TTRL can surpass the upper bound of initial model performance (e.g., majority voting accuracy) and approach the efficacy of RL with ground-truth labels, even using only unlabeled data.

优缺点分析

Strengths

  1. This paper proposes TTRL, a novel method for RL on unlabeled data using majority voting, addressing the need for label-free training.
  2. It shows significant improvements (up to 211%) on math reasoning tasks across different model sizes.TTRL surpasses initial model limits and approaches supervised RL performance without ground-truth labels.
  3. It works across tasks (AIME, AMC, GPQA) and RL algorithms (PPO, PRIME), proving generalizability. Weaknesses
  4. It doesn’t compare directly with the latest unsupervised RL methods (e.g., Genius), weakening claims of superiority.
  5. Its performance depends on settings like temperature and episode counts, which aren’t fully optimized.
  6. It requires many samples for voting, making it resource-intensive (8×A100 GPUs).
  7. It mainly focuses on math tasks; effectiveness in open-domain NLP is unproven.

问题

  1. Is it equivalent to expanding the number of N in BoN during the RL process? 2.. In Fig. 6, what if increasing voting number with backbone? I think it's necessary a fair comparison because you have consumed more voting times in RL process.

局限性

yes

最终评判理由

With all discussions and other comments of reviewers considered, I intend to maintain my score.

格式问题

nan

作者回复

Response to Weaknesses

  1. It doesn’t compare directly with the latest unsupervised RL methods (e.g., Genius), weakening claims of superiority.

Thank you for your suggestion! We note that according to the NeurIPS 2025 FAQ, papers appearing online after March 1st, 2025 is generally considered concurrent and authors are not expected to compare to those. Genius was uploaded to arXiv on April 11, 2025, just five weeks before the submission deadline of May 15, 2025. We apologize for unable to conduct this experiment in time. We have briefly discussed it in the related work, in accordance with NeurIPS policy. Nevertheless, the reviewer's suggestion is greatly appreciated and we will further investigate this line of work and include additional experiments in future revisions. We believe this weakness should not affect the overall assessment of our manuscript and sincerely hope for your reconsideration.

  1. Its performance depends on settings like temperature and episode counts, which aren’t fully optimized.

First, it is important to clarify that in our experiments, apart from minor adjustments to the episode and batch sizes due to the test-time setup, all other configurations, such as the temperature, follow prior work. The sensitivity we mentioned primarily refers to observations made during our exploration of alternative settings for improving performance, where we noticed a degradation in results, consistent with recent findings [1]. Moreover, the training dynamics metrics proposed in Appendix F can be used to further optimize the hyperparameters. At the time, we did not conduct these experiments due to the high computational cost. However, we fully agree that fine-grained ablation and in-depth analysis of the hyperparameters is necessary and could lead to further insights. We will include additional experiments on hyperparameters, such as ablation studies on batch size and training temperature. We sincerely appreciate your suggestion.

  1. It requires many samples for voting, making it resource-intensive (8×A100 GPUs).

In fact, increasing the number of vote samples followed by downsampling is optional. TTRL can rely solely on rollouts of standard RL for both voting and training. In this case, the computational cost of TTRL is the same as that of standard RL, without any additional overhead. We provide an experiment under this setting in Response to Weakness 2 of Reviewer 2bYH (2.1). The results show that with a computational cost same to that of standard RL, the performance is comparable to that achieved by incorporating additional generations for voting during rollout.

  1. It mainly focuses on math tasks; effectiveness in open-domain NLP is unproven.

We acknowledge that the current version of our experiments mainly focuses on math tasks. This choice was intentional, driven by the verifiability of task outcomes and the stability of the majority-vote-based reward signal. As we emphasized (in line 5 of our paper), TTRL is compatible with arbitrary Test-Time Scaling methods, including generative reward models, and is therefore applicable to general NLP tasks. We will include a discussion on the potential of applying TTRL to open-domain NLP tasks and identify this direction as a key focus for future work. Thanks for your suggestion!

Response to Questions

  1. Is it equivalent to expanding the number of N in BoN during the RL process?

We understand that the concern here refers to increasing the number of samples (N) during inference. Unlike simply enlarging N at inference time, TTRL leverages the reward signal to update the model’s policy distribution, rather than directly adopting the majority vote as the final output. This approach is more favorable for reducing inference-time costs, especially when transferring across different scenarios.

  1. In Fig. 6, what if increasing voting number with backbone? I think it's necessary a fair comparison because you have consumed more voting times in RL process.

This is an excellent question! We are willing to include majority voting over a backbone model with a very large number of votes (denoted as maj@n with large n) as a baseline. The inference cost of this setup matches the total training cost of TTRL, enabling a comparison under equivalent computational budgets. In fact, this baseline is unfavorable to TTRL, as majority voting incurs the full inference cost at prediction time for each evaluation on the same dataset, whereas TTRL is trained only once and avoids repeated inference overhead. To more thoroughly address your concerns, we conduct experiments under this setting, performing a series of TTS evaluations on the AMC2023 using Qwen2.5-Math-1.5B and Qwen2.5-Math-7B. The results demonstrate that the performance of majority voting converges as the number of votes increases, which aligns with previous findings, resulting in diminishing returns (see Figure 5 in [2] and Figure 7 in [3]). Therefore, the maximum achievable backbone maj@n scores are approximately 37.3 and 50.79, respectively, even when more voting is applied. In contrast, after applying TTRL, Qwen2.5-Math-1.5B and Qwen2.5-Math-7B achieve scores of 48.9 and 68.1, respectively, significantly surpassing the backbone maj@n with a large n.

MetricQwen2.5-Math-1.5BQwen2.5-Math-7B
maj@228.0933.23
maj@433.6841.2
maj@837.2345.73
maj@1638.147.98
maj@3238.4349.19
maj@6438.1749.87
maj@12837.8650.14
maj@25637.5550.4
maj@51237.4150.6
maj@102437.350.79
w/ TTRL (pass@1)48.968.1
delta+11.6+17.31

Therefore, even under a computationally fair setting, TTRL retains a clear advantage: it converts compute into generalization capability via learning, rather than brute-force sampling. Moreover, in scenarios with distribution shift, TTRL-trained models can adapt to new test distributions using significantly less computation at inference time compared to large-N majority voting. In our revision, we will provide a more thorough discussion on this topic.

References

[1] POLARIS: A POst-training recipe for scaling reinforcement Learning on Advanced ReasonIng modelS

[2] Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

[3] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

评论

I'd like thank you for your detailed response, which has addressed most of my concerns. However, my main concern of Weakness 2&4 remains unsolved. As a general RL approach for LLMs, the all domain potential should be proved. With other comments of reviewers considered, I intend to maintain my score. Thanks again for your effort.

审稿意见
5

The paper introduces a novel method for training Large Language Models (LLMs) using Reinforcement Learning (RL) on unlabeled/self-generated data. The core idea is to use majority voting as a way to estimate a reward function used for post-training of LLMs. The paper shows that this simple approach can obtain significantly improve the base models and get performance similar to more involved RL schemes (which require human annotation). This paper can be interpreted as an alternative way to perform Test-Time Scaling (TTS) (which has seen much attention in the recent literature).

Key findings:

  • Self-improvement of LLMs: TTRL enables LLMs to self-improve by leveraging the priors in pre-trained models and utilizing majority voting to estimate rewards from unlabeled test data.
  • Consistent Improvement: Experiments demonstrate that TTRL consistently improves performance across various tasks and models.
  • Sustained Self-Improvement: The paper highlights that TTRL achieves sustained self-evolution during "online" RL, dynamically generating higher-quality supervision signals as its capabilities improve.
  • Scalability and Generalization: TTRL scales with model size, meaning larger models can produce more accurate majority voting rewards, leading to more effective learning.

优缺点分析

Strengths:

  • The setting of trying to achieve self-improvement of an LLM based on unlabeled examples is highly relevant for the community.
  • The proposed extraction of rewards based on majority voting is simple and concise. Leading to an appealing method with practical impact.
  • The method gives improved results across a variety of base models, a big plus given that many recently proposed test time scaling methods did not show consistent improvements across models.
  • The experiments are extensive, consisting of fine-tuning of multiple different base models.
  • The ablation studies are insightful and were good to read. I like the fact the authors investigated scaling behavior which suggests that as the models get larger the bigger the gain provided by TTRL.

Weaknesses:

  • There is one main concern I have: the observed behavior is also explainable from the recently popularized perspective that the pre-trained model already contains the knowledge for significant performance gains; but achieving those gains is difficult via sampling alone (since in long sampling chains the likelihood of a hit is small). If that is the case, couldn't it be that the presented method merely reduces entropy (and does so towards the majority vote) and similar results could be achieved without any RL with a simple tightening of the logits around the majority vote? This concern is corroborated by the fact that the authors observed a clear binary threshold effect (models need to be good enough for TTRL to work effectively).
  • While it is great to see experiments with different models it is important to recognise that the gains achieved by TTRL do not lift any of the models anywhere close to the stat-of-the-art. Additionally simple other TTS methods such as s1 achieve far superior results. Perhaps this is mainly due to the fact that the paper here only experiments with smaller models and does not contain 32B parameter model experiments but it should be discussed (and perhaps experiments with a larger model should be added if feasible).
  • There are some odd/short figure captions. For example the caption in Figure 1 is very terse and what does "Position" mean here?
  • Some typos ungrammatical sentences. E.g. line 27 "Silver and Sutton has recently" -> "Silver and Sutton have recently". Line 30 "LRM" -> "LLM". Line 41 "How to obtain rewards [...]" -> "How can one obtain rewards [...]". Line 39 "Therefore" does not follow on from the previous sentence? Perhaps say "In this paper". Many more such examples throughout. The paper could benefit from some careful proof-reading.
  • The paper uses the term "self-evolution" a lot hinting at some connection to an evolutionary process which I find somewhat misplaced. Wouldn't it more sense to just call it self-improvement?
  • The definition of "rule-based-rewards" (l 80) seems a bit off. I understand that the authors were trying to make a connection to sparse rewards extracted based on rule matching but there is no such rule here, it's just the score from majority voting. It would be better to refer to it as "a sparse reward analogous to rule-based rewards".
  • The setting of self-improving models with their own feedback is not entirely novel and there has been lots of recent work around scaling data that requires less human feedback or ‘labels’ and I wish there were more comparisons made with respect to approaches such as RL with AI Feedback [1] for scaling feedback to train using RL.
  • There seems to be some sensitivity to hyperparameters.

[1] Bai, Yuntao, et al. "Constitutional ai: Harmlessness from ai feedback." arXiv preprint arXiv:2212.08073 (2022).

问题

Please consider the concerns I raise in Weaknesses above and discuss them. Additionally I had a few questions:

  • The authors defined the reward function in Equation 3, which is simply a delta function with respect to the label that has the majority vote. I was wondering if the authors considered any other softer distance functions such that the reward is higher when it is closer to the majority vote and less when it is further away but more as a continuous function.
  • I’m unconvinced with the ‘lucky hit’ phenomenon. It would seem that as the majority ratio increases the reward accuracy actually decreases (in Figure 9). To me it seems that: as the model becomes more confident, the reward gets worse. I’m not sure that’s the behavior we want, the lucky hit seems to be just hoping that there is enough uncertainty when the labels are wrong. But if the majority ratio increases during training surely the lucky hit phenomenon would disappear?
  • I am very curious why the curves are so close in Figure 7! The authors give some suggestions in the paper saying that it's great that TTRL is so close to RL (leakage) here but me it seems to suggest that the RL labels do not provide much signal?
  • It would be interesting to see if models can further improve using TTRL even when RL is employed. I.e. what happens if you use TTRL not just after pre-training but as an additional phase after RL training?
  • Majority voting is shown to work well in practice to improve model capabilities for math reasoning but there are many cases in which one would expect that the assumption of relying simply on the most frequent model output is wrong. Have the authors thought about this?

局限性

Yes.

最终评判理由

I maintain that this paper is a clear accept to me. The experiments are interesting to the community and the rebuttal further added to the strength.

格式问题

None.

作者回复

Response to Weaknesses

  1. Couldn't it be that the presented method merely reduces entropy (and does so towards the majority vote) and similar results could be achieved without any RL with a simple tightening of the logits around the majority vote?

We acknowledge that this perspective is insightful. Indeed, TTRL can, to some extent, be interpreted as a form of entropy regularization. However, the RL process is fundamentally different from simple logit scaling or temperature reweighting. We emphasize that the reward is derived from cross-sample majority voting, rather than from static adjustments to the logits. Essentially, TTRL allows the model to endogenously select more strategic behavior sequences through policy learning, rather than relying on post-hoc filtering via majority vote.

  1. It is important to recognise that the gains achieved by TTRL do not lift any of the models anywhere close to the stat-of-the-art. Additionally simple other TTS methods such as s1 achieve far superior results. Perhaps this is mainly due to the fact that the paper here only experiments with smaller models and does not contain 32B parameter model experiments but it should be discussed.

We would like to note that experiments on Qwen2.5-32B have been provided in Table 1. However, it is fully understandable that the reviewers' concern lies in TTRL may does not lift models close to the state-of-the-art. Regarding this, in our Table 2, we present the results of applying TTRL to the highly challenging AIME 2024 using Skywork-OR1-Math-7B and Qwen3-8B. It is evident that TTRL improves the performance of Qwen3-8B from 72.5 to 82.5 on AIME 2024, surpassing models such as OpenAI-o1, DeepSeek-R1, OpenAI-o3-mini (medium), Qwen3-32B, and Qwen3-30B-A3B [1].

  1. There are some odd/short figure captions. For example the caption in Figure 1 is very terse and what does "Position" mean here?

Thank you for your suggestion! We will make improvements in the next version. In Figure 1, the term "Position" mainly refers to the placement of TTRL within the existing machine learning paradigms. Our primary intention is to describe its distinctions, positioning, and role in comparison with paradigms such as SFT and RL.

  1. Some typos ungrammatical sentences. The paper could benefit from some careful proof-reading.

Thank you very much for pointing this out! We will carefully proofread the entire manuscript to correct these errors and address any others that may be present.

  1. The paper uses the term "self-evolution" a lot hinting at some connection to an evolutionary process which I find somewhat misplaced. Wouldn't it more sense to just call it self-improvement?

Our intention was to highlight the model’s ability to progressively improve by leveraging its own outputs for learning, without external supervision. However, we agree that the term may imply a connection to biological or evolutionary processes, which is not our intended framing. To avoid potential confusion, we will adopt the more precise term “self-improvement” in the revised version and revise relevant sections accordingly.

  1. The definition of "rule-based-rewards" (l 80) seems a bit off. I understand that the authors were trying to make a connection to sparse rewards extracted based on rule matching but there is no such rule here, it's just the score from majority voting. It would be better to refer to it as "a sparse reward analogous to rule-based rewards".

Your suggestion is reasonable. Our consideration is based on the definition in DeepSeek-R1, which is "rule-based verification of correctness". Therefore, we understand "rule-based" as primarily referring to whether the verifier used is rule-based or generative. The score obtained through majority voting here is still computed using a rule-based verifier. Hence, we consider this to still fall under the category of rule-based rewards. Nevertheless, we will consider your suggestion and explore refining the terminology to better distinguish it from traditional rule-based rewards.

  1. The setting of self-improving models with their own feedback is not entirely novel and there has been lots of recent work around scaling data that requires less human feedback or ‘labels’ and I wish there were more comparisons made with respect to approaches such as RL with AI Feedback [1] for scaling feedback to train using RL.

Yes, there are indeed numerous works on self-improvement models. We acknowledge that our discussion of related literature may be somewhat limited. We are grateful for your suggestion and for pointing out this relevant line of work, such as Constitutional AI. We will incorporate the discussion of these works and offer a more thorough comparison.

  1. There seems to be some sensitivity to hyperparameters.

First, it is important to clarify that in our experiments, apart from minor adjustments to the episode and batch sizes due to the test-time setup, all other configurations, such as the temperature, follow prior work. The sensitivity we mentioned primarily refers to observations made during our exploration of alternative settings for improving performance, where we noticed a degradation in results, consistent with recent findings [2]. Moreover, the training dynamics metrics proposed in Appendix F can be used to further optimize the hyperparameters.

Response to Questions

  1. I was wondering if the authors considered any other softer distance functions such that the reward is higher when it is closer to the majority vote and less when it is further away but more as a continuous function.

You have provided an insightful suggestion! This approach is more general, as it can be extended to tasks that cannot be verified through rule-based methods. Since our experiments at that time primarily focused on verifiable tasks such as math, we did not come up with this method. We sincerely appreciate your thoughtful comments and will further try it.

  1. I’m unconvinced with the ‘lucky hit’ phenomenon. It would seem that as the majority ratio increases the reward accuracy actually decreases (in Figure 9). To me it seems that: as the model becomes more confident, the reward gets worse. I’m not sure that’s the behavior we want, the lucky hit seems to be just hoping that there is enough uncertainty when the labels are wrong. But if the majority ratio increases during training surely the lucky hit phenomenon would disappear?

First, your understanding are entirely correct. We may not have explained this clearly: our intention is not to attribute the sustained improvement of TTRL to the "Lucky Hit" nor to claim that the impact of "Lucky Hit" remains constant throughout training. On the one hand, we consider "Lucky Hit" to be one of the reasons why TTRL can be effective even with weak models. It is undeniable that this phenomenon indeed increases the proportion of accurate rewards, even when the reward accuracy declines at later stages. On the other hand, as you pointed out, its influence naturally diminishes during the later stages of training. This attenuation is expected because "Lucky Hit" is more likely to occur when the model is weak, and as the model improves through training, the likelihood of such occurrences decreases.

  1. I am very curious why the curves are so close in Figure 7! The authors give some suggestions in the paper saying that it's great that TTRL is so close to RL (leakage) here but me it seems to suggest that the RL labels do not provide much signal?

We believe that this is indeed one of the reasons, which aligns with our perspective: the foundation of TTRL is to estimate labels for training through majority voting, and we consider that one reason why TTRL works is that using these estimated labels in a RL manner mitigates the noise contained in them. Moreover, we believe that the comparable performance can also be attributed to the fact that majority voting provides high quality supervision.

  1. It would be interesting to see if models can further improve using TTRL even when RL is employed. I.e. what happens if you use TTRL not just after pre-training but as an additional phase after RL training?

Your thoughts are exactly the same as our initial intuition! This was indeed the intended outcome when we initiated this work: positioning TTRL as a post-training stage. In fact, we believe this represents a future trend, as the amount of unlabeled and highly challenging data will inevitably grow. Using TTRL to enable the scaling of RL and the self-evolution of models on more challenging and larger data (unlabled data) is valuable. To further verify that TTRL has the potential to serve as a post-training stage even when RL is employed, we conducted further experiments on DeepSeek-R1-LLaMA-8B:

AIME 2024AMC 2023MATH-500
DeepSeek-R1-LLaMA-8B51.781.689.6
w/ TTRL69.288.990.9
+17.5+7.3+1.3

After applying TTRL, the performance on AIME 2024 improves significantly. The gains on AMC 2023 and MATH-500 are relatively modest likely due to the saturation of these benchmarks. Moreover, as mentioned in Response to Weaknesses 2, we also provide results of TTRL on several LRMs in Table 2 of the paper. These LRMs have already been trained with costly RL, and we hope these results help address your concerns.

  1. Majority voting is shown to work well in practice to improve model capabilities for math reasoning but there are many cases in which one would expect that the assumption of relying simply on the most frequent model output is wrong. Have the authors thought about this?

Please see Response to Weakness "Reward Inference" of Reviewer KyvW.

Reference

[1] Qwen3 Technical Report

[2] POLARIS: A POst-training recipe for scaling reinforcement Learning on Advanced ReasonIng modelS

评论

Thank you for the detailed reply and additional experiments these all make sense and will be great to include in the final version of the paper.

My belief is that the paper should be accepted as mentioned in my initial review and reflected in my score!

最终决定

This paper introduces Test-Time Reinforcement Learning a method for improving language models on reasoning tasks using reinforcement learning with rewards generated from a majority vote of the model's own outputs. A simple technique allows self-improvement on unlabeled data, yielding strong performance gains.

The paper's main strength is its simple yet effective method, which is backed by strong empirical results on math reasoning tasks. While reviewers initially raised concerns about the methodology and evaluation, the authors' rebuttal provided new experiments that resolved these issues. Reviewers initially raised major concerns about the validity of test-time training ( KyvW) and the strength of the experimental baselines (2bYH). The authors delivered a convincing rebuttal, providing new experiments that clarified the method's context within test-time adaptation and confirmed the performance gains were not artifacts of the setup. The thorough response improved support for the paper with the primary remaining limitation as narrow focus on mathematical tasks, though this is not considered a considerable drawback as the narrow setting provides substantial value.