PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差1.0
4
4
2
2
ICML 2025

Thinking LLMs: General Instruction Following with Thought Generation

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We use RL to improve the thought process.

摘要

关键词
thought generationCoTpreference learningRL

评审与讨论

审稿意见
4

This paper introduces Thinking LLMs, a novel approach aimed at improving general instruction following in large language models (LLMs) by explicitly incorporating internal thought processes before generating responses. Traditional LLMs respond directly to user instructions without intermediate reasoning steps, which can be inefficient for complex queries. The proposed method, Thought Preference Optimization (TPO), enables LLMs to generate and refine internal thoughts in an unsupervised manner without requiring additional human-annotated data.

The key idea behind TPO is to:

Generate multiple candidate thought-response pairs for a given instruction using an instruction-tuned LLM. Evaluate the responses using a reward model that only assesses the response quality, not the thought process itself. Optimize the thought generation through Direct Preference Optimization (DPO) by selecting the thought-response pairs that lead to the highest-rated responses. Iteratively refine the model, ensuring that the generated thoughts consistently contribute to better response quality. The paper provides empirical validation across multiple instruction-following benchmarks, demonstrating that TPO-trained models:

Outperform standard LLMs on AlpacaEval (+4.1% win rate) and Arena-Hard (+4.3% win rate). Improve response quality not just in reasoning tasks but also in non-reasoning domains like marketing, health, and general knowledge. Reduce reliance on manual Chain-of-Thought (CoT) prompting by allowing LLMs to learn when and how to think.

给作者的问题

How does TPO handle cases where the initial thought generation is incorrect? Does the model attempt self-correction, or does it remain committed to the initial flawed reasoning path?

What are the primary failure modes of TPO? Are there specific task types where the thought-generation process degrades performance instead of improving it?

Can TPO be integrated with retrieval-augmented generation (RAG) systems? Would the model's internal thought process benefit from external knowledge retrieval, and how would that affect optimization?

论据与证据

The paper presents several claims that are supported by empirical evidence:

Claim 1: Thinking LLMs improve instruction-following accuracy across various tasks.

The experiments on AlpacaEval and Arena-Hard demonstrate consistent gains over direct-response baselines, with TPO achieving a 52.5% win rate on AlpacaEval and 37.3% on Arena-Hard. A category-wise analysis indicates that TPO benefits both reasoning and non-reasoning tasks, unlike prior CoT-based methods that primarily enhance mathematical and logical reasoning. Claim 2: Thought generation can be optimized without direct human supervision.

The proposed reward-based optimization strategy allows the model to learn effective thought processes without requiring labeled thought data. Through Direct Preference Optimization (DPO), the system refines thought generation iteratively, producing better responses over time. Ablation studies confirm that models trained without TPO fail to achieve similar improvements, reinforcing that the optimization of thought generation is a crucial component. Claim 3: Thinking helps general instruction-following beyond traditional reasoning tasks.

The fine-grained analysis (Figure 4) demonstrates improvements in domains such as marketing, translation, and content writing, areas where CoT prompting was previously considered ineffective. This contrasts with prior research (e.g., Sprague et al., 2024), which suggested that CoT primarily benefits logic-based tasks.

方法与评估标准

The paper follows a rigorous evaluation framework:

Uses AlpacaEval and Arena-Hard, widely accepted benchmarks for instruction-following models. Compares against strong baselines, including direct-response models and those trained with generic thought prompting. Provides a detailed breakdown of performance across 20 instruction categories, illustrating the types of tasks that benefit most from TPO. Conducts ablation studies to evaluate the impact of: Different thought prompting strategies (generic vs. specific). Effect of different reward models (ArmoRM vs. STE). Impact of iterative training on thought refinement. Potential improvements:

Human evaluation of response quality would provide stronger validation of TPO’s effectiveness. Additional experiments on real-world applications (e.g., dialogue systems, tutoring systems) would further demonstrate the practical benefits of TPO.

理论论述

The paper does not focus on formal theoretical contributions but provides strong empirical justification for the proposed approach.

The method is based on Reinforcement Learning from AI Feedback (RLAIF) and Direct Preference Optimization (DPO), but lacks a formal theoretical analysis of thought generation dynamics. A deeper exploration of whether TPO converges to optimal thought processes or if certain types of errors persist over training iterations would strengthen the theoretical foundation.

实验设计与分析

The experimental setup is well-structured and controlled:

Uses standardized prompts, models, and evaluation protocols. Clearly demonstrates win rate improvements over baselines. Implements a length-control mechanism to ensure response quality does not simply improve due to verbosity. Areas that could be expanded:

Failure case analysis is limited. The paper does not discuss scenarios where TPO might fail, such as backtracking-heavy tasks or reasoning tasks requiring multiple revisions. Automated evaluation reliance—the paper depends on GPT-4-based judges, which may introduce evaluation biases that are not addressed.

补充材料

The supplementary material includes:

Expanded experimental results, including category-wise performance comparisons. Additional ablations on different prompt types and training setups. Examples of thought-response pairs, illustrating qualitative benefits of TPO. A stronger discussion of failure cases and potential limitations would further enhance the supplementary content.

与现有文献的关系

Chain of Thought (CoT) (Wei et al., 2022) – Demonstrated the benefits of explicit reasoning but was mostly effective for math and logic tasks. Reinforcement Learning from AI Feedback (RLAIF) (Bai et al., 2022) – Introduced reward-based optimization for LLMs, but TPO adapts this for thought generation. Quiet-STaR (Zelikman et al., 2024) – Explored unsupervised thought generation but was primarily focused on structured reasoning tasks. DeepSeek-R1 (Guo et al., 2025) – Used reinforcement learning to train structured reasoning templates, whereas TPO provides a more flexible, data-driven approach.

遗漏的重要参考文献

Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models

其他优缺点

Strengths:

The paper presents an original approach to optimizing thought generation in LLMs, extending beyond traditional Chain-of-Thought (CoT) techniques. The proposed Thought Preference Optimization (TPO) method is well-motivated and effectively eliminates the need for manually annotated thought supervision, making it scalable and adaptable across different tasks. The empirical evaluation is comprehensive, covering AlpacaEval, Arena-Hard, and category-wise breakdowns, providing strong evidence that TPO improves instruction-following models in both reasoning and non-reasoning tasks. The paper is well-written and structured, making it easy to follow, with clear explanations of methodology, evaluation, and experimental findings. The ablation studies effectively demonstrate the contribution of different components of TPO, particularly the role of DPO in refining thought generation. Weaknesses:

The paper does not explore failure cases in depth. It would be helpful to provide more insights into when and why TPO fails, particularly in tasks that require iterative backtracking or fine-grained numerical precision. The evaluation relies entirely on GPT-4-based reward models, which introduces potential biases. A complementary human evaluation would help confirm the effectiveness of TPO in real-world scenarios. The work primarily focuses on GPT-based models. It remains unclear how well TPO generalizes to open-source models such as LLaMA, Mistral, or Claude. While the method is technically sound, there is no theoretical analysis of whether TPO converges to an optimal thought process or if certain errors persist over time.

其他意见或建议

Providing examples of failure cases in the supplementary material would help clarify the limitations of TPO. A discussion on the computational efficiency of TPO relative to existing approaches would be useful. Does training with TPO introduce significant additional overhead compared to standard instruction tuning? The paper could benefit from more real-world application demonstrations, such as deploying TPO-trained models in interactive agents or dialogue systems.

作者回复

We thank the reviewer for the constructive feedbacks. Below, we address your concerns and propose our revisions.


Failure case analysis is limited. The paper does not discuss scenarios where TPO might fail, such as backtracking-heavy tasks or reasoning tasks requiring multiple revisions. Automated evaluation reliance—the paper depends on GPT-4-based judges, which may introduce evaluation biases that are not addressed.

Our results on GSM8K, where both TPO and standard DPO degraded performance, illustrate a limitation. We attribute this primarily to the reward model's limitations in accurately judging complex, multi-step reasoning, highlighting that reward model quality is critical, especially for such tasks. We also acknowledge the potential for inherent biases in automated evaluations using LLM-based judges like GPT-4. To mitigate this, we implemented controls where feasible; for instance, output length was controlled across compared methods to prevent models from exploiting potential length bias in the automated evaluator.

Essential References Not Discussed

We agree these references are relevant and appreciate the reviewer pointing them out. We'll discuss these papers in our final draft.

The paper does not explore failure cases in depth. It would be helpful to provide more insights into when and why TPO fails, particularly in tasks that require iterative backtracking or fine-grained numerical precision. The evaluation relies entirely on GPT-4-based reward models, which introduces potential biases. A complementary human evaluation would help confirm the effectiveness of TPO in real-world scenarios.

We agree that a deeper analysis of failure cases would be valuable. Although the patterns TPO exhibits in general are generally helpful, such as drafting a reminder list for the answer (Figure 6 & 15), refining the response (Figure 5 & 16), and reflections, we also observed cases where TPO failed. Notably, in Figure 18, we found the model exhibits non-terminating behavior by iteratively criticizing its own answer without being able to produce a fix. This indicates the model might have a tendency to deviate from the thought-answer structure, despite being specifically trained to follow the structure.

The work primarily focuses on GPT-based models. It remains unclear how well TPO generalizes to open-source models such as LLaMA, Mistral, or Claude. While the method is technically sound, there is no theoretical analysis of whether TPO converges to an optimal thought process or if certain errors persist over time.

Regarding the request for theoretical analysis, we are not aware of any theoretical analysis on the optimality of TPO. But our intuition why TPO worked is it provides more freedom for the LLM to explore during the RL process by putting no constraint on the thought part (as the judge does not know what is inside the thought), though some errors might persist, e.g. not following format, non-stopping behaviors, etc.

Does training with TPO introduce significant additional overhead compared to standard instruction tuning?

TPO training doesn’t add additional complexity in implementation compared with DPO, but potentially generates slightly more tokens than standard DPO due to the thought part.

The paper could benefit from more real-world application demonstrations, such as deploying TPO-trained models in interactive agents or dialogue systems.

These are interesting ideas to explore in our future work.

How does TPO handle cases where the initial thought generation is incorrect? Does the model attempt self-correction, or does it remain committed to the initial flawed reasoning path?

Figure 18 provides an example where the model recognized an error but failed to recover or correct it.

What are the primary failure modes of TPO? Are there specific task types where the thought-generation process degrades performance instead of improving it?

One primary failure mode stems from the reward model's limitations. If the reward model struggles to accurately judge the final answers for certain types of tasks (like complex reasoning), it can provide poor guidance, causing TPO to optimize in the wrong direction and potentially degrade performance more. Another observed failure, illustrated in Figure 18, involves process instability where the model might enter non-stopping loops, such as repeated self-criticism without resolution, or deviate from the intended thought-answer structure.

Can TPO be integrated with retrieval-augmented generation (RAG) systems? Would the model's internal thought process benefit from external knowledge retrieval, and how would that affect optimization?

Thank you for the insightful comment; unfortunately, we are unable to provide a direct answer at this time, but it is an interesting topic worth exploring in the future.


We greatly appreciate the reviewer’s constructive feedback, which significantly enhances the quality and clarity of our work.

审稿意见
4

Authors propose TPO, a method that finetunes an instruction-tuned LLM to output discrete thought tokens for harder tasks, without any supervision signal. The model undergoes iterative RLFAI preference learning, where the reward model comes from a judge model that judges based on the LLM’s final answer. Finally, authors show TPO achieves performance improvement against baselines on AlpacaEval and Arena-hard.

给作者的问题

  • Can we also see a variant of Figure 4 with win rate against the seed model? As authors have already mentioned, COT only helps with math/logic related tasks, so seed model should be a stronger baseline here for most of the categories shown.
  • Authors show TPO do worse than baselines in GSM8K, any ideas why it does better than the baseline in the “math and calculation” subset of ultrafeedback?
  • Why is it problematic that “the seed model … uses CoT anyway due to its instruct training”? If it is because the accuracy is too high for further improvement, can authors also evaluate on MATH, which is a harder task than gsm8k?
  • Authors mentioned that the poor performance on GSM8K might be “due to our setup not being oriented toward such tasks”. Have the authors tried to simply add the training examples in GSM8K as a part of the training data for TPO?

论据与证据

Yes

方法与评估标准

  • TPO uses a judge that is finetuned from Llama 70B. What if the authors simply prompt the 70B model with the thought prompts from figure 2, and SFT the smaller llama 8B with this dataset? How would simply doing this SFT compare with using TPO? An even more simpler variant is to just use the 70B to generate without the thought prompts, and SFT on this dataset.
  • How would TPO compare with parallel sampling techniques? I.e. sample multiple answers and use the judge model to select the final answer.
  • In the related works section, authors mention several prior works on thinking tokens that also do not require any labeled data (STaR, Quiet-STaR), is there a reason why authors didn’t compare empirically with those two baselines?

理论论述

N/A

实验设计与分析

N/A

补充材料

No.

与现有文献的关系

The authors propose TPO, a method for LLMs to learn to generate thinking tokens during inference time. Time-time compute scaling is an important and timely topic for LLMs capabilities research.

遗漏的重要参考文献

N/A

其他优缺点

Strength:

  • Paper writing and figure illustrations are clear and easy to digest.
  • Comprehensive evaluation and ablations.
  • The method proposed is simple and achieves good results in the evaluation tasks.
  • People have more commonly associated COT to work with reasoning tasks, the surprising results showing TPO improves on various instruction following tasks are interesting to see.

Weakness:

  • Missing baselines that authors should include (see questions below).
  • See more questions below

其他意见或建议

N/A

作者回复

We sincerely thank the reviewer for the valuable and insightful feedback. We address your concerns as follows:


How would simply doing this SFT compare with using TPO?

That's an insightful question regarding potential training alternatives. The Llama 70B model in our study functions strictly as a reward model, similar to the other RM evaluated (ArmoRM), providing reward signals rather than generating target outputs for supervision. Our TPO framework utilizes these reward signals and is agnostic to the specific nature of the reward model (e.g., a traditional RM or an LLM-as-a-Judge). This contrasts fundamentally with the proposed alternatives, which would use the 70B model as a generator to create a dataset for supervised fine-tuning (SFT). Because TPO operates via preference optimization driven by reward signals, while the suggested approaches rely on SFT using generated targets, a direct comparison might not be appropriate.

How would TPO compare with parallel sampling techniques? I.e. sample multiple answers and use the judge model to select the final answer.

We appreciate the reviewer raising the comparison with inference-time techniques. TPO as a training-time optimization method, is orthogonal to inference-time techniques such as sampling multiple responses and selecting the best using a reward model. These approaches are not mutually exclusive; applying inference-time sampling and selection strategies to a model already optimized with TPO is a plausible approach. Investigating the interplay and relative benefits of TPO training versus various inference-time selection methods remains an interesting direction for future work.

authors mention several prior works on thinking tokens that also do not require any labeled data (STaR, Quiet-STaR), is there a reason why authors didn’t compare empirically with those two baselines?

Thank you for asking about the comparison with STaR and Quiet-STaR. STaR requires ground truth answers, and Quiet-STaR relies on supervised fine-tuning data. In contrast, TPO requires only unlabeled prompts and preference signals from a reward model. These fundamental differences in data make a direct empirical comparison challenging. Furthermore, STaR and its variants were primarily evaluated on reasoning-intensive tasks, whereas our work focuses on general instruction-following capabilities.

Can we also see a variant of Figure 4 with win rate against the seed model? As authors have already mentioned, COT only helps with math/logic related tasks, so seed model should be a stronger baseline here for most of the categories shown.

We understand the reviewer's interest in seeing a direct comparison against the base seed model. The 'seed model' refers to the base pre-trained model prior to any preference optimization. The 'direct baseline' actually presented in Figure 4 is this seed model after it has undergone standard Direct Preference Optimization (DPO) training directly on final answers. This DPO-optimized direct baseline is generally stronger than the original un-optimized seed model, making it the more relevant and challenging point of comparison for evaluating TPO.

Authors show TPO do worse than baselines in GSM8K, any ideas why it does better than the baseline in the “math and calculation” subset of ultrafeedback?

This is a keen observation regarding the differing performance on math-related datasets. These two datasets exhibit significant distributional differences in the types of mathematical problems presented. We hypothesize it’s because the reward model we used is better at judging questions in the ultrafeedback distribution but bad on the GSM8k distribution.

Why is it problematic that “the seed model … uses CoT anyway due to its instruct training”? If it is because the accuracy is too high for further improvement, can authors also evaluate on MATH, which is a harder task than gsm8k?

We understand the reviewer's query regarding the implications of the seed model's inherent CoT capabilities. The goal of that experiment was to understand the effect of thinking and CoT on gsm8k performance. In particular, we tried to measure the accuracy of the seed model when directly outputting answers without any CoT. However this was tricky to measure because the model often did CoT before answering, sometimes even when explicitly instructed to not to do CoT

Authors mentioned that the poor performance on GSM8K might be “due to our setup not being oriented toward such tasks”. Have the authors tried to simply add the training examples in GSM8K as a part of the training data for TPO?

That's a relevant suggestion concerning the training data composition for specific benchmarks. We did not include examples from the GSM8K training set within the dataset used for TPO training in this study. This is an interesting direction which we’ll leave for future work.


We deeply appreciate the reviewer’s feedback and hope our responses fully address your concerns.

审稿人评论

Thank you to the authors for answering my questions and providing further clarifications. I will maintain my score.

审稿意见
2

The paper proposes a method to enhance LLMs by enabling them to "think" explicitly before generating responses. This is aimed at improving performance on complex tasks requiring reasoning and planning, as well as general instruction-following tasks. The authors introduce the so-called Thought Preference Optimization (TPO), a training method that equips LLMs with the ability to generate internal "thoughts" before producing a response. These thoughts are hidden from the user and serve as an intermediate step to improve response quality. The method does not require additional human-labeled thought data.

给作者的问题

See above.

论据与证据

  1. Claim: Generated thoughts are meaningful and improve response quality.

The paper focuses on optimizing thoughts based on the quality of the final response but does not rigorously evaluate the quality of the thoughts themselves. For example, are the thoughts interpretable and aligned with the optimal or correct human-like reasoning steps?

  1. Claim: TPO introduces a novel and significant advancement in LLM training.

While the method is well-executed, its core idea—using preference optimization to improve intermediate outputs—is not fundamentally novel. It builds heavily on existing techniques like Direct Preference Optimization (DPO) and Reinforcement Learning from AI Feedback (RLAIF). A clearer comparison with existing methods (e.g., standard DPO, CoT) and a discussion of how TPO uniquely advances the field would strengthen this claim.

方法与评估标准

  1. The proposed method focuses on optimizing thoughts based on response quality but seems to lack the rigorous evaluation of the quality of the thoughts themselves. For example, are the thoughts interpretable and meaningful? T

  2. The proposed method underperforms on math tasks (e.g., GSM8K), which is surprising given the emphasis on reasoning. This suggests that TPO may not be well-suited for tasks requiring precise, step-by-step reasoning. The authors attribute this to the small proportion of math-related instructions in the training data, but this explanation feels insufficient.

  3. The paper claims that specific thought prompts (e.g., drafting and evaluating responses) perform slightly better than generic prompts, but the difference is marginal. This raises questions about whether the added complexity of specific prompts is justified.

理论论述

No theoretical claims

实验设计与分析

  1. The paper claims that specific thought prompts (e.g., drafting and evaluating responses) perform slightly better than generic prompts, but the difference is marginal. This raises questions about whether the added complexity of specific prompts is justified.

  2. It is better to also evaluate the computational cost of generating and optimizing thoughts relative to the improvements in response quality. This would help establish the method's practical value.

  3. It may be also helpful to investigate why TPO underperforms on math tasks and explore ways to improve its performance in this domain.

补充材料

No.

与现有文献的关系

The key contributions of the paper is related to areas like CoT prompting, thought generation, iterative training and self-Improvement.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  1. The paper evaluates the proposed method, Thought Preference Optimization (TPO), on well-established benchmarks like AlpacaEval 2 and Arena-Hard, demonstrating clear improvements over baselines. The use of GPT-4 as an auto-evaluator adds credibility to the results.

  2. The fine-grained evaluation on 20 categories (e.g., marketing, health, math) provides valuable insights into where TPO excels, showing gains even in non-reasoning tasks.

  3. The finding that TPO improves performance in non-reasoning tasks (e.g., marketing, health) is interesting and suggests that "thinking" may have broader applications than previously assumed.

Weakness:

  1. The method builds on existing techniques (e.g., DPO, RLAIF) and does not introduce significant algorithmic novelty.

  2. The quality of the generated thoughts is not rigorously evaluated, and their interpretability remains unclear.

  3. The practical benefits of the method, such as computational cost and real-world applicability, are not thoroughly analyzed.

  4. The paper claims that specific thought prompts (e.g., drafting and evaluating responses) perform slightly better than generic prompts, but the difference is marginal. This raises questions about whether the added complexity of specific prompts is justified.

In summary, while the application of preference optimization to thought generation is interesting, the core idea builds heavily on existing techniques like DPO and RLAIF. The paper does not introduce a fundamentally new paradigm, making the work feel incremental.

其他意见或建议

  1. How interpretable are the generated thoughts? Can they be used to debug or explain the model's decision-making process? Are there examples where the thoughts are nonsensical or misleading, despite leading to good responses?

  2. It's better to include a qualitative analysis of the generated thoughts to assess their interpretability, alignment with human reasoning, and potential utility for model transparency.

  3. It's better to evaluate the computational cost of generating and optimizing thoughts relative to the improvements in response quality.

  4. It is better to investigate why TPO underperforms on math tasks and explore ways to improve its performance in this domain.

作者回复

We sincerely thank the reviewer for the thoughtful and constructive feedback. We address each concern raised and propose revisions:


Are the thoughts interpretable and aligned with the optimal or correct human-like reasoning steps?

While our methodology does not impose explicit constraints on the structure or content of the thoughts during reinforcement learning, our analysis revealed the emergence of patterns that resemble effective human problem-solving strategies. These include:

  • Generating Checklists/Key Points: As seen in Figures 6 and 15, the model often formulates preliminary lists of elements to include or emphasize in the final answer.
  • Iterative Refinement: Figures 5 and 16 show the model refining potential answers within the thought process, adding arguments or examples.
  • Self-Correction and Evaluation: Figure 18 illustrates the model engaging in self-correction, sometimes even overriding pre-specified formats.

It's better to include a qualitative analysis of the generated thoughts to assess their interpretability, alignment with human reasoning...

We agree that a dedicated qualitative analysis assessing thought interpretability is a valuable suggestion. However, rigorously and explicitly evaluating the intrinsic quality of machine-generated thoughts presents significant challenges. First, benchmarks specifically designed for assessing thought processes are not well-established. Secondly, defining objective criteria for a "good" thought is inherently complex. For instance, is a thought process containing flawless step-by-step logic but ultimately unhelpful superior to one that suggests a promising overall direction despite containing intermediate errors? Developing methods for more direct evaluation of thought quality itself remains an important direction for future research.

It builds heavily on existing techniques like Direct Preference Optimization (DPO) and Reinforcement Learning from AI Feedback (RLAIF)...

We wish to clarify that the core innovation of TPO is not presented as a fundamentally new type of reinforcement learning or preference optimization algorithm itself. The primary objective of TPO is to demonstrate that substantial performance gains are achievable by affording the model this unjudged "thinking space," without requiring significant alterations to the underlying training algorithm architecture.

We provides empirical evidence supporting this claim through direct comparisons with standard DPO and a CoT approach (representing the iteration 0 before TPO). As presented in our results, optimizing the hidden thought process via TPO yields significant performance improvements over these methods.

...specific thought prompts perform slightly better than generic prompts, but the difference is marginal...

Specific thought prompt introduce negligible implementation complexity because the TPO training methodology remains identical irrespective of the prompt format. While average performance gains across all tasks may appear modest, specific prompts yield substantial improvements on challenging benchmarks, notably ArenaHard.

It is better to also evaluate the computational cost...

We’ll add discussion and plots of the generation cost vs the final performance in our final draft.

investigate why TPO underperforms on math tasks...

We hypothesize this may stem from the deficiency of the reward model we used, which focused on general instruction following rather than specialized mathematical reasoning. For domains requiring strict correctness like mathematics, directly evaluating answer accuracy against reference solutions, might be more effective than relying solely on the preference judge. Incorporating such correctness checks represents a promising direction for future work.

the core idea builds heavily on existing techniques like DPO and RLAIF...

TPO's contribution lies not in proposing a new optimization algorithm variant, but in demonstrating the methodology of optimizing an free-form thought process based only on judgments of the final answer. This indirect supervision approach facilitates the emergence of complex reasoning behaviors during training, which are not seen in standard answer only training.

Are there examples where the thoughts are nonsensical or misleading, despite leading to good responses

We did not observe cases where nonsensical thoughts produced high-quality final answers; we hypothesize the KL regularization used during training helps maintain coherent and interpretable thought structures by penalizing significant deviations from the base model's.

It's better to evaluate the computational cost of generating and optimizing thoughts relative to the improvements in response quality.

We’ll add analysis in the final draft.


We greatly appreciate the reviewer’s constructive feedback, which improves the clarity of our work. We respectfully ask the reviewer to reconsider their rating, given our revisions and clarifications.

审稿意见
2

This paper presents a method and studies how to get LLMs to output initial thought traces before a final answer on instruction-following tasks. Their main idea is to prompt LLMs to initially produce these thought traces before a final response, score just the final response with an LLM-as-a-judge, and train over pairs of entire thought + final response sequences rated by just the response parts using DPO. Doing so enables training "thinking LLMs" without the need for any human-sourced thoughts, and leads to improved performance on AlpacaEval and Arena-Hard benchmarks (>50% win rate against direct finetuning on good responses). They finally perform finer-grained analysis to see where the thinking helps across instruction topic, and study properties such as impact of initial prompt type and thought lengths.

给作者的问题

  1. Did the authors consider studying the cost-quality trade-off between having to generate more tokens in the hidden thoughts vs direct prompting?

论据与证据

The claim of "investigating the possibility of converting existing LLMs into Thinking LLMs that work across a wide variety of tasks, without any additional data" is supported, albeit modestly.

Support / positives

  • Experiments demonstrate that this paradigm works through their proposed method
  • The authors pick several interesting axes of study / points of comparison, such as (1) prompt for structuring the initial thoughts, (2) different judge models, (3) post-training procedure (DPO vs IRPO)
  • They also find some interesting empirical nuggets, such as the initial responses following thoughts being worse than the initial instruction-tuned model's. But then improving via the procedure. I found the example of self-correction on GSM8k also a nice highlight, though it would have been interesting to see how frequent this phenomenon occurred

Insufficiencies / negatives

  • Lack of model support. All the experiments for the model generation are done on 1 model: Llama 3.1 8B Instruct. While this is a pretty modest model not known for having "reasoning" capabilities (i.e., it's cool to see these thinking traces emerge here), just showing the evaluation on one model seems insufficient re: the scope of a general method, and "converting existing LLMs into Thinking LLMs" at large.
  • Lack of understanding or insight into why thoughts help. I appreciated the study into the different topics and how thinking could help to varying degrees, as well as the examples of when thinking helped or hurt in the appendix. However, I would have liked to see more (hypotheses + validation or not) on why the thought processes help, especially on non-reasoning instruction-following tasks.
    • For example, what kinds of patterns or additional context emerge from the thoughts that contribute to higher quality responses?
    • Do different patterns emerge for different topics?
    • How robust is this emergence? e.g., the Figure 15 response example doesn't strike me as something that benefits truly from the thoughts. The response also seems like it could have come from the model without thoughts.
  • Justification for parts of the method. I appreciated the study on length-control and DRO vs IRPO, but some parts of the method came across a bit ad-hoc. e.g., what was the motivation for why the preference pair building was done as proposed? Could other techniques work?
    • Regarding the question on studying whether things are possible, I'm curious if this is a result strictly from preference optimization (TPO), or could we get Thinking LLMs via purely an outcome-based signal (answer the question or not) and techniques that just use this (KTO, SFT after sampling for positive sequences, RLVR)

Nit, but the phrasing "we allow the model to independently learn to think" (L079) is a bit misleading, given that we use additional (larger) LLMs for prompt generation (Llama 70B) and response scoring (STE, ArmoRM)

方法与评估标准

I think AlpacaEval and Arena-Hard are reasonable given the span of instruction categories (e.g., marketing, health and general knowledge).

However, as so far as contrasting against the popular "logic-based tasks like math or coding", did the authors consider additional broadening the tasks or "skills" beyond general instruction-following (and evaluation based on LLM-as-a-judge preferences)?

  • For example, can TPO and reasoning help with non-logic tasks like summarization or question-answering over (long) contexts?

The method comparison is also a bit lacking, where I think the authors should at least compare against STaR. As the authors point out in their related work (L431):

However, these methods rely on supervised training so ground-truth thought data is required. STaR (Zelikman et al., 2022) removes this constraint by generating both thought and answer from a model using few-shot prompting.

Given that STaR also enables reasoning thoughts without the need for human thought data, and can be applied to the instruction-following tasks (e.g., filtering by using an LLM-as-a-judge on whether the response satisfies the instruction or not), it seems worth comparing to assess the novelty + impact of contribution for TPO.

理论论述

N/A. No theoretical claims made.

实验设计与分析

Yes. I checked model comparison, benchmark selection, and ablations. See issues pointed out in Claims And Evidence and Methods And Evaluation Criteria.

Namely:

  • Lack of support for Thinking LLMs beyond Llama 3.1 8B Instruct
  • Evaluation on "only" AlpacaEval and Arena-Hard (or at least beyond the topic-based granularity breakdown, I think looking into the nature of the skills needed to follow the instructions faithfully [e.g., question-answering, summarization, content generation] and comparing this to the improvement in performance via the thoughts would be more insightful (an example of this is the factoid highlight in Figure 5, thought a more systematic analysis would be better).

补充材料

Yes. Experimental and implementation details (Evaluation, ELO computation). Additional artifacts (thought examples).

与现有文献的关系

The authors show interesting results where thinking can emerge for non logic-based tasks, can be done without needing to do SFT on human thought traces, and can help non-logic based task quality.

遗漏的重要参考文献

N/A

其他优缺点

See supports / positives in above claims response.

其他意见或建议

N/A

作者回复

We sincerely thank the reviewer for the thoughtful and constructive feedback. We address your concerns and propose corresponding revisions:


Lack of model support. All the experiments for the model generation are done on 1 model: Llama 3.1 8B Instruct...

We acknowledge the reviewer's concern regarding model diversity. While we aim to conduct further evaluations involving larger models such as the latest Llama 3.3 70B and Deepseek R1, our current model selection is primarily constrained by limited training resources. Subject to the availability of time and resources, we intend to expand our experimental scope to include a wider range of models. However, this expansion may be relegated to future work.

Lack of understanding or insight into why thoughts help... For example, what kinds of patterns or additional context emerge from the thoughts that contribute to higher quality responses?

We agree that understanding why thoughts contribute to better responses is crucial. Our analysis identified several patterns within the thought processes that correlate with higher-quality final responses:

  • Generating Checklists/Key Points: As illustrated in Figures 6 and 15, the model often drafts a preliminary list of essential elements or topics it determines should be included or emphasized in the final answer, effectively creating a structured plan.
  • Iterative Refinement: Figures 5 and 16 demonstrate instances where the model refines its potential answer. This includes augmenting initial drafts with stronger arguments, incorporating more specific examples, or restructuring the content for clarity.
  • Self-Correction and Evaluation: Figure 18 provides an example of the model engaging in self-correction and evaluation. Notably, this reflective behavior can emerge even when it leads to deviations from pre-specified output formats.

These observed behaviors are consistent with established strategies known to enhance the quality and reliability of responses generated by large language models.

Do different patterns emerge for different topics? How robust is this emergence? ...

We appreciate the reviewer inquiring about the robustness and specificity of these patterns. We didn’t statistically evaluate and classify these thought patterns because it requires eye-balling and does not have a way to do it at scale. However, we want to emphasize that some of the behaviors exhibited by our thought model are nearly never seen by a direct answer model, for example like the self reflection, refinement and self reminder behaviors, which we believe are the main reasons why our model can perform better.

I appreciated the study on length-control and DRO vs IRPO, but some parts of the method came across a bit ad-hoc...

Regarding length control, our approach incorporates established mechanisms documented in prior work. Extensive research has demonstrated that without effective length constraints, model performance can be significantly impacted, often negatively. We adopted these standard techniques primarily to ensure fair and meaningful comparisons between different methods evaluated in our study, as consistent length controls were applied across all conditions. This also enhances the practical relevance of our findings by preventing models from artificially inflating reward scores through excessive verbosity.

if this is a result strictly from preference optimization (TPO), or could we get Thinking LLMs via purely an outcome-based signal ...

Regarding the outcome-based signal, it aligns with methodologies explored in recent literature, such as DeepSeek-V1. Our proposed framework, which explicitly separates the generation process into distinct 'thought' and 'answer' components, is indeed adaptable to such outcome-based evaluation paradigms. By evaluating only the final 'answer' part, the framework grants the model greater flexibility in the 'thought' generation phase, encouraging more exploration during reinforcement learning, potentially leading to more diverse reasoning strategies and ultimately improving final task performance. Furthermore, alternative optimization algorithms like KTO or RLVR could readily replace DPO within our framework; our initial selection of DPO was based on its established effectiveness and relative simplicity of implementation.

The method comparison is also a bit lacking, where I think the authors should at least compare against STaR.

STaR requires ground truth answers, in contrast, TPO requires only a reward model. These fundamental differences in data make a direct empirical comparison challenging. Furthermore, STaR and its variants were primarily evaluated on reasoning-intensive tasks, whereas our work focuses on general instruction-following capabilities.


We greatly appreciate the reviewer’s constructive feedback, which enhances the quality and clarity of our work. We respectfully ask the reviewer to reconsider their rating, given our revisions and clarifications.

最终决定

This paper proposes a novel method for training models to develop reasoning abilities without requiring additional chain-of-thought (CoT) data annotation. By evaluating only the final outputs rather than the quality of the CoT itself, the proposed Thought Preference Optimization (TPO) method bypasses the challenges of directly evaluating CoT quality. The experimental results demonstrate that TPO significantly improves model performance on general-domain benchmarks.

All reviewers acknowledged the substantial performance gains achieved by this method and expressed interest in its ability to enhance performance even in non-reasoning domains (e.g., marketing, health, and general knowledge). However, some concerns were raised regarding the evaluation of the actual generated CoT quality and comparisons with alternative approaches. The authors addressed most of these concerns in their rebuttal.

While half of the reviewers strongly supported acceptance (4: Accepted), the other half leaned toward rejection (2: Weakly Rejected). In my assessment, although the paper could benefit from deeper analysis of the generated thought chains, the novelty of unsupervised CoT learning and the demonstrated performance improvements justify acceptance.