PaperHub
6.0
/10
Poster4 位审稿人
最低3最高5标准差0.8
4
5
3
3
3.0
置信度
创新性2.5
质量2.8
清晰度2.8
重要性2.5
NeurIPS 2025

Reinforced Context Order Recovery for Adaptive Reasoning and Planning

OpenReviewPDF
提交: 2025-04-30更新: 2025-10-29
TL;DR

We propose an algorithm to recover the correct generation order from textual data and solve reasoning and planning problems adaptively without order annotations.

摘要

Modern causal language models, followed by rapid developments in discrete diffusion models, can now produce a wide variety of interesting and useful content. However, these families of models are predominantly trained to output tokens with a fixed (left-to-right) or random order, which may deviate from the logical order in which tokens are generated originally. In this paper, we observe that current causal and diffusion models encounter difficulties in problems that require adaptive token generation orders to solve tractably, which we characterize with the $\mathcal{V}$-information framework. Motivated by this, we propose Reinforced Context Order Recovery (ReCOR), a reinforcement-learning-based framework to extract adaptive, data-dependent token generation orders from text data without annotations. Self-supervised by token prediction statistics, ReCOR estimates the hardness of predicting every unfilled token and adaptively selects the next token during both training and inference. Experiments on challenging reasoning and planning datasets demonstrate the superior performance of ReCOR compared with baselines, sometimes outperforming oracle models supervised with the ground-truth order.
关键词
Adaptive Token OrderAutoregressive ModelsDiffusion ModelsReinforcement Learning

评审与讨论

审稿意见
4

This paper proposes a ReCOR based on reinforcement learning, which aims to solve the performance bottleneck caused by fixed or random generation order in traditional causal language models (CLMs) and diffusion models (MDMs) in complex reasoning tasks. By modeling the generation order as a reinforcement learning problem and combining it with a multi-stream Transformer architecture, ReCOR achieves self-supervised learning of the optimal generation order without labeled data. Experiments have verified the effectiveness of the method on tasks such as arithmetic, Sudoku, and logic puzzles. Some tasks even surpass the baseline model based on real order annotation.

优缺点分析

1) For the first time, the generation sequence problem is formalized as a reinforcement learning task of maximizing the prediction V-information. 2) Significantly outperforms baselines on non-causally dependent arithmetic tasks (ARG, MUL) and complex reasoning tasks (Sudoku, Zebra)

问题

  1. Training complexity. RL relies in part on online policy sampling, which may require a large amount of interaction data. The authors mention using a pre-trained model for warm-starting, but do not specify a specific strategy (such as imitation learning or curriculum learning)

  2. ​​Generalization risk​​. The modification of the backbone model (such as GPT-2) by the multi-stream architecture may limit its transferability. If it is replaced with another architecture (such as LLaMA), do I need to adjust the hyperparameters?

局限性

  1. ​Interpretability of generation order: We recommend visualizing the generation order ρ predicted by ReCOR in the Sudoku task (as shown in the right side of Figure 1) to verify whether it conforms to the intuition of “from easy to difficult”.

  2. The calculation of V-information in Formula 7 depends on the shared token predictor pψ, and the impact of its approximate error with the ideal V-information on policy learning needs to be further discussed.

  3. ​​Evaluation limitations​​: The experiment only uses accuracy (complete match) as an indicator. For partially correct generation results (such as partially correct Sudoku), more fine-grained evaluation (such as cell-level accuracy) can be supplemented.

格式问题

NO

作者回复

Thank you for your review! We address your concerns below.

Q1. Training procedure of ReCOR.

R1: We clarify that we do not use a pretrained model for initialization and train all methods from scratch. In contrast to other domains like robotics or games, the RL interaction procedure required by ReCOR is equivalent to transformer inferences of the order policy, which can be implemented highly efficiently.

Q2. Generalization to other architectures.

R2: We note that ReCOR is specifically designed to be highly generalizable. The multi-stream design modifies minimally upon the original transformer architecture, introduces little hyperparameter, and is compatible with any variant of transformer, including LLaMA.

Q3. Visualization in Sudoku.

R3: Unfortunately, due to restrictions by the NeurIPS rebuttal policy, we cannot display images or videos in the rebuttal. We show here an example of ReCOR in the Sudoku test set; it can be seen that ReCOR follows the easy-to-hard order, e.g.

  • The given cells in the prompt are trivial and filled first;
  • The first non-trivial cell to be filled is the 17th one at (3, 2), which can be deduced from the prompt alone by elimination on number 9;
  • The last cells to be filled, e.g., the 81st at (4, 2), have very few clues initially, both in the row, column, and subgrid.
Puzzle:
 6  0  0  0  0  9  0  0  0 
 4  5  0  0  6  3  0  9  2 
 0  0  2  0  0  0  0  8  0 
 0  0  6  0  0  4  0  0  0 
 0  0  0  0  9  0  0  0  0 
 9  0  0  8  1  0  0  0  4 
 0  8  0  0  4  0  2  0  0 
 0  0  0  5  0  0  0  0  0 
 0  2  0  0  0  0  7  0  5 

Generation order:
15 68 71 35 26  4 59 62 70 
 6 14 25 31  2  9 30  8 13 
51 17  1 34 32 33 56 16 55 
64 81  7 37 75 21 49 58 69 
65 36 67 38 20 73 60 78 57 
22 74 72 10 24 77 54 79 18 
53  3 52 42 12 45 19 46 47 
50 23 48 11 80 76 61 66 63 
40  5 43 41 27 29 28 44 39 

Full response:
 6  1  3  2  8  9  5  4  7 
 4  5  8  7  6  3  1  9  2 
 7  9  2  4  5  1  3  8  6 
 8  7  6  3  2  4  9  5  1 
 2  4  1  6  9  5  8  7  3 
 9  3  5  8  1  7  6  2  4 
 5  8  7  1  4  6  2  3  9 
 3  6  9  5  7  2  4  1  8 
 1  2  4  9  3  8  7  6  5

Q4. Approximation error of V\mathcal{V}-information.

R4: We note that our V\mathcal{V}-information estimator, the token predictor pψp_\psi, is trained in a supervised manner and learns much faster than the RL-trained order policy πθ\pi_\theta. As a result, its estimates of V\mathcal{V}-information is very accurate, as indicated by the superior performance of ReCOR. We can further improve the performance of ReCOR by giving a more accurate estimate. For example, we can sample multiple K>1K>1 actions at every state and supervise pψp_\psi at all of these positions, reducing the approximation error and further improving performance, as shown in Sec. 5.6.

Q5. Evaluation metrics.

R5: For reasoning and planning problems, it is often the complete correctness of the solution that matters; for example, any single error in a long mathematical proof could invalidate the entire proof. We thus use full solution match as the primary evaluation metric, following the common practice of our baselines. To address your concerns more thoroughly, we provide a cell-level analysis on ARG, as shown in the table below. Note that this is a significantly easier setup than a full solution match. While the performance of baselines improves, ReCOR achieves a near-perfect result and still significantly outperforms the baselines.

MethodReCORCLM[1]MDM[2,3]AdaMDM[2,3]
Cell Accuracy0.9990.5050.8680.912

[1] Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. "Language models are unsupervised multitask learners." OpenAI blog 1, no. 8 (2019): 9.

[2] Ye, Jiacheng, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. "Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning." In The Thirteenth International Conference on Learning Representations (ICLR 2025).

[3] Kim, Jaeyeon, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, and Sitan Chen. "Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions." In Forty-second International Conference on Machine Learning (ICML 2025).

评论

Thanks for the response. I will keep my score.

审稿意见
5

(Motivation) Language models are predominantly trained to output tokens with a fixed (left-to-right) or random order, which may deviate from the logical order in which tokens should ideally be used. Causal LMs always follow a rigid left-to-right generation paradigm, encountering many intractable tokens along the way, and random order or masked diffusion-based models struggle with performance from introducing large distribution shifts between training and inference or attending to sub-problems (the latter of which is also shown as a result in this work).

The authors create an objective based on \nu-information to turn token order selection into a decision-making problem that can be solved with an MDP and soft Q learning. The resulting algorithm is called ReCOR (“Reinforced Context Order Recovery”). The submitted work demonstrates that ReCOR outperforms or performs similarly to baselines and oracles on simpel arithmetic tasks, synthetic autoregression, and puzzle datasets (Sudoko, Zebra), evaluating using GPT2 as backbone for all algorithms. The authors use their results to discuss algorithmic performance, balanced compute requirements and scaling, and the impact of adaptive ordering during inference and training.

优缺点分析

Reasons to accept:

  • The paper is very well written and, to the best of my knowledge, original work (Clarity). As I am not familiar with most of the token-reordering literature, I will reduce my confidence to 3.

  • The authors provide good theoretical description and background for the introduction of ReCOR and conduct a series of experiments to support all their claims. The proposed algorithm is of high quality and optimized for parallel computation (Quality).

  • The idea of using a learned and adaptive re-ordering for reasoning and planning problems is elegant and promising. The submission has the potential to advance the field of (low-compute, task-specific) reasoning and planning work significantly. (Significance).

Reasons to reject:

  • It is unclear to me how well the learned adaptive reordering generalizes across tasks or to more complex tasks. It would have been ideal to see ReCOR tested for something like (simple) coding benchmarks with smaller models in the 1.5-3B parameter regime.

问题

  • I assume the results for puzzle benchmarks and arithmetic benchmarks were from different training runs and models. I would like to understand if ReCOR limits a trained model to a very specific narrow task or not. Would ReCOR perform better across the tested tasks than the baselines if trained on both? Can a ReCOR-trained GPT2 still perform general language generation?

  • Have the baselines been evaluated for harder tasks (e.g., coding) with larger models (larger than 1B parameters)? Based on your results, how would you expect ReCOR to perform in comparison at these scales for these tasks?

局限性

Yes.

最终评判理由

I had little concerns regarding the quality of this submission and only asked for clarifications on cross-task performance and scaling the proposed algorithm to larger models.

I still believe this is an elegant and promising approach that can be the foundation of crucial future work and should be featured at NeurIPS 2025.

格式问题

No.

作者回复

Thank you for your thoughtful review! We are glad to see that you think ReCOR is "elegant and promising". We address your questions below.

Q1. Limitation of training tasks.

R1: We note that ReCOR is not limited to a narrow task. Similar to standard causal language models, ReCOR can learn to model all of the training data provided to it without any annotation requirements. We support this statement with a mixed data experiment where we mix the training sets of ARG and MUL and retrain ReCOR and baselines. The average evaluation accuracies on both test sets are shown below. We can see that ReCOR learns from both datasets and achieves uniformly good performance, while baselines still struggle. We also note that ReCOR, like its baselines, is trained from scratch without initializing from pretrained GPT-2 checkpoints and never trained on general language data.

MethodReCORCLMMDMAdaMDM
Average Accuracy0.9450.2650.3980.558

Q2. Scaling to larger models and problem domains.

R2: Our baselines primarily focus on the same reasoning and planning domains and have similar parameter scales to ReCOR. Scaling up model size and problem domains is a key future work that we are actively pursuing to further demonstrate the potential of ReCOR. For example, as mentioned in your review, we believe ReCOR can be applied to coding domains to automatically recover the hierarchical structure of code and enable better modelling of the training data with its order recovery capabilities, while our baselines are limited to fixed linear (causal language models) or random (masked diffusion models) structures.

评论

Thank you for the clarifying comments! I will maintain my favorable score as I think this would be a good paper I'd like to see at NeurIPS.

审稿意见
3

This paper studies the problem that for certain tasks (such as arithmetic calculation and sudoku), solving the task from a easy-to-hard order is more preferred than the traditional left-to-right order. The paper proposes Reinforced Context Order Recovery (ReCOR), which models token order prediction, before predicting the token at each predicted position. Specifically, ReCOR leverages V-information to recover token orders as a markov decision making problem, and applies Q-learning to predict tokens. At inference time, a multi-stream architecture is used for order and token predictions. Evaluated on various tasks (including synthetic autoregression, multiplication, Sudoku, and Zebra) results show that a GPT-2 based ReCOR performs better than autoregressive methods (CLM) and diffusion methods (MDM, AdaMDM) and is close to an oracle-based AR-GT.

优缺点分析

Strengths:

  1. This paper is well motivated and targets at an interesting problem with autoregressive models in terms of decoding orders.
  2. The proposed ReCOR is intuitive and straightforward, and achieves better performance on various tasks.

Weaknesses:

  1. The scope and generalization of the proposed method is limited. Typically I wouldn't criticize on the base model used (in this case GPT-2) because I understand resource limitations. However, in this particular case, on the one hand, GPT-2 is a very shallow and small LM that is not representative of more recent LMs where more layers may already take ordering into considering (even though still decoded in a left-to-right order), especially when easy-to-hard prediction is easily targeted with reasoning models. On the other hand, GPT-2 is trained in an autoregressive way, which raises the question on how ReCOR, a framework taking into arbitrary order interacts with the pretrained weights. Furthermore, it is not clear how the baselines are implemented (e.g., how is the CLM and MDM trained, including the base model, training examples, architecture, regularization). Moreover, although MDM is similar to ReCOR in the sense that it is not autoregressive, diffusion methods are designed from a totally different perspective (and is not a meaningful baseline). Therefore, it is not convincing to me that this work would inspire a wide range of audience interested in similar topics.
  2. Some details are missing. See details below.

问题

  1. Equation 4 - 7 defines how order might be defined from entropy to V-information, but how is 8 derived from 7? i.e., how do you model Hv (B | ∅)? And how do you condition on arbitrary contexts as you mentioned that the naive solution is not tractable?
  2. In equation 13, where is the language modeling objective from? What is considered as the ground truth?
  3. Why is the threshold required from equation 14 - 15? What is the interpretation of adding this threshold?

局限性

yes

最终评判理由

Because of the concerns I raised including the generalization of the proposed method which require more careful analysis and changes of narratives, I believe a major revision is required to make the paper beneficial to broader audiences.

格式问题

NA

作者回复

Thank you for your review! We are happy to see that you think ReCOR is "intuitive", "well motivated", and "targets an interesting problem." We address your concerns below.

Q1. Comparison with LMs with more layers and reasoning models.

R1: We note that simply scaling up under the next-token prediction paradigm is insufficient for handling the order problem; furthermore, reasoning models require suitable CoT annotations to warm up, and it is extremely challenging to solve novel, unfamiliar problems. To demonstrate the ineffectiveness of existing state-of-the-art large reasoning models, we perform an experiment with o3 on ARG. We design two prompts where prompt A includes the actual generation rule of the dataset and examples, and prompt B includes in-context examples only. Note that in our main experiments, only the generated data is available to the models without the ground-truth generation rule, corresponding to the harder scenario B. We evaluate 50 random samples from the test set and allocate a token budget of 10000 tokens each. As a result,

  • Under prompt A with ground-truth rules, o3 achieves only 0.3 success rate, where the majority of failures come from depletion of token budgets. Even considering only the completely generated solutions, the success rate of o3 is 0.938 and still lower than ReCOR's 0.987. Furthermore, o3 takes ~4min on average to solve each individual question, while ReCOR's inference procedure solves the entire 1000-sample test set within 10 seconds.
  • Under prompt B without access to ground-truth rules, o3 fails to recover the correct data generation algorithm and has 0 success rate. We stress again that ReCOR operates under this harder setting without any additional annotations.

In conclusion, we show that scale alone is inefficient and insufficient, while our ReCOR solves the problem effectively and efficiently.

Q2. Interaction with pretrained weights.

R2: We note that ReCOR is a framework for pre-training from scratch without relying on any pretrained weights. We consider fine-tuning from existing checkpoints as an important direction for future work, where we can adapt left-to-right pretrained weights into an adaptive-order model. This has been shown to be feasible by recent literature, e.g., [1] adapts a LLaMA checkpoint to be a masked diffusion model, but doing so qualifies as an independent contribution and is out of scope for the current work.

Q3. Training details of baselines.

R3: We have included training and architecture details of all methods in the supplementary material. To summarize here, we largely followed the training setups of our baselines, in particular [2,3], and build upon their released codebase to ensure a fair comparison. All methods (including CLM and MDM) are trained from scratch and use the same GPT-2 backbone architecture, training data (except for the oracle AR-GT which has additional access to privileged order information), and weight decay regularization of 0.1.

Q4. Comparison between ReCOR and MDM.

R4: We'd like to clarify that ReCOR is autoregressive; it is just not left-to-right, differing from standard causal language models. For a complete view of related works, we compare ReCOR with diverse baselines, including both causal language models and masked diffusion models, and show that ReCOR outperforms all of these approaches. We additionally provide detailed analysis and comparison between ReCOR and MDM in Sec. 5.4 and 5.5, extending the results of recent high-profile works in diffusion language modelling [2, 3]. We believe our algorithms and findings are of value to these communities and beyond.

Q5. Modelling of HV(B)H_\mathcal{V}(B \mid \varnothing); conditioning of arbitrary contexts.

R5: Since HV(B)H_\mathcal{V}(B \mid \varnothing) conditions on an empty set, it remains a constant with respect to the generation order ρ\rho, and consequently our optimization parameter θ\theta in Eq. 7. Thus we can remove the term from our objective, leading to the simplification from Eq. 7 to 8. For the conditioning of arbitrary contexts, as shown in Fig. 2, we feed the context consisting of a set of tokens at arbitrary positions into the transformer. The token is represented by vocabulary embeddings while the position is represented by additive absolute positional embeddings, as stipulated by GPT-2. This design allows us to tractably encode and condition on an arbitrary context with the same set of parameters, without having to instantiate a separate model for each set of positions.

Q6. Derivation and training of LLML_\text{LM} (Eq. 13).

R6: The language modelling objective LLML_\text{LM} is directly derived from the overall objective Eq. 8; in particular, we optimize Eq. 8 with respect to ψ\psi using LLML_\text{LM}. The ground-truth response token sequence y\mathbf{y} comes from the training data, which is the textual corpus we're training on.

Q7. Interpretation of the thresholded reward (Eq. 14-15).

R7: We can interpret the threshold as a form of reward shaping, a commonly used technique in the field of reinforcement learning to improve learning efficiency. Intuitively, in ReCOR, we use the threshold to provide a sharper, binary distinction between "easy" and "hard" tokens to assist policy learning.

[1] Gong, Shansan, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An et al. "Scaling Diffusion Language Models via Adaptation from Autoregressive Models." In The Thirteenth International Conference on Learning Representations.

[2] Ye, Jiacheng, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. "Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning." In The Thirteenth International Conference on Learning Representations.

[3] Kim, Jaeyeon, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, and Sitan Chen. "Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions." In Forty-second International Conference on Machine Learning.

评论

Q1. Comparison with LMs with more layers and reasoning models. In O3, the reasoning traces are hidden, but this is a good direction (although I don't think comparing the latency is meaningful here because I understand you are using a GPT2 model, which is much more efficient and could be more impressive to achieve similar or better scores). I would like to see more comparison to models such as deepseek R1 where through reasoning training (rather than simple asking the model to "think step-by-step" in the most naive COT setup), it is very likely that the autoregressive models already takes ordering into consideration. You can verify these from the full reasoning traces on tasks such as Sudoku. I am not saying that you need to perform better than these large models with your proposed method, but it would be necessary to understand the difference since researchers in similar areas are well aware of these methods and need corresponding comparisons.

Q3: Sorry I didn't find the appendix when reading the paper. I strongly suggest attaching the appendix to the main paper in your revision which makes it easier to trace. When you implement all baselines from scratch, can you add the language modeling benchmarks as well (such as the ones used in the GPT2 paper) so that the audience can get a fair sense of the model?

Q4. Comparison between ReCOR and MDM. I apologize if my original phrasing was not clear. I understand how ReCOR works in the sense of "not just left-to-right". What I meant was that it was not fundamentally comparable to a diffusion model, which in literature has not gains comparable quality to autoregressive models in general.

评论

Thank you for the additional comments! Our response is as follows.

R1 (continued): We would like to clarify that we agree reasoning models can handle orders to some extent. In our previous experiment with o3, with explicit ground-truth inference rules, o3 can execute the computation in the desired order, although the performance is worse than ReCOR. Our core arguments regarding reasoning models are as follows:

  1. Reasoning models require CoT annotations to warm up before they can learn to deduce the correct order. When facing entirely novel problems, reasoning models are prone to failures. This is evidenced by our experiment with o3, where it failed to solve ARG without access to the ground-truth rules (prompt B). This phenomenon is also observed in recent literature [1] showing that reasoning training struggles to elicit fundamentally new reasoning patterns and relies on existing reasoning patterns in the pretraining data. ReCOR, on the other hand, can learn to extract the correct order from pretraining data without any annotations.

  2. Even with access to the correct algorithm for solving the problem (during training or inference), the execution of such orders can still be slow and unreliable. This is shown in our experiment with o3 (prompt A). To further demonstrate this, we also task DeepSeek-R1 with solving Sudoku puzzles, as you suggested. While the reasoning trace (too long to be shown here) contains deliberations about which cells to fill, the final answer is wrong and contains conflicts after 20 minutes of thinking (27670 tokens). Closer inspection of the reasoning trace reveals that while the model is aware of the overall principle, it often can not find the precise easiest cell.

R3 (continued): Thank you for your suggestion! We apologize for the inconvenience and will merge the supplementary material PDF with the main text when we have the chance. Regarding language modelling benchmarks, we note that ReCOR focuses on hard reasoning and planning problems and is trained on such datasets, following the setups of our baselines [2,3]. The training dataset does not include general language data, and as a result, standard language modeling benchmarks do not apply here. We consider scaling up ReCOR to more diverse problem domains to be an important direction of future work.

R4 (continued): We do agree that MDMs are quite different from ReCOR, and dedicate Sec. 5.4 and 5.5 to discussions about the differences (and the consequences thereof) between ReCOR and MDM. We compared with MDM primarily because it is the most relevant and strongest baseline we could find in terms of the token ordering problem [5,6,7]. For example, [2,3,4] are all very recent, state-of-the-art works that also focus on the token ordering problem under reasoning and planning scenarios, of which [2] is ICML 2025 outstanding paper. We compared ReCOR with them and argued about the need for adaptive token ordering during training in Sec. 5.4. In addition, we note that MDMs are actually stronger than standard autoregressive, causal language models (CLMs) without access to ground-truth orders under our setup. We found CLMs to underperform AdaMDM (and ReCOR) due to their inability to handle adaptive orders.

Thank you again for your time and engagement in the discussion period! We believe this is vital for a healthy scientific process and hope our response can address your remaining concerns.

[1] Yue, Yang, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. "Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?." arXiv preprint arXiv:2504.13837 (2025).

[2] Kim, Jaeyeon, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, and Sitan Chen. "Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions." In Forty-second International Conference on Machine Learning (ICML 2025).

[3] Ye, Jiacheng, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. "Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning." In The Thirteenth International Conference on Learning Representations (ICLR 2025).

[4] Shah, Kulin, Nishanth Dikkala, Xin Wang, and Rina Panigrahy. "Causal language modeling can elicit search and reasoning capabilities on logic puzzles." Advances in Neural Information Processing Systems 37 (2024): 56674-56702.

[5] Bachmann, Gregor, and Vaishnavh Nagarajan. "The Pitfalls of Next-Token Prediction." In Forty-first International Conference on Machine Learning.

[6] Zhang-Li, Daniel, Nianyi Lin, Jifan Yu, Zheyuan Zhang, Zijun Yao, Xiaokang Zhang, Lei Hou, Jing Zhang, and Juanzi Li. "Reverse that number! decoding order matters in arithmetic learning." arXiv preprint arXiv:2403.05845 (2024).

[7] Golovneva, Olga, Zeyuan Allen-Zhu, Jason E. Weston, and Sainbayar Sukhbaatar. "Reverse Training to Nurse the Reversal Curse." In First Conference on Language Modeling.

评论

R1: I would suggest adding more similar analysis in your revision, including why do you think the natural reasoning pattern would not solve the task (e.g., deepseek r1 but the proposed method can, and if the proposed method is able to generalizable to more broader tasks like those reasoning models including r1).

R3: If the baselines do not include general language modeling tasks and are trained from scratch, how can an audience evaluate the fairness of the baselines (and the proposed method accordingly)? What would a vanilla baseline perform (e.g., GPT-2)? Why can't you add language modeling tasks during model training?

评论

R1 (continued): Thank you for your suggestion! We will update the main text with more analysis and discussions. We note that ReCOR performs reinforcement learning during pretraining and allows the model to control the order of its response, while standard language models and reasoning models do not have this capability to control orders by taking actions. Therefore, they can only rely on existing patterns in the data to perform reasoning about orders, which is inefficient and error-prone, while ReCOR allows a much larger room for exploration. Furthermore, we note that these post-training techniques are orthogonal to ReCOR and can also be applied to ReCOR-pretrained models.

Regarding the scope of tasks, we have conducted a mixed-dataset experiment that trains ReCOR and baselines on ARG and MUL simultaneously. The average evaluation accuracies on both test sets are shown below. We can see that ReCOR learns from both datasets and achieves uniformly good performance, while baselines still struggle. Further scaling up the scope of tasks to other scenarios, e.g. coding, more complex math, or even multimodal perception, is an important direction for future work that we are actively pursuing.

MethodReCORCLMMDMAdaMDM
Average0.9450.2650.3980.558

R3 (continued): We feel there has been a misunderstanding here. The core contribution of our work is a new training paradigm focused on the token ordering problem, which has recently attracted much attention (see references above and in the main text) as one of the important caveats of the dominating next-token prediction paradigm. We choose reasoning and planning problems as suitable testbeds to demonstrate this problem and the performance of our solution.

We do not claim to have a model with better general language generation capabilities, which would require extensive training at a much larger scale and is also beside the point for the current paper. In particular, we use the GPT-2 architecture instead of the (pretrained) GPT-2 model per se, and do not finetune from pretrained checkpoints. The reason we choose this setup, like our baselines, is that pretraining on general language corpus likely won't help with the reasoning and planning tasks. To clarify, we does compare with the vanilla GPT-2 architecture trained with standard next-token prediction objective from scratch, which is the CLM baseline in our paper.

As for the pretrained GPT-2, we pull the GPT-2 checkpoint (124M) from huggingface and evaluate it on the MUL dataset. This is arguably the easiest one for GPT-2 since multiplication is likely present in its training corpus, and this checkpoint has an order of magnitude more parameters than ReCOR in the main text. However, pretrained GPT-2 performs abysmally and obtains a success rate of 0. This is expected since even much larger pretrained models like o3 and R1 struggle to solve the problems, as shown by the experiments in our previous responses. These negative results shed light on the inherent limitations of the next-token prediction objective, which form the fundamental motivation of ReCOR.

审稿意见
3

The paper proposed Reinforced Context Order Recovery (ReCOR), a self-supervised framework that learns to adaptively determine the optimal token generation order without explicit order annotations. ReCOR estimates the hardness of predicting every unfilled token and adaptively selects the next token during both training and inference. They showed their method outperforms baselines on arithmetic problems and c logic puzzles Sudoku and Zebra.

优缺点分析

Strengths:

  • The problem is well motivated and automatically recovering the correct generation order from textual data is very important in some planning tasks.

  • The proposed method jointly optimizes the token prediction model and the order prediction policy, generating rewards with the former as self-supervision for the latter.

  • Propose method solves arithmetic problems without special data preprocessing

Weaknesses:

  • The paper formulates automatically recovering the correct generation order from textual data as a decision making problem and uses RL-techniques to solve this. Even though the problem is well motivated and seems interesting, I am not sure how this method can scale. Specially how this can be compared against existing large models? What is the potential here? It might be the case that it is able to recover the correct generation order but other capabilities might be sacrificed.

  • In experiment the details about model architecture and details of training are missing.

  • I think since the proposed approach changes the training of the model, the limitations need to be extended to cover the scalability, comparing other capabilities, and even comparing with other existing models in different scales.

  • Some terms or abbreviations seem to be unconventional and there is no reference for that: for example L_SQL (I think it meant soft Q-learning) , \nu-information framework.

问题

Please refer to weaknesses above.

局限性

yes

格式问题

no

作者回复

Thank you for your time in reviewing our paper! We are glad to see that you regard our problem as ·"well motivated and interesting"·. We address your concerns below.

Q1. Scalability of method; potential when compared with existing large models.

R1: We have demonstrated the scalability of ReCOR in Section 5.6 with compute scaling experiments. ReCOR can scale along multiple axes, improving with more training and test-time compute. We note that ReCOR is highly scalable due to its self-supervised nature; ReCOR requires only purely textual data to train without any additional annotations, making it as general as standard pretraining objectives like next-token prediction. Compared with existing large models that only learn to predict the next token, ReCOR-trained models can predict orders in addition to tokens, capturing structures in the training data, which in turn enhances the token prediction capability, as shown in our experiments.

Q2. Architecture and training details.

R2: We have included the architecture and training details in our supplementary material PDF with pseudocode in the appendix of the main text. To summarize, we largely followed the training setups of our baselines, in particular [1], and built upon their released codebase to ensure a fair comparison. All methods are trained from scratch and use the same GPT-2 architecture, training data (except for the oracle AR-GT, which has additional access to privileged order information), and weight decay regularization of 0.1.

Q3. Comparison with models of different scales.

R3: We note that simply scaling up under the next-token prediction paradigm is not sufficient for handling the order problem; furthermore, reasoning models require suitable CoT annotations to warm up, and it is extremely challenging to solve novel, unfamiliar problems. To demonstrate the ineffectiveness of existing state-of-the-art large reasoning models, we perform an experiment with o3 on ARG. We design two prompts where prompt A includes the actual generation rule of the dataset and examples, and prompt B includes in-context examples only. Note that in our main experiments, only the generated data is available to the models without the ground-truth generation rule, corresponding to the harder scenario B. We evaluate 50 random samples from the test set and allocate a token budget of 10000 tokens each. As a result,

  • Under prompt A with ground-truth rules, o3 achieves only a 0.3 success rate, where the majority of failures come from depletion of token budgets. Even considering only the completely generated solutions, the success rate of o3 is 0.938 and is still lower than ReCOR's 0.987. Furthermore, o3 takes ~4min on average to solve each individual question, while ReCOR solves the entire 1000-sample test set within 10 seconds.
  • Under prompt B without access to ground-truth rules, o3 fails to recover the correct data generation algorithm and has 0 success rate. We stress again that ReCOR operates under this harder setting without any additional annotations.

In conclusion, we show that scale alone is inefficient and insufficient, while ReCOR solves the problem effectively and efficiently.

Q4. Terminology references.

R4: We apologize for any confusions; LSQLL_\text{SQL} indeed means soft Q-learning, which we cite as [33] in Sec. 4.3, line 161 and 166 before we define the soft Q-learning loss function Eq. 11. V\mathcal{V}-information is cited as [30] in Sec. 4.1, line 120, where we invoke the theoretical formulation from existing literature to characterize the token hardness problem. We will revise the manuscript for better clarity in the revision.

[1] Ye, Jiacheng, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. "Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning." In The Thirteenth International Conference on Learning Representations (ICLR 2025).

评论

Dear Reviewer,

This is a gentle reminder to carefully read the rebuttal and respond to the authors about whether it addresses the concerns.

最终决定

A. For certain tasks (such as arithmetic calculation and sudoku), solving the task from a easy-to-hard order is more preferred than the traditional left-to-right order. The paper uses RL to predict the token order before filling in the token at each predicted position. Results are shown on tasks (e.g., multiplication, Sudoku, Zebra), comparing to autoregressive methods and diffusion models.

B. Reviewers found the paper well written and the method a well motivated approach to an important problem. They also appreciated the strong results, the theoretical justification, and the fact that the method could be parallelized.

C. Reviewers raised questions about scalability to larger models, missing experimental details, scope and generalization of the method (only shown on GPT-2), some missing details

D. I think that the reviewers did a sufficient job addressing the reviewer concerns, and that the paper should be accepted. Reviewer UpeJ did not engage in the discussion; both myself and another reviewer believe that their concerns have been addressed, so we're treating the actual scores as 5/4/"4"/3. I also feel like the primary concerns in Reviewer MWbF's review (score=3) have been addressed in the rebuttal. I agree with all reviewers that the empirical results are good, but I feel like they are actually rather marginal (e.g., compare with AdaMDM in Table 2). However, the paper has a great discussion of the comparison with AdaMDM, including some nice conceptual experiments.

E. During the rebuttal, the authors answered reviewer questions, provided results with the o3 model, and demonstrated that the method can generalize across multiple tasks and datasets. The reviewers that engaged in the discussion said that some of their concerns had been addressed. However, one reviewer said that concerns about the generalization of the approach and correctness of the analysis remained. The author's contested reviewer MWbF's understanding of the paper.