PaperHub
6.5
/10
Poster4 位审稿人
最低6最高8标准差0.9
6
6
8
6
3.3
置信度
正确性3.0
贡献度3.0
表达2.5
ICLR 2025

Progress or Regress? Self-Improvement Reversal in Post-training

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-28
TL;DR

A complex study of iterative self-improvement in Large Language Models (LLMs), focusing on various fine-tuning strategies and their effectiveness.

摘要

关键词
Iterative Self-improvementProblem-solving AI

评审与讨论

审稿意见
6

This work goes beyond pass@1 to dive deeper into evaluating self-improvement methods for mathematical reasoning. They show that improvement in benchmark performance of self-improvement showed signs in regression in areas such as output diversity and OOD generalization. They developed a new evaluation framework for current self-improvement methods to evaluate on.

优点

  • very relevant topic for the conference
  • good empirical results and comprehensive experiments
  • interesting findings that can inspire future research directions:
    • most methods decline in task performance after ~4 iterations
    • declined abilities in output diversity and OOD generalization
    • lesser powerful models gain more, but more powerful models achieve better max performance
    • M_1 performance effects subsequent iterations

缺点

  • (minor) in introduction, it wasn't apparent why the new metrics are important. For example why is diversity needed in solving these reasoning tasks?

问题

  • for table 3, can you label x-axis as sampling N? Also what model is used here?
评论

Weakness(minor): in introduction, it wasn't apparent why the new metrics are important. For example why is diversity needed in solving these reasoning tasks?

Thank you for your suggestions. We have already introduced the purpose of defining new metrics in the introduction:

  • Improvement Set: This helps us understand where the improvements come from during the iterative posttraining process. It further shows that iterative self-posttraining is essentially reranking the solutions, increasing the model's pass@1 but not necessarily improving pass@n.
  • Solution Diversity: Solution diversity helps the model become more robust in problem-solving. It allows the model to verify and self-check through multiple solution approaches. Furthermore, in the context of inference scaling laws, solution diversity helps improve model performance during the inference stage.
  • OOD Generalization: Testing the model's out-of-distribution (OOD) generalization ability.

Question: for table 3, can you label x-axis as sampling N? Also what model is used here?

  • Thank you for your reminder. We have added the x-axis label in Figure 3.
  • As mentioned at the beginning of Chapter 5, the models used throughout Chapter 5 are all Mistral-7B.
审稿意见
6

This work studies the post-training steps of supervised fine-tuning and preference optimization during iterative self-improvement, where an LLM improves by training on its own generated solutions. The authors claim that such iterative post-training may show progress according to traditional top-1 test accuracy measures, but may also negatively impact semantic diversity and OOD performance on other tasks. They propose an iterative self-improvement algorithm which, given a training set of inputs and ground truth outputs, iteratively fine-tunes a model on self-generated outputs labelled according to whether they are correct or incorrect, serving as synthetic preferences similar to RLAIF. The authors compare 3 methods of fine-tuning: SFT, DPO, and alternating SFT-DPO. They find that improvement plateaus for most tasks after ~3 iterations before declining, as well as a few other findings. Next, they show that while accuracy on the test set increases over post-training, diversity across a number of metrics may be reduced as well as accuracy on OOD tasks.

优点

  • Well-motivated and timely. Review of related work is thorough.
  • Iterative self-improvement post-training algorithm and most of the analyses seem general enough to be applied to arbitrary domains with minimal changes.
  • Interesting reversal results with diversity and OOD generalization decreasing during post-training. Used a variety of diversity metrics.

缺点

  • I'm not sure I understand Improvement Sets (Section 5.1). Perhaps I'm missing something here, but in my understanding of figure 3, it seems somewhat trivial that accuracy@N increases with greater N.
  • I'm also not sure I understand the point of Group Disparity (Section 5.3). Why not just compare level 1 accuracy vs. level 5 accuracy directly, if level 1 accuracy is increasing and level 5 is decreasing?
  • Correct answer coverage seems potentially at odds with solution diversity, so I'm not exactly sure how to resolve these two metrics. Is it possible that better performing models aways have lower solution diversity?
  • Many of the results figures could be improved, including with longer captions:
    • Figure 1 - It's currently hard to compare SFT vs. DPO vs. SFT-DPO across (a) - (c) since different rows have different scales for the y-axis.
    • Figure 2 - The left figure took some effort for me to understand. Maybe example questions/answers would help.
    • Figure 3 - The x-axis is unlabeled and the caption is very short. As stated above, I also wasn't sure what the main takeaway of this is.
    • Figure 5 - The bottom y-axis has a very small range [90, 95], and I don't have a good sense for what the scale of these values should be. The caption for this also seems too short, and I don't know what the blue dotted line means.

问题

  • Figure 3: why is it surprising that pass@N accuracy increases with greater N?
  • How does correct answer coverage relate to solution diversity? Are the two at odds with each other?
  • Is solution diversity inherently desirable in domains like mathematical reasoning? Is it possible that higher performing models on certain domains always have lower diversity?
评论

Question3: Is solution diversity inherently desirable in domains like mathematical reasoning? Is it possible that higher performing models on certain domains always have lower diversity?

In the field of mathematics, accuracy is undoubtedly the most important evaluation metric. However, diversity still holds significance:

  • Having multiple ways to solve a problem can increase the robustness of the solution. If one method fails or is flawed, another approach can be used to verify or correct it.
  • Different solutions can serve as checks against each other. If multiple methods yield the same result, it increases confidence in the correctness of the solution.
  • In the context of the emphasis on the inference scaling law, a lower solution diversity is detrimental to scaling the inference law. Models with poor diversity are not data-efficient.

Moreover, as we mentioned earlier, in the process of iterative self-posttraining, the model's performance improves while diversity decreases, but this does not imply that all models follow this pattern.

Finally, we emphasize that we choose mathematics because of its strong verifiability. Iterative self-posttraining can be extended to other reasoning domains, such as agent planning, which may greatly benefit from solving problems with diverse solutions.

评论

Weakness4:

  • Many of the results figures could be improved, including with longer captions:

  • Figure 1 - It's currently hard to compare SFT vs. DPO vs. SFT-DPO across (a) - (c) since different rows have different scales for the y-axis.

  • Figure 2 - The left figure took some effort for me to understand. Maybe example questions/answers would help.

  • Figure 3 - The x-axis is unlabeled and the caption is very short. As stated above, I also wasn't sure what the main takeaway of this is.

  • Figure 5 - The bottom y-axis has a very small range [90, 95], and I don't have a good sense for what the scale of these values should be. The caption for this also seems too short, and I don't know what the blue dotted line means.

  1. We have updated Figure 1 to ensure that different rows have the same y-axis scale.

  2. Regarding the issue in Figure 2, in Appendix E, we provide a real example of how the model's output on a specific problem changes during the iterative SFT process. This allows us to observe how the model's output evolves on a real problem as iterations increase. For the left panel of Figure 2, we have added more text to clarify the information conveyed by the image.

  3. In Figure 3, we have added a label to the x-axis indicating the number of sampling steps. Additionally, we have included more experimental details in the figure title and emphasized the key message the figure aims to convey: "As nn increases, pass@N quickly approaches 100%."

  4. Since the MATH Level is too difficult for Mistral-7B, and the accuracy difference between Level 1 and Level 5 is too large, the expression ((Acc(Level 1) - Acc(Level 5)) / Acc(Level 1) concentrates between 90-100%. The blue dashed line represents the group disparity at iteration 5, which is always higher than at iteration 1, indicating that iteration 5 focuses more on solving simpler problems rather than more complex ones. We have added more details to the captions, including experimental setup information for both figures and highlighting important points for the reader to improve readability.

评论

Weakness3: Correct answer coverage seems potentially at odds with solution diversity, so I'm not exactly sure how to resolve these two metrics. Is it possible that better performing models aways have lower solution diversity?

Question2: How does correct answer coverage relate to solution diversity? Are the two at odds with each other?

Your understanding is very thorough and detailed, and I also understand your concerns. During the iterative self-posttraining process, the model's correct outputs are repeatedly used to reinforce the model, which improves performance but can reduce the diversity of the model's outputs. This is the phenomenon we've observed and the problem of iterative self-posttraining. We encourage future work to address this issue by improving diversity while maintaining performance, for example, by using diversified sampling to increase data diversity and further sustain model diversity.

Furthermore, we would like to clarify that the trade-off between performance and diversity only applies within the context of iterative self-posttraining. In a broader sense, this trade-off does not hold. For instance:

  • Llama2-7b-gsm8k and Llama3-8b-gsm8k (models fine-tuned with GSM8K on Llama2-7b and Llama3-8b, respectively) show almost no difference in diversity on the GSM8K dataset, but their accuracy rates are significantly different, at 42% and 67%, respectively.
  • Comparing Llama2-7b with iterative SFT on GSM8K for 4 rounds (Llama2-7b-gsm8k-iter4) with Llama3-8b-gsm8k (which undergoes only a single round of SFT on GSM8K without iterative self-posttraining), the former shows both lower diversity and accuracy compared to the latter.

Therefore, in more general comparisons, solution diversity and model accuracy are not inherently related.

评论

Weaknesses2: I'm also not sure I understand the point of Group Disparity (Section 5.3). Why not just compare level 1 accuracy vs. level 5 accuracy directly, if level 1 accuracy is increasing and level 5 is decreasing?

Apologies for any potential lack of clarity. To provide a more detailed definition, the "group disparity" metric is used to evaluate the model's capability to solve complex problems. It helps determine whether the model can only solve simple problems (e.g., achieving high scores at MATH level 1 but low scores at MATH level 5), or whether it is capable of solving more difficult problems. A model that can only solve simple problems is not desirable, as it has little potential to tackle more complex challenges. If, during the iterative self-posttraining process, the group disparity continually increases, it suggests that the model is focusing more on solving easy problems rather than improving its ability to solve more complex ones.

Additionally, regarding why the definition of group disparity is expressed as the following formula: Acc(Level1)Acc(Level5)Acc(Level1)\frac{Acc(Level 1) - Acc(Level 5)}{Acc(Level 1)} instead of Acc(Level1)Acc(Level5)Acc(Level 1) - Acc(Level 5), consider the following scenario: if the original model has Acc(Level1)=40 and Acc(Level5)=10, and after iterative self-posttraining the model has Acc(Level1)=50 and Acc(Level5)=20, using the second formula would result in a group disparity of 30 in both cases. However, in reality, after iterative self-posttraining, the model correctly solved twice as many problems at Level 5, while at Level 1, it only solved 25% more problems. Thus, we normalize the metric you proposed to better capture the model's improvement in solving complex problems during the iterative self-posttraining process.

评论

Thank you for your detailed comments.

Weaknesses 1: I'm not sure I understand Improvement Sets (Section 5.1). Perhaps I'm missing something here, but in my understanding of figure 3, it seems somewhat trivial that accuracy@N increases with greater N.

Question1: Figure 3: why is it surprising that pass@N accuracy increases with greater N?

We apologize for not clearly defining the concept of the improvement set. Here, we provide a more intuitive definition.

  1. In the iterative post-training process, we start with an initial SFT model, denoted as M1\mathcal{M_1}. After several rounds of self-pretraining on M1\mathcal{M_1}, we obtain a new model, Mt\mathcal{M_t}.
  2. In the greedy decoding and pass@1 evaluation setup, both M1\mathcal{M_1} and Mt\mathcal{M_t} correctly answer some questions from the test set, Dtest\mathcal{D}_{test}. Let the sets of correctly answered questions for M1\mathcal{M_1} and Mt\mathcal{M_t} be denoted as D1\mathcal{D}_1 and Dt\mathcal{D}_t, respectively.
  3. The improvement set (IS) can be defined as: IS(t)=DtD1\text{IS}(t) = \mathcal{D}_t - \mathcal{D}_1.
  4. Intuitively, this set consists of the questions that Mt\mathcal{M_t} answers correctly, but M1\mathcal{M_1} does not, i.e., the questions where Mt\mathcal{M_t} shows improvement over M1\mathcal{M_1}.

You are absolutely correct in observing that accuracy@N increases with larger N. However, the key insight here is not just the general trend, but that as kk increases, accuracy@N (or pass@N) almost always increases to 100%. This indicates that M1\mathcal{M_1}, through multiple samplings, is able to correctly answer nearly all questions in the improvement set of Mt\mathcal{M_t} (IS(t)). From this observation, we can draw two conclusions:

  • M1\mathcal{M_1} already possesses the potential to solve the problems in the improvement set.
  • Iterative self-posttraining primarily serves to re-rank solutions, enhancing the probability of sampling correct answers, but it does not fundamentally improve the model's ability to solve new problems.
评论

Thank you for your replies and clarifications, these helped answer most of my questions. I've raised my score from 5 to 6.

I'm still uncertain whether decreasing solution diversity is necessarily undesirable (or "regressive"), though.

In the context of the emphasis on the inference scaling law, a lower solution diversity is detrimental to scaling the inference law. Models with poor diversity are not data-efficient

This raises more questions for me. Perhaps I'm missing something, but I don't see any mention of inference-time scaling laws in the paper.

I get that when you draw a large number of samples at inference time, if responses are diverse, a set of samples might be more likely to include the correct solution. However, diversity also might mean more solutions to filter through, e.g. if you draw 100 responses and have 30 unique solutions (more diversity) instead of 3 (less). For math problems, this could mean a greater number of unique wrong answers, and for other domains, a greater number of unhelpful responses.

审稿意见
8

This work points out a phenomenon called self-improvement reversal. To be specific, the authors formulate three post-training paradigms (iterative SFT, iterative DPO, and iterative SFT-DPO) and conduct experiments to examine how various factors (iteration steps, different models, task difficulties) influence self-improvement. Although such strategied shows a general trend of improvement in pass@1 accuracy. They still struggle in some other problems like solutions diversity and OOD generalization. This paper provides a deeper understanding to self-improvement.

优点

  1. The experiments are well-designed and provide a comprehensive understanding to the internal mechanisms of self-improvement.
  2. This paper finds that such pass@1 accuracy metric may lead to a wrong judgment about model performance.
  3. The experimental results in Figure 2 are insightful. The relationship between correct answer coverage and iterative post-training methods has not been discussed before.

缺点

  1. Some of the conclusions in this paper are as expected, e.g. results from Table 3 are related to [1].

[1] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

问题

No more questions.

评论

Thank you for your recognition of our work.

  1. Some of the conclusions in this paper are as expected, e.g. results from Table 3 are related to [1].

Thank you for your reminder; we have already included this paper in the references.

审稿意见
6

This paper analyzes post-training via self improvement loops for LLMs. Their results suggest that there is a sort of "Goodhardt's Law" at play sometimes, where improving the benchmark scores will often lead to decreased performance after some point. They particularly note diversity as an important metric which often drops.

优点

Good idea and experiments. It's an important and timely topic. The observation that some metrics like diversity are sometimes unmeasured with these methods is very important.

缺点

I think the writing can be improved somewhat, even if the core ideas are clear.

问题

Do you expect these results to hold with RL posttraining?

评论

Thank you for your comments.

Do you expect these results to hold with RL posttraining?

Thank you for your comment, it's an excellent question.

Considering that iterative DPO/SFT has become a key component of post-training processes, such as in the LLaMA3[1] series models, iterative SFT/DPO is the primary focus of this paper. Of course, RL is also a significant technique in post-training, and we may discuss it in future work.

Regarding the use of RL in post-training, will the conclusions in this paper still hold? We believe that during the RL process, the reward model plays a crucial role in determining the optimization direction of the model. If the reward model continues to focus solely on the correctness of answers, we think the issues raised in the paper may still arise during iterative self-post-training. However, if the reward model also considers factors such as problem diversity and answer readability, these conclusions might change.

[1] Dubey, Abhimanyu, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).

评论

Thank you for the response and the great paper!

AC 元评审

This work studies post-training (specifically, iterative SFT, iterative DPO, and iterative SFT-DPO) and its "self-improvement reversal" problem. Specifically, the work conducts experiments to examine how various factors (iteration steps, different models, task difficulties) influence self-improvement, and shows that although these post-training methods show a general trend of improvement in pass@1 accuracy. They still struggle in some other problems like solutions diversity and OOD generalization. The paper is well-written, with ample results and analysis for this well-motivated problem with practical significance. The paper presents interesting reversal results with diversity and OOD generalization decreasing during post-training, using a variety of diversity metrics. It'd be desirable to include more discussion of why diversity is important in "inference scaling law".

审稿人讨论附加意见

The reviewers were generally happy with the work by giving a score of 6 or higher. Thus the discussion is succinct.

最终决定

Accept (Poster)