Weakness 3: Also, the paper lacks a hypothesis-driven structure that ties the findings to the central research question, which appears more like a tech report rather than a research paper.

Our central research question is to investigate several key components of the self-improving algorithm, as motivated and outlined in the earlier sections of the paper (Section 2, Lines 120–131). Different papers may adopt different writing structures, and in this work, we follow a structure that systematically explores various aspects and draws conclusions step by step through our experiments. We do not think that this writing style constitutes a significant flaw in the paper. For instance, many prior works have also adopted similar structures to empirically study best practices, such as [1,2].

[1] Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback; NeurIPS 2024

[2] What matters when building vision-language models?; NeurIPS 2024

Weakness 4: Contribution: While the paper has explored different configurations exhaustively, the overall contribution is vague given its lack of theoretical grounding and limitations in the empirical study.

Our main contribution is to approach self-improving algorithms through the lens of reinforcement learning (RL), identify several critical aspects, and empirically study best practices to propose an effective training recipe. While we did not provide deep theoretical grounding for many of the phenomena observed, we argue that our empirical findings are both meaningful and valuable for practitioners, making the contribution significant. In particular, the additional models and evaluations presented in the General Response further strengthen this contribution.

Question 1: Ablation studies showing the model’s performance with and without the CoT phase to quantify its impact on results.

We would like to emphasize that the warmup phase with CoT training is essential because we observed that most open-weight multimodal models, such as Llava and MiniCPM-V, rarely generate CoT reasoning and typically only produce a short final answer. For self-evolving training in reasoning tasks, the key improvements are achieved by enhancing the reasoning process, specifically the CoT reasoning – it is just how this setting is defined. This follows the standard setting adopted by nearly all prior works in this area [1,2,3,4]. Thus, conducting self-evolving reasoning training when the model cannot generate CoT reasoning would be largely ineffective. This is precisely why we include the CoT warmup phase in the first place. As stated in our response to Weakness 1 of the reviewer, we do not rely on any additional annotations or external resources during the warmup phase. Therefore, we believe there is no issue or limitation associated with this approach.

[1] Scaling relationship on learning mathematical reasoning with large language models; arXiv 2308.01825

[2] STaR: Bootstrapping Reasoning With Reasoning; NeurIPS 2022

[3] V-STaR: Training Verifiers for Self-Taught Reasoners; COLM 2024

[4] Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models; TMLR 2024

Question 2: What's the computation cost of the proposed method compared to baselines?

Admittedly, self-evolving training requires more computational resources compared to the SFT baseline. This is a common characteristic of self-evolving training approaches [1, 2] and reinforcement learning with human feedback (RLHF) [3, 4]. The additional computation cost mainly comes from extra data generation processes of these RL-like approaches and typically more training steps than SFT. However, when comparing different variations of self-evolving training, the computational costs are largely similar, as we are all training for the same number of steps with the same batch size for all self-evolving runs.

[1] STaR: Bootstrapping Reasoning With Reasoning; NeurIPS 2022

[2] ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search; NeurIPS 2024

[3] TÜLU 3: Pushing Frontiers in Open Language Model Post-Training; Tech Report

[4] The Llama 3 Herd of Models; arxiv 2407.21783

Question3: Can the best configurations explored in this study be applied to other task domains without losing their effectiveness?

As shown in the general response, we have applied our training recipe to other two models, Phi-3.5-Vision-4B and InternVL2-2B as well and evaluated these models on several added benchmarks and tasks. The results demonstrate that our best configurations are effective in most of these settings.