Thank you for your detailed and critical feedback - we have carefully reviewed each point and respectfully address them below:

Q1: (Quality and significance) Although the evaluation is extensive, on many different benchmarks, I feel the results are somewhat of limited impact, … To experimentally evaluate this, more comparisons with methods employing such approaches should be provided (for example, taken from the related works, [1][2] etc ) instead of testing multiple VLMs.

References:

[1] Videorag: Retrieval augmented generation over video corpus.

[2] Reflexion: Language agents with verbal reinforcement learning.

A1: Thank you for the insightful question. Our evaluation focuses on whether large vision-language models (LVLMs) benefit from the ReAgent-V framework under different parameter settings. As shown in the results, ReAgent-V consistently outperforms its base models and surpasses most strong baselines, such as InternVL-2.5-8B and BIMBA-LLaVA. We further conducted additional experiments comparing ReAgent-V with the retrieval-based VideoRAG approach and a Reflexion-inspired variant, as shown in the tables below. Although Reflexion [2] was originally designed for language tasks, we adapted its core mechanisms. All models were evaluated under identical settings for fairness. Due to space constraints, detailed implementation specifics will be included in the final version release.

The results show that ReAgent-V consistently outperforms [1], [2], and other baselines, demonstrating its effectiveness. All experiments were repeated three times, and our method achieves statistically significant improvements over models without ReAgent-V across all benchmarks (p < 0.01, two-tailed test), further validating its robustness.

Model (Backbone)	VideoMME (avg)	EgoSchema	LongBench
ReAgent-V (Qwen2.5-VL-7B)	60.7	61.9	54.3
VideoRAG (Qwen2.5-VL-7B)	60.3	61.2	53.7
ReAgent-V (LLaVA-Video-7B)	57.9	60.8	53.1
VideoRAG (LLaVA-Video-7B)	57.2	60.3	52.6

Model (Backbone)	VideoMME (avg)	EgoSchema	LongBench
ReAgent-V (Qwen2.5-VL-7B)	60.7	61.9	54.3
Reflexion (Qwen2.5-VL-7B)	50.6	53.2	43.8
ReAgent-V (LLaVA-Video-7B)	57.9	60.8	53.1
Reflexion (LLaVA-Video-7B)	46.7	50.5	42.8

Beyond video understanding, we evaluate ReAgent-V on two additional tasks: Video LLM Reasoning and VLA Alignment. In Video LLM Reasoning, ReAgent-V outperforms the GRPO-trained model from [3] using only 45% of the training data. In VLA Alignment, our reflection-based reward replaces the hand-crafted cost function in GRAPE [4], yielding improved performance.

References:

[3] Video-r1: Reinforcing video reasoning in mllms, 2025.

[4] GRAPE: Generalizing robot policy via preference alignment, 2025.

Q2: (Clarity) Discussion/tests about the sensitivity of the entropy-based frame selection to the various hyperparameters is missing.

A2: Thank you for the question. We provide a robustness analysis of the two key hyperparameters in Eq(4) - the scaling factor and threshold - using Qwen2-VL-7B on the short split of VideoMME.

First, we fix and vary :

**Table R1: Accuracy under different values of **

	Accuracy (%)
1	73.5
2	71.5
3	74.5
4	72.8

Next, we fix and vary :

Table R2: Accuracy under different values of

Value	Accuracy (%)
0.5	71.8
0.6	72.3
0.7	73.5
0.8	71.2

The results show that our method is stable across a wide range of parameter values. Overly strict thresholds (e.g., ) may select too few frames and hurt performance. We adopt and as default for a good balance of robustness and effectiveness, and will include this in the revised paper.

Q3: There are some other problems:

i) the section 2.2 on tool selection has little … How?

ii) Similarly, 2.3 could be improved. I understand the specific details are in the appendix, but I feel the main text should still include more details.

iii)Small typo in line 200: missing space 'ReAgent-V as...'

iv) In figure 5, I feel the question at the top …burger ..."

A3: Thank you for your careful reading and valuable comments. We will fix the typo on line 200 (“ReAgent-Vas...”) and revise unclear phrasing in Figure 5 (e.g., “What is the telling…”). We also acknowledge that Sections 2.2 and 2.3 lack implementation details. For tool selection, the base model determines which tools to invoke based on the question, the video content, and its own reasoning needs. In the revision, we will clarify how base models select tools based on the question and video content, and provide a more detailed explanation of tool selection and reflection strategies in the main text rather than in the appendix

Q4: (Originality) Besides the entropy-based frame selection, the proposed algorithm provides limited insights. … I get limited insights for possible future works, only the 'quantitative' results, and the choices made seem pretty arbitrary. I'd like to see more discussions in this sense.

A4: Thank you again for the insightful comments. The three reflection strategies are designed to mitigate overconfidence in single-strategy responses, which often cause hallucinations and errors [1, 2]. Each strategy provides a distinct perspective, and their combination enables the model to integrate complementary signals, leading to more accurate answers. As shown in Figure 4, this combined approach outperforms individual strategies on benchmarks such as VideoMME, LongBench, and EgoSchema.

To further support this, we analyzed the VideoMME benchmark using ReAgent-V with Qwen2.5-VL-7B, focusing on the Temporal Perception task where reflection is frequently triggered. The average output confidences for the conservative, neutral, and aggressive strategies were 0.63, 0.71, and 0.45, respectively, with the aggressive strategy showing the highest variance. This suggests base models are less likely to produce conservative or aggressive responses due to lower confidence, but by explicitly prompting these strategies and merging their outputs, we can capture cases that single-strategy responses may overlook. ReAgent-V’s reward module not only generates reward reports to guide data selection for GRPO/DPO training but also directly improves base model reasoning through answer revision - a capability absent in prior agent frameworks. In the Video LLM Reasoning task, using ReAgent-V’s reward reports to filter GRPO training data yields more efficient performance gains than standard methods.

References:

[1] Analyzing and mitigating object hallucination in large vision-language models

[2] VL-Uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation

Q5: How does frame selection reduce the inference time so much? (Table 3). Is the time required by this procedure offset by the reduction in computation?

A5: Thank you for the question. While ECRS-based frame selection introduces some overhead, it significantly reduces the number of frames passed to the vision-language model - the main inference bottleneck - resulting in efficiency gains that far outweigh the selection cost.

Q6: Why is the amount of data used in Table 2 different for the various models? To validate the implicit claim (that ReAgent-V is better with less data) … or that ReAgent-V stays constant even after adding more data.

A6: Thank you for this interesting and careful question. In the Video LLM Reasoning application, ReAgent-V is used to filter GRPO training data via a reflection-based mechanism. As pointed out in [1], more challenging examples often lead to greater improvements in GRPO training.

Using the dataset from [2], we leveraged ReAgent-V’s reflection signals to identify difficult samples - those that triggered reflection - and retained only these difficult samples, comprising 45% of the original data. Training the base model from [2] on this subset under the same GRPO setup resulted in better performance than using the full dataset. This shows that ReAgent-V not only improves reasoning but also serves as an effective data filtering mechanism for reinforcement learning pipelines.

To further validate this, we conducted a control experiment by randomly selecting an equal-sized subset (52k samples) from the original training set. As shown below, the model trained on randomly selected data underperformed compared to both the full-data baseline and our reflection-based filtering:

Method	Steps	#Data	VSI-Bench	VideoMMMU	MMVU	MVBench	TempCompass	VideoMME
Vanilla GRPO [2]	16	116k	32.3	45.8	60.6	60.9	69.8	53.8
GRPO + Random select	16	52k	31.2	44.3	59.1	59.6	68.4	52.5
GRPO + ReAgent-V	16	52k	33.1	47.9	63.0	61.4	70.3	54.2

These results highlight the importance of selecting informative training samples rather than relying on quantity, and confirm the utility of ReAgent-V as a high-quality data selector for downstream learning tasks.

References:

[1] Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement, 2025.

[2] Video-r1: Reinforcing video reasoning in mllms, 2025.