Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
摘要
评审与讨论
The paper presents a novel two-system architecture (Global Reasoning System and Detail Understanding System) that effectively addresses the trade-off between long-horizon temporal reasoning and fine-grained spatial understanding. GRPO is leveraged to select the key frames. The paper demonstrates strong performance on challenging benchmarks like Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS). The motivation and idea are easy to follow. The writing is good.
优缺点分析
[Strengths]
- Leveraging GRPO to re-formulate the key frames selection is interesting.
- The performance is good.
[Weaknesses]
- Leveraging GRPO to re-formulate the key frames selection is interesting. I wonder why the authors conduct this idea in the Referring Video Segmentation Task? For long-video understanding tasks, selecting key frames is important. However, referring video segmentation needs to segment objects per frame. Why is it still necessary to select key frames? Moreover, how to obtain the segmentation mask in the frame that is not selected by system 1.
- The authors claim that the proposed method addresses the trade-off between long-horizon temporal reasoning and fine-grained spatial understanding. However, this method increases the computational complexity with systems 1 and 2. With a similar computational complexity, we can increase the frame resolution or sample rate to improve the performance. More discussions are needed.
- The proposed Hierarchical Reward Design for System1 is an important part of this paper. However, it seems that decreases the performance. If it is useless, why do you still include it in the main paper? Moreover, it is important to show the performance with alone.
问题
Please refer to [Weaknesses].
局限性
Please refer to [Weaknesses].
最终评判理由
My concerns are solved.
格式问题
N/A
W1: Motivation for Using Referring Video Segmentation (RVS)
W1.1 Why Referring Video Segmentation Task?
The Referring Video Segmentation (RVS) task can be naturally decomposed into three key components:
- Understanding complex user instructions and identifying the referred object across the video, which requires the model to perform global temporal reasoning over the entire video.
- Once the object is identified, the model needs to accurately localize and segment it in high-resolution frames, requiring strong fine-grained spatial understanding.
- Propagating segmentation masks from sparse keyframes to the entire video via object tracking, which is more about low-level visual perception and can be handled by off-the-shelf tools like SAM2—thus not the focus of this paper.
The first two components align well with the challenge we highlight in the Introduction: the trade-off between long-horizon temporal reasoning and fine-grained spatial understanding. In contrast, many standard long-video understanding tasks primarily require global reasoning and often do not demand such precise spatial understanding. In those cases, System 1 alone may be sufficient to handle the task. Therefore, RVS provides a more rigorous testbed for our proposed two-system architecture.
W1.3 How Are Masks Generated for Non-Key Frames?
For frames not directly selected by System 1, we adopt SAM2 as a tracking and propagation module. SAM2 is a strong visual foundation model capable of propagating object masks across time based on the sparse predictions on keyframes.
This approach follows the standard paradigm in RVS literature: high-cost grounding and segmentation are performed on sparse frames, and light-weight temporal propagation is used to fill in the rest. Our contribution lies in learning to select the most informative keyframes, which significantly boosts both efficiency and segmentation consistency, especially under challenging conditions like motion blur, occlusion, or scene transitions.
W1.2 Why Is Key Frame Selection Still Necessary?
Given the above two premises, the necessity of selecting key frames can be justified from two main perspectives:
-
Grounding models lack long-range temporal reasoning: Modern vision-language (grounding-capable) models models (e.g., Sa2VA,Qwen2.5VL,Qwen2.5Omni) excel at precise localization in static images but struggle when provided with long, complex video sequences and ambiguous or complex instructions. If we directly input the full video and a temporally complex prompt, these models struggle to identify the correct object. In contrast, by leveraging the global reasoning ability of System 1, we can first select informative key frames and reformulate the instruction into simpler, localized prompts. As a result, the grounding model only needs to perform fine-grained segmentation on single high-resolution frames guided by short and unambiguous instructions. These two steps—key frame selection and instruction simplification—are strongly coupled, and together enable the model to more effectively fulfill the first two components described in W1.1.
-
More intelligent than rule-based sparsity: Prior works like Sa2VA also rely on sparse frame segmentation followed by mask propagation. However, their frame sampling is rule-based (e.g., uniformly sampled or first five frames), lacking flexibility and potentially leading to informational loss. In contrast, our System 1 can adaptively select frames with large semantic transitions, such as those involving object occlusion, reappearance, or camera scene cuts. These challenging segments are often the hard cases for standard tracking models. By letting System 1 select such frames for precise grounding by System 2, we can significantly reduce the tracking burden and overall segmentation error.
W2: Computational Complexity
We would like to clarify that the trade-off between long-horizon temporal reasoning and fine-grained spatial understanding primarily affects the training phase, rather than inference.
At inference time, increasing frame resolution or sampling rate is indeed a straightforward way to enhance performance. However, if the model has not been trained with long sequences and high-resolution inputs, it will lack the necessary capacity for temporal and spatial reasoning—thus limiting the benefit of high-resolution inference. A concrete example can be found in Appendix Table 3, where merely increasing input quality does not close the performance gap without structural support.
In our framework, we decouple training as follows:
- System 1 operates on inputs with resolution 100×28×28, with a maximum of 24 frames, and is trained end-to-end.
- System 2 does not require training or backpropagation, and handles only 5 high-resolution frames (1200×28×28). Furthermore, because System 2 processes frames independently during inference, we leverage the vLLM engine to parallelize its execution efficiently.
If we were to train the whole pipeline end-to-end using high-resolution frames (e.g., 1200×28×28), the compute budget would limit the number of frames to around 5, making it impossible to learn long-range video reasoning. This constraint directly limits global temporal understanding.
As a concrete example, Sa2VA, an end-to-end trained model, only supports 5 frames for both training and inference—confirming that a monolithic approach struggles with long-horizon temporal modeling.
Thus, our proposed two-system architecture offers a computationally efficient and scalable way to address this trade-off, enabling both global video reasoning and fine-grained visual grounding.
W3: Reward Design
Thank you for raising this point. We have conducted additional experiments to more comprehensively evaluate the effectiveness of our hierarchical reward design.
As discussed in the paper, the grounding-based reward is strongly coupled with System 2, meaning it depends heavily on the specific outputs of a particular System 2 model. This tight coupling can lead to reduced generalization when switching to a different System 2. Although adding to ( and ) may slightly reduce performance under the same System 2, it significantly improves generalization to other System 2 implementations.
To validate this, we perform ablation studies where we replace System 2 with Sa2VA at inference time and evaluate how different reward combinations affect final performance:
| Reward Setting | Referring | Reason | Single Object | Multi Object | Overall |
|---|---|---|---|---|---|
| 47.31 | 28.79 | 38.93 | 40.39 | 38.05 | |
| 45.50 | 34.20 | 39.60 | 42.60 | 39.90 | |
| 44.20 | 32.50 | 38.20 | 41.90 | 38.40 | |
| + Sa2VA | 58.12 | 43.45 | 50.57 | 53.53 | 50.79 |
| + Sa2VA | 60.37 | 41.83 | 51.70 | 51.02 | 51.10 |
| + Sa2VA | 61.84 | 46.14 | 54.59 | 50.68 | 53.99 |
The results show that:
- When using Sa2VA as System 2, adding leads to notable performance improvements, especially on the Reasoning sub-task.
- This is because is a weakly coupled reward, focusing on the alignment between visual features and language prompts at the single-frame level. It provides better transferability across different System 2 implementations.
- In contrast, tends to overfit to the specific structure and behavior of the System 2 used during training, limiting generalization.
In summary, even if shows limited benefit in fixed setups, it plays an important role in enabling general-purpose and modular policy learning, which is essential for a scalable architecture.
Thanks for the reply. I have no other questions.
We would like to express our sincere gratitude for your feedback. We are glad to have clarified your concerns through our response. Your comments have provided valuable perspectives that have helped us refine our methodology and improve the clarity and rigor of our presentation.
This paper addresses the challenge that traditional models struggle to balance global temporal reasoning and local detail understanding simultaneously, proposing a two-system architecture: a Global Reasoning System is trained via GRPO, and then tasks are handed over to an additional downstream task-specific system. The approach demonstrates excellent performance in audio-visual segmentation and video understanding tasks.
优缺点分析
Strengths
- The problem addressed in this paper holds significant application value, enabling remarkable performance improvement with minimal data usage.
- The paper features thorough and well-documented experiments, ensuring both readability and scientific rigor.
Weaknesses
- There is an error in the experimental data, and the delta values in Table 1 are incorrect.
- The paper does not discuss the relationship between data scale and RL-driven improvements in target application scenarios. Adding this discussion would help assess the need for labeled data in practical applications.
问题
-
One of the core contributions highlighted in the study is the improvement in out-of-domain generalization. Does System 1 still demonstrate performance gains in scenarios with more significant distribution shifts compared to the training data, such as autonomous driving or medical domains?
-
The design of the Key Frame Quality Reward appears to be influenced by specific data distributions. Will this affect its application in joint training with more diverse data distributions?
局限性
Yes
最终评判理由
All of my initial concerns have been addressed. I will keep my rating.
格式问题
No Paper Formatting Concerns
W1: Correction of Δ Values
Thank you for pointing this out. We acknowledge the error in the originally reported delta values and have corrected them accordingly. The revised table is shown below:
| Model | Seen (J&F) | Seen (J) | Seen (F) | Unseen (J&F) | Unseen (J) | Unseen (F) |
|---|---|---|---|---|---|---|
| Qwen2.5-Omni-7B (SFT) | 39.1 | 35.4 | 42.8 | 66.2 | 63.1 | 69.3 |
| Omni-R1-7B | 47.2 | 43.0 | 51.4 | 74.2 | 71.3 | 77.0 |
| Δ (corrected) | +8.1 | +7.6 | +8.6 | +8.0 | +8.2 | +7.7 |
We will correct this in the next version of the paper. Thanks again for your careful reading.
W2: Data Scale
We believe the relationship between data scale and RL-driven improvements can be analyzed from two perspectives:
Impact of Data Volume on a Single Task
We conducted an experiment on RefAVS, comparing models trained with 1600 and 10400 samples. The results are shown below (see Appendix Table 3 for more details):
| Training Samples | RefAVS(F) Seen | RefAVS(F) Unseen | AVHBench Acc |
|---|---|---|---|
| 1600 | 51.35 | 76.98 | 60.77% |
| 10400 | 54.49 | 77.63 | 58.85% |
We observe that increasing data size improves task-specific metrics, particularly in the seen subset. However, it may also introduce slightly higher hallucination rates, as reflected in AVHBench accuracy.
Multi-Datasets Joint Training
We further investigate the effect of multi-task data mixing, especially combining ReVOS and RefAVS datasets.
As shown in Appendix Table 4, training on both datasets leads to consistent performance gains even on ReVOS—which was not directly targeted during joint training:
- ReVOS overall score: 44.7 → 47.6 (+2.9)
This demonstrates that RL-based joint training across tasks and distributions can enhance general video understanding capabilities, without requiring dataset-specific tuning.
Q1: Out-of-Domain Generalization
Thank you for raising this important point. To evaluate out-of-domain generalization, we tested both Qwen2.5 Omni-7B and Omni-R1 on the DriveBench[1] dataset, which targets the autonomous driving domain—a distribution notably different from our training data.
The behavior prediction task in DriveBench requires the model to predict future steering angles and vehicle speeds based on multi-view images (from six camera directions). It is worth noting that DriveLM, the strongest baseline, is specifically fine-tuned on large-scale driving data.
We evaluate the behavior accuracy metric (as it does not depend on GPT APIs for scoring):
| Model | Behavior Accuracy |
|---|---|
| LLaVA 1.5-7B | 0.100 |
| LLaVA-NeXT | 0.230 |
| Qwen2VL-7B | 0.300 |
| Qwen2.5 Omni-7B | 0.330 |
| Omni-R1 | 0.345 |
| DriveLM (oracle) | 0.440 |
While DriveLM benefits from domain-specific fine-tuning, Omni-R1 is trained in a general-purpose fashion without any driving-specific data. Despite this, it still outperforms its baselines(Qwen2.5 Omni-7B), demonstrating clear improvements in out-of-domain generalization brought by our reinforcement learning framework.
Q2: Key Frame Quality Reward
Thank you for the insightful question. We would like to clarify that the underlying principles of the Key Frame Quality Reward are generally applicable across video tasks, although the specific hyperparameter settings may vary depending on the dataset and task characteristics.
Specifically:
-
encourages the model to select non-overlapping segments as keyframes, promoting diversity. This reward is task-agnostic and generally useful across different video scenarios.
-
softly constrains the number of selected keyframes. We acknowledge that different videos may require different numbers of keyframes, so we do not enforce a fixed value. Instead, we define a reasonable range bounded by and , and penalize the model only if it selects too few or too many frames. The optimal number of keyframes is learned implicitly through the downstream rewards and . When training on multi-domain data, we simply need to ensure that the range defined by and is sufficiently broad.
-
is only applied when ground-truth object-level annotations (e.g., masks or boxes) are available, such as in tasks like RefAVS or ReVOS. For high-level video understanding tasks that lack such annotations, we set its weight to zero.
In summary, only requires mild adjustment when applied to joint training over diverse datasets. The other two reward components remain broadly applicable and robust across various settings.
Reference
[1] Xie, S., Kong, L., Dong, Y., Sima, C., Zhang, W., Chen, Q. A., ... & Pan, L. (2025). Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives. arXiv preprint arXiv:2501.04003.
Dear Reviewer zCqj,
This is a reminder that we are nearing the end of the author-reviewer discussion period. Please carefully read the authors' responses to your reviews, as well as the responses to other reviews. If you have any follow-up questions, please post them ASAP, so there is time for back-and-forth discussion with the authors.
If you don’t have any additional questions and if your concerns have been addressed, please update your ratings or final justification accordingly. If you are not willing to update your ratings, please explain in your final justification why concerns are still not addressed.
Note that the author-review discussion period will end on Aug 8, 11.59pm AoE.
Best,
AC
Thank you for the detailed and thorough response.
The correction to Table 1 is noted. The new experiments on data scaling and out-of-domain generalization (DriveBench) are insightful and convincingly support the paper's claims. Furthermore, the detailed explanation of the Key Frame Quality Reward components clarifies their adaptability.
All of my initial concerns have been addressed.
Thank you very much for your thoughtful review and for acknowledging our work. Your constructive comments have been invaluable in enhancing the quality of our paper. We are delighted that our responses addressed your concerns.
Thank you again for your time and expertise.
Dear Reviewer zCqj,
As the discussion phase is approaching its conclusion, we would like to kindly check if you have any further questions, concerns, or suggestions regarding our rebuttal. Your additional input would be invaluable in helping us further clarify and improve our work.
Thank you again for your time and thoughtful engagement throughout the review process.
Best regards
The core contributions of this work are embodied in its two-system collaborative architecture based on Qwen2.5-Omni, which decouples long-horizon reasoning from fine-grained pixel understanding via reinforcement learning. By leveraging hierarchical rewards and simulated collaboration between systems, this study demonstrates that reinforcement learning can effectively optimize keyframe selection and task reformulation in omnimodal settings, even with minimal training data. The contributions of this work include: (1) Scalable Two-System Architecture A Global Reasoning System (System 1) for efficient long-range temporal reasoning and keyframe selection. A Detail Understanding System (System 2) for high-resolution visual grounding, enabling precise segmentation. (2) End-to-End Reinforcement Learning Framework Built on Group Relative Policy Optimization (GRPO), requiring only one epoch of RL training on small datasets. Hierarchical rewards (keyframe quality, instruction alignment, global consistency) guide policy optimization. (3) State-of-the-Art Performance & Generalization Outperforms specialized models (e.g., Sa2VA, EEMC) on RefAVS and REVOS benchmarks. Improves out-of-domain generalization in both video and multimodal understanding tasks.
优缺点分析
Strengths:
- This work proposes a coarse-to-fine reasoning system and design three appropriate reward functions to stimulate the grpo system's understanding and reasoning ability.
- The method proposed in this work achieves great performance on these types of benchmarks: referring audio-visual segmentation 、referring video object segmentation、general understanding QA benchmark.
Weaknesses:
1.This work verifies the superiority of the architecture design on many different tasks, but lacks some ablation experiments on the model in the architecture itself, such as replacing the grounding model or segmentation model. These experiments can further verify the reliability of the method.
问题
According to my understanding (from lines 237 to 250), in the few-shot rl training, system1 is open and system2 (including sam2) is frozen. However, according to the results in Table 2, the best combination is system1+sa2va, and the paper does not seem to explain how to combine them. I guess that this step uses system1's qwen2.5omni only as a tool for accurately and reasonably extracting frames, and sends these key frames to sa2va, which can achieve such good performance. I am more concerned about the details of this step, because it seems that sa2va does not participate in the training as system2. What caused this improvement in the performance of this combination?
局限性
yes
最终评判理由
All my concerns have been fully addressed; the paper is ready for acceptance.
格式问题
No formatting concerns.
W1: Architecture-Level Ablations
We appreciate the reviewer’s suggestion. To evaluate the impact of architectural choices within our pipeline, we conduct ablation studies by replacing both the grounding model and the segmentation model, examining how these changes affect training and inference.
Grounding Model Replacement We replaced the original Qwen2.5-Omni-7B with VisionReasoner[1], a unified model for visual perception tasks enhanced via GRPO across grounding, counting, and segmentation. To ensure fair comparison, we aligned all training hyperparameters and tested two settings:
- Replace only during training
- Replace during both training and inference
Results on ReVOS are shown below:
| Method | Referring | Reason | Single Object | Multi Object | Overall |
|---|---|---|---|---|---|
| Omni-R1 (original) | 52.5 | 36.9 | 45.0 | 46.6 | 44.7 |
| Replace grounding model (training only) | 52.2 | 39.4 | 45.4 | 49.2 | 45.8 |
| Replace grounding model (full) | 55.5 | 41.7 | 49.1 | 44.7 | 48.6 |
As seen, simply replacing the grounding model during training leads to noticeable gains. A full switch to VisionReasoner during both training and inference yields even greater improvements across most sub-tasks.
Segmentation Model Replacement We further replaced the segmentation model used during inference. Specifically, we switched from SAM2 to TAM [2], a more efficient segmentation tracker, while keeping VisionReasoner as the grounding model (training only).
| Segmentation Model | Referring | Reason | Single Object | Multi Object | Overall |
|---|---|---|---|---|---|
| SAM2 (baseline) | 52.2 | 39.4 | 45.4 | 49.2 | 45.8 |
| TAM | 50.6 | 39.1 | 44.0 | 51.3 | 44.9 |
Although the overall score drops slightly (-1.1%), TAM shows better performance on multi-object tracking, indicating its advantage in handling complex scenarios.
Conclusion: These results validate the robustness and generalizability of our architecture. Stronger grounding or segmentation models can further enhance performance, and our pipeline remains effective under architectural variations.
Q1: Integrating Sa2VA as System 2
Yes, your understanding is correct. The purpose of this experiment is to demonstrate that System 1 and System 2 are modular and can be decoupled—replacing System 2 with a stronger module like Sa2VA leads to better performance.
Although Sa2VA was not involved in our reinforcement learning training, it naturally fits the role of System 2. In its original setting, Sa2VA takes as input either the first five frames of a video or five frames uniformly sampled from the whole video, along with the original global instruction.
In our setup, we leverage System 1 to select five key frames and generate localized, simplified instructions for each segment. These keyframes and instructions are then fed into Sa2VA for fine-grained segmentation.
This decoupled design allows System 1 to fully exploit its strength in global video reasoning, while Sa2VA focuses on precise, local-level understanding. As shown in Table 2, this integration yields improved performance without any modification to Sa2VA’s parameters or architecture, further validating the effectiveness of our modular two-system design.
Reference
[1] Liu, Y., Qu, T., Zhong, Z., Peng, B., Liu, S., Yu, B., & Jia, J. (2025). VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning. arXiv preprint arXiv:2505.12081.
[2] Xiong, Y., Zhou, C., Xiang, X., Wu, L., Zhu, C., Liu, Z., ... & Chandra, V. (2024). Efficient track anything. arXiv preprint arXiv:2411.18933.
Thanks. I have no other questions.
We sincerely appreciate your insightful questions and constructive suggestions on architectural ablation study and are glad that we solved your concerns. Your comment has provided valuable insights into our proposed method and played a significant role in improving the quality of our paper.
This paper introduces Omni-R1, an RL framework built on GRPO training objective for the tasks of Referring Audio-Visual Segmentation and Reasoning Video Object Segmentation. The proposed approach adopts two systems, where the system-1 is responsible for performing global reasoning and output a set of key frames and system-2 is used to perform fine-grained reasoning to generate final answer. Experiments on RefAVS and ReVOS tasks demonstrated the effectiveness of the proposed framework.
优缺点分析
Strengths
- This paper proposes an effective framework involving GROP to solve the tasks of RefAVS and REVOS.
- The paper is well-written and clearly organized, making it easy to follow.
Weakness
- The paper claims that the proposed framework is a omnimodal approach capable of solving video, image and audio tasks. However, the major improvements and contribution lie on the two-stages reasoning on videos.
- Less training details for the GROP training process, for example the reward curves and rollout number during training.
- Lack of case studies on the reasoning results of system-1 and system-2, as well as the rollout results in between. Providing these results would help demonstrate the entire process more effectively.
- Lack of evaluations on image and video understanding tasks.
问题
- For the Key Frame Quality Reward, how to determine the diversity and number of key frames for different videos which are further used to calculate and ? I think that it is impossible to tell the diversity and key frames number without analyzing the content of videos in varying types and durations.
局限性
yes
最终评判理由
Most of my concerns, including the training details have been addressed. I will keep my positive score.
格式问题
NA
W1: Omnimodal Approach
One of the central challenges in building an OmniLLM lies in handling long-term temporal reasoning in combination with high-resolution fine-grained understanding. This challenge is especially prominent when integrating modalities like video and audio, which have vastly different token densities: for example, 1 second of audio may require only ~25 tokens, while a single 448×448 video frame typically takes 256 visual tokens. Thus, the core conflict—and bottleneck—resides in efficiently performing global reasoning over long-range video sequences while preserving detailed understanding at key moments.
Our dual-system architecture is specifically designed to address this challenge. As mentioned in L149, our System 1 can process both video and audio inputs jointly. While our method section focuses primarily on video understanding, we also validate the generality of the framework on RefAVS, which requires both audio and visual cues for temporal localization and grounding. In this task, System 1 learns to reason over the audio and visual sequence and generate global-to-local instructions.
Importantly, our results on RefAVS show that joint audio-video reinforcement learning improves performance significantly, demonstrating that our system does not rely solely on vision. Additionally, we show that the same omnimodal reinforcement learning paradigm improves performance on Omnibench—an image-audio task—further verifying the generality and scalability of our proposed Omni-R1 framework.
Therefore, while the most substantial improvements are observed on video tasks (which are inherently more complex), our method is truly omnimodal, and effective across vision, audio, and multi-modal reasoning scenarios.
W2: Training Details
Thank you for the suggestion. We will include reward curves in the next version of the paper. Due to image restrictions during the rebuttal phase, we provide a brief summary here:
During training, the reward component converges quickly and stably, indicating that the model rapidly adapts to the temporal distribution of selected frames. In contrast, exhibits fluctuations around 0.3, which we attribute to inconsistencies in local frame-level grounding. As for , the reward shows an oscillating trend throughout training. Notably, high-quality rollout trajectories (where ) ultimately stabilize around 0.8, suggesting effective convergence under high-performing behavior.
Regarding the number of rollouts, we clarify in L248 that the group size is 8, which corresponds to 8 rollouts per update in the GRPO training process.
W3: Case Studies
Thank you for your suggestion. We will include additional case studies in the next version of the paper to illustrate the outputs of both System-1 and System-2, as well as the intermediate results from the rollout process.
While we are currently unable to add new figures during the rebuttal phase, we kindly refer you to Appendix Figures 3 and 4, which already show examples of the reasoning behaviors of both systems. These examples provide an initial demonstration of how System-1 and System-2 work together to perform multimodal reasoning.
W4: Evaluations on Image and Video Understanding Tasks
Thank you for the suggestion. In the main paper (Table 3), we evaluate our model on video understanding benchmarks such as VideoMME and MVBench, as well as the OmniBench benchmark for image understanding.
To further assess image-level understanding capabilities, we additionally evaluate on MME[1] and DriveBench[2]. These results support the generality of our framework across both video and image modalities.
MME Results:
| Model | Total Score | Perception Score | Cognition Score |
|---|---|---|---|
| Qwen2.5-Omini | 2071.4 | 1556.8 | 514.6 |
| Omni-R1 | 2100.2 | 1572.0 | 528.2 |
DriveBench Results (Behavior Accuracy):
| Model | Behavior Accuracy |
|---|---|
| LLaVA 1.5-7B | 0.100 |
| LLaVA-NeXT | 0.230 |
| Qwen2VL-7B | 0.300 |
| Qwen2.5 Omni-7B | 0.330 |
| Omni-R1 | 0.345 |
| DriveLM | 0.440 |
We would like to note that DriveLM is specifically trained on driving-domain data, while Omni-R1 is trained in a general-purpose manner without any driving-specific fine-tuning. Despite this, Omni-R1 still achieves competitive performance across domains. Importantly, compared to the base Qwen2.5 Omni-7B, Omni-R1 consistently improves performance on both tasks (e.g., MME and DriveBench), demonstrating the effectiveness of our reinforcement learning–based optimization strategy.
Q1: and
Thank you for the thoughtful question. We would like to further clarify the design and intent behind the two keyframe quality rewards:
-
is designed to encourage the selection of non-overlapping and semantically diverse video segments as keyframes. It penalizes redundant or closely clustered selections. This formulation is generalizable across different types and durations of videos, as it operates purely on the segment selection patterns without requiring prior content-specific assumptions.
-
We acknowledges that different videos may require different numbers of keyframes. Therefore, we do not enforce a fixed number of selected keyframes. Instead, we define a soft constraint via two thresholds: and . A penalty is applied only when the number of selected keyframes falls outside this range. The model then learns to select the optimal number of keyframes based on downstream rewards and .
This design enables the system to adaptively balance diversity and sufficiency of information while remaining general-purpose and content-agnostic.
Reference
[1] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
[2] Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
Dear Reviewer vim3,
This is a reminder that we are nearing the end of the author-reviewer discussion period. Please carefully read the authors' responses to your reviews, as well as the responses to other reviews. If you have any follow-up questions, please post them ASAP, so there is time for back-and-forth discussion with the authors.
If you don’t have any additional questions and if your concerns have been addressed, please update your ratings or final justification accordingly. If you are not willing to update your ratings, please explain in your final justification why concerns are still not addressed.
Note that the author-review discussion period will end on Aug 8, 11.59pm AoE.
Best,
AC
Thanks for the rebuttal comment and explanations provided by authors. I will keep my score as positive.
We sincerely appreciate your thoughtful feedback and are truly grateful for your recognition of the strengths in our work. Your constructive comments and insightful suggestions have been invaluable in helping us refine and enhance the quality of our paper.
Thank you again for dedicating your time and sharing your expertise.
Dear Reviewer vim3,
As the discussion phase is approaching its conclusion, we would like to kindly check if you have any further questions, concerns, or suggestions regarding our rebuttal. Your additional feedback would be invaluable in helping us further clarify and improve our work.
Thank you again for your time and thoughtful engagement throughout the review process.
Best regards
In this work, the authors present Omni-R1, a two-system RL algorithm for Omnimodal Reasoning. Omni-R1 contains a global reasoning stage to select informative keyframes, and a detailed understanding stage for high-resolution inputs. The two-system architecture is trained via GRPO with specially designed reward functions. The authors provide the experimental results on the RefAVS and REVOS tasks, and the results demonstrate the effectiveness of the proposed method.
All reviewers agree that this work's contribution is strong with the two-system collaborative architecture and GRPO-based RL training.
- Citing Reviewer zCqj (score 5): The problem addressed in this paper holds significant application value, enabling remarkable performance improvement with minimal data usage.
- Citing Reviewer 7n6f (score 4): This work proposes a coarse-to-fine reasoning system and design three appropriate reward functions to stimulate the grpo system's understanding and reasoning ability.
The majority of the reviewers' concerns were the missed implementation details or limited ablation studies. In the author-reviewer discussion phase, the authors address most of the concerns; there are a few additional results that I feel particularly made the work stronger:
- additional training details for the GROP training process.
- more ablation experiments about architectural designs,
- additional results on self-driving datasets, which demonstrate the out-of-domain generalization ability.
- Results and further analysis of the reward design.
Overall, this work is strong and received positive feedback from all reviewers, and I recommend to accept. I clearly see its value to present to the NeurIPS community.