$\mathcal{V}ista\mathcal{DPO}$: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
摘要
评审与讨论
In this paper, the authors propose a Video Hierarchical Spatial-Temporal DPO (VistaDPO) mechanism, a DPO strategy to optimize the alignment between video and language in LVMs. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Furthermore, the authors construct and release an annotated QA pairs dataset, which can serve as follow-up video DPO research.
给作者的问题
Check the Weaknesses.
论据与证据
Yes
方法与评估标准
Yes
理论论述
Looks good to me.
实验设计与分析
Looks good to me.
补充材料
Yes.
与现有文献的关系
Lack of comparison with highly related work.
[1]Nguyen, Trong-Thuan, Pha Nguyen, and Khoa Luu. "Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18384-18394. 2024.
遗漏的重要参考文献
No.
其他优缺点
Strengths: VistaDPO demonstrates notable strengths by effectively addressing video-language misalignment through a hierarchical approach involving instance-level, temporal-level, and perceptive-level optimization. Its detailed spatial-temporal grounding dataset (VistaDPO-7k) significantly enhances the model's fine-grained semantic understanding and generalization capability, resulting in good performance on benchmarks for hallucination reduction, video QA, and captioning tasks, alongside strong robustness in adversarial conditions.
Weaknesses: However, it also exhibits limitations.
- Hierarchical optimization increases computational complexity due to multi-level spatial-temporal modeling.
- Limited generalizability beyond the specific tested base models.
- Lack of comparison with related works, like [1].
[1]Nguyen, Trong-Thuan, Pha Nguyen, and Khoa Luu. "Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18384-18394. 2024.
其他意见或建议
Check the Weaknesses.
Thank you all for your thoughtful and constructive comments! We are really encouraged to see that the reviewers appreciate some positive aspects of our paper, such as strong performance (Reviewer NY3g, CzUw, zu9j, sRt7), high robustness (Reviewer NY3g, sRt7), comprehensive & informative experiments (Reviewer NY3g, CzUw, zu9j), innovative contributions (Reviewer NY3g, CzUw, zu9j), and sound methodology (Reviewer NY3g, CzUw, zu9j).
Following we show our detailed responses. For any further possible concerns and questions, please feel free to let us know, and we are more than happy to engage in interactions and discussions.
Relation To Broader Scientific Literature: Lack of comparison with highly related work.
Thanks! As noted, HIG and VistaDPO both employ hierarchical modeling for video understanding but differ in focus and application. HIG focuses on spatiotemporal relationship modeling to generate scene graphs, addressing object relationship challenges. In contrast, VistaDPO emphasizes multimodal alignment between video and language, tackling issues like hallucination and human intuition drift.
Methodologically, HIG uses graph neural networks to capture spatiotemporal interactions, while VistaDPO employs preference optimization with multi-level alignment to enhance semantic consistency in large video models. Both approaches uniquely contribute to their respective domains—HIG in visual relationship modeling and VistaDPO in multimodal semantic alignment.
We will incorporate this reference in reversion to provide a more comprehensive discussion.
Weakness 1: Hierarchical optimization increases computational complexity due to multi-level spatial-temporal modeling.
Hierarchical optimization, despite increasing computational complexity through multi-level spatial-temporal modeling, delivers notable performance gains for video-language alignment in LVMs and offers valuable directions for future work. Balancing this trade-off presents a challenge and opportunity to develop efficient, scalable approaches.
Weakness 2: Limited generalizability beyond the specific tested base models.
Although neither VistaDPO nor VistaDPO-7K provides explicit evidence of being model-specific in terms of methodology or data, we have validated their effectiveness on mainstream Large Video Language Models (PLLaVA and Video-LLaVA) and presented the results in the paper.
To further support this point, we additionally included results for LLaVA-Next-Video [1] within a limited timeframe. Its performance on Video Hallucination (VideoHallucer & EventHallusion), VideoQA (ActivityNet-QA), and Captioning (VideoChat-GPT Bench) tasks is as follows:
Table A: Evaluation on VideoHallucer and EventHallusion.
| Models | VideoHallucer | EventHallusion | |||||
|---|---|---|---|---|---|---|---|
| Basic↑ | Hallucinated↑ | Overall↑ | Entire↑ | Mix↑ | Misleading↑ | Overall↑ | |
| LLaVA-Next-Video (7B) | 64.8 | 58.7 | 31.5 | 49.1 | 70.5 | 71.6 | 64.8 |
| LLaVA-NeXT-Video-DPO (7B) | 62.5 | 60.9 | 32.0 | 51.8 | 72.5 | 75.5 | 67.5 |
| LLaVA-NeXT-Video-VistaDPO (7B) (Ours) | 67.2 | 63.5 | 38.8 | 54.4 | 73.1 | 77.5 | 68.9 |
Table B: Evaluation on ActivityNet-QA and VideoChat-GPT Captioning Bench.
| Models | ActNet-QA | Correct↑ | Detail↑ | Context↑ | Temporal↑ | Consist↑ |
|---|---|---|---|---|---|---|
| LLaVA-Next-Video (7B) | 53.5/3.2 | 3.4 | 3.3 | 3.9 | 2.6 | 3.1 |
| LLaVA-NeXT-Video-DPO (7B) | 60.2/3.5 | 3.6 | 3.5 | 4.2 | 3.0 | 4.1 |
| LLaVA-NeXT-Video-VistaDPO (7B) (Ours) | 62.0/3.6 | 3.8 | 3.6 | 4.4 | 3.4 | 4.1 |
The experiments demonstrate that VistaDPO exhibits broadly significant effectiveness overall.
[1] LLaVA-NeXT: A Strong Zero-shot Video Understanding Model
Weakness 3: Lack of comparison with related works, like HIG.
As discussed in the response to Relation To Broader Scientific Literature, the mentioned references are not directly related to this work and do not require direct comparison.
Nevertheless, we believe they can enrich the discussion of the paper from the perspectives of data and spatiotemporal modeling. Therefore, we will further include citations and discussions of these works in future revisions.
The paper tackles the problem of video large language models. The authors claim that existing methods for open-ended video-language understanding often suffer from misalignment with human intuition and video hallucination issues. In order to address these issues, they proposed VistaDPO, a new framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. The proposed method provides direct preference optimization on three levels including instance-level, temporal-level, and perception-level. Extensive experiments on multiple public datasets demonstrate the effectiveness of the proposed scheme.
给作者的问题
N/A
论据与证据
The main claim of the paper is that "the proposed VistaDPO method can effectively solve the problem of misalignment with human intuition and video hallucination". This claim is supported by extensive experiments on public benchmarks, which is reasonable.
方法与评估标准
Yes. It is reasonable to conduct DPO on a fine-grained, which could inherently provide more detailed supervision to optimize the model.
理论论述
The paper does not contain any proofs or theoretical claims.
实验设计与分析
The experiments were conducted on public benchmarks for general video understanding. Results on these benchmarks can well-demonstrate the effectiveness and the significance of the proposed scheme.
补充材料
The appendix provides more details about the limitations, related work, model details, prompt templates, more ablation studies, and visualizations.
与现有文献的关系
The key contribution of the paper is the fine-grained VistaDPO design, which could effectively alleviate the misalignment with human intuition and video hallucination issues, as demonstrated by the experiments.
遗漏的重要参考文献
N/A
其他优缺点
Generally, this is a good paper about LLM-based video understanding and DPO on Video-LLMs. Extending existing DPO paradigms from coarse-grained to fine-grained is a non-trivial setting. The proposed VistaDPO model can well-handle such scenarios.
其他意见或建议
In my opinion, it might be better to revise the term "three levels" (instance-level, temporal-level, and perception-level) into "three aspects", as these three "levels" do not seem to have a clear hierarchical relationship.
伦理审查问题
N/A
Thank you to all the reviewers for your thoughtful and constructive comments! We are really encouraged to see that the reviewers appreciate some positive aspects of our paper, such as strong performance (Reviewer NY3g, CzUw, zu9j, sRt7), high robustness (Reviewer NY3g, sRt7), comprehensive & informative experiments (Reviewer NY3g, CzUw, zu9j), innovative contributions (Reviewer NY3g, CzUw, zu9j), and sound methodology (Reviewer NY3g, CzUw, zu9j).
Your expertise significantly helps us strengthen our manuscript! Here we address your thoughtful comments point-to-point. Also for any further possible concerns and questions, please feel free to let us know, and we are more than happy to engage in interactions and discussions.
Other Suggestion: It might be better to revise the term "three levels" (instance-level, temporal-level, and perception-level) into "three aspects".
Thank you for your insightful comment! While we understand your point, we use "three levels" to reflect a hierarchical progression: instance-level provides foundational analysis, temporal-level builds on it to capture dynamics across time, and perception-level integrates both for higher-level understanding. This dependency establishes a clear hierarchy.
We hope this clarifies our rationale, but we remain open to further discussion to improve clarity. Thank you again for your valuable feedback.
Thanks for the response from the authors. I'm keeping my original rating.
Dear Reviewer zu9j,
Thank you so much for your valuable feedback and for helping improve our paper!
Best Regards
This paper introduces VistaeDPO, a method designed to enhance video-text preference alignment at three levels: instance, temporal, and perceptive. The authors also contribute VistaDPO-7k, a new dataset for DPO training, and demonstrate great performance improvements on various video benchmarks.
给作者的问题
See my questions in the comments above.
论据与证据
The claims are well-supported by the presented evidence.
方法与评估标准
The method's design, which decomposes video hallucination into three hierarchical levels for fine-grained preference optimization, is logically sound.
Regarding object-level spatial alignment, it appears to operate at the image level if my understanding is correct. Would it make more sense to consider joint spatial-temporal alignment? For instance, could rejected responses be generated by masking an object's bounding box across multiple frames?
理论论述
N/A
实验设计与分析
The experimental designs and analyses are generally well-executed。
补充材料
I skimmed the Supplementary.
与现有文献的关系
This paper contributes to the field of video-language models, specifically addressing visual hallucination using DPO. It introduces a valuable benchmark and proposes an advancement in training methods for mitigating visual hallucination.
遗漏的重要参考文献
No
其他优缺点
Strengths:
- The VistaDPO-7k dataset is a potentially valuable resource for the research community.
- The experiments are comprehensive and demonstrate substantial performance gains.
- The ablation study (Table 3) and qualitative examples (Figure 6) are particularly informative, effectively illustrating the efficacy of VistaeDPO.
Weaknesses:
-
(Not a requirement for acceptance and no pressure for conducting experiments, just as a general comment) Evaluating VistaeDPO on a wider range of base models beyond Video-LLaVA and LLaVA would strengthen the paper. Models like QwenVL, LLaVA-Next-Video, or Tarsier are more "video-centric" and may provide a stronger test of generalization.
-
The proposed method seems dependent on the fine-grained annotations of VistaDPO-7K. Could the authors comment on the applicability of their method to preference datasets that lack the fine-grained annotations present in VistaDPO-7k?
其他意见或建议
N/A
Thank you to all the reviewers for your thoughtful and constructive comments! We are really encouraged to see that the reviewers appreciate some positive aspects of our paper, such as strong performance (Reviewer NY3g, CzUw, zu9j, sRt7), high robustness (Reviewer NY3g, sRt7), comprehensive & informative experiments (Reviewer NY3g, CzUw, zu9j), innovative contributions (Reviewer NY3g, CzUw, zu9j), and sound methodology (Reviewer NY3g, CzUw, zu9j).
Your expertise significantly helps us strengthen our manuscript. Here we address your thoughtful comments point-to-point. For any further possible concerns and questions, please feel free to let us know, and we are more than happy to engage in interactions and discussions.
Methods: Would object-level spatial alignment make more sense to consider joint spatial-temporal alignment?
Thank you for your detailed review and thoughtful comment! Object-level spatial alignment is performed at the keyframe(s) level. In VistaDPO-7K, two key annotations are provided: the first and last frames where a key object appears (clip level) and the keyframe(s) of the key object (object level). Some videos have a single keyframe, while others include multiple keyframes, depending on the interaction duration between key subjects (see Figure 2(a)).
For the proposed object-level spatial alignment, our core objective is to mask out key objects in the keyframe(s) to ensure that the corresponding objects and their relationships are absent. This process creates negative samples, further enhancing the model’s spatial reasoning capabilities. When multiple keyframes are involved, object-level alignment also serves as a form of joint spatial-temporal alignment.
Thank you again for highlighting this issue. We will provide a more detailed explanation in future revisions.
Weakness 1: Evaluating VistaDPO on a wider range of base models beyond Video-LLaVA and PLLaVA would strengthen the paper.
Thank you for your thoughtful suggestion! While our proposed VistaDPO-7K and Vista-DPO do not explicitly reflect model-specific correlations, extensive model testing indeed facilitates better validation of generalization. Due to time and computational constraints, we have included results for LLaVA-Next-Video. Its performance on Video Hallucination (VideoHallucer & EventHallusion), VideoQA (ActivityNet-QA), and Captioning (VideoChat-GPT Bench) tasks is as follows:
Table A: Evaluation on VideoHallucer and EventHallusion.
| Models | VideoHallucer | EventHallusion | |||||
|---|---|---|---|---|---|---|---|
| Basic↑ | Hallucinated↑ | Overall↑ | Entire↑ | Mix↑ | Misleading↑ | Overall↑ | |
| LLaVA-Next-Video (7B) | 64.8 | 58.7 | 31.5 | 49.1 | 70.5 | 71.6 | 64.8 |
| LLaVA-NeXT-Video-DPO (7B) | 62.5 | 60.9 | 32.0 | 51.8 | 72.5 | 75.5 | 67.5 |
| LLaVA-NeXT-Video-VistaDPO (7B) (Ours) | 67.2 | 63.5 | 38.8 | 54.4 | 73.1 | 77.5 | 68.9 |
Table B: Evaluation on ActivityNet-QA and VideoChat-GPT Captioning Bench.
| Models | ActivityNet-QA | Correct↑ | Detail↑ | Context↑ | Temporal↑ | Consist↑ |
|---|---|---|---|---|---|---|
| LLaVA-Next-Video (7B) | 53.5/3.2 | 3.4 | 3.3 | 3.9 | 2.6 | 3.1 |
| LLaVA-NeXT-Video-DPO (7B) | 60.2/3.5 | 3.6 | 3.5 | 4.2 | 3.0 | 4.1 |
| LLaVA-NeXT-Video-VistaDPO (7B) (Ours) | 62.0/3.6 | 3.8 | 3.6 | 4.4 | 3.4 | 4.1 |
Weakness 2: Could the authors comment on the applicability of their method to preference datasets that lack the fine-grained annotations present in VistaDPO-7k?
The fine-grained information in VistaDPO-7K significantly supports VistaDPO's performance and provides insights for video-language alignment. For datasets lacking such annotations, instance, temporal, and object-level details—validated in this study—can be supplemented using advanced spatial and temporal grounding techniques without manual effort. Future work will focus on automated pipelines and improved metrics to better align with human preferences, making VistaDPO's sample construction a valuable reference.
I appreciate the authors' provided responses and additional experiments, which strengthen the paper's claims. I maintain my score as Accept.
Dear Reviewer CzUw,
We extend our heartfelt thanks to you for supporting our paper! We are pleased that our rebuttal have sufficiently addressed your concerns.
Warm Regards
In this work, a DPO based framework has been proposed for Video-LLMs for different task including video QA and hallucination and captioning. Inspired from DPO based methods for LLMs, a spatio-temporal aware DPO framework is proposed that optimized the preference of Video-LLM on across three axis: 1) instance level 2) temporal level and 3) perception level. To realize these objectives, a new training dataset has is proposed that consists of preferred and non-preferred responses across these different hierarchies. The proposed method has been applied on different Video-LLMs where is shows substantial performance gains compared to baselines and other methods. Extensive ablation studies are performed to justify the different DPO losses in the proposed framework.
update after rebuttal
Dear Authors,
Thank you for providing the rebuttal response. Overall, my major concerns have been sufficiently addressed. I would strongly recommend that the authors incorporate the ablations and clarifications presented in the rebuttal response into the main manuscript in the final version.
The proposed DPO-based framework for video understanding tasks holds significant importance in advancing these models on multiple performance fronts. I would recommend acceptance for the manuscript.
给作者的问题
Additional questions:
-
It is unclear how the representation embeddings are obtained for the tsne plot visualizations.
-
It will be good to have an analysis that shows which component of the model (LLM, vision backbone etc) is most crucial to tune to leverage the proposed DPO objectives.
论据与证据
Yes the claims made in the submission are supported by strong empirical and qualitative results.
方法与评估标准
Yes the evaluation criteria and proposed framework are relevant to the problem at hand. The manuscript provides results on MB bench in the supplementary section, where the performance gains have not the same scale as in our benchmarks, it is important to discuss via analysis and qualitative results potential factors for different performance trends.
理论论述
The manuscript has not listed any explicit theoretical proof. The extension of LLM based DPO to different spatio-temporal hierarchies is intuitive.
实验设计与分析
Overall the experimental design is sound. However, it is not clear exactly what data the model is trained on? Does the Video-LLMs that uses the VistaDPO method are trained using only the 7K dataset proposed in the manuscript? The overall dataset size seems a bit small compared to other video-language training datasets. I understand preference based data curation is inherently difficult, but there should be additional ways to overcome the dataset scarcity problem. Have the authors tried to use both the proposed dataset and the general video-language dataset together in the same training run by activating the objective functions accordingly?
补充材料
Yes I have read the pdf document after the references.
与现有文献的关系
The main contributions aims to advance the robustness and reliability of Video-LMMs such that they refrain from generation hallucinating content.
遗漏的重要参考文献
Relevant references have been included in the submission manuscript.
其他优缺点
Overall, the proposed spatio-temporal variants of DPO for aligning Video-LMMs with human intents are intuitive and have shown strong empirical results. These contributions are encouraging in general.
Regarding the weaknesses, I would strongly recommend the authors to provide resolutions to all comments made in earlier sections of the review.
其他意见或建议
Minor: Space typo in the impact statement.
Thank you to all the reviewers for your thoughtful and constructive comments! We are really encouraged to see that the reviewers appreciate some positive aspects of our paper, such as strong performance (Reviewer NY3g, CzUw, zu9j, sRt7), high robustness (Reviewer NY3g, sRt7), comprehensive & informative experiments (Reviewer NY3g, CzUw, zu9j), innovative contributions (Reviewer NY3g, CzUw, zu9j), and sound methodology (Reviewer NY3g, CzUw, zu9j).
Your expertise significantly helps us strengthen our manuscript. Here we show our detailed responses. For any further possible concerns and questions, please feel free to let us know, and we are more than happy to engage in interactions and discussions. Also if you feel our responses useful, please kindly raise your scores.
MVBench Evaluation: It is important to discuss via analysis and qualitative results potential factors for different performance trends.
Your observations are truly insightful! The proposed method outperforms both the base model and Hound-DPO on most aspects of MVBench. However, MVBench’s significantly larger scale and diversity (e.g., 4,000 QA groups across 20 types vs. 1,800 QA groups across 2 types in VideoHallucer) make it inherently more challenging. Additionally, the relatively lightweight scale of the VistaDPO-7K dataset, which is used as a post-training dataset, somewhat limits the potential for significant performance improvements on MVBench.
We observed smaller gains in UA, MA, and EN due to the following reasons:
- UA (Unexpected Action): While VistaDPO effectively aligns the model with spatiotemporal semantic hierarchies, it does not explicitly align the model with human-defined abstract concepts, such as humor.
- MA (Moving Attribute): Only 4% of the VistaDPO-7K dataset focuses on dynamic attributes, and the model lacks explicit enhancements for handling physical affine transformations caused by moving viewpoints.
- EN (Egocentric Navigation): The majority of data in VistaDPO-7K is still based on mainstream third-person perspectives. This domain gap between third-person and first-person video data negatively impacts the model's performance in egocentric tasks.
In future work, we plan to expand the scale of the VistaDPO-7K dataset to further enhance its diversity and comprehensiveness.
Experimental Designs: It is not clear exactly what data the model is trained on? Does the Video-LLMs that uses the VistaDPO method are trained using only the 7K dataset proposed in the manuscript?
Yes, Video-LLMs in this study are directly post-trained (fine-tuned) using only the VistaDPO-7K dataset, based on their pre-trained weights, following the standard DPO approach.
In this manuscript, we did not utilize any additional datasets for post-training, as they lack the well-annotated temporal and object-level information provided by VistaDPO-7K. Nevertheless, incorporating larger and more diverse annotated datasets could further enhance the model’s effectiveness. Future research will explore the development of automated data pipelines to address this.
Additional Question 1: It is unclear how the representation embeddings are obtained for the tsne plot visualizations.
Thank you for pointing this out, and we apologize for any potential confusion caused! For the t-SNE visualization in Figure 3, the video and text semantics are represented using the last token embedding from the LVMs.
Additional Question 2: It will be good to have an analysis that shows which component of the model (LLM, vision backbone etc) is most crucial to tune to leverage the proposed DPO objectives.
Thank you for your thoughtful question! In the manuscript, we follow prior LVM DPO works (e.g., LLaVA-Hound-DPO, LLaVA-Next-DPO) by integrating a LoRA into the language model backbone (e.g., Vicuna-7B). This ensures efficient training while maintaining fairness for comparison with previous studies.
Below, we provide a comparison on EventHallusion of tuning only the LLM, only the vision backbone, and both:
Table A: Comparison of Tuning Module on EventHallusion.
| Methods | Entire↑ | Mix↑ | Misleading↑ | Overall↑ |
|---|---|---|---|---|
| Video-LLaVA | 30.7 | 57.5 | 41.2 | 45.9 |
| + Only Vision Backbone | 32.5 | 58.0 | 39.2 | 46.2 |
| + Only LLM (Ours) | 50.9 | 62.2 | 95.1 | 67.2 |
| + Both Vision Backbone & LLM | 51.8 | 61.1 | 93.1 | 66.5 |
As shown, the vision backbone is typically already well-tuned during large-scale pre-training to align visual embeddings with the language model. We hope this ablation provides a clearer overview of our approach.
Overall, the reviewers are positive about the paper. The proposed VistaDPO framework and the accompanying dataset represent significant contributions to improving LVM alignment and reducing hallucination through a novel hierarchical DPO approach. The empirical results are strong and convincingly demonstrate the method's effectiveness. While concerns about dataset size, computational cost, and generalizability were raised, the authors provided reasonable clarifications and additional experiments in their rebuttal.
Given the consensus on the novelty and significance of the contributions, the strong empirical validation, the value of the new dataset, and the largely positive reception (3 Accepts, 1 Weak Accept) bolstered by a constructive rebuttal, the recommendation is to Accept this paper.
The authors should ensure they incorporate the clarifications, additional ablation results (component tuning), and the promised discussion of related work (HIG) into the final version.