PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
5
5
3.0
置信度
创新性2.5
质量3.0
清晰度2.8
重要性3.0
NeurIPS 2025

Generative RLHF-V: Learning Principles from Multi-modal Human Preference

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

A novel alignment framework that integrates generative reward models with multi-modal RLHF.

摘要

关键词
AlignmentSafetyRLHFPreference LearningMulti-modal LLMs

评审与讨论

审稿意见
5

This work introduces a framework for better learning more generalizable reward models for preference learning by inferring principles that drive the preferences. This is done in two stages, firstly learn the principles from a preference dataset (chosen/rejected pairs given a prompt), then use the principles to refine scores by pairwise comparing candidate responses within a group of responses. By comparing the responses within the group, a more robust estimate of the reward is learned. A number of models are tested using a number of benchmarks.

优缺点分析

Strengths

  • The code and models are being made available.
  • Public dataset are used to training/evaluation.
  • Comprehensive evaluation against a number of benchmarks, and the the evaluation demonstrates the efficacy of the approach.

Weaknesses

  • As noted, the use of RL means the approach is vulnerable to reward hacking.
  • From what I understand the principles are not learned explicitly, which leads to a couple of questions (see below)

问题

In the caption for Figure 4 it states that “Response B provides accurate and comprehensive information, thus receiving the highest score”. It was not clear to me where the model learns the explicit principles “accuracy” and “comprehensiveness”. Is it the case that this is implied only because of the chosen/rejected responses? If this is the case, might it be that these are not the actual preferences captured, but rather the model representation of the principles is some entanglement of accuracy and comprehensiveness (i.e., not interpretable to humans, but equivalently meaningful to the model)?

Lines 161-164: If the point about entangled representations above is correct, for the case of GRM+RL might it be that performance is harmed when providing principles as a hint because the provided principles do not align with the model interpretation of the learned principles? If it was due to redundancy because the principles are already learned/accounted for (as suggested) then might one just expect the model to ignore that redundant input?

Suggesions

Figure 4 — in the figure where the scores are shown (Left, Right, Average) replace with (Left, Right, Mean)

It seems the symbol “r” has dual meanings — rewards and reasoning traces. Symbols should not be reused to avoid confusion.

Remove the period after D before the equation above Line 103.

Line 107: “to evaluates multiple'” > “to evaluate multiple”

Line 111: Remove the repeated sentence that begins “Each generated response ..”

Line 119 (and subsequently throughout the paper): Capitalize references to subsections, i.e., Subsection 4.2.

In text, for numbers less than ten, write them out in full. E.g., Line 182 “4” > “four”, etc.

局限性

Yes

最终评判理由

This is a nice body of work and is a valuable contribution.

格式问题

No concerns.

作者回复

Dear reviewer vM9w:

Thank you for your high recognition and affirmation of our work! We are greatly encouraged. We will address each of your comments point by point, and hope that our responses will strengthen your confidence in our work.

Q1: As noted, the use of RL means the approach is vulnerable to reward hacking.

Response 1: Clarification of Reward Hacking

You raise a very insightful question. Reward hacking is indeed an unavoidable challenge in reinforcement learning approaches [1, 2]. It essentially occurs when the policy model discovers vulnerabilities in the reward model and exploits them to obtain high rewards, while fundamentally contradicting human intentions.

Our proposed generative RLHF-V aims to provide a more robust reward for reinforcement learning fine-tuning of MLLMs, thereby improving model performance. This has been thoroughly demonstrated in our paper's experiments and has also received your recognition.

Beyond performance improvements, the contribution of generative RLHF-V is not merely an algorithm, but rather an empirical study on the application of multimodal generative reward models. Therefore, we candidly acknowledge that as a reinforcement learning method, generative RLHF-V is also susceptible to reward hacking issues. We do not wish to conceal this fact and are willing to expose the reward hacking problems that may exist in extreme scenarios with our method.

We hope this will inspire subsequent research, whether in more thorough mechanistic analysis or more robust algorithm design.

For example, during the rebuttal period, we implemented the four novel scoring methods of social technical: Spectral Ranking[3] and the three scoring methods from social choice techniques[4]: Borda count, Instant Runoff, and Ranked Pairs. We tested them in two scenarios: reward model scoring and reinforcement learning fine-tuning.

We selected the batch ranking task from MLLM-as-a-Judge, which requires the reward model to rank multiple responses, using the minimum edit distance from human expert results as the metric (lower is better). The results are shown in Table 1, where the left is the initial Qwen2.5-VL-Instruct-7B, and the right is the model trained with our proposed method:

Table 1: Performance Comparison of Different Scoring Methods on MLLM-as-a-Judge Batch Ranking Task

MethodsOursInit.
Average Score (Ours)0.44510.5143
Borda Count0.49310.5523
Instant Runoff0.46400.5071
Ranked Pairs0.43520.4965
Spectral Ranking0.53010.5656
  1. Social technical methods represented by Ranked Pairs can indeed improve the performance of grouped comparison methods, outperforming our implemented Average Score method in the paper.
  2. Models fine-tuned with our method show the most significant improvements on both Average Score and Ranked Pairs methods.
  3. While the Average Score method is not optimal, it still achieves relatively good performance.

Based on the above results, we find that social technical methods can indeed enhance the upper limit of grouped comparison methods, which is encouraging; meanwhile, our adopted Average Score method also achieves relatively good performance and can serve as an excellent baseline.

Finally, thank you again for your kind understanding and support.

[1]: Gao L, Schulman J, Hilton J. Scaling laws for reward model overoptimization[C]//International Conference on Machine Learning. PMLR, 2023: 10835-10866.

[2]: Ji J, Qiu T, Chen B, et al. Ai alignment: A comprehensive survey[J]. arXiv preprint arXiv:2310.19852, 2023.

[3]: Vigna S. Spectral ranking[J]. Network Science, 2016, 4(4): 433-445.

[4]: Conitzer V, Freedman R, Heitzig J, et al. Social choice should guide ai alignment in dealing with diverse human feedback[J]. arXiv preprint arXiv:2404.10271, 2024.


Q2: It was not clear to me where the model learns the explicit principles “accuracy” and “comprehensiveness”. Is it the case that this is implied only because of the chosen/rejected responses? If this is the case, might it be that these are not the actual preferences captured, but rather the model representation of the principles is some entanglement of accuracy and comprehensiveness (i.e., not interpretable to humans, but equivalently meaningful to the model)? Lines 161-164: If the point about entangled representations above is correct, for the case of GRM+RL might it be that performance is harmed when providing principles as a hint because the provided principles do not align with the model interpretation of the learned principles? If it was due to redundancy because the principles are already learned/accounted for (as suggested) then might one just expect the model to ignore that redundant input?

Response 2: Clarification of Principles Learning

Thank you very much for your thoughtful and constructive feedback on our paper. We are grateful for your positive assessment and for the insightful questions you raised, which have helped us to clarify and strengthen the core contributions of our work.

We would like to address the two excellent points you made regarding how the model learns principles and why providing them as hints can be detrimental.

1. On the Implicit Learning of Principles (e.g., "Accuracy" and "Comprehensiveness")

You asked whether principles like "accuracy" and "comprehensiveness" are learned implicitly from the preference data, and suggested that the model’s internal representation might be an "entanglement" of these concepts rather than a discrete, human-interpretable one.

Your intuition is entirely correct, and we thank you for articulating it so clearly. This is precisely the mechanism at play. Our model learns these principles implicitly. The RL objective (lines 103-104) simply rewards the Generative Reward Model (GRM) for correctly predicting the human-preferred response. To succeed at this task across a diverse dataset, the model must learn to identify the underlying features that consistently lead to a positive preference.

The principles it generates in its analysis (as seen in Figures 3 and 11) are its attempt to translate these complex, learned features into human-readable language. The model does not have a discrete "accuracy" module. Instead, as you insightfully suggest, its internal representation is a high-dimensional entanglement of features that, when decoded into text, are best described by words like "accuracy," "completeness," or "step-by-step guidance." These are emergent concepts, not pre-programmed ones.

We will revise the manuscript to make this point more explicit for the reader. We agree that your framing of this as an "entangled representation" is a more precise and powerful description of the process.

2. On Performance Degradation When Providing Principles to GRM+RL

You raised a very sharp point regarding our findings in lines 161-164 and RQ6, where providing principles as hints harmed the performance of the GRM+RL model. You hypothesized that this might be because the provided generic principles do not align with the model's internally learned, specific principles, rather than being merely redundant.

This is an excellent point, and we believe your hypothesis is a more accurate explanation than our initial suggestion of simple redundancy. The GRM+RL model, through its training, develops a sophisticated policy for generating highly specific and context-dependent principles for any given pair of responses. When we provide a static, generic principle from the dataset's annotation guidelines, we are not just giving it redundant information; we are imposing a constraint that conflicts with its learned optimal policy. This distracting constraint forces the model away from its flexible, tailored analysis and toward a more rigid one, which ultimately degrades its performance on out-of-distribution tasks.

We will update the discussion in lines 161-164 and the analysis in RQ6 to reflect this more nuanced and accurate explanation.

If you feel that our response in this round has not fully addressed your concerns, we are very willing to engage in further communication with you. Please feel free to ask questions at any time.

Once again, we thank you for your time and for providing such valuable and thought-provoking feedback. Your comments have given us a clearer lens through which to view our own results and have helped us significantly improve the paper.

Suggestions:

Figure 4 — in the figure where the scores are shown (Left, Right, Average) replace with (Left, Right, Mean)

It seems the symbol “r” has dual meanings — rewards and reasoning traces. Symbols should not be reused to avoid confusion.

Remove the period after D before the equation above Line 103.

Line 107: “to evaluates multiple'” > “to evaluate multiple”

Line 111: Remove the repeated sentence that begins “Each generated response ..”

Line 119 (and subsequently throughout the paper): Capitalize references to subsections, i.e., Subsection 4.2.

In text, for numbers less than ten, write them out in full. E.g., Line 182 “4” > “four”, etc.

Response 3: Thank you for your detailed and helpful suggestions

It's difficult to put into words the delight we felt when we saw your detailed feedback. We have made corresponding modifications point by point in our paper and believe they are of tremendous help. We would like to express our heartfelt gratitude once again.

评论

Thank you for the follow up. I will maintain my score and recommend accepting the paper. This is a nice body of work and is a valuable contribution.

评论

We sincerely appreciate your positive feedback and detailed comments. We will incorporate the changes from the rebuttal period into the final version. If you have any other questions, please feel free to ask anytime!

评论

Dear Reviewer vM9w,

Your support for us is reflected not only in the scores but also in a long series of detailed and insightful feedback. Words cannot adequately express our gratitude to you. As the discussion period draws to a close and our communication nears completion, we hope to summarize this pleasant discussion.

W1: As noted, the use of RL means the approach is vulnerable to reward hacking.

We clarified that the main contribution of our paper is an empirical study of GRMs for multi-modal RLHF. Therefore, while reward hacking behavior does indeed affect RL performance, it is also one of the findings of our empirical research.

W2: Clarification of Principles Learning

Your questions about the principles learning section reflect that there is still room for improvement in our writing. We have provided detailed feedback and commit to updating the corresponding explanations in the revision to further enhance readability.

W3: Presentations and Minor Issues

We are deeply moved to receive such detailed and thorough feedback from you, which makes our paper of higher quality.

We are greatly honored to see your recognition of our rebuttal content, maintaining your strong support for us, and even evaluating our work as a valuable contribution. Thank you once again for your recognition and support!

审稿意见
4

This work introduces Generative RLHF-V, a novel framework for aligning multi-modal large language models (MLLMs) with human intentions. Traditional score-based reward models in RLHF suffer from poor accuracy and generalization, while existing generative reward models struggle to produce learnable scalar rewards due to their pairwise design. Generative RLHF-V addresses these issues through a two-stage process: (1) training generative reward models using reinforcement learning to better capture human preferences, and (2) optimizing RL via grouped response comparisons to improve multi-modal scoring accuracy. The authors conduct extensive experiments to verify the effectiveness of their method.

优缺点分析

Strengths:

  1. The writing and clear and easy to follow.

  2. The author studied a novel method for training generative reward models for multi-modal language models.

  3. The author conducts extensive experiments to verify the effectiveness of their proposed method.

Weakness:

  1. The authors propose using group comparisons based on model-generated responses to train reinforcement learning. Have the authors considered exploring alternative RL algorithms, such as GRPO, for training the reward models?

  2. It is unclear which specific algorithm the authors use to train the MLLM based on the proposed reward models. Do the authors employ PPO, GRPO, or another RL algorithm for this purpose?

  3. The authors mention the emergence of self-praise behavior in the experimental results, which appears similar to reasoning patterns induced by reinforcement learning. Is this observation specific to the Qwen-based model, or can it be generalized to other multi-modal large language models (MLLMs) as well?

问题

Please refer to Strengths and Weaknesses part.

局限性

yes

最终评判理由

I agree with the opinion of most of reviwers. The paper should be accepted.

格式问题

No

作者回复

Dear reviewer NwUp:

We are grateful for your recognition of the clarity of our writing, method innovation, and the breadth of our experiments. You must have carefully read our paper to propose such detailed and insightful feedback, and we will address each point. If you believe we have resolved your concerns, we sincerely and kindly hope that you will give us higher support. Also, if you have additional questions about our responses or other aspects you would like to discuss, we also welcome further communication with you.

W1: The authors propose using group comparisons based on model-generated responses to train reinforcement learning. Have the authors considered exploring alternative RL algorithms, such as GRPO, for training the reward models?


Response 1: Exploration of Alternative RL Algorithms

You are certainly experienced in RL experiment research! In fact, during the early research stage, we simultaneously attempted GRPO and PPO RL algorithms to train the generative reward model, but GRPO performed slightly better in reward modeling, and because GRPO did not need to train an additional critic model, it consumed less computational resources, so we finally chose GRPO. The experimental results are as shown in Table 1 and Table 2.

Table 1: Performance Comparison of GRM+GRPO vs GRM+PPO on Distribution-in Test

MethodGRM + GRPOGRM + PPO
Accuracy0.810.78

Table 2: Performance Comparison of GRM+GRPO vs GRM+PPO on Distribution-out Test

DatasetsBeavertails-VLLaVA-CriticMLLM-as-a-Judge
GRM-GRPO0.790.780.68
GRM-PPO0.750.720.64

We believe the reason for this might be that the reward for the preference learning task is relatively "simple", and the model only needs to provide a higher score for the human-preferred response to obtain the reward. Therefore, this task does not require a critic model to provide additional information, and averaging the trajectory reward values within the group can achieve good results.


W2: It is unclear which specific algorithm the authors use to train the MLLM based on the proposed reward models. Do the authors employ PPO, GRPO, or another RL algorithm for this purpose?

Response 2: Clarification of RL Algorithm

Thank you for pointing this out! We use GRPO algorithm to train the multi-modal generative reward model. We will supplement this in section 4.1 Experiment Setup. We believe this will make the article clearer.


W3: The authors mention the emergence of self-praise behavior in the experimental results, which appears similar to reasoning patterns induced by reinforcement learning. Is this observation specific to the Qwen-based model, or can it be generalized to other multi-modal large language models (MLLMs) as well?

Response 3: Exploration of "Self-Praise" Behavior

Thank you for your attention to our exploration of the "self-praise" behavior. Extending this conclusion to other models will undoubtedly further strengthen our contribution. During the rebuttal period, we conducted additional experiments on LLaVA-1.5-7B, with results as follows:

We found that LLaVA-1.5-7B exhibits a different reward hacking behavior from the Qwen-VL series models under over-training conditions. Compared to the behavior mentioned in our main text where models add extra text to praise their own responses, LLaVA tends to use overly complimentary vocabulary to describe the image itself. Here is an example:

...[Complete Response]...The composite image of all the elements of the branded outfit, including the custom-made Revo-Knit pink dress, the matching headband and necklace, the Wilson head and racket, the Delta Duck logo on the left side, the W logo on the right side, the multiple accessories, the white wristband,socks and bracelets, all contribute to the overall powerful and memorable delivery of the brand's message, making it an emphatic signature for both the athlete and the endorsed brand. It effectively conveys the essence of the Wilson brand to the audience, creating a lasting impression of the brand's values, commitment, and the quality of their products, ultimately serving as a powerful promotional tool.

We found that this behavior cannot hack higher performance in MLLM-as-a-Judge evaluations, and the actual performance of over-trained models is not good, as shown in Table 3.

Table 3: Performance Comparison of Over-trained Models on LLaVA-Wild and MIA-Bench

ModelLLaVA-WildMIA-Bench
Initial Model72.7161.15
Over-trained Model69.9857.85

We believe this may be because the capabilities of the LLaVA-1.5-7B model are limited and not sufficient to explore reward hacking behaviors similar to those observed in the Qwen-VL series. We will add discussions on reward hacking behaviors in models with different architectures to the future work section, and acknowledge in the paper that this addition is thanks to the kind reminder from reviewer NwUp. We sincerely appreciate your insightful suggestions once again.

评论

Thanks for your response! I think the author has resolved my problem. I will maintain my score for weak accept.

评论

Thank you very much for your positive response and recognition!

评论

Dear Reviewer NwUp,

Thank you once again for your continued support and detailed suggestions! As the rebuttal period draws to a close, it appears our communication with you is also nearing completion. Please allow us to provide a final summary to conclude this pleasant exchange.

We thank you again for your recognition of our writing, experiments, and methodological novelty, and we are even more grateful for your suggestions regarding supplementary details of our methodology.

W1: Exploration of Alternative RL Algorithms

We conducted an empirical comparative analysis of RL algorithms (PPO, GRPO) used for GRM training, and explained our rationale for choosing GRPO as the RL algorithm.

W2: Clarification of RL Algorithm

We sincerely acknowledge the details we omitted in the paper and commit to making corresponding changes in the revision.

W3: Exploration of "Self-Praise" Behavior

We are delighted that you are interested in "Self-Praise" behavior and other reward hacking behaviors. During the rebuttal period, we explored new reward hacking behaviors on the LLaVA-1.5-7B model.

Your response informing us that we have clarified your concerns and that you are willing to maintain your support fills us with excitement. Thank you once again for your support and feedback!

审稿意见
5

This paper introduces Generative RLHF-V, a framework for aligning multimodal large language models (MLLMs) with human preferences by combining generative reward modeling (GRM) and reinforcement learning from human feedback (RLHF) in a multimodal context. Empirical results demonstrate that Generative RLHF-V significantly improves performance across multiple benchmarks and multiple model sizes, outperforming traditional score-only RMs and supervised GRMs. The authors further analyze generalization capabilities, scaling properties, and potential vulnerabilities such as reward hacking.

优缺点分析

Strength

The paper proposes a novel framework that combines generative reward models (GRMs) with multi-modal reinforcement learning from human feedback (RLHF), addressing the limitations of score-only reward models.

The experiments are comprehensive, covering 7 benchmarks and 4 MLLM variants. The reported performance significantly outperforms the selected baselines, showing the effectiveness of the proposed method.

The authors include a careful analysis of reward hacking in over-optimized settings, which is a critical concern in RL-based alignment. These insights could serve as a valuable reference for future research.

Weakness

The main experiments are conducted only on the Qwen-VL family, so it remains uncertain how well the pipeline transfers to models with different architectures or modalities. Including results on several diverse model series would give a clearer picture of the method’s robustness and broader applicability.

The core ideas are closely related to prior work in the text-only NLP literature, and thus the novelty of the contribution is somewhat limited (though I acknowledge that the paper thoughtfully extends generative reward modeling and grouped comparison to the multi-modal setting and provides useful insights).

问题

See above

局限性

Yes

最终评判理由

The authors have added additional experiments on LLaVA and more discussions on the novelty of the paper, which seems reasonable to me.

格式问题

None

作者回复

Dear reviewer ecEr:

We appreciate the reviewers' recognition of the effectiveness and novelty of our method. In response to the reviewers' concerns about robustness and applicability, we have supplemented our results with experiments on LLaVA-1.5-7B and provided detailed analysis. Regarding the novelty of our contributions, we have also provided comprehensive discussions.

We hope these responses address your concerns. And we sincerely and kindly hope that you can provide greater support. If you have additional questions about our responses or other aspects you would like to discuss, we also welcome further communication with you.

W1: The main experiments are conducted only on the Qwen-VL family, so it remains uncertain how well the pipeline transfers to models with different architectures or modalities. Including results on several diverse model series would give a clearer picture of the method's robustness and broader applicability.

Response 1: Additional Experiments on LLaVA-1.5-7B

You are certainly a rigorous and responsible reviewer! Our experiments cover two versions of Qwen-VL (2.0 and 2.5), two modes (pretrained and instruction-following models), and three sizes (2B, 3B, and 7B).

Nevertheless, we sincerely agree with your suggestion and have supplemented experiments on LLaVA-1.5-7B during the rebuttal period, reproducing our key experimental RQs from our paper, as detailed below:

RQ1 in paper: Does the GRM+RL facilitate more generalizable principle learning from preferences?

Table 1: Principle Learning Performance Comparison on LLaVA-1.5-7B Across Different Training Methods

DatasetsBeavertails-VLLaVA-CriticMLLM-as-a-Judge
Score-Only RM0.520.550.51
GRM0.570.590.56
GRM (P)0.580.610.57
GRM + SFT0.540.520.54
GRM + SFT (P)0.550.520.55
GRM + RL0.690.720.64
GRM + RL (P)0.640.660.58

The results show that on LLaVA-1.5-7B, GRM + RL still demonstrates the highest scoring accuracy and optimal principle learning capability across three out-of-distribution datasets, consistent with the experimental conclusions in the paper.

It is worth noting that despite LLaVA-1.5-7B's limited model capacity, our method still achieved significant performance improvements. This further expands the applicability scope of our approach.

RQ3 in paper: Are GRM+RL and grouped comparison competitive methods for multi-modal RLHF?

Table 2: Multi-Modal RLHF Performance Comparison on LLaVA-1.5-7B Using Different Training Methods and Reward Models

ModelFeedbackMIA-BenchLLaVA-WildLLaVA-WilderMM-VetMM-Vet-V2
LLaVA-1.5-7BN/A61.1572.7145.0137.1535.78
+ DPORM64.3070.0847.5838.5236.10
+ PPORM66.4573.9249.2340.1838.45
+ GRPORM68.1276.3452.6742.7341.28
+ GRPOGRM69.8578.5655.4244.9143.67
+ GRPOGRM+SFT70.2377.3454.8942.5243.91
+ GRLHF-V (Ours)GRM+RL72.5981.9859.6547.8647.73

The results show that on five mainstream benchmarks, the method combining GRM + RL with grouped comparison achieves optimal results, consistent with the experimental conclusions in the paper.

RQ5 in paper: What is the reward hacking behaviors of an over-trained Generative RLHF-V model?

We found that LLaVA-1.5-7B exhibits a different reward hacking behavior from the Qwen-VL series models under over-training conditions. Compared to the behavior mentioned in our main text where models add extra text to praise their own responses, LLaVA tends to use overly complimentary vocabulary to describe the image itself. Here is an example:

...[Complete Response]...The composite image of all the elements of the branded outfit, including the custom-made Revo-Knit pink dress, the matching headband and necklace, the Wilson head and racket, the Delta Duck logo on the left side, the W logo on the right side, the multiple accessories, the white wristband,socks and bracelets, all contribute to the overall powerful and memorable delivery of the brand's message, making it an emphatic signature for both the athlete and the endorsed brand. It effectively conveys the essence of the Wilson brand to the audience, creating a lasting impression of the brand's values, commitment, and the quality of their products, ultimately serving as a powerful promotional tool.

We found that this behavior cannot hack higher performance in MLLM-as-a-Judge evaluations, and the actual performance of over-trained models is not good, as shown in Table 3.

Table 3: Performance Comparison of Over-trained Models on LLaVA-Wild and MIA-Bench

ModelLLaVA-WildMIA-Bench
Initial Model72.7161.15
Over-trained Model69.9857.85

We believe this may be because the capabilities of the LLaVA-1.5-7B model are limited and not sufficient to explore reward hacking behaviors similar to those observed in the Qwen-VL series. We will add discussions on reward hacking behaviors in models with different architectures to the future work section, and acknowledge in the paper that this addition is thanks to the kind reminder from reviewer ecEr. We sincerely appreciate your insightful suggestions once again.


W2: The core ideas are closely related to prior work in the text-only NLP literature, and thus the novelty of the contribution is somewhat limited (though I acknowledge that the paper thoughtfully extends generative reward modeling and grouped comparison to the multi-modal setting and provides useful insights).

Response 2: Clarification of Novelty

Thank you for your recognition and questioning of our work. As you pointed out, Generative RLHF-V is an extension based on existing methods in the text-only NLP field: self-principled-critique-finetuning (SPCT). However, we would like to further clarify that this work is not simply extending SPCT from text modality to multimodal modality, but rather building a complete pipeline around the multimodal human principles learning problem, including reward modeling and grouped comparison (as you mentioned in the strengths, this is also a thoughtful methodology that provides useful insights). Within this pipeline, we have also conducted more in-depth and unique analysis of Generative RLHF-V's learning capabilities for multi-modal principles.

  1. We found that compared to SPCT's approach of providing reference principles, directly letting the model actively infer the principles behind human preferences can achieve stronger and more accurate principle learning effects. We demonstrate this conclusion in Figure 5 and Table 3.

  2. We found that in multimodal scenarios, integrating grouped comparison and GRM+RL near linearly enhances multi-modal RLHF performance across various settings of candidate responses. We demonstrate this conclusion in Figure 7.

  3. We found that over-trained GRM+RL models exhibit self-praising reward hacking behavior on Qwen-VL series models, while during the rebuttal period we discovered that LLaVA-1.5-7B exhibits reward hacking behavior of overly praising the image itself. As shown in Figure 8, these analyses provide unique insights into the use of generative reward modeling in multimodal scenarios.

In summary, we believe that the contribution of this work goes far beyond algorithmic improvements, but rather represents the first systematic empirical study of the entire pipeline from reward modeling, grouped comparison, to various ablation analyses on the multimodal principles learning problem. We again thank the reviewer for their tolerant attitude and positive rating of our work. Your attention has prompted us to more clearly summarize and express the contributions of our work, and we hope the above supplementary explanations can effectively address your concerns.

评论

I appreciate the author's efforts on providing more experiments on LLaVA and more discussion on the novelty of the paper. I will revise my scores conditioned on that the authors would add these experiment results and discussion to the final version of the paper.

评论

Thank you for such a timely and positive response. It is a great honor that our efforts have addressed your concerns. We promise to incorporate these experiments and discussions into the final version. If you have any other questions of our paper, please feel free to communicate with us at any time!

评论

Dear Reviewer ecEr,

Thank you once again for your strong support and insightful feedback! As the rebuttal deadline approaches, it appears our communication is nearly complete. We hope to provide a final summary to conclude this pleasant discussion.

We greatly appreciate the your recognition of our method's effectiveness and the comprehensiveness of our experiments. In response to the your concerns about applicability and novelty, we have supplemented experiments on LLaVA-1.5-7B during the rebuttal period and provided detailed clarifications regarding the novelty of our work.

W1 (Generalizability of Model Architecture): We have supplemented key experiments on LLaVA-1.5-7B. The results demonstrate that our GRM+RL approach still exhibits the highest scoring accuracy and optimal principle learning capability on three out-of-distribution datasets, consistent with the experimental conclusions in the paper.

W2 (Novelty Limitations): We clarify that this work is not simply extending SPCT from text modality to multimodal, but rather constructs a complete pipeline around the multimodal human principle learning problem. Our contributions include:

  1. Discovering that having models actively infer principles behind human preferences achieves stronger learning effects than providing reference principles
  2. In multimodal scenarios, the integration of grouped comparison and GRM+RL approximately linearly improves multimodal RLHF performance
  3. Finding that over-trained models exhibit unique reward hacking behavior

We are pleased to see that you have expressed recognition of our rebuttal and correspondingly updated the score. We commit to updating the corresponding changes in the revision, confident that this will further enhance the quality of the paper. Thank you once again for your thorough review and the valuable energy and time you have invested.

审稿意见
5

The authors address the problem of aligning a multimodal LLM (MLLM) to human preferences using RLHF. Their proposed method has two major stages. First, they propose using multi-modal generative reward modeling instead of traditional reward modeling, and an important part of this is automatically learning human principles that seem to be aligned to the expressed preferences. Next, they propose a fine-tuning scheme where instead of one pair of candidate responses at a time, the authors generate mutliple candidate responses and use the pairwise comparisons between each pair to compute average score for better rewards for the RL method. Finally, they present extensive experimental results to show the efficacy of their methods in improving performances of open-source small scale (2B-7B) LLMs.

优缺点分析

Strengths

  • The paper clearly shows how combining generative reward models (GRM) with pairwise preferences clearly improves the performance of the MLLM in benchmark tasks. The experimental results show that fine-tuning with GRM in an RL methodology and grouped comparison-based score computation both individually increase performance in benchmark tasks. But additionally, these two methods together see the biggest improvement.

  • The extensive experimentation, and presentation of specific research questions that can be answered based on the experiments was great.

Weaknesses

  • Some parts of the paper was hard to follow. A key contribution, grouped comparison, is only explained halfway through. A short description like "we generatr multiple candidates at the same time, compare each pair of candidates to each other, and use all available comparisons to calculate average scores for all the candidates" could've been mentioned earlier on, in the introduction. Some motivational statements in the introduction seemed out-of-place to me. For example, claiming "Pair-wise comparison is blocking multi-modal principles from generalizing to learnable rewards" seemed imprecise, as the authors still use pairwise comparisons, just multiple ones simultaneously, rather than list-based or rank-based methods which would be a more stark contrast to pairwise comparisons. Another example of a writing issue, in RQ6 of section 4, the first sentence ("SPCT mentions that providing principles as a reference within the user prompt.") is not a proper sentence. I suggest revising the paper to proofread other possible errors that can lead to lack of clarity.

  • For the grouped comparison part, I mentioned how the authors still use pairwise comparisons. There are more structured approaches to predicting scores from a pairwise comparison matrix as such as spectral ranking (which works well with cardinal scores). (ref: https://arxiv.org/pdf/0912.0238). It would be interesting to see if calculating rewards in that manner shows any visible improvement. Another possible venue of work to explore would be the use of social choice technique in AI alignment (which suggests using more than pairwise preferences). See here: https://arxiv.org/pdf/2404.10271

  • The authors mention how there's almost a linear increase in some benchmark scores as number of candidate responses in the grouped comparison method. The authors don't really mention the computational tradeoffs here. Or how long the linear growth sustains.

问题

  1. Have you considered alternative approaches to predicting scores from pairwise comparison matrices, such as spectral ranking methods?

  2. The authors demonstrate a near-linear increase in benchmark scores as the number of candidate responses increases in grouped comparison. What are the computational tradeoffs involved? At what point does this linear growth taper off?

局限性

Despite grouped comparison being a pivotal part of the paper, the authors' exploration in utilizing the grouped data in a structured way was limited.

最终评判理由

The authors have added more experimental results that introduce more structured ways of going beyond pairwise preferences. Conditional on the fact that they add the extta experimental results in the revised paper, I'm happy to update the score and give a stronger recommendation for acceptance.

格式问题

N/A

作者回复

Dear reviewer mrSo:

Thank you for your invaluable and insightful feedback. Your thoughtful comments and constructive suggestions have been instrumental in helping us enhance the clarity and quality of our paper. We sincerely appreciate the time and effort you dedicated to reviewing our work. We have carefully considered all your points and have revised the manuscript accordingly.

We have provided a detailed point-by-point response to your questions. If you feel we have addressed your concerns, we sincerely and kindly hope you will give us higher support.


W1: Some parts of the paper was hard to follow. A key contribution, grouped comparison, is only explained halfway through. A short description like "we generatr multiple candidates at the same time, compare each pair of candidates to each other, and use all available comparisons to calculate average scores for all the candidates" could've been mentioned earlier on, in the introduction. Some motivational statements in the introduction seemed out-of-place to me. For example, claiming "Pair-wise comparison is blocking multi-modal principles from generalizing to learnable rewards" seemed imprecise, as the authors still use pairwise comparisons, just multiple ones simultaneously, rather than list-based or rank-based methods which would be a more stark contrast to pairwise comparisons. Another example of a writing issue, in RQ6 of section 4, the first sentence ("SPCT mentions that providing principles as a reference within the user prompt.") is not a proper sentence. I suggest revising the paper to proofread other possible errors that can lead to lack of clarity.

Response 1: Clarification of "Grouped Comparison"

We completely agree with your excellent suggestion that a clearer, earlier explanation of "grouped comparison" would significantly improve the reader's understanding of our core contributions. We have revised the introduction to include a concise description of this mechanism, much like the one you proposed. This ensures that the concept is clearly defined from the outset, providing a better foundation for the sections that follow.

Also, we thank you for pointing out the imprecision in some of our motivational statements. Upon reflection, we concur that the claim "Pair-wise comparison is blocking multi-modal principles from generalizing to learnable rewards" could be misconstrued. Our method indeed leverages and extends pairwise comparisons in a novel way rather than abandoning them. We have refined this statement in the introduction to more accurately articulate that our contribution overcomes the limitations of traditional single pairwise comparisons by aggregating information from multiple simultaneous comparisons within a group.


W2 & Q1: For the grouped comparison part, I mentioned how the authors still use pairwise comparisons. There are more structured approaches to predicting scores from a pairwise comparison matrix as such as spectral ranking (which works well with cardinal scores). (ref: https://arxiv.org/pdf/0912.0238). It would be interesting to see if calculating rewards in that manner shows any visible improvement. Another possible venue of work to explore would be the use of social choice technique in AI alignment (which suggests using more than pairwise preferences). See here: https://arxiv.org/pdf/2404.10271

Response 2: Additional Experiments For Structured Prediction

We are delighted to see your insightful comments. This series of research will undoubtedly become a very valuable improvement direction for generative RLHF-V in the future, and demonstrates its reference value for other fields.

During the rebuttal period, we implemented the four novel scoring methods you mentioned: Spectral Ranking and the three scoring methods from social choice techniques: Borda count, Instant Runoff, and Ranked Pairs. We tested them in two scenarios: reward model scoring and reinforcement learning fine-tuning.

  • Experiment (1/2): Reward Model Scoring

We selected the batch ranking task from MLLM-as-a-Judge, which requires the reward model to rank multiple responses, using the minimum edit distance from human expert results as the metric (lower is better). The results are shown in Table 1, where the left is the initial Qwen2.5-VL-Instruct-7B, and the right is the model trained with our proposed method:

Table 1: Performance Comparison of Different Scoring Methods on MLLM-as-a-Judge Batch Ranking Task

MethodsOursInit.
Average Score (Ours)0.44510.5143
Borda Count0.49310.5523
Instant Runoff0.46400.5071
Ranked Pairs0.43520.4965
Spectral Ranking0.53010.5656
  1. Social technical methods represented by Ranked Pairs can indeed improve the performance of grouped comparison methods, outperforming our implemented Average Score method in the paper.
  2. Models fine-tuned with our method show the most significant improvements on both Average Score and Ranked Pairs methods.
  3. While the Average Score method is not optimal, it still achieves relatively good performance.

Based on the above results, we find that social technical methods can indeed enhance the upper limit of grouped comparison methods, which is encouraging; meanwhile, our adopted Average Score method also achieves relatively good performance and can serve as an excellent baseline.

  • Experiment (2/2): Reinforcement Learning Fine-tuning

Table 2: Multi-Modal Benchmark Performance of Different Scoring Methods in Reinforcement Learning Fine-tuning

ModelScore MethodsMIA-BenchLLaVA-WildLLaVA-WilderMM-VetMM-Vet-v2
Qwen2-VL-7BN/A52.5881.3047.1832.1227.15
+ GRLHFScore Average62.31103.5562.8441.2545.16
+ GRLHFBorda Count55.7683.0655.2336.8735.08
+ GRLHFInstant Runoff57.8984.2559.1235.1236.53
+ GRLHFRanked Pairs59.9489.7657.5239.5639.75
+ GRLHFSpectral Ranking57.6387.7656.8940.0838.01

We find that in the context of reinforcement learning fine-tuning, the improvements from the above social technical methods are not significant. We believe this is because the Score Average method can provide more fine-grained scoring for various answers, while methods like Ranked Pairs can only provide coarse-grained rankings, making it difficult to provide sufficiently fine-grained supervisory signals for reinforcement learning.

Still, we believe that the social technical methods have the potential to enhance the performance of grouped comparison methods, and we hope our empirical study can inspire more research in this direction or other applications. We will include the above experiments and analysis in the appendix: potential research directions section, denoting that this is inspired by the reviewer mrSo's suggestion. We expect researchers in fields such as social choice theory to further explore the applications of these methods in alignment scenarios.


W3 & Q2: The authors mention how there's almost a linear increase in some benchmark scores as number of candidate responses in the grouped comparison method. The authors don't really mention the computational tradeoffs here. Or how long the linear growth sustains.

Response 3: Clarification of Computational Tradeoffs

Thank you for your thorough and rigorous feedback. We present the computational overhead of the grouped comparison method. The data in Table 3 represents the computation time for a single step, including rollout, scoring, and parameter updates, with batch size and other hyperparameters consistent with those in the appendix. The experiments were conducted on 16×H800 GPUs.

Table 3: Computational Overhead of Grouped Comparison Method with Different Numbers of Candidate Responses

Number of Candidate ResponsesTime (ms)
3409
5675
71176

We sincerely acknowledge that as the number of candidate responses increases, the computational overhead of the grouped comparison method also increases significantly. We commit to updating these results in the limitations section of our paper.

We are willing to engage in further discussion on this issue. Specifically, we believe that the primary source of computational overhead comes from the An2A_n^2 comparisons performed on nn candidate responses, which leads to a significant increase in computation time as nn grows. However, it may not be necessary to perform pairwise comparisons on all responses to achieve results comparable to, or even better than, those obtained through An2A_n^2 comparisons. We believe that methods from social technical and other fields may offer valuable insights to help us reduce the number of comparisons and computational overhead. We will also discuss this in the appendix: potential research directions, denoting that this is inspired by the reviewer mrSo's suggestion. Thank you again for your insightful comments.

评论

It's nice to see that the other methods for using rank data is showing some promise. For experiment 1, can you explain in more detail what the two columns are indicating? Is this equivalent to Table 1 in the original manuscript? For example, what is the exact difference between Init + Borda and Ours + Borda (or any of the other combination) in an MLLM-as-a-judge paradigm? In the same vein, and I apologize for not pointing this in the original review, for completeness sake, it might be better to briefly describe the MLLM-as-a-judge paradigm in the full paper before describing the results as well.

Following up on this, will this result be presented in the updated paper? In that case, the authors should mention some plans on which results will change.

评论

Dear reviewer mrSo,

Thank you for your timely and constructive feedback! We are very pleased that our previous rebuttal with limited character count addressed some of your concerns. We are very willing to provide more detailed responses to the new questions you have raised.

It's nice to see that the other methods for using rank data are showing some promise.

For experiment 1, can you explain in more detail what the two columns are indicating? Is this equivalent to Table 1 in the original manuscript?

We will add an introduction to all three tasks in the MLLM-as-a-Judge benchmark in Appendix Section 6 Experiment Details of the paper, with the following specific content:

MLLM-as-a-Judge benchmark introduces three distinct tasks to evaluate the judgment capabilities of MLLMs.

  • Scoring Evaluation, requires the MLLM to rate a single response on a scale of 1 to 5 based on criteria like relevance, accuracy, and creativity.
  • Pair Comparison, involves the model comparing two responses to the same prompt and determining which is superior, with the option of declaring a tie.
  • Batch Ranking, tasks the MLLM to arrange multiple responses in descending order of quality based on a given instruction, without the possibility of a tie.

Our experiments on MLLM-as-a-Judge in the main paper primarily include its Scoring Evaluation (Table 1 and Figure 6 in our main paper) and Pair Comparison (Figure 5, (c). component) components.

To explore the potential of structured prediction methods, we supplemented experiments with the Batch Ranking evaluation set, during the rebuttal period, as it naturally requires ranking multiple responses.

For example, what is the exact difference between Init + Borda and Ours + Borda (or any of the other combination) in an MLLM-as-a-judge paradigm?

We sincerely apologize for not clearly explaining the details of these method abbreviations in the first round. We will make the following clarification:

  • Init represents the starting point for training the generative reward model, which is Qwen-2.5-VL-7B-Instruct.
  • Ours represents the generative reward model trained using the GRPO algorithm starting from Qwen-2.5-VL-7B-Instruct.

An example of the evaluation dataset is:

{
    "id": "1",
    "image": "...",
    "question": "...",
    "responses": [
        {"model_name": "gpt4o", "response": "..."},
        {"model_name": "gemini-2.5-pro", "response": "..."},
        {"model_name": "llava", "response": "..."},
        {"model_name": "cogvlm", "response": "..."},
    ],
    "ground_truth": "ABCD"
}

Where id is the unique identifier for the evaluation, responses are the model-generated answers, and ground_truth is the human-annotated correct answer. In this case, it indicates that gpt4o's response (A) is the best, and cogvlm's response (D) is the worst.

In the specific implementation, we split data with the same id containing nn responses into An2A_n^2 pairwise responses, where the data format is:

{
    "id": "1",
    "image": "...",
    "question": "...",
    "responses": [
        {"model_name": "gpt4o", "response": "..."},
        {"model_name": "gemini-2.5-pro", "response": "..."},
    ]
},
{
    "id": "1",
    "image": "...",
    "question": "...",
    "responses": [
        {"model_name": "gpt4o", "response": "..."},
        {"model_name": "llava", "response": "..."},
    ]
},

Subsequently, we have the generative reward model, either Init or Ours, perform pairwise comparison on each piece of data. We then use different methods for ranking statistics, where the methods include: {Average, Borda, Instant Runoff, Ranked Pairs, Spectral Ranking}, resulting in the normalized minimal editing distance in the named suffix results in Table 1 (our Round 1 rebuttal).

In the same vein, and I apologize for not pointing this in the original review, for completeness sake, it might be better to briefly describe the MLLM-as-a-judge paradigm in the full paper before describing the results as well.

Excellent suggestion! We indeed overlooked the introduction to the MLLM-as-a-Judge benchmark. As mentioned above, we will add the corresponding introduction in Appendix Section 6 Experiment Details of the paper.

Following up on this, will this result be presented in the updated paper? In that case, the authors should mention some plans on which results will change.

We are very willing to share our revision plans with you. We will add a new section called Potential Directions in the appendix. We will discuss the potential of using the aforementioned social choice methods in Batch Ranking tasks and Reinforcement Learning Fine-tuning tasks, without making changes to the existing results in the main paper.

Finally, we greatly appreciate your quick feedback and suggestions. We hope our response in this round can address your concerns, and we look forward to further communication with you.

评论

Thank you for the responses. I wasn't clear before that the newer results were for the batch ranking task only. On that point, the newer techniques (social choice methods, and spectral ranking) should be possible to test even for the Pair Comparison task with an MLLM-as-a-judge framework, right?

Finally, it does seem like you plan to put a lot of the work you did during rebuttal phase as future work, and mention this in the appendix. I think the full experiments with all the different angles of considering ranked info beyond pairwise comparisons will only make this paper stronger. Based on your replies and the other reviews, I will maintain my score of weak accept, as I do think this paper is good in terms of the direction and can be accepted in it's current state. But in case it's not accepted, the additions will make a resubmission much stronger.

评论

Thank you sincerely for your support and endorsement of our work! To be honest, when we received your suggestion regarding social choice methods, we were thrilled and excited, as it opened our eyes to more possibilities for generative RLHF-V.

Please allow us to sincerely and respectfully restate our understanding of social choice methods to avoid any potential misunderstandings.

In our view, social choice methods can serve as an important extension of grouped comparison, capable of reconstructing rankings of all responses from comparisons between multiple pairs, thereby mitigating issues such as preference cycles and other internal contradictions. Since RLHF samples multiple responses and uses reward models to provide corresponding scores, these responses can naturally be combined into multiple pairs in the generative RLHF-V pipeline, allowing for further optimization using social choice methods.

We implemented a simple version of this idea in RLHF during the rebuttal period. As shown in Table 2 of our Round 1 response, we used various social choice methods to reproduce the core RLHF experiments from the paper. Unfortunately, we found that these simple implementations of social choice methods did not demonstrate significant performance improvements. As we mentioned in Round 1, this may be because the discrete scores obtained from rankings cannot provide sufficiently fine-grained supervision signals. For example, given four responses A, B, C, D, through social choice methods, we can only obtain the ranking A>B>C>D and assign them scores of 4, 3, 2, 1 respectively.

Therefore, we downgraded the task and found that social choice methods can achieve better results than direct averaging in the Batch Ranking task of MLLM-as-a-Judge for ranking multiple responses, as demonstrated in Table 1 of our Round 1 response.

On that point, the newer techniques (social choice methods, and spectral ranking) should be possible to test even for the Pair Comparison task with an MLLM-as-a-judge framework, right?

We appreciate your insightful suggestion, and multi-angle analysis of social choice methods would certainly significantly enhance the analytical depth of the paper. In this round of response, we attempted to apply social choice methods to Pair Comparison tasks. Specifically, we collected the nn responses with the same corresponding id from the MLLM-as-a-Judge evaluation set, performed AN2A_N^2 times pairwise comparisons, and used GRMs (Ours or Init) for pair comparison. We then applied social choice methods to the collected results and used the relative order in the final ranking as the final pair comparison result. The results are as follows:

Table 1: Performance Comparison of Different Scoring Methods on MLLM-as-a-Judge Pair Comparison Task

MethodsOursInit.
Average Score (Ours)0.68120.6334
Borda Count0.69520.5973
Instant Runoff0.68770.5868
Ranked Pairs0.72330.6159
Spectral Ranking0.66520.5578

We found that social choice methods can achieve relatively significant performance improvements in pair comparison tasks. Moreover, we observed that social choice methods showed greater improvements on GRMs that had undergone RL training. This further demonstrates the broad prospects for combining these two approaches.

We apologize for not accurately understanding your expectations regarding the position of social choice methods in the paper earlier. As you mentioned, we believe that including them in the main paper would certainly bring substantial improvements to the paper's quality.

I think the full experiments with all the different angles of considering ranked info beyond pairwise comparisons will only make this paper stronger.

During the rebuttal period, our limited resources were only sufficient to show our preliminary attempts at RLHF results (which were not ideal) and the results of GRMs on MLLM-as-a-Judge. We plan to extend our experiments with social choice methods for RLHF beyond the discussion period, trying them in all applicable places. We are very willing to share our internal experimental plan developed in response to your suggestions:

  1. For Table 2: Since directly converting rankings to scores is not feasible, we will attempt to use multiplicative weighting between rankings and original scores to implement a solution that both avoids preference cycles and provides fine-grained scores.

  2. For Figure 7: After validating the improvements of social choice methods in RLHF, we will observe whether their benefits increase with the number of candidate responses.

Whether or not we are accepted in this submission, we will add these corresponding new experiments to the experimental section of the revised main paper. We sincerely appreciate your suggestions and hope our efforts can strengthen your confidence in us.

Thanks again for your insightful suggestions and kind support!

评论

Dear Reviewer mrSo,

We would like to once again express our sincere gratitude for the time and effort you have dedicated during the rebuttal period. Throughout this round of rebuttal, you engaged in multiple rounds of discussion with us, and we are pleased to see that your concerns have progressively diminished. In fact, by the end, you even expressed your willingness to support the acceptance of our paper. However, it appears that one final point of discussion remains without your response.

We deeply value your feedback, as it has been immensely helpful in improving our paper. Therefore, we sincerely request that you take the time to review our responses and provide your comments, so that our paper can be further enhanced.

评论

Thank you again for the engaging back and forth. As I have said before, I am highly excited by how different approaches of going beyond pairwise comparisons seem to be strong approaches for RLHF-V.

I agree with your assessment that social choice methods are less likely to perform well in predicting scores, as opposed to pairwise comparison and batch ranking tasks. Thank you for providing the results for pairwise comparison.

Now, with all these updated results, conditional on the fact that they can be incorporated into the paper, I'm happy to update my rating to Accept.

评论

Dear Reviewer mrSo,

We are deeply grateful for your thorough and constructive review of our work. Your insightful feedback has been instrumental in significantly enhancing the quality and clarity of our paper. We sincerely appreciate the substantial time and effort you dedicated to reviewing our manuscript and engaging in productive discussions throughout the rebuttal process.

1. Enhanced Clarity and Presentation

  • Revised the introduction to provide a clearer, earlier explanation of "grouped comparison"
  • Refined motivational statements to more accurately articulate our contributions
  • Improved overall writing quality and precision throughout the paper

2. Exploration of Advanced Methods

  • Implemented and evaluated spectral ranking and social choice techniques (Borda count, Instant Runoff, Ranked Pairs)
  • Conducted comprehensive experiments across different scenarios (reward model scoring and RL fine-tuning)
  • Added valuable insights about the potential and limitations of these methods in our context

3. Computational Analysis

  • Provided detailed computational overhead analysis with concrete timing measurements
  • Discussed the scalability challenges and potential optimization directions
  • Added a comprehensive discussion of computational tradeoffs in the limitations section

Your suggestions regarding social choice techniques have opened exciting new research directions for generative RLHF-V. The exploration of these methods has not only improved our current work but also laid the foundation for future investigations in this promising area. We particularly appreciate how your feedback helped us recognize the broader applicability of our approach and its connections to established theoretical frameworks.

We commit to incorporating all discussed improvements in our revision. Your suggestions have enriched our understanding and will continue to guide our future work in this field.

Thank you once again for being such a dedicated and insightful reviewer. Your thorough engagement and constructive criticism exemplify the best of academic peer review, and we are honored to have benefited from your expertise.

评论

Dear Reviewers and ACs,

We extend our sincere gratitude to the reviewers (mrSo, ecEr, NwUp, vM9w) for their invaluable and constructive feedback. We are pleased that the reviewers acknowledged the effectiveness of our Generative RLHF-V approach, its comprehensive experimental evaluation across multiple multimodal benchmarks, as well as its contribution to multimodal principle learning and alignment.

We are deeply honored to have received initial scores unanimously approved by the reviewers (5444, all above the acceptance threshold), and we are even more delighted to have received their insightful feedback, which has significantly enhanced the quality of our paper. Our rebuttal responses have been met with satisfaction from all reviewers, and we are truly privileged to have gained their further support. We extend our heartfelt gratitude once again, as the collective efforts of all reviewers have created a truly exceptional submission experience.

We promise to incorporate reviewers' feedback into the revised version. The main updates in the revised version include:

  • (Reviewer mrSo) We enhanced the clarity of our grouped comparison explanation by providing earlier, more intuitive descriptions in the introduction. We implemented and evaluated advanced structured prediction methods including spectral ranking and social choice techniques (Borda count, Instant Runoff, Ranked Pairs), demonstrating their potential in both reward model scoring and reinforcement learning scenarios.

  • (Reviewer mrSo) We provided comprehensive computational overhead analysis with concrete timing measurements, discussing scalability challenges and potential optimization directions for future work.

  • (Reviewer ecEr) We conducted extensive additional experiments on LLaVA-1.5-7B to demonstrate the generalizability and robustness of our approach across different model architectures, showing consistent improvements in principle learning and multimodal RLHF performance.

  • (Reviewer ecEr) We provided detailed clarification of our novel contributions, emphasizing that our work goes beyond simple modality extension to include systematic empirical study of the complete pipeline from reward modeling to grouped comparison in multimodal settings.

  • (Reviewer NwUp) We conducted comparative analysis of different RL algorithms (GRPO vs PPO) for training generative reward models, providing empirical justification for our algorithmic choices and clarifying implementation details.

  • (Reviewer NwUp) We explored reward hacking behaviors across different model architectures (LLaVA-1.5-7B vs Qwen-VL series), revealing diverse manifestations of over-optimization and their implications for multimodal alignment.

  • (Reviewer vM9w) We provided comprehensive analysis of the reward hacking phenomenon as an inherent challenge in RL-based approaches, positioning it as a valuable empirical finding rather than a limitation, and exploring potential mitigation strategies through social choice methods.

  • (Reviewer vM9w) We clarified the mechanism of implicit principle learning, explaining how models develop entangled representations of concepts like "accuracy" and "comprehensiveness" and why providing explicit principles can harm performance in GRM+RL settings.

  • (All Reviewers) We improved presentation quality, addressed grammatical issues, enhanced figure clarity, and added comprehensive discussions of limitations and future research directions inspired by reviewer suggestions.

We sincerely appreciate the reviewers' time and expertise. All discussions and experiments would be incorporated into the final version of the paper.

Once again, we deeply appreciate the reviewers' time and insightful feedback that has significantly enhanced the quality and impact of our work.

With best regards,

Authors

最终决定

(a) Summary of Scientific Claims and Findings

The paper introduces Generative RLHF-V, a framework for aligning multimodal large language models (MLLMs) with human preferences. Unlike traditional score-based reward models, the authors propose generative reward modeling (GRM) that captures richer human principles. Instead of comparing only one pair of candidate responses, the grouped comparison method generates multiple responses, compares them pairwise, and aggregates scores for more accurate reward estimation. The method is evaluated on small-to-medium scale open-source models, across seven benchmarks and four MLLM variants. Empirical results demonstrate gains over both score-only RMs and supervised GRMs, with also futher analysis.

(b) Strengths

  • Novel framework: Integrates GRM and grouped comparisons into multimodal RLHF

  • Strong empirical validation: Results show improvements over baselines, with complete benchmarks and ablation studies.

  • Clear research questions: the paper is framed around specific RQs

  • Analysis of limitations: Discussion of vulnerabilities such as reward hacking adds value to the paper

(c) Weaknesses

  • Clarity and presentation: Several reviewers noted that explanations (especially of grouped comparisons) could be introduced earlier

  • Computational tradeoffs: Despite the promising results, computational costs was not sufficiently analized.

(d) Reasons for Final Decision

I am recommending acceptance since all the reviewers agree that the paper has merits and represents a valuable contribution. The authors are strongly encouraged to include the additional experimental results and additional discussion in the camera ready

(e) Summary of the discussion

The reviewers are satisfied by the amendments promised by the authors (additional experimental evaluation and additional discussion). They agree that the paper merits acceptance.