PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
4
3
5
4.0
置信度
创新性2.5
质量2.5
清晰度3.3
重要性2.5
NeurIPS 2025

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

OpenReviewPDF
提交: 2025-05-05更新: 2025-10-29
TL;DR

We propose a unified policy optimization framework that supports autoregressive generation across both text and image modalities, enabling enhanced multimodal interleaved generation

摘要

关键词
multimodal interleaved generationunified modelsreinforcement learning

评审与讨论

审稿意见
4

This paper presents a reinforcement learning-based post-training strategy that enables unified vision-language models to generate coherent, interleaved sequences of text and images, introducing a warm-up phase with limited interleaved data and a novel multimodal extension of Group Relative Policy Optimization with hybrid and process-level rewards, and demonstrates significant improvements on dedicated interleaved generation benchmarks without sacrificing general multimodal understanding or generation capabilities.

优缺点分析

Strengths:

  1. The paper addresses an important and underexplored challenge: enabling unified vision-language models to produce truly interleaved multimodal outputs (text and images), which is crucial for applications like visual storytelling and step-by-step reasoning.

  2. The method does not require large-scale, high-quality interleaved datasets. Instead, it leverages limited interleaved data and a well-designed warm-up phase, making it practical and accessible for real-world applications.

Weakness:

  1. While the method is empirically effective, the authors offer limited theoretical analysis or formal justification for why the proposed multimodal GRPO and reward design are particularly suited to interleaved generation.

  2. The process-level reward modeling and the need to generate multiple candidate sequences per prompt in GRPO introduce substantial computational costs, which may limit scalability.

  3. The approach relies heavily on the effectiveness and robustness of multiple reward models (textual, visual, and structural) during training, but the authors do not thoroughly analyze how these reward models generalize to out-of-distribution scenarios or novel tasks, potentially limiting the method’s adaptability.

  4. Tables 3, 4, 5, and 6 are poorly presented with inconsistent sizes and font styles, making the paper appear unpolished and disorganized.

问题

  1. Process-level rewards and group-wise sampling increase training costs. Can you provide more quantitative analysis on the computational and memory overhead introduced by the GRPO stage (e.g., compared to standard RLHF or supervised fine-tuning)? Please also discuss potential strategies for reducing this overhead, such as more efficient candidate sampling or reward estimation.

  2. Could you provide more details or experiments on the robustness and generalization ability of these reward models, especially when applied to out-of-distribution data or tasks beyond the current benchmarks? If not already included, consider testing or discussing how the method would adapt if reward models are imperfect or biased.

局限性

yes

最终评判理由

The authors address most of my concerns during the rebuttal and the additional experiments and explanations improves the manuscript. Therefore, I'm lean to acceptance, but still open to rejection.

格式问题

N/A

作者回复

Weakness1: While the method is empirically effective, the authors offer limited theoretical analysis or formal justification for why the proposed multimodal GRPO and reward design are particularly suited to interleaved generation.

A: We thank the reviewer for pointing out the lack of theoretical justification. To address this concern, we provide conceptual justifications from two theoretical perspectives to support its design:

  1. Unified Policy Optimization: In the main paper, we have introduced a unified policy optimization formulation in Eq. 3 and 4:
LGRPO(θ)=1Gi=1G1Yit=1Yimin[Duni(t)Ai,t,clip(Duni(t),1ϵ,1+ϵ)Ai,t],L_{GRPO}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|Y_i|} \sum_{t=1}^{|Y_i|} \min [ D_{uni}(t) \cdot A_{i,t}, \, \text{clip}(D_{uni}(t), 1 - \epsilon, 1 + \epsilon) \cdot A_{i,t} ], Duni(t)=πθ(yi,ttxtXi,yi,t)πθold(yi,ttxtXi,yi,t)1tk(t)+πθ(yi,timgXi,yi,t)πθold(yi,timgXi,yi,t)1t>k(t).D_{uni}(t) = \frac{\pi_{\theta}(y_{i,t}^{txt} \mid X_i, y_{i, \leq t})}{\pi_{\theta_{old}}(y_{i,t}^{txt} \mid X_i, y_{i, \leq t})} \cdot 1_{t \leq k}(t) + \frac{\pi_{\theta}(y_{i,t}^{img} \mid X_i, y_{i, \leq t})}{\pi_{\theta_{old}}(y_{i,t}^{img} \mid X_i, y_{i, \leq t})} \cdot 1_{t > k}(t).

This formulation allows actions from the visual modality to be seamlessly integrated into the overall generation trajectory and jointly optimized alongside textual actions. By treating all modality transitions within a unified policy space, our approach enables global optimization over the entire multimodal sequence. This not only ensures semantic consistency across modalities but also facilitates effective gradient propagation through the interleaved structure, which is critical for learning coherent cross-modal dependencies.

  1. Process-Level Reward Design: Previous works focusing on single-modality generation, such as Deepseek-R1, have explored introducing process-level rewards to better align generation with human preferences. However, they encountered a critical limitation: it is inherently difficult to accurately define and annotate what constitutes a "correct step" in complex reasoning or generation processes. This ambiguity in step boundaries leads to reward diffusion, where local credit assignment becomes unclear, resulting in noisy gradient updates and unstable learning dynamics. In contrast, interleaved multimodal generation naturally provides discrete and semantically grounded boundaries through modality switching (e.g., from text to image and back). These modality transitions serve as anchors for defining process-level steps, enabling more accurate and interpretable credit assignment during optimization. To facilitate understanding, we derive the optimization objective as:
L(θ)=EYπθ[R(Y)θlogπθ(Y)],L(\theta) = E_{Y \sim \pi_\theta} \left[ R(Y) \cdot \nabla_\theta \log \pi_\theta(Y) \right], Var[R(Y)θlogπθ(Y)]Var[R(Y)].\text{Var} \left[ R(Y) \cdot \nabla_\theta \log \pi_\theta(Y) \right] \propto \text{Var} \left[ R(Y) \right].

When step-level definitions are incorrect or inconsistent, the resulting reward distribution becomes noisy and ill-posed, leading to unstable learning signals and suboptimal policy updates. This leads to high variance in gradient estimation. In our interleaved multimodal setting, modality transitions provide explicit segmentation of the trajectory, allowing us to define rewards locally per modality segment. This reduces ambiguity in credit assignment and improves sample efficiency during optimization.

We will include a more detailed theoretical discussion in the revised version to support this rationale.

Weakness2 & Q1: The process-level reward modeling and the need to generate multiple candidate sequences per prompt in GRPO introduce substantial computational costs, which may limit scalability. Can you provide more quantitative analysis on the computational and memory overhead introduced by the GRPO stage (e.g., compared to standard RLHF or SFT)? Please also discuss potential strategies for reducing this overhead, such as more efficient candidate sampling or reward estimation.

A: We appreciate the reviewer’s concern regarding the potential computational and memory overhead introduced by our GRPO stage.

  1. We first would like to clarify that our process-level reward does not introduce additional computational burden—it simply redistributes the original outcome-level reward across different modality-specific steps. Therefore, it incurs negligible additional cost compared to standard RLHF reward computation. Furthermore, our group-wise sampling reduces training overhead compared to standard RLHF methods such as PPO, as it eliminates the need for training a separate critic model, which is often the most computationally intensive component in PPO-based pipelines.

  2. To further support our claim, we provide a quantitative comparison of the computational and memory overhead introduced by GRPO relative to standard RLHF and supervised fine-tuning (SFT) in the table below. For fair comparison, the number of samples per iteration is fixed to 4. The results show that GRPO introduces only marginal overhead compared to SFT, and is significantly more efficient than PPO-style RLHF in both compute and memory. We will include this quantitative analysis in the revision. |Method|GPU Hours (per 1K steps)|Peak Memory (GB)| |:----|:----|:----| |SFT|9.7|30| |PPO-based RLHF|15.3|58| |Our GRPO|11.1|42|

  3. To further reduce cost, we are actively exploring more efficient group-wise candidate sampling strategies (e.g., top-k filtering or nucleus sampling) and lighter reward estimation techniques that approximate process-level rewards without relying on full model evaluations at each step.

Weakness3 & Q2: Could you provide more details or experiments on the robustness and generalization ability of these reward models, especially when applied to out-of-distribution data or tasks beyond the current benchmarks? If not already included, consider testing or discussing how the method would adapt if reward models are imperfect or biased.

A:

  1. To examine the robustness of reward models, we have included experiments in the main paper (Table 6) that specifically examine imperfect or biased reward models. In this setting, we replace the original reward model with CLIP, which serves as a weaker and less aligned alternative. The results show that while there is a slight performance drop (-2.56 on MMIE and -0.55 on InterleavedBench), our GRPO framework remains stable and continues to outperform baseline methods, highlighting its resilience to noise and bias in the reward signal.

  2. To further address this concern, we conduct additional experiments following the MVoT protocol [1], where a suite of out-of-distribution, maze-specific tasks is used to evaluate the model’s visual reasoning capabilities across multimodal contexts. As shown in the table below, without any changes to the reward models, our method achieves substantial performance gains on these challenging benchmarks. This demonstrates the robustness and generalization ability of our reward design beyond the current benchmark tasks. |Method|MAZE|MINIBEHAVIOR|FROZEN-LAKE| |:----|:----|:----|:----| |MVoT|0.9295|0.9514|0.856| |Vila-U|0.7046|0.6527|0.5839| |Ours|0.9593|0.9706|0.8742|

Weakness4: Tables 3, 4, 5, and 6 are poorly presented with inconsistent sizes and font styles, making the paper appear unpolished and disorganized.

A: We thank the reviewer for highlighting the formatting inconsistencies in Tables 3, 4, 5, and 6. We acknowledge the issue regarding inconsistent sizes and font styles, and we will carefully address these presentation details in the refined version to ensure clarity and visual consistency throughout the paper. We appreciate the reviewer’s feedback, which will help improve the overall quality and professionalism of our refined version.

[1] Li, Chengzu et al. “Imagine while Reasoning in Space: Multimodal Visualization-of-Thought.” arXiv preprint, 2501.07542.

评论

I appreciate the authors' response, which addressed some of my concerns. I am particularly interested in how the performance would be affected by applying top-k filtering or nucleus sampling. By the way, the presentation of the theoretical analysis is quite difficult to follow.

评论

We thank the reviewer for the constructive feedback.

Q1: Impact of top-k filtering or nucleus sampling.

A:

  • By default, we adopt nucleus sampling with p = 0.7 in our experiments. To explore the impact of the sampling parameter, we conducted experiments under different p values. As shown below, model performance is somewhat sensitive to the value of p. In particular, performance degrades when p is too small (e.g., p = 0.4), likely due to overly restrictive sampling that limits output diversity. | Nucleus sampling | MMIE | InterleavedBench | |------------------|------|-------------------| | p=0.7 | 59.5 | 3.13 | | p=0.9 | 59.1 | 3.04 | | p=0.4 | 58.4 | 2.81 |

  • Experiments with top-k filtering are currently in progress. We will update the results as soon as they become available and would be happy to share them with the reviewer.

Q2: Clarity of theoretical analysis.

A: We appreciate the reviewer’s feedback regarding the clarity of our theoretical analysis. Below, we provide a simplified explanation of our key ideas:

  • We propose a unified policy optimization framework (Eq. 3 and 4) that treats both text and image actions within the same policy space. This allows the model to learn a coherent generation trajectory across modalities. By jointly optimizing actions from different modalities, we ensure global consistency and enable smooth gradient flow through the interleaved sequence, which is essential for learning meaningful cross-modal dependencies.

  • In previous methods (e.g., Deepseek-R1), defining accurate step-level rewards is difficult, often leading to ambiguous credit assignment and unstable training. In contrast, our interleaved multimodal setup naturally segments the generation process through explicit modality switches (e.g., text → image → text), which serve as clear and semantically meaningful boundaries. These transitions allow us to assign rewards to specific modality segments, improving the interpretability of the reward function, reducing gradient variance, and stabilizing learning.

We hope this explanation helps clarify the motivation and formulation of our method.

Please feel free to let us know if you have any further questions. We would be happy to address them.

评论

Dear Reviewer 3D4S

Thanks again for your valuable comments and suggestions. As the discussion phase is nearing its end, we wondered if the reviewer might still have any concerns that we could address.

We believe our responses addressed all the thoughtful questions raised by the reviewer. It would be great if the reviewer can kindly check our responses and provide feedback with further questions/concerns (if any). We would be more than happy to address them. Thank you!

Best wishes,

Authors

评论

Dear Reviewer 3D4S

We have supplemented our experiments with top-k sampling. As shown in the table below, the model achieves relatively better performance at k=50, while performance drops significantly at k=10 and k=500, indicating that policy sampling is sensitive to diversity and quality. Furthermore, the overall performance of top-k sampling does not match that of nucleus sampling, suggesting that nucleus sampling is better suited for balancing diversity and quality in this task.

Top-k samplingMMIEInterleavedBench
k=1052.42.17
k=5058.73.01
k=10056.52.62
k=50050.31.86

We hope our response effectively addresses the reviewer’s concerns.

Please feel free to let us know If there are any further questions or clarifications needed. We would be happy to discuss them.

Best wishes,

Authors

评论

The authors' substantial efforts during the rebuttal period have greatly enhanced my evaluation of this manuscript. As a result, I have decided to raise my score to borderline accept.

评论

We appreciate the reviewer's time for reviewing and thanks again for the valuable comments and the positive score!

审稿意见
4

Based on unified VLMs, this paper expands their capability to generate interleaved text and images instead of either understanding or generation. The authors utilize GRPO to post-train VLMs with textual, visual and format rewards together with progress rewards after a warmup stage to enable interleaved outputs. They conduct experiments on MMIE and InterleavedBench, and achieve performance gains compared to original VLM base model.

优缺点分析

Strengths:

  1. The paper enables the capability of interleaved image and text generation with GRPO post-training for unified VLMs, and achieves improved performance compared to the baseline.
  2. The method is intuitive and easy to implement.
  3. The structure/writing of the paper is easy to follow.

Weaknesses:

  1. The experiment is only conducted on Vila-u, which is less powerful than other baselines as shown in Table 5 for multimodal understanding or generation, the generalization of the proposed method on the more powerful unified VLM is not demonstrated or proved.
  2. The compared baselines in main results (Table 1 and Table 2) are somehow outdated, which is lack of more powerful baselines like EMU3 [1] to ensure a more comprehensive evaluation.
  3. There is limited insight the community can gain after reading this paper. It is expected that after GRPO with combined rewards, the model can achieve better performance on interleaved generation for text and images, adding a baseline of SFT only post-train and comparing it with RL post-train may be more insightful.

[1] Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., Zhao, Y., Ao, Y., Min, X., Li, T., Wu, B., Zhao, B., Zhang, B., Wang, L., Liu, G., He, Z., Yang, X., Liu, J., Lin, Y., Huang, T., & Wang, Z. (2024). Emu3: Next-Token Prediction is All You Need. ArXiv, abs/2409.18869.

问题

  1. Can more comprehensive evaluations be conducted as stated in weaknesses?
  2. Can a baseline of pure SFT post-train be added to provide more insights on the advantage of RL-based post-training at the domain of interleaved text and image generation?

局限性

yes

最终评判理由

I have read the author's rebuttal and further discussion from authors, which addresses my concern.

格式问题

NA

作者回复

Weakness1: The experiment is only conducted on Vila-u, which is less powerful than other baselines as shown in Table 5 for multimodal understanding or generation, the generalization of the proposed method on the more powerful unified VLM is not demonstrated or proved.

A: To address this concern, we have conducted extensive experiments on Janus-Pro [1], a more powerful and unified VLM compared to Vila-u, with stronger performance on both multimodal understanding and generation tasks. As shown in the tables below, our proposed method consistently improves performance across various evaluation metrics on Janus-Pro, demonstrating its effectiveness and generalizability beyond a single model architecture.

MethodMMIEInterleavedBench
EMU-245.331.68
EMU-350.172.19
Ours (Vila-u)59.53.13
Ours (Janus-Pro)61.763.35
MethodMME-PMMvetSEEDBench-imgPOPE#Train imagesImage res.GenEval
Janus-Pro1516.745.170.178.972M3840.80
Ours (Janus-Pro)1552.945.471.380.616M3840.83

Weakness2 & Q1: The compared baselines in main results (Table 1 and Table 2) are somehow outdated, which is lack of more powerful baselines like EMU3 to ensure a more comprehensive evaluation.

A: We thank the reviewer for the suggestion. However, not all unified models currently support interleaved generation, which is the core focus of our method. For completeness, we faithfully report the results of baseline models in Table 1 and Table 2 based on their original papers to ensure fairness and reproducibility.

Furthermore, we have additionally included the performance of EMU3 on standard benchmarks where applicable, to offer a more comprehensive view of model capabilities. Our method still demonstrates competitive or superior performance, further confirming its effectiveness.

MethodMMIE-SAMMIE-PLMMIE-MRMMIE-AVGInterleavedBench
EMU-239.6546.1250.7545.331.68
EMU-346.1450.8552.8650.172.19
Ours56.8762.2854.3159.53.13

Weakness3 & Q2: There is limited insight the community can gain after reading this paper. It is expected that after GRPO with combined rewards, the model can achieve better performance on interleaved generation for text and images, adding a baseline of SFT only post-train and comparing it with RL post-train may be more insightful.

A: We thank the reviewer for the valuable suggestion. In the main paper, we have included a comparison in Table 3 where the model is trained only with interleaved data (only w/ warm-up stage) using supervised fine-tuning. The results (53.31 on MMIE and 1.97 on InterleavedBench) show that SFT alone yields suboptimal performance on interleaved generation tasks, indicating that exposure to interleaved data is helpful but not sufficient for optimal performance.

Furthermore, to ensure a fair comparison, we extend the experiment by increasing the SFT data to match the scale used in RL post-training. As shown in the table below (Row 2), even under this setting, RL post-training with our combined reward design consistently outperforms SFT, which only gains minimal benefits. The results suggest that reinforcement learning brings non-trivial gains beyond what SFT can achieve, especially in aligning fine-grained generation behavior across modalities. We will highlight these insights more clearly in the revision to better convey the added value of our approach.

MethodMMIEInterleavedBench
SFT (Warm-up)53.311.97
SFT (Full)54.822.04
GRPO59.53.13

[1] Xiaokang Chen, et al. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint, 2025.

评论

Thanks for the efforts in rebuttal, which partially addressed my concern. The experiments on Janus-pro/vila-u shows marginal improvement on general understanding/generation tasks, could you show me more insights on why the training can't increase the performance on general tasks as large as interleaved text and images generation?

评论

We appreciate the reviewer’s insightful question. We would like to address this concern from three perspectives:

  1. Unlike interleaved consistency tasks or mathematical reasoning tasks, which are relatively well-defined and focused, general understanding tasks are inherently open-ended and multi-dimensional, making them more challenging to evaluate and optimize for. As such, performance improvements on these tasks can be more nuanced and difficult to attribute directly to specific components of the training framework.

  2. Our reward functions are primarily tailored for interleaved generation, especially focusing on cross-modal alignment and image-text consistency. Consequently, the rewards are not specifically designed to optimize general understanding or generation capabilities. This design choice explains why performance gains on general understanding benchmarks are not as substantial as those observed on interleaved benchmarks.

  3. To further explore this point, we conducted additional experiments incorporating general generation rewards into our training paradigm.

  • Specifically, we followed flow-GRPO [1] and integrated GenEval-related rewards. As shown in the table below, this modification led to improved performance on GenEval, but had no positive effect—or even slightly degraded performance—on interleaved benchmarks. This suggests that current GenEval-related rewards may not align well with the requirements of interleaved generation, highlighting the importance of task-specific reward design. | w/ GenEval reward | GenEval | MMIE | InterleavedBench | |-------------------|---------|------|-------------------| | ✗ | 0.46 | 59.5 | 3.13 | | ✓ | 0.67 | 58.2 | 2.85 |
  • Furthermore, designing a reward model that generalizes well across both general understanding and generation tasks remains an open challenge that the community continues to actively explore.

Please feel free to let us know if you have any further questions. We would be happy to address them.

[1] Liu, Jie, et al. "Flow-grpo: Training flow matching models via online rl." arXiv preprint, 2505.05470.

评论

Thanks for the explanation, which resolves my concern. I will raise my score accordingly.

评论

We appreciate the reviewer's time for reviewing and thanks again for the valuable comments and the positive score!

审稿意见
3

This paper introduces a policy optimization framework for unified multimodal models that significantly enhances their capability for interleaved generation. To achieve this objective, the method employs a warm-up stage using hybrid datasets to develop initial interleaved generation capabilities, followed by policy optimization using an extended GRPO framework. The optimization process incorporates a comprehensive multi-type rewards that evaluates textual relevance, visual-text alignment, and structural fidelity to guide the training process effectively. Experimental results on established interleaved generation benchmarks demonstrate the effectiveness of the proposed method, showing substantial improvements in both generation quality and coherence compared to existing approaches.

优缺点分析

Strengths:

  • The construction pipeline for the interleaved data used during warm-up is well-designed and robust.
  • The final performance of interleaved generation demonstrates strong quality and coherence.

Weaknesses:

  • The proposed framework lacks novelty. The GRPO algorithm remains unchanged, still producing token-by-token outputs without any modifications. As a result, this framework is only applicable to autoregressive (AR)-based models such as Janus-Pro. It cannot be applied to other mainstream unified models such as Transfusion[1], MetaQuery [2], or BAGEL [3].
  • The use of multiple rewards is rather straightforward and does not constitute a significant contribution.
  • The interleaved generation primarily focuses on coherence and creativity, but does not demonstrate capabilities such as interleaved reasoning or other complex interleaved scenarios, which limits its applicability.

[1] Zhou, Chunting, et al. "Transfusion: Predict the next token and diffuse images with one multi-modal model." arXiv preprint arXiv:2408.11039 (2024).

[2] Pan, Xichen, et al. "Transfer between modalities with metaqueries." arXiv preprint arXiv:2504.06256 (2025).

[3] Deng, Chaorui, et al. "Emerging properties in unified multimodal pretraining." arXiv preprint arXiv:2505.14683 (2025).

问题

  • Regarding Table 3, I would like to see the performance of the model when applying GRPO directly without the warm-up stage. While the model can generate interleaved outputs even without warm-up (albeit of poor quality), it would be valuable to understand how much the warm-up stage contributes to the final performance by comparing the direct GRPO approach against the full two-stage pipeline.

  • The ablation study on group size in GRPO presented in Table 6 appears limited in scope. Given that recent work such as DeepSeek-R1 [1] has explored larger group sizes (8, 12, 16), it would be beneficial to extend this analysis to investigate whether larger group sizes could further improve performance. This expanded ablation would provide better insights into the scalability of the approach and help determine the optimal group size configuration for multimodal interleaved generation tasks.

  • The training process for interleaved generation appears to primarily benefit the interleaved output capability without demonstrating clear improvements to the model's core image understanding abilities. Could the authors provide insights into whether this specialized training offers any additional benefits beyond interleaved generation? For instance, does the multi-modal policy optimization process enhance visual reasoning, cross-modal alignment, or other fundamental multimodal understanding capabilities?

[1] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).

局限性

Yes

最终评判理由

While the paper presents a practical extension of GRPO to multimodal settings, the adaptation relies heavily on existing formulations and does not introduce fundamentally novel techniques. The proposed process-level reward is essentially a form of reward shaping, combining known signals heuristically rather than offering a new design. Key claims, such as model-agnostic applicability, are not empirically supported, which leans to the borderline rejection.

格式问题

No major issue.

作者回复

Weakness1: The proposed framework lacks novelty. The GRPO algorithm remains unchanged, still producing token-by-token outputs without any modifications. As a result, this framework is only applicable to autoregressive (AR)-based models such as Janus-Pro. It cannot be applied to other mainstream unified models such as Transfusion, MetaQuery, or BAGEL.

A:

  1. We respectfully disagree with the reviewer’s comment that our GRPO framework is unchanged and lacks novelty. In this work, we make several substantial extensions beyond the original GRPO algorithm.
  • We extend GRPO to the multimodal setting by introducing a unified policy optimization framework, which allows actions from the visual modality to be seamlessly integrated into the overall generation trajectory and jointly optimized alongside textual actions. This enables global optimization over the interleaved sequence, ensuring semantic consistency and facilitating effective gradient propagation across modalities.

  • We further design a novel hybrid process-level reward that allows for finer-grained credit assignment and better alignment of intermediate generation steps. These innovations are non-trivial and crucial for effectively adapting GRPO to large-scale multimodal generation tasks.

  1. Furthermore, our approach is not inherently tied to autoregressive models. We simply adopt VILA-U as our base model, which operates in an autoregressive fashion. In fact, GRPO strategies have also been successfully applied in diffusion- and flow-matching-based frameworks, such as Flow-GRPO [1].
  • Mathematically, regardless of whether generation is performed via AR decoding or diffusion sampling, both paradigms can be abstracted as sampling from a parameterized policy πθ(yx)\pi_\theta(y \mid x), where yy denotes the generated output (either autoregressive sequence or diffusion sampling trajectory) and xx is the input context. Both approaches optimize πθ(yx)\pi_\theta(y \mid x) through gradient-based updates guided by relative preferences over grouped samples.

  • Therefore, our framework—particularly the design of process-level hybrid rewards—is inherently model-agnostic, and can be seamlessly integrated into a wide range of generation paradigms beyond the AR setting. We will further investigate its applicability to diffusion-based or flow-matching-based generators in future work.

We hope our response helps clarify the novelty and contributions of our work, as well as eliminate potential misunderstandings—particularly regarding the misconception that our framework is limited to autoregressive models.

Weakness2: The use of multiple rewards is rather straightforward and does not constitute a significant contribution.

A:

  1. First, we would like to clarify that developing new reward models is not the primary motivation of this work. Our core goal lies in building a unified framework under the GRPO paradigm that enables advanced interleaved multimodal generation. Therefore, we deliberately adopt multiple straightforward yet effective reward models as functional tools to guide policy learning. The innovation of our approach lies in leveraging multimodal hybrid reward signals that provide comprehensive supervision for interleaved multimodal generation.

  2. Moreover, for credit assignment, we do propose a novel process-level reward specifically tailored for interleaved multimodal generation.

  • Previous works, such as Deepseek-R1, attempted to use process-level rewards, but struggled with the challenge of defining “correct steps” in complex tasks. This ambiguity led to reward diffusion and noisy gradient updates, hindering stable learning.

  • In contrast, interleaved multimodal generation offers natural, semantically grounded step boundaries via modality transitions (e.g., text to image), which act as anchors for defining local rewards. This enables clearer credit assignment along the trajectory and improves sample efficiency during optimization. Experimental results in Table 4 (+1.67 on MMIE and +0.34 on InterleavedBench) underscore the significance and contribution of the proposed process-level reward.

Therefore, we believe the choice of reward models does not undermine the novelty or contributions of our work; on the other hand, our innovation lies in how rewards are assigned—particularly through the proposed process-level design tailored for interleaved generation.

Weakness3 & Q3: The interleaved generation primarily focuses on coherence and creativity, but does not demonstrate capabilities such as interleaved reasoning or other complex interleaved scenarios, which limits its applicability.

A: We thank the reviewer for raising this point.

  1. First, we would like to clarify that the benchmarks selected in our main experiments already include scenarios involving interleaved reasoning and other complex interleaved tasks. For example, we have reported the ``Multi-step reasoning" metric on MMIE benchmark in Table 1, and our results (53.28%) outperforms current advanced models.

  2. To further strengthen this evaluation, we conduct additional experiments following the MVoT protocol [2], where a suite of maze-specific tasks is used to evaluate the model’s visual reasoning capabilities across multimodal contexts. As shown in the table below, our method achieves substantial performance gains on these benchmarks, further validating its ability to handle complex interleaved reasoning beyond simple coherence and creativity. |Method|MAZE|MINIBEHAVIOR|FROZEN-LAKE| |:----|:----|:----|:----| |MVoT|0.9295|0.9514|0.856| |Vila-U|0.7046|0.6527|0.5839| |Ours|0.9593|0.9706|0.8742|

Q1: Regarding Table 3, I would like to see the performance of the model when applying GRPO directly without the warm-up stage. While the model can generate interleaved outputs even without warm-up (albeit of poor quality), it would be valuable to understand how much the warm-up stage contributes to the final performance by comparing the direct GRPO approach against the full two-stage pipeline.

A: We appreciate the reviewer’s insightful suggestion. As requested, we conducted an additional experiment where we directly train the unified model using GRPO without the warm-up stage. The results (Row 2) show that generation quality after GRPO is significantly lower compared to the full two-stage pipeline. We attribute this to the model’s poor initial ability to generate interleaved outputs, which limits its capacity to produce high-reward samples during the GRPO stage. These findings underscore the importance of the warm-up stage in providing a reasonable initialization for modality transitions and content generation.

Warm-upGRPOMMIEInterleavedBench
-0.51
17.841.25
✓ (w/ initialization)25.961.58
59.53.13

To further investigate this phenomenon, we explored an alternative strategy under the pure GRPO setting by explicitly encouraging the model to sample modality-switching actions through modifications to the sampling policy and reward formulation. Specifically, during the initial training phase, we manually sampled an interleaved content trajectory by adjusting the sampling probability of modality-switching tokens. Additionally, we introduced a multimodal content reward to enhance its relative advantage within the group and incentivize the generation of interleaved content.

However, performance (Row 3) still fell short compared to the warm-up+GRPO setup, indicating that starting from a cold initialization hinders the model’s ability to discover meaningful interleaved structures solely through reinforcement signals. These results support our two-stage training design and highlight the critical role of the warm-up stage in stabilizing training and guiding the model toward effective interleaved generation.

Q2: The ablation study on group size in GRPO presented in Table 6 appears limited in scope. Given that recent work such as DeepSeek-R1 has explored larger group sizes (8, 12, 16), it would be beneficial to extend this analysis to investigate whether larger group sizes could further improve performance. This expanded ablation would provide better insights into the scalability of the approach and help determine the optimal group size configuration for multimodal interleaved generation tasks.

A: We appreciate the reviewer’s suggestion and agree that exploring larger group sizes can offer valuable insights into the scalability of GRPO. However, due to the heavy memory usage of unified multimodal models, especially during group-wise sampling and interleaved generation, we were only able to extend our ablation study up to group size = 6 and 8 within our current hardware constraints.

As shown in the table below, increasing the group size to 6 yields noticeable improvements, indicating the potential benefit of larger groups. However, further increasing the group size to 8 brings only marginal gains. We acknowledge the value of further exploration at even larger group sizes (e.g., 12, 16), and we plan to include such extended experiments in future work when more computational resources become available.

GMMIEInterleavedBench
255.142.27
459.53.13
661.433.38
861.853.42

[1] Liu, Jie, et al. "Flow-grpo: Training flow matching models via online rl." arXiv preprint, 2505.05470.

[2] Li, Chengzu et al. “Imagine while Reasoning in Space: Multimodal Visualization-of-Thought.” arXiv preprint, 2501.07542.

评论

Dear Reviewer Ksvg,

We sincerely appreciate the reviewer's time for reviewing, and we really want to have a further discussion with the reviewer to see if our detailed explanations and additional results solve the concerns. We have addressed all the thoughtful questions raised by the reviewer (e.g., applying GRPO directly without the warm-up stage) and we hope that our work’s contribution and impact are better highlighted with our responses. As the discussion phase is nearing its end, it would be great if the reviewer can kindly check our responses and provide feedback with further questions/concerns (if any). We would be more than happy to address them. Thank you!

Best wishes,

Authors

评论

Reviewer Ksvg

Thanks again for the valuable comments and suggestions. As the discussion phase is nearing its end, we wondered if the reviewer might still have any concerns that we could address. We believe our responses addressed all the thoughtful questions raised by the reviewer (eg, novelty, reward, interleaved reasoning and ablation study).

It would be great if the reviewer can kindly check our responses and provide feedback with further questions/concerns (if any). We would be more than happy to address them. Thank you!

Best wishes,

Authors

评论

Thank you for the clarification. However, I would like to respectfully maintain my original concerns. The extension of GRPO to multimodal settings seems relatively straightforward, given the unified token space—making the core formulation largely unchanged. This limits the novelty of the adaptation. While the authors claim model-agnostic applicability (to AR and diffusion models), such settings are not explored in the current work, weakening this claim. Additionally, the proposed process-level reward appears to be a heuristic combination of existing rewards (e.g., format, text, image), rather than a fundamentally new design. Overall, I appreciate the response, but I believe the concerns about novelty and completeness remain.

评论

We appreciate the reviewer’s continued engagement and constructive critique. However, we respectfully disagree with the reviewer’s assessment that our GRPO framework is “unchanged” and lacks novelty. In this work, we make substantial extensions beyond the original GRPO algorithm.

  • Our unified policy formulation integrates textual and visual actions into a single policy space, enabling global optimization with stable gradient propagation. To the best of our knowledge, this capability is not achieved by any prior GRPO adaptation, thereby demonstrating the novelty of our approach.
  • The reviewer’s interpretation that our framework is inherently limited to AR models is, in our view, overly restrictive. We clarified this misconception from a theoretical perspective: both autoregressive and diffusion-based models can be mathematically unified as sampling from a parameterized policy, indicating the model-agnostic property of our framework.
  • The core innovation of our reward design lies not in the choice of reward functions, but in process-level credit assignment. By anchoring steps at modality transitions, we localize rewards to modality segments, providing fine-grained and stable credit assignment. This approach directly addresses the reward diffusion problem seen in prior works (e.g., DeepSeek-R1) and represents a key novelty of our method.

We hope this clarifies the novelty and scope of our contributions, as well as eliminates potential misunderstandings about the applicability of our framework.

审稿意见
5
  • This paper proposes a reinforcement learning-based post-training strategy to enable existing unified vision-language models to generate multimodal interleaved outputs. Specifically, the paper proposes a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting.
  • The paper has conducted experiments on two challenging benchmarks, MMIE and InterleavedBench, demonstrating the effectiveness of their proposed approach
  • The paper has provided comprehensive ablation studies showing the effectiveness of submodules.

优缺点分析

Strengths

  • The paper proposes an effective RL training strategy to enable unified vision-language models to perform high-quality interleaved multimodal generation.
  • The paper introduces a hybrid group-wise reward signal composed of textual relevance, visual-text alignment, and structural fidelity—addressing multiple aspects of interleaved generation quality.
  • The proposed method is evaluated on two challenging benchmarks, MMIE and InterleavedBench, demonstrating the effectiveness of their proposed approach.

Weaknesses

  • The computation of the process-level reward (L200) is unclear to me. Is it computed using modality-specific segment as target output, and follows the same reward computation strategy as in Eq. 7?
  • In Table 5, the proposed method underperforms compared to VILA-U on MMVet and POPE

问题

I wonder if the authors have tried the version without the warmup stage, how does the model perform?

局限性

yes

最终评判理由

I think the idea of leveraging a relatively small amount of data with RL to enable interleaved generation is interesting. Although the paper could benefit from a clearer and more detailed explanation of the proposed method. I appreciate the authors' effort during the rebuttal and given that the promised clarifications will be incorporated into the revised version, I would like to maintain my score.

格式问题

N/A

作者回复

Weakness1: The computation of the process-level reward (L200) is unclear to me. Is it computed using modality-specific segment as target output, and follows the same reward computation strategy as in Eq. 7?

A: Thank you for the question. In our approach, we treat modality-specific segments as the target outputs. However, the reward computation for each such step YkY^k slightly differs from Eq. 7. We compute a modality- and format-specific local reward rkr^k as follows:

rk=IYkYtxtrt(X,Yk)+IYkYimgrv(X,Yk)+rf(X,Yk).r^k = I_{Y^k \in Y_{\text{txt}}} \cdot r_t(X, Y^k) + I_{Y^k \in Y_{\text{img}}} \cdot r_v(X, Y^k) + r_f(X, Y^k).

Specifically, the local reward is only computed based on the specific modality segment, and is then uniformly assigned to all tokens within that segment.

For credit assignment across steps within the trajectory, we follow a group-wise outcome comparison strategy similar to outcome-level supervision in Eq. 5:

A^i,tki=r(X,Yiki)mean({r(X,Y11),,r(X,Y1k1),,r(X,YGkG)})std({r(X,Y11),,r(X,Y1k1),,r(X,YGkG)}).\hat{A}^{k_i}_{i,t} = \frac{r(X, Y^{k_i}_i) - \text{mean}\left( \{ r(X, Y^{1}_1), \dots, r(X, Y^{k_1}_1), \dots, r(X, Y^{k_G}_G) \} \right)}{\text{std}\left( \{ r(X, Y^{1}_1), \dots, r(X, Y^{k_1}_1), \dots, r(X, Y^{k_G}_G) \} \right)}.

We normalize the per-step scores within the group, and compute the relative advantage of each step. This strategy enables a more fine-grained and diverse credit assignment across modality-specific steps. We will clarify this implementation detail more explicitly in the refined version of the paper.

Weakness2: In Table 5, the proposed method underperforms compared to VILA-U on MMVet and POPE.

A: We thank the reviewer for pointing this out. The slightly lower performance of our method on MMVet and POPE compared to VILA-U is within a reasonable margin of experimental variance (approximately 2% and 1%, respectively), and we believe it does not indicate a fundamental limitation of our approach.

To further address this concern, we conducted additional experiments using Janus-Pro, a stronger unified model. The results demonstrate consistent improvements across all benchmarks, including MMVet and POPE, validating that our GRPO framework remains effective and generalizable on these benchmarks as well.

MethodMME-PMMvetSEEDBench-imgPOPE#Train imagesImage res.GenEval
Janus-Pro1516.745.170.178.972M3840.80
Ours (Janus-Pro)1552.945.471.380.616M3840.83

Q1: I wonder if the authors have tried the version without the warmup stage, how does the model perform?

A: We appreciate the reviewer’s insightful suggestion. As requested, we conducted an additional experiment where we directly train the unified model using GRPO without the warm-up stage. The results (Row 2) show that generation quality after GRPO is significantly lower compared to the full two-stage pipeline. We attribute this to the model’s poor initial ability to generate interleaved outputs, which limits its capacity to produce high-reward samples during the GRPO stage. These findings underscore the importance of the warm-up stage in providing a reasonable initialization for modality transitions and content generation.

Warm-upGRPOMMIEInterleavedBench
-0.51
17.841.25
✓ (w/ initialization)25.961.58
59.53.13

To further investigate this phenomenon, we explored an alternative strategy under the pure GRPO setting by explicitly encouraging the model to sample modality-switching actions through modifications to the sampling policy and reward formulation. Specifically, during the initial training phase, we manually sampled an interleaved content trajectory by adjusting the sampling probability of modality-switching tokens. Additionally, we introduced a multimodal content reward to enhance its relative advantage within the group and incentivize the generation of interleaved content.

However, performance (Row 3) still fell short compared to the warm-up+GRPO setup, indicating that starting from a cold initialization hinders the model’s ability to discover meaningful interleaved structures solely through reinforcement signals. These results support our two-stage training design and highlight the critical role of the warm-up stage in stabilizing training and guiding the model toward effective interleaved generation.

评论

Dear Reviewer 7qgK,

We sincerely appreciate the reviewer's time for reviewing, and we really want to have a further discussion with the reviewer to see if our detailed explanations and additional results solve the concerns. We have addressed all the thoughtful questions raised by the reviewer (e.g., directly train using GRPO without the warm-up stage) and we hope that our work’s contribution and impact are better highlighted with our responses. As the discussion phase is nearing its end, it would be great if the reviewer can kindly check our responses and provide feedback with further questions/concerns (if any). We would be more than happy to address them. Thank you!

Best wishes,

Authors

评论

Thank you for the detailed responses, which have addressed most of my concerns. I find the idea of leveraging a relatively small amount of data with RL to enable interleaved generation interesting. Although the paper would benefit from a clearer and more detailed explanation of the proposed method, given that the promised clarifications will be incorporated into the revised version, I would like to maintain my score.

评论

We appreciate the reviewer's time for reviewing and thanks again for the valuable comments and the positive score!

最终决定

This paper addresses the challenging task of enabling vision-language models to generate interleaved sequences of text and images, which is crucial for applications such as visual storytelling and stepwise reasoning. The authors propose a reinforcement learning approach using Group Relative Policy Optimization (GRPO) complemented by a warm-up stage with curated hybrid datasets. Their novel hybrid reward function guides training to improve the quality and coherence of multimodal generation, which is empirically validated on relevant benchmarks like MMIE and InterleavedBench.

Strengths highlighted by the reviewers include the paper’s focus on an underexplored and significant problem, the practical and effective RL training strategy, and strong empirical performance gains over baselines. The construction of interleaved training data is well-designed and robust, and the authors have addressed many reviewer concerns during rebuttal, such as expanding experimental scope to stronger models and clarifying reward computations. The approach is noted for not requiring large-scale interleaved datasets, making it more accessible for practical use. However, novelty is somewhat limited due to the use of an existing RL algorithm framework, and theoretical justification is minimal. Reviewers also have concerns about generalizing reward models to out-of-distribution scenarios.

While one reviewer remained unconvinced about the framework’s novelty, others acknowledged the authors’ rebuttal efforts and improved their scores. Overall, the paper stands out for its practical contribution to multimodal interleaved generation despite incremental methodological novelty, warranting acceptance as a solid poster paper at the conference.