PaperHub
7.3
/10
Spotlight3 位审稿人
最低6最高8标准差0.9
8
8
6
3.7
置信度
正确性3.3
贡献度3.0
表达3.0
ICLR 2025

LLaVA-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

OpenReviewPDF
提交: 2024-09-23更新: 2025-03-04

摘要

关键词
large language modelmultimodal learninginterleaved image-text

评审与讨论

审稿意见
8

The paper proposed llava-next-interleave, which unifies the muliple image, multiple frame (video), mulitple view (3D), and Multiple patches (single-image) scenarios in one MLLM, which is very interesting. One dataset namely M4-Instruct dataset and the LLava-interleav Bench is proposed, which is very helpful for pushing the progresses of the MLLM research. Detailed ablation studies are provided, where some useful and meaningful insights are provided.

优点

The paper makes attempt to unify the visual format, such as multiple images, multiple frames, multiple views, and mulitple patches into one signal MLLM, which can facilitate the application of MLLM and improve generalization ability of MLLM on processing different format visual signals. With such unification, different visual formats only need to be processed with the corresponding formate, which can thereby be learned and tuned on the MLLM.

The composed dataset and benchmark are very useful for the research of MLLM.

Some ablation studies are conducted which are very interesting, such as the mixed interleaved data formats during training, as well as combining differenet data scenarios improves individual task performance.

缺点

Some insights may be further studied.

  1. Mixed interleaved data formats can help improvign the perfromances. I am wondering the underlying reasons. One reason may be the different formats increases the diversity of the data. One reason may be the deficiency of the data, with mixed formats, the data are trained in two rounds (if I understand correctly), the performances are improved. The authors are highly suggested to performed detailed studies.

Please specify the experiment settings and also provide the results with the data with both formats. In this case, we can find with the data sufficiency matters for the performance gain.

  1. The constructed data are trained in the continual training stage. I am wondering if we want to improve the ability of MLLM for handling different visual formats. Should we add more data (different types of visual formats) in the stage-i pretraining, besides the single image-text pair. Or we only perform the training in the continual training stage, the MLLM ablity on handling on multiple images, patches, frames, views can be well improved.

If possible, add the data in the pretraind stage. I am wondering whether there are still performance gain.

问题

Check detailed information in the Weakness part.

评论

Continue training from image model or joint training.


Thank you for the valuable suggestion. We have conducted an ablation experiment to compare direct training with all data formats against continued training from a single-image pre-trained model, and we provide the detailed results below on single-image, multi-image, and video. In this table,

  • The first row represents direct training with a combination of single and multi-image data
  • The second row illustrates fine-tuning from single-image models (stage-i) using multi-image data and a subset of single-image data.
  • The results clearly indicate that continual training yields superior performance.
Continue TrainingMulti-image BenchmarksSingle-image BenchmarksVideo Benchmarks
Mantis-EvalBLINKQ-BenchNLVR2ScanQAai2dchartqadocvqaMME*popescienceqa_imgActivityNet-QA (Acc/Score)MVBenchVideo Detailed DescriptionVideoChat-CorrectnessVideoChat-DetailVideoChat-ContextVideoChat-TemporalVideoChat-Consistency
from pretrain41.037.64754.027.746.338.347.547.185.459.444.7/2.1743.02.962.972.873.492.423.14
from stage-i45.639.25267.829.352.252.259.252.086.860.648.0/2.8445.63.253.122.973.622.363.27
评论

Thank you for your thoughtful comments and acknowledgment of our work. We have thoroughly responded to your feedback to address your concerns.

Mixed interleaved data formats.


We apologize for not providing sufficient details about our use of mixed data formats.

  • In our methodology, each data sample is randomly assigned to either an interleaved format or a front-loaded format, rather than being trained in both formats simultaneously. This approach ensures that the total number of training samples remains constant.
  • The observed improvement in performance is not a result of increased data quantity but rather the enhanced diversity introduced by this mixed-format strategy.

In addition, we conducted an ablation study to evaluate the impact of data sufficiency on multi-image performance.

  • Setting: We randomly sampled 25%, 50%, and 75% of the M4-Instruct dataset to fine-tune the LLaVA-NeXT-Image model.
  • Results: As summarized in the table below, it demonstrates that as the data size increases, the model exhibits strong scaling performance, highlighting the high quality and effectiveness of our M4-Instruct dataset. We also observe that, the performance improvement from 0% to 50% is more substantial than the improvement from 50% to 100%.
  • Observation: The results suggest that, for multi-image capabilities, the enhancement of data diversity (achieved in the 0% to 50% range) in M4-Instruct is more critical than merely increasing data quantity.
M4-InstructAvgSpot the DifferenceImage Edit InstructionVisual Story TellingText-rich VQAMulti-image VQAPuzzleQ-BenchNLVR2
0%32.412.913.210.159.639.49.051.068.0
25%47.927.517.626.364.370.330.966.380.3
50%52.832.820.328.369.277.841.269.284.2
75%56.335.622.830.673.882.744.572.788.2
100%58.637.124.333.176.187.548.774.288.8
评论

The authors have done additional experiments and addressed my concerns. I am standing on my previous rating.

审稿意见
8

This paper introduces LLaVA-NeXT-Interleave, a large multimodal model (LMM) designed to unify and enhance capabilities across multi-image, video, and 3D tasks while maintaining performance on single-image tasks. The authors propose using an interleaved image-text data format as a universal template to represent different computer vision scenarios. They compile a comprehensive dataset called M4-Instruct, consisting of over 1 million samples spanning four primary domains with 14 tasks and 41 datasets. Additionally, they create the LLaVA-Interleave Bench, a diverse set of benchmarks for evaluating multi-image performance. Extensive experiments demonstrate that LLaVA-NeXT-Interleave achieves state-of-the-art results across various benchmarks and exhibits emerging capabilities such as cross-task transfer.

优点

Innovative Unification Framework: The paper presents a novel approach by using an interleaved image-text format to unify multi-image, video, and 3D tasks within a single LMM framework. This unification is both practical and innovative, addressing a significant gap in the current research.

Comprehensive Dataset and Benchmark Creation: The compilation of the M4-Instruct dataset and the LLaVA-Interleave Bench provides valuable resources for the community. These datasets are extensive and cover a wide range of tasks and scenarios, enhancing the reproducibility and applicability of the research.

State-of-the-Art Performance: The proposed model achieves leading results across various benchmarks in multi-image, video, and 3D tasks while maintaining strong performance on single-image tasks. This demonstrates the effectiveness of the approach.

Emerging Capabilities: The model exhibits emerging capabilities such as cross-task transfer and zero-shot task composition, indicating strong generalization and adaptability to new scenarios.

Detailed Experimental Analysis: The paper includes extensive experiments and ablation studies that thoroughly evaluate the model's performance and validate the proposed techniques.

缺点

Limited Details on Data Curation: While the paper introduces a large and diverse dataset, it provides limited information on the data curation process, quality control, and potential biases. More details on how data quality and diversity were ensured would strengthen the work.

Computational Efficiency Concerns: The paper does not thoroughly discuss the computational efficiency or resource requirements of the proposed model, especially when handling multiple modalities. Comparisons in terms of model complexity and inference speed with existing models would be beneficial.

Scope of Evaluation Metrics: The evaluation focuses mainly on quantitative performance metrics. Incorporating qualitative analyses or user studies could provide additional insights into the model's real-world applicability and limitations.

Scalability and Practical Deployment: Potential challenges related to scaling the model to larger datasets or deploying it in practical applications are not fully explored. Discussion on these aspects would enhance the paper's impact.

问题

Data Curation Process: Can the authors provide more details on the data collection and curation process for the M4-Instruct dataset? Specifically, how were data quality and diversity ensured, and what measures were taken to mitigate potential biases?

Computational Resources: How does LLaVA-NeXT-Interleave perform in terms of computational efficiency compared to other state-of-the-art models? Are there any optimizations implemented to handle the increased complexity of processing multiple modalities?

Scalability: Have the authors explored the scalability of the model when trained on even larger datasets or when applied to more complex tasks? Are there any observed limitations or degradation in performance?

Generalization to Unseen Tasks: While the model demonstrates emerging capabilities, how well does it generalize to completely unseen tasks or modalities not included in the training data?

评论

Scope of evaluation metrics.


  • We conducted qualitative evaluations to assess the emerging capabilities of our model, as presented in Tables 9–12 and 17 in our paper.
  • Additionally, we have included new examples in the appendix of our revised submission (Tables 18–20) to further illustrate these capabilities. While a user study would provide additional insights, due to time constraints, we plan to include such studies in future research.

Scalability.


We scaled our model from 0.5B to 14B parameters, with the 14B model representing a robust, large-scale LLM. To illustrate the scaling law, we created a figure (Figure 6, appendix of the revised submission) that demonstrates the model's performance across multi-image evaluations for both in-domain and out-domain metrics, averaged across tasks. The results indicate that scaling up LLMs significantly enhances performance. However, due to limited time and computational resources, further scaling will be explored in future research.

5. Generalization to Unseen Tasks.


We highlighted emerging capabilities of our model through qualitative visualizations.

  • For unseen tasks, we conducted evaluations on out-domain data, as shown in Table 1 of our paper. The results demonstrate that our model achieves state-of-the-art performance compared to prior models. However, defining "completely unseen" tasks is inherently challenging, and qualitative evaluations are limited by these ambiguities.
  • For now, we demonstrate the model's capabilities on certain unseen tasks via visualization (Tables 9–12 and 17) and have included additional examples in the appendix of our revised submission (Tables 18–20). We will continue to expand this type of evaluation in future research.

Regarding unseen modalities, our current focus is solely on vision and language modalities. Research on other modalities is planned for future work.

评论

Computational efficiency.


Our model training was conducted using 32 A100 GPUs.

For inference speed, we note that since the vision encoder is relatively small compared to the language model, the primary factors affecting performance are the size of the language model and the number of input tokens. The model's complexity scales with the number of input tokens, which includes both image and language tokens. Thus, inference speed and model complexity are more closely tied to token count than to other factors.

To evaluate the inference speed for multi-image scenarios, we based our analysis on the Qwen-7B model, using an average of six images as input. We compared this with two existing models: the single-image model LLaVA-Next-Image and the multi-image model Mantis. The inference speed results are provided below.

It is worth noting that LLaVA-Next-Image employs the AnyRes technique, which typically divides an image into four patches and combines them with the original image, effectively processing five images in total. Despite this, our results indicate that our inference speed is comparable to existing models. Moving forward, we aim to explore more efficient methods for processing multiple images to further optimize performance.

ModelLanguage model#Images#Tokens Per Image#Total tokensSeconds/Sample
LlaVa-Next-ImageQwen-7b1729x537822.3
LlaVa-Next-InterleaveQwen-7b672944702.5
MantisLLaMa-8b657635522.1
评论

We genuinely appreciate your insightful comments and recognition of our work. We have carefully addressed your feedback with detailed responses and have updated the relevant sections in the revised manuscript to address your concerns.

Details on data curation, quality control, and potential bias.


We apologize for any missing details in the description of our data construction process. In the M4-Instruct dataset, we not only aggregated existing datasets but also created new ones through innovative methods, such as leveraging GPT-4 Vision (GPT-4-V) prompts. For instance, notable contributions is the construction of the MMChat_Twitter_Post, ScanQA, and MagicBrush dataset.

  • For the MMChat_Twitter_Post dataset, we utilized images sourced from an existing social media dataset MMChat. These images were input into GPT-4-V, paired with a language prompt specifically designed to generate Twitter-like posts. This process allowed us to create realistic and contextually relevant data entries, enhancing the overall dataset quality.

  • For the ScanQA dataset, since the original ScanQA only includes paired question-answer data with 3D point clouds, we sourced the previous ScanNet dataset and identified the mapping relationship between each point cloud and its corresponding video. From each video, we uniformly sampled 16 frames and prompt GPT-4-V to refine the aligned question-answer pairs in ScanQA, constructing multi-image instructions.

  • For the MagicBrush dataset, we transformed its original image editing task into various conversational formats. In general, we concatenate the source and target image in the instruction as the condition and set the task as predicting the corresponding editing prompt. For cases where multiple edits are required between two images, we either structure the data into multi-turn conversations, enabling models to progressively interpret the differences, or prompt GPT-4-V to generate a consolidated editing instrution encapsulating several editing points.

It is important to note that we did not directly incorporate existing datasets into our new combined dataset. Instead, we implemented a rigorous data curation process to ensure quality and consistency, which included the following steps:

1. Task Description and Ground-Truth Conversion:

  • Generated a task-specific prompt to describe each dataset.
  • Converted ground-truth labels into natural language answers for consistency.

2. Quality Assessment of Candidate Datasets:

  • Sampled approximately 100 entries from each candidate dataset.
  • Evaluated the quality and relevance of the dataset for training.
  • Rejected datasets deemed low quality or irrelevant.

3. Filtering of Selected Datasets:

  • Removed samples that did not meet token length requirements (e.g., entries with fewer than two images or more than 15 images).

4. Dataset Splitting:

  • Divided the newly collected dataset into training and validation subsets.

5. Integration and Performance Evaluation:

  • Trained the new dataset alongside previously verified datasets.
  • Evaluated its impact on performance for both existing and new tasks.
  • Revisited data samples if the new dataset negatively affected performance or exhibited poor validation performance.

6. Conflict Resolution and Refinement:

  • Resolved task/prompt conflicts between the new and existing datasets.
  • Double-checked data quality during this process.

7. Rebalancing Dataset Sizes:

  • Adjusted the representation of datasets to ensure proportional and meaningful contributions to the combined dataset.

Following these procedures, the final dataset was rigorously optimized to maintain performance on previous tasks while excelling on new tasks.

Regarding potential biases, all our data were collected from publicly available datasets, and we did not impose any preferences that might exacerbate biases. However, we recognize the importance of addressing biases in datasets and will investigate this issue further in future work.

We believe that this new dataset represents a valuable resource for future research, thanks to its enhanced quality, diversity, and comprehensive nature.

审稿意见
6

The authors collect a interleave-formatted dataset from existing multi-image, video, multi-view, and image datasets. After that, they finetune a LMM from an existing single-image LMM to support multiple multi-image scenarios based on their collected dataset. Moreover, they curate a multi-scenario interleave benchmark to comprehensively evaluate LMM’s interleave capabilities.

优点

In addition to the clear writing and logical flow of this work, I have outlined its strengths below:

  1. This work presents a solid benchmark for comprehensively evaluating the interleaving capabilities of LMMs. This does help the LMM community.

  2. The experiments in this work are thorough, providing an excellent baseline for future research in this field.

缺点

In my view, the technical innovation of this work is limited. I have outlined more detailed reasons below.

  1. The proposed M4-Instruct dataset is merely a combination of several existing datasets, lacking sufficient technical innovation in its construction process.

  2. The model proposed in this paper adopts the architecture and even checkpoints of previous methods, with only fine-tuning applied to the collected dataset. Technical innovation in the training process is also quite limited.

  3. Although this work primarily follows a path of data collection and model fine-tuning, it only validates the data's effectiveness on a single model, LLaVA-Next. The authors should supplement their study by applying the M4-Instruct data to other 7B-sized single-image LMMs, such as InternVL-1.5, Cambrian, and MiniCPM-V2.5.

问题

See weakness above

评论

Technical innovation during training.


Thanks for this insightful question. We summarize the response as below.

  1. Unified Training Formats:
  • Our main contribution is to integrat diverse data formats, including multi-image, video, 3D, and single-image, into a single interleaved training framework.
  1. Preservation of Model Architecture for Ease of Adaptation:
  • We achieved the unification without modifying the existing model architecture, ensuring compatibility and adaptability across other models.
  1. Simplicity as a Strength:
  • We prioritized simplicity over unnecessary complexity, providing practical advantages in implementation and scalability.
  1. Recognition of Novelty:
  • The novelty and effectiveness of the unified framework were acknowledged by Reviewer jqfW (Strength 1).
评论

Applying the M4-Instruct data to other 7B-sized single-image LMMs.


Thank you for this valuable question. Given that many multi-modal LLMs adopt a LLaVA-like architecture, we first conducted a comparison between Qwen and LLaMA within our model framework. Additionally, as some models you mentioned do not have open-source training code, we performed an ablation study on ShareGPT4V and SPHINX by training it with our M4-Instruct data. The results, as shown below, demonstrate that our constructed dataset provides substantial benefits and utility when applied to other models.

ModelLanuage ModelSettingSpot the DifferenceImage Edit InstructionVisual Story TellingText-rich VQAMulti-image VQAPuzzleQ-BenchNLVR2
LLaVA-Next-ImageQwen-7bZero-shot12.913.210.159.639.49.051.068.0
LLaVA-Next-InterleaveQwen-7bSFT with M4-Instruct37.124.333.176.187.548.774.288.8
LLaVA-Next-ImageLLaMA-8bZero-shot17.111.410.260.145.59.651.768.8
LLaVA-Next-InterleaveLLaMA-8bSFT with M4-Instruct36.824.633.476.086.948.573.686.3
SPHINX-MoE-ImageMixtral-8×7BZero-shot8.49.78.842.331.99.540.254.2
SPHINX-MoE-InterleaveMixtral-8×7BSFT with M4-Instruct26.518.726.270.375.444.964.182.7
ShareGPT4V-ImageVicuna-7BZero-shot10.311.210.145.534.39.242.852.3
ShareGPT4V-InterleaveVicuna-7BSFT with M4-Instruct32.425.427.866.879.848.868.480.6
评论

Thanks for your responses. My main concerns are addressed. So I consider to raise my score to 6.

评论

We sincerely appreciate your valuable comments and recognition of our work. We have provided detailed responses to your comment and updated the relevant content in the revised manuscript, hoping to address your concerns.

Technical innovation in data.


We apologize for any lacking details about the technical innovation in our data. In the M4-Instruct dataset, we not only aggregated existing datasets but also created new ones through innovative methods, such as leveraging GPT-4 Vision (GPT-4-V) prompts. For instance, notable contributions is the construction of the MMChat_Twitter_Post, ScanQA, and MagicBrush dataset.

  • For the MMChat_Twitter_Post dataset, we utilized images sourced from an existing social media dataset MMChat. These images were input into GPT-4-V, paired with a language prompt specifically designed to generate Twitter-like posts. This process allowed us to create realistic and contextually relevant data entries, enhancing the overall dataset quality.

  • For the ScanQA dataset, since the original ScanQA only includes paired question-answer data with 3D point clouds, we sourced the previous ScanNet dataset and identified the mapping relationship between each point cloud and its corresponding video. From each video, we uniformly sampled 16 frames and prompt GPT-4-V to refine the aligned question-answer pairs in ScanQA, constructing multi-image instructions.

  • For the MagicBrush dataset, we transformed its original image editing task into various conversational formats. In general, we concatenate the source and target image in the instruction as the condition and set the task as predicting the corresponding editing prompt. For cases where multiple edits are required between two images, we either structure the data into multi-turn conversations, enabling models to progressively interpret the differences, or prompt GPT-4-V to generate a consolidated editing instrution encapsulating several editing points.

It is important to note that we did not directly incorporate existing datasets into our new combined dataset. Instead, we implemented a rigorous data curation process to ensure quality and consistency, which included the following steps:

1. Task Description and Ground-Truth Conversion:

  • Generated a task-specific prompt to describe each dataset.
  • Converted ground-truth labels into natural language answers for consistency.

2. Quality Assessment of Candidate Datasets:

  • Sampled approximately 100 entries from each candidate dataset.
  • Evaluated the quality and relevance of the dataset for training.
  • Rejected datasets deemed low quality or irrelevant.

3. Filtering of Selected Datasets:

  • Removed samples that did not meet token length requirements (e.g., entries with fewer than two images or more than 15 images).

4. Dataset Splitting:

  • Divided the newly collected dataset into training and validation subsets.

5. Integration and Performance Evaluation:

  • Trained the new dataset alongside previously verified datasets.
  • Evaluated its impact on performance for both existing and new tasks.
  • Revisited data samples if the new dataset negatively affected performance or exhibited poor validation performance.

6. Conflict Resolution and Refinement:

  • Resolved task/prompt conflicts between the new and existing datasets.
  • Double-checked data quality during this process.

7. Rebalancing Dataset Sizes:

  • Adjusted the representation of datasets to ensure proportional and meaningful contributions to the combined dataset.

Following these procedures, the final dataset was rigorously optimized to maintain performance on previous tasks while excelling on new tasks. Therefore, we believe the data curation itself is also innovative.

AC 元评审

This paper proposed LLaVA-NeXT-Interleave which is a unified model to tackle multi-image, multi-frame, multi-view and multi-patch scenarios in LLMs. This paper curated a M4-Instruct dataset containing around 1M samples across 4 primary domains with14 tasks. The developed unified model is comprehensively evaluated and showed improvement compared with prior art on multi-image, multi-frame and multi-view tasks.

Strength:

  1. As agreed by all reviewers, the proposed M4-Instruct dataset and benchmark are solid and comprehensive and are useful for MLLM research.
  2. The unification of the framework is innovative.
  3. The experiments are thorough.
  4. Achieving state-of-the-art performance.

Weakness:

  1. Reviewers proposed some weakness the authors addressed them in the discussion. Reviewer bKSe and 5CHq mentioned their concerns are addressed and concerns from reviewer jqfW are also answered.
  2. The template used for this paper is slightly different from the template of ICLR draft. As the difference is small, we do not take this into consideration while making decisions. However, the authors should pay attention to the template usage for paper submission.

审稿人讨论附加意见

Reviewer bKSe mentioned the following weakness of the paper:

  1. There is not enough technical innovation in constructing the M4-Instruct dataset.
  2. No technical contribution to the model development.
  3. Method only tested on a single model. The authors provided more details of their effort of curating the datasets to address point 1. They also emphasis the novelty of model training including unified training formats and preservation of model architecture to address point 2. The authors applied proposed dataset to other models to address point 3. The reviewer acknowledged the concerns have been addressed.

Reviewer 5CHq pointed out that:

  1. This paper does not have the insights about why proposed dataset could help improve the performance.
  2. Whether use the proposed dataset during pre-training would also improve the performance. The authors addressed point 1 by providing additional experiments demonstrating the improvement is from data diversity. They also provide an additional experiment to show that using proposed dataset in continue training is better than using them in pre-training stage. The reviewer acknowledge the concerns have been addressed.

Reviewer jqfW asked the authors to provide more details, including:

  1. Details in data curation.
  2. Discussion about computation efficiency.
  3. Scope of evaluation.
  4. Scalability and practical deployment. The authors provided more details about the above four points. From my point of review, these details could address these concerns.
最终决定

Accept (Spotlight)