CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
摘要
评审与讨论
This paper is motivated by the fact that traditional scaling approaches are computationally expensive and overlook the significance of efficiently improving model capabilities from the vision side. The authors introduce CuMo, a method to enhance MLLMs by incorporating sparsely-gated Mixture-of-Experts (MoE) blocks into the vision encoder and MLP connector. CuMo employs a three-stage training process and has achieved impressive results across multiple benchmarks.
优点
- This paper is well-written, and the proposed method is simple and easy to follow.
- This paper has addressed an important issue: traditional MoE-based MLLMs are computationally expensive and overlook the vision encoder.
- CuMo achieves good performance across various benchmarks.
缺点
- The author employ the MoE method in the vision encoder and connector, but don't explain why using this approach enhances the model's capabilities.
- The use of the MoE method inherently expands the trainable parameters to some extent. The author should consider whether merely increasing the parameters of the vision encoder alone will also enhance the model's capabilities. This will provide readers with valuable insights.
- CuMo's LLM is mainly limited to Mistral-87B. Since this paper primarily explores the expansion of the vision encoder and connector, I think the author should include comparisons with other LLM backbones, such as Qwen[1] and LLaMA[2], to demonstrate that the model's improvements are not due to differences in the inherent capabilities of the LLM itself.
- Since CuMo is three-stage training, the increased amount of training data may lead to an unfair comparison with other baselines, such as LLaVA1.5.
[1] Qwen Technical Report.
[2] LLaMA: Open and Efficient Foundation Language Models.
问题
See weakness.
局限性
The authors mention that the main limitation of this work is the hallucination problem and indicate that it will be improved in future work.
Q1: Why MoE in the vision encoder and connector enhances the model's capabilities
A1: MoE has been widely used in LLMs to improve the capacity [1] of text generation as it improves the model size during training while keeping inference costs lower at inference. In our work, we apply MoE in the vision encoder and connectors of multimodal LLMs, which has the potential to generate versatile visual tokens and further improve the model capabilities on many vision-language instruction-following tasks. We verify our assumptions with detailed ablation studies on the effectiveness of the proposed MLP-MoE and CLIP-MoE.
[1] Shard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Q2: Whether merely increasing the parameters of the vision encoder alone will also enhance the model's capabilities for comparisons
| Model | ImageNet | Res. | Params | TextVQA | MMVet | SEED |
|---|---|---|---|---|---|---|
| CLIP-ViT-L | 76.2 | 336 | 0.30B | 57.6 | 32.1 | 66.4 |
| SigLIP-SO400M | 83.1 | 384 | 0.43B | 58.1 | 32.5 | 67.5 |
| CLIP-ViT-H | 78.0 | 224 | 2.52B | 49.2 | 29.5 | 58.2 |
A2: It depends. We compare the CLIP-ViT-L encoder with the other two larger vision encoders: CLIP-ViT-H and SigLIP-SO400M. CLIP-ViT-H has 8x more parameters than CLIP-ViT-L while performing much worse due to low-resolution inputs. SigLIP-SO400M has 0.13B more parameters and performs consistently better than CLIP-ViT-L. These findings have been also explored in recent multimodal LLMs [2,3] that the model size is not the top factor that affects the overall performance of multimodal LLMs.
[2] MM1: Methods, Analysis, and Insights from Multimodal LLM Pre-training
[3] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
| Model | TextVQA | MMVet | SEED |
|---|---|---|---|
| SigLIP-SO400M | 58.1 | 32.5 | 67.5 |
| + CLIP-MoE & MLP-MoE | 59.4 | 34.1 | 69.8 |
We further apply CLIP-MoE and MLP-MoE to the stronger SigLIP-SO400M vision encoder and they can still make improvements upon a larger and more powerful vision encoder.
Q3: CuMo's LLM is mainly limited to Mixtral-8x7B, and the author should include comparisons with other LLM backbones.
| Model | TextVQA | MMVet | SEED |
|---|---|---|---|
| Mistral-7B | 57.6 | 32.1 | 66.4 |
| + MLP-MOE & CLIP-MoE | 59.3 | 34.3 | 69.6 |
| Vicuna-7B-v1.5 | 58.2 | 30.5 | 66.1 |
| + MLP-MOE & CLIP-MoE | 59.4 | 32.6 | 68.5 |
A3: CuMo is mainly focusing on Mistral-7B and Mixtral-8x7B. Our ablations are based on Mistral-7B and verified the effectiveness of CLIP-MoE and MLP-MoE separately. We further evaluate CuMo on Vicuna-7B-v1.5 as shown in the table above with added CLIP-MoE and MLP-MoE to show their effectiveness.
Q4: Since CuMo is three-stage training, the increased amount of training data may lead to an unfair comparison with other baselines, such as LLaVA1.5.
A4: In Table 2, we maintain a fair comparison to LLaVA1.5 under the same training data and CuMo-7B consistently performs better than LLaVA1.5, either under Vicuna7B or Mistral-7B. In Table 1, we compare CuMo with models that use various datasets even with private datasets like LLaVA-NeXT and MM1, while CuMo is trained under fully open-sourced datasets. We may remove LLaVA1.5 from Table 1 and keep LLaVA-NeXT for comparison in the updated version to avoid confusion.
Thanks to the author's response. Partial of my concerns are addressed during the rebuttal. I would like to raise my score.
The paper introduces CuMo, a novel approach to enhancing multimodal large language models (LLMs) by integrating sparsely-gated Mixture-of-Experts (MoE) blocks into both the MLP connector and vision encoder. CuMo addresses the challenge of scaling multimodal LLMs effectively by leveraging MoE's efficiency in parameter usage during training and inference. Key contributions include a detailed exploration of MoE integration strategies across different components of LLMs, a three-stage training methodology to stabilize model training, and the introduction of auxiliary losses for expert load balancing. Experimental results demonstrate that CuMo outperforms existing state-of-the-art multimodal LLMs on various benchmarks, showcasing its efficacy in enhancing model performance while managing computational efficiency
优点
Innovative Integration of MoE: Integrating sparsely-gated Mixture-of-Experts (MoE) blocks into both the MLP connector and vision encoder of multimodal large language models (LLMs) represents a novel approach. This approach is not only innovative in terms of architecture but also in its application to enhance multimodal understanding.
Experimental Rigor: The authors provide a thorough experimental evaluation, comparing CuMo against state-of-the-art models on multiple benchmarks. This comprehensive evaluation includes ablation studies, which validate the effectiveness of their proposed approach.
缺点
Issue: While the paper touches upon the scalability benefits of MoE, there's limited discussion on the computational efficiency and training time required for integrating MoE blocks into multimodal LLMs. This is crucial as MoE designs can potentially introduce additional computational costs during training.
While the paper focuses on performance metrics, there's a lack of discussion on the interpretability of MoE decisions within CuMo. Understanding how MoE blocks make decisions and whether they exhibit consistent behavior across different inputs is crucial for deploying models in real-world applications.
The paper shows that Mixtral-8×7B's CuMo does not consistently outperform Mini-Gemini on several datasets, particularly on TextVQA and MME. This discrepancy could be due to architectural differences, especially Mini-Gemini's specialization in high-resolution data. The author should conduct a comparative study with non-MoE architectures like Vicuna or Llama3, focusing solely on CuMo's projection MoE and Vision Encoder MoE. This would provide insights into how CuMo's MoE-based approach performs against architectures not leveraging MoE.
Table 1 highlights CuMo's results in datasets like VQAv2, SeedImg, and MMbench, even though it does not consistently achieve the best performance. This can mislead readers about CuMo's comparative performance against other models.
The experiments primarily focus on Mistral or Mixtral-8×7B architectures with MoE integration. There's a lack of exploration into how CuMo's MoE-based enhancements compare against non-MoE architectures like Vicuna or Llama3.
问题
see weaknesses
局限性
see weaknesses
Q1: Computational efficiency and training time of MoE.
| CuMo | CLIP | MLP | LLM | Total | Time |
|---|---|---|---|---|---|
| Mistral-7B | 0.30B | 0.025B | 7.25B | 7.58B | ~16h |
| + Top 2-in-4 MLP-MoE | 0.30B | 0.10B | 7.25B | 7.65B | ~16h |
| + Top 2-in-4 CLIP-MoE | 0.91B | 0.10B | 7.25B | 8.26B | ~20h |
A1: We provided the breakdown of additional parameters during training in Table 6 and here we further include the training time of CuMo-Mistral-7B with MLP-MoE and CLIP-MoE. We used a single 8xA100 machine and LLaVA-665K data for reference here. Note that we only used the model parallelism in deepspeed for implementation while the training can be faster if expert parallelism is further added.
Q2: Lack of discussion on the interpretability of MoE decisions within CuMo.
| Subset | Layer ID | Top 1 Expert Ratio |
|---|---|---|
| OCR | 8 | 31.54% |
| Color | 7 | 33.97% |
| Code | 18 | 34.49% |
| Reasoning | 1 | 35.01% |
A2: Following Section 5 of Mixtral-8x7B[1], we did the expert distribution analysis in Figure 4, which shows that the experts are loaded pretty balanced overall across the layers. We further used the subsets of images in MME by topic and found that they show a preference towards certain experts in some layers, which may imply some hidden patterns of the assignment of experts based on the applications or topics. We may update these results in Section 4.4 as parts of the analysis.
[1] Mixtral of Experts, https://arxiv.org/abs/2401.04088
Q3: The paper shows that Mixtral-8×7B's CuMo does not consistently outperform Mini-Gemini on several datasets, particularly on TextVQA and MME.
A3: CuMo-Mixtral-8x7B is better than Mini-Gemini-Mixtral-8x7B on MME (+0.5), MMMU(+3.2), and MM-Vet (+2.9), while worse on TextVQA (-3.2) and MMBench (-0.3), as shown in Table 1. One main reason behind this is that Mini-Gemini takes high-res inputs and benchmarks like TextVQA are sensitive to input resolution as shown in the Table 2 & 3 in Mini-Gemini.
Q4: The experiments primarily focus on Mistral or Mixtral-8×7B architectures with MoE integration. The author should conduct a comparative study with non-MoE architectures like Vicuna or Llama3, focusing solely on CuMo's projection MoE and Vision Encoder MoE.
| TextVQA | MMVet | SEED | |
|---|---|---|---|
| Mistral-7B | 57.6 | 32.1 | 66.4 |
| + MLP-MOE & CLIP-MoE | 59.3 | 34.3 | 69.6 |
| Vicuna-7B-v1.5 | 58.2 | 30.5 | 66.1 |
| + MLP-MOE & CLIP-MoE | 59.4 | 32.6 | 68.5 |
A4: Our ablation studies are mainly on the Mistral-7B, which is a non-MoE LLM to verify the effectiveness of MLP-MoE and CLIP-MoE. We further evaluate CuMo on Vicuna-7B-v1.5 to make comparisons as shown in the table above under the same LLaVA-665K training data. The CLIP-MoE and MLP-MoE also make improvements over the Vicuna-7B-v1.5.
Q5: Table 1 highlights are misleading.
A5: Thanks for the suggestions. We'll revise that and highlight the best performance across models in each session of Table 1.
The paper presents upcycling for large multimodal models (LMMs). It specifically looks at how to enable upcycling for the different components of an auto-regressive based multimodal model (e.g., LLaVA). It shows how the MLP and vision-encoder (in this case, CLIP) are the two modules that should be upcycled and instead of relying on upcycled LLMs, it is better to use pretrained MoE models (for their specific example of upcycled-Mistral vs Mixtral). The paper outlines the training recipe, which is based on 3 stages, with the first focusing on enabling a stable multi-modal model, and then incorporating their CuMo recipe in two stages. To enable stability and balance across the introduced experts, the authors enabled both load balancing loss for the experts and (as suggested in the ST-MoE paper) as auxiliary losses. They apply these auxiliary losses to the two different upcycled MoEs separately. This is followed by a detailed training recipe and evaluation (both qualitative and quantitative) and ablations that explain their design choices.
优点
- The paper outlines a clear recipe for upcycling in the context of multimodal models, which has not been explored in literature before.
- The authors support their design choices through well-designed ablations - particularly for the MoE blocks (which form the core of their method) - for the MLP connector and CLIP model. They show the benefits of using pretrained MoE LLM models over upcycling a dense LLM.
- They present the benefits of using different auxiliary losses (balancing + , which they term as bzloss) to train the upcycled model.
- The paper relies on fully open datasets, and it presents all training settings and hyper-parameters, enabling the reproduction of its results.
- The authors show quantitative results, which show for models of similar # active parameters (during inference), the models are competitive with other popular LMMs of similar sizes and outperform some benchmarks. They show this with two different LLMs (Mistral + Mixtral), showing their method is composable with different LLMs. For results that use the GPT4 API, they present the statistical average of 3 API calls to calibrate their results. They give a full breakdown of how the # active parameters are computed for their models based on the upcycled components.
- They follow this up with some qualitative analysis for the balance in experts (relying on the bzloss), seeing an approximately equal distribution of tokens through the experts. They also show some dialogue examples based on sample images for 3 different LMMs, highlighting the benefits of using their method.
- For some of their training recipes, they show studies of using high-fidelity data and how relying on the latest training methods, such as multi-resolution features, helps boost their overall model quality.
缺点
-
One main criticism of the upcycling setup is the limit on the gains from upcycling. The original setting [1] compares the upcycled model to the original MoE model in a 100% dense compute setting and shows that it takes ~20% additional capacity to catch up to the upcycled model (on the LLM setting). While it does take more computing on the ViT models, do the authors have any intuition on when training MoE-based ViTs (such as V-MoE [2]) and MoE connectors from scratch will be much better / potentially start outperforming the dense-only case?
-
Another criticism of the setup is the diminishing gains as the base model size increases (see Figures 2 & 3 in [1]). Do the authors have an intuition of how this will apply to the CuMo setup? If we scale the CLIP-ViT model or use a much stronger model like SigLIP-ViT, will the gains from CuMo still hold? Please note that I'm not expecting new experiments or results here — just a sense of the authors' intuition.
[1] Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints, https://arxiv.org/abs/2212.05055
问题
Suggestions:
- In Table 1, please highlight all the best numbers, not just where the CuMo model is the best (both for the 7B dense & MoE sections)
- In Table 2, TextVQA is highlighted as the best with the CuMo model, but the QwenVL model has a higher score. How is the highlighting done in this case - if it is still the "best" result (despite # data samples used), then it will be good to highlight the appropriate model.
- For both table captions, it would be good to mention "best results are highlighted in bold" so that it is clear to the user.
- For Figure 6, please mention which model generates the responses - Cumo with Mistral or Mixtral.
- In the Appendix intro paragraph, the authors mention the "M3" model in the last line, but no other reference has been made to this model before. It would be good to clarify this.
Questions:
- The authors mention a non-society license in the impacts section - is there a specific license being applied that the authors cannot speak about as it will break double-blind? Or is it a general enough license so it is well understood how the release is regulated?
- To clarify, the authors mention only the datasets in the checklist (question 5), but in the impacts section, they mention releasing both code and model. It would be good to be consistent on this front.
- For the usage guidelines (question 11), there is nothing directly mentioned in the paper. Is the idea that the license, when released, will have clear guidelines for this?
局限性
-
The authors list hallucination as the main limitation of the model, which they propose can be corrected using RLHF and help improve reliability. While RLHF can help to some extent with the hallucination problem in the context of multi-modal LLMs, it firstly aligns models with human preferences on tasks of interest (e.g., improving captioning capabilities). Additional systems such as RAG also play a role in this case. I'd recommend potentially mentioning this.
-
In a similar vein to RLHF, since the model has not been aligned, I'd also recommend that the authors mention the potential for biased / non-helpful outputs from their model.
Q1: Limit on the gains from upcycling and when training MoE-based ViTs or MoE connectors from scratch will be much better / potentially start outperforming the dense-only case?
A1: The conclusion of using 20% additional capacity to catch up the upcycled model in original sparse upcycling and V-MoE is based on a much more intensive computation and data budget for pre-training LLM. However, in the CuMo setup, our motivation for using upcycling is to stabilize the training of the MoE module under a small data and compute budget, because training from scratch of connectors or ViTs is not comparable to the dense models due to the training instabilities as shown in Table 3(a). To estimate the gains of upcycling compared to training from scratch, we may need to train a CLIP-MoE from scratch, which is out of our training budget and beyond the scope of our work.
Q2: Diminishing gains as the base model size increases and how this will apply to the CuMo setup? If we scale the CLIP-ViT model or use a much stronger model like SigLIP-ViT, will the gains from CuMo still hold?
| ImageNet | Res. | Params. | TextVQA | MMVet | SEED | |
|---|---|---|---|---|---|---|
| CLIP-ViT-L | 76.6 | 336 | 0.30B | 57.6 | 32.1 | 66.4 |
| + MoE | - | 336 | 0.50B | 59.3 | 34.3 | 69.6 |
| SigLIP-SO400M | 83.2 | 384 | 0.43B | 58.1 | 32.5 | 67.5 |
| + MoE | - | 384 | 0.72B | 59.4 | 34.1 | 69.8 |
A2: We think the diminishing gains also exist in the CuMo setup if we use larger and stronger pre-trained CLIP with MoE while keeping training data unchanged. Here we use the pre-trained SigLIP-SO400M as the vision encoder and add MoE to it as shown in the table above. SigLIP-SO400M has a much better performance on ImageNet zero-shot classification than CLIP-ViT-L (83.2 vs 76.6). The added MoE can still make improvements to this stronger vision encoder but the average improvement shrinks compared to CLIP-ViT-L. However, the training data here is limited to LLaVA-665K for quick verification, which may not show the full potential of the model if training with more data.
Q3: Suggestions regarding Table 1, 2, captions, Figure 6 and Appendix.
A3: Thanks for the suggestions. We'll update Table 1 by highlighting the best performance numbers in each section and Table 2 by highlighting QwenVL's TextVQA number, as well as the table captions to make them clear to the audience. For Figure 6, the responses are generated by CuMo-Mistral-7B. For the Appendix, the 'M3 model' refers to the CuMo-Mistral-7B, we'll revise it in the updated version as well.
Q4: Non-society license, release, and usage guidelines.
A4: We plan to release the code under Apache 2.0 and the weights of all CuMo checkpoints under CC BY-NC 4.0 for non-commercial use. All the datasets and pre-trained weights we used for training and evaluation are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
Q5: Limitations.
A5: Thanks for the suggestions. We'll add discussions of RAG and the potential for biased outputs in the limitation section.
Thank you for the response to the reviews and additional experimental results. After reading all reviews and response, and the overall global response, I will retain my score.
Please make the necessary changes for the figure and table captions and other recommended changes in the final version.
We sincerely thank all the reviewers for their thoughtful comments. We feel encouraged that the reviewers find that:
- The innovation of our method integrates MoE design into the vision side of current multimodal LLMs (NGUV, rTFY, 5mox), and the implementation is simple and easy to follow (5mox).
- We provide detailed ablations to validate the effectiveness of the proposed method (NGUV, rTFY) and achieve good performance across various benchmarks (rTFY, 5mox).
- Our work is based on fully open datasets and we present all training settings with hyper-parameters for reproduction of the results (NGUV).
We also appreciate the suggestions from all reviewers that help us continue to improve the draft. We attempted our best to address the questions using the individual responses below.
Dear Reviewers,
Please read authors' responses carefully and provide your answers.
Thanks, AC
This paper proposes to use a sparse MoE block in the MLP connector and the vision encoder for multimodal LLMs. In particular, for stable and fast MoE training, the proposed method exploits a three-stage training with a upcycling strategy and auxiliary losses. Experimental results on various benchmark multimodal understanding tasks show that the proposed MoE-based MLLM outperforms existing MLLMs having similar model sizes and rigorous ablation studies validate the effectiveness of the proposed architecture and training strategy.
Overall, the paper is well-written and easy to understand, and the authors sufficiently address most of concerns and issues raised by the reviewers.
In terms of the technical novelty, actually, the proposed MLLM is somewhat incremental from the combination of existing methods for its MoE blocks, initialization, and even training losses, even though it is first to apply the MoE-based vision encoder to MLLMs. However, rigorous empirical validation from extensive experiments on multiple benchmark tasks with in-depth analysis and comparison with recent state-of-the-art MLLMs sufficiently supports its feasibility and powerfulness.
Based on that all reviewers scored in a positive direction and the above contribution, I recommend the paper to be accepted in NeurIPS.