Thanks for your detail review!

Q1: About Novelty and Tradd-off of MoE

MoE on multimodal language models has been explored in other domains. There's a trade-off between the performance gain and the increase of model parameters and inference time.

We would sufficiently highlight the distinctions between ChartMoE and other MoE models in MLLMs. Some prior work, such as MoE-LLaVA[1], DeepSeek-VL[2], and CuMo[3], has employed MoE architectures in MLLMs. However, these approaches all apply MoE to LLMs or ViTs to increase model capacity, introducing a large number of learnable parameters to boost performance. In contrast, our ChartMoE introduces several distinctive innovations:

Motivation: Our goal is not to expand model capacity but to enhance the model's chart comprehension through alignment tasks while preserving performance on other general tasks. Hence, we retain the original connector parameters as one expert initialization manner.
Initialization: Unlike previous methods that rely on random or co-upcycle initialization, we leverage diverse alignment tasks for expert (connector) initialization. This approach enables ChartMoE to exhibit remarkable interpretability (Fig.6,11,12). From the perspective of visual tokens, experts aligned with Table and JSON focus more on regions with text, such as titles and legends, excelling in tasks similar to OCR. In contrast, the Code expert focuses more on data points and trend directions, excelling in visual logical reasoning and overall analysis.
Complexity: We are the first to apply MoE exclusively to the MLP connector (projector) in LLaVA-like MLLMs. In ChartMoE (based on InternlmXC-v2), the MoE architecture introduces minimal additional parameters (model size 8.364B 8.427B, + 63M only) and training complexity (Fig.4). It also shows negligible impact on inference speed (0.945 0.952 seconds per QA on ChartQA test set) and peak memory usage (23.72 GB 23.86 GB, fp16 on A100-40G GPU).

We have added this discussion to Appendix D.1 to make our paper more comprehensive. We hope our explanation helps you reassess the novelty of ChartMoE!

Q2: More Theoretical Analysis of MoE

Our motivation for using the MoE connector is primarily based on the following two points:

Different structured texts contain varying core information and information volumes, leading the aligned experts to focus on different regions of the charts.
We aim to enhance chart understanding capabilities while preserving the original model's performance on general tasks.

The visualization results (Figures 6, 11, 12) show that the vanilla expert tends to handle background tokens, the table&Json experts focus more on text and textures, and the code expert pays more attention to data trends. This insight led us to remove the bz-loss from the standard MoE architecture, as the visual tokens are not evenly distributed, and forcing an equal workload for each expert is not optimal. We will explore more in-depth theoretical analysis in our future work.

Q3: More Ablation Study

Have you ever tried training on your alignment data with random initialization of expert parameters and balanced loss from scratch? It will better prove the significance of this work.

Thank you for providing an important baseline setting in the ablation study! We have added the results for this setting in Table 5. Align refers to using ChartAlign for alignment, while Init. refers to random initialization without alignment. The experimental results demonstrate the significance of ChartMoE.

We are glad to see your recognition of our contribution. We would be happy to answer any remaining questions or concerns!

Reference:

[1] Moe-llava: Mixture of Experts for Large Vision-Language Models, Arxiv 2024.

[2] Deepseek-vl: Towards Real-world Vision-Language Understanding, Arxiv 2024.

[3] CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts, Arxiv 2024.