On Path to Multimodal Generalist: General-Level and General-Bench
摘要
评审与讨论
This paper pioneers the idea of a General-Level framework to evaluate MLLMs, allowing for an accurate assessment of MLLM capabilities. The authors provide significant observations and principles on benchmark setting and design a sophisticated evaluation metric at different levels to maintain the rationality of the benchmark. They build a General-Bench based on the proposed framework, which is a massive benchmark to evaluate the comprehension and generation capabilities of MLLMs across various modalities such as image, video, audio, 3D, language. A comprehensive evaluation result with a massive analysis and discussion is provided in this paper.
给作者的问题
- The idea of utilizing General-Level to expand upon classical MLLM benchmarks is novel. However, General-Bench focuses on capability while overlooking critical risks (e.g., hallucination, bias amplification). Could high-scoring models in your framework inadvertently reward unsafe behaviors? Should 'safety synergy' be introduced as a separate dimension in the newly proposed framework?
- Since the benchmark contains more than 700 tasks, which is tremendous, evaluating models on General-Bench likely requires massive compute. Does this contradict the push for sustainable AI? Have you quantified the carbon footprint, and if not, should this be a mandatory disclosure for future benchmarks?
- In the current setting, is each modality equally important on the path to AGI? The authors mention that they would like to incorporate more modalities into this benchmark. The current modalities seem to be the primary modalities for intelligence, while the newly added modalities do not seem as important as the current ones. If each modality is equally important, the current configuration may not fairly reflect the model's capabilities.
论据与证据
The claims are clear. This paper provides enough evidence. The authors claim that existing benchmarks rely too heavily on single-task performance metrics; they present evidence by evaluating 100+ MLLMs and showing that most fail higher-level synergy requirements. They further claim that current benchmarks lack coverage of diverse formats and advanced capabilities.
方法与评估标准
This paper proposes General-Bench to evaluate multimodal synergy across comprehension, generation, and cross-modal synergy. It introduces a five-level classification system that measures how well models preserve synergy at increasingly complex levels. Experimental results show that even the best-performing MLLMs struggle with higher-level synergy tasks, suggesting current models are far from achieving full multimodal generalization. This paper can offer a systematic and accurate measure of progress toward AGI.
理论论述
This paper provides detailed definition of General-Level. I think the proof is rational.
实验设计与分析
- The task selection in the benchmark is comprehensive, containing more than 700 tasks and covering all the main tasks across various modalities, fully reflecting the capability of each MLLM.
- The selected models also mainly cover the primary open-source and closed-source models, reflecting a high coverage of MLLMs.
- The observations and analyses for the experiments provide in-depth insights, such as current MLLMs focusing more on content comprehension than supporting generation, and that multimodality does NOT enhance language.
补充材料
I have reviewed the supplementary material. It provides more details than what I expect on experimental settings (e.g., SoTA specialist selection), multimodal generalist (MLLM) selections, level definitions, and benchmark datasets. The supplementary material is very comprehensive and provides enough support for the main claims and background of the proposed benchmark.
与现有文献的关系
The paper provides a novel MLLM benchmark, inheriting from previous MLLM benchmarks, such as MME, MMMU, and MMT-Bench. However, it also introduces innovations, such as expanding the simple ranking into a five-level ranking, providing a more comprehensive and reasonable evaluation for MLLMs.
遗漏的重要参考文献
This paper includes enough references, and the authors have adequately discussed the relevant works.
其他优缺点
Strengths:
- This paper derives a stunning and impressive idea: introducing the five-level capability grading mechanism from the autonomous driving industry into MLLM evaluation. This category-based ranking can comprehensively assess the MLLM’s synergy capabilities across comprehension, generation, and multimodal interactions.
- The definition of General-Level is reasonable and conforms to the requirements of the two assumptions in Section 3.2. In addition, Appendix C.1 provides further explanation, and the general-level is convincing as an evaluation criterion for MLLM capability ranking.
- This paper proposes a panoramic evaluation for various multimodal tasks. Regardless of modality coverage or task number, the output performs better than previous MLLM benchmarks. The task selection shows variety and covers various domains, making it very suitable to be a standard MLLM benchmark.
- The paper conducts a thorough evaluation on the benchmark, selecting more than 100 MLLMs. This is a substantial workload and clearly demonstrates the current MLLMs’ performance on the new evaluation criteria. Also, the observations gained from the results, such as "multimodality does NOT enhance language", provide valuable insights for MLLM research.
其他意见或建议
There might be some minor issues
- Line 2328: all task -> all tasks
- Caption in Figure 1: hinge -> hinges
- Line 60: Yhe -> The
We are more than excited to receive your very strong recognition of our work, which means a lot! Also thanks for your detailed review and valuable suggestions. We have done our best to address all concerns and are happy to engage in further discussion to improve the clarity and quality of our work.
Q1. General-Bench focuses on capability while overlooking critical risks (e.g., hallucination, bias amplification). Could high-scoring models in your framework inadvertently reward unsafe behaviors? Should 'safety synergy' be introduced as a separate dimension in the newly proposed framework?
A: Thank you for the reviewer’s thoughtful and forward-looking comment. When designing the scope definition and task list curation for General-Bench, we paid special attention to mitigating potential bias by ensuring diversity in data sources, annotation methods, and task formats. Moreover, we prioritized using well-established datasets with high-quality annotations, and in some cases, relied on manual annotations to further minimize the risk of hallucinated content.
Regarding the safety perspective, we explicitly focused on evaluating positive and safe skill sets when defining the skill scope of our benchmark. During data collection and task construction, we carefully reviewed whether the expected model outputs could involve safety-sensitive content. If any safety risks were detected, we excluded those instances from the final benchmark.
We sincerely appreciate the reviewer’s suggestion to consider "safety synergy" as a separate evaluation dimension. We agree that safety is a critical component of trustworthy multimodal AGI systems and should be an integral part of generalist benchmarking. In future versions of General-Bench, we plan to incorporate safety-focused tasks and metrics to better evaluate models not just on capability, but also on responsible and safe behavior.
Q2. Since the benchmark contains more than 700 tasks, which is tremendous, evaluating models on General-Bench likely requires massive compute. Does this contradict the push for sustainable AI? Have you quantified the carbon footprint, and if not, should this be a mandatory disclosure for future benchmarks?
A: Thank you for raising this important concern. We agree that sustainability is a crucial consideration in designing and deploying large-scale AI benchmarks.
Although General-Bench includes over 700 tasks, it is intentionally designed to be modular and flexible. Model developers are not required to run evaluations on the entire task suite. In practice, given prior knowledge of a model’s capability boundaries, developers can select a relevant subset of tasks for evaluation to obtain a fair and rigorous comparison across other models.
We also acknowledge the importance of quantifying carbon footprint in building more responsible and sustainable AI systems. While we have not yet included CO₂ impact reporting in the current version, we plan to integrate optional carbon estimation tools in future releases, following emerging best practices in large-scale evaluation.
Q3. In the current setting, is each modality equally important on the path to AGI? The authors mention that they would like to incorporate more modalities into this benchmark. The current modalities seem to be the primary modalities for intelligence, while the newly added modalities do not seem as important as the current ones. If each modality is equally important, the current configuration may not fairly reflect the model's capabilities.
A: Thank you for this thoughtful question. We would like to clarify that General-Bench does not assume all modalities are equally important on the path toward AGI, nor do we assign equal weight to each modality in our evaluation. On the contrary, we recognize that different modalities play distinct and complementary roles in intelligent systems, and their importance can vary depending on the task, environment, or application context.
The currently included modalities—language, image, video, audio, and 3D—have been extensively studied and are more readily available at scale, which naturally draws more research attention. However, other modalities such as heatmaps, sensor data, charts, or even tactile signals also encode rich, structured information about the world. These may be particularly crucial in specialized domains such as medical diagnosis, physical reasoning, or embodied intelligence.
Our motivation for incorporating more modalities is not to suggest that each one is equally critical for AGI, but rather to build a broader and more flexible evaluation framework that can assess how well models generalize across diverse input types. We appreciate the reviewer’s suggestion and will revise the paper to make our position on modality importance and evaluation scope more explicit and transparent.
Q4. Typos: Line 2328: all task -> all tasks ...
A: Thanks!Will correct this.
This paper presents a comprehensive benchmark that includes over 700 existing tasks, providing a foundation for evaluating multimodal large language models (MLLMs). It introduces a five-level classification framework designed to systematically categorize MLLMs based on their capabilities and functionalities. Furthermore, the paper conducts an extensive evaluation of hundreds of models using the proposed benchmarks, offering valuable insights into their performance across various dimensions. By establishing a standardized assessment methodology, this work aims to advance research in the field and facilitate the development of more effective and versatile MLLMs.
update after rebuttal
I would like to thank the authors for the rebuttal. As the authors didn't update Table 1 of the draft as they mentioned in the rebuttal, I am not able to judge whether they will do so properly. I keep my original rating.
给作者的问题
Please address my questions above.
论据与证据
Strength:
- This paper serves as a benchmark study, featuring an impressive number of tasks, diverse data domains, and a substantial number of evaluated models. I greatly appreciate the authors' extensive efforts in constructing this comprehensive evaluation benchmark.
Weakness:
- The five-level classification method lacks clarity, as it does not provide well-defined criteria for each level. In particular, the concept of "synergy" between different modalities is not clearly articulated, making the classification ambiguous. For instance, Unified-IO-2 and Next-GPT, which are any-to-any modality generative models capable of producing visual and audio outputs from multimodal inputs, are categorized as Level-2, suggesting they lack synergy in comprehension and generation. Meanwhile, DeepSeek-VL and LLaVA-One-Vision, which do not even possess visual generation capabilities, are classified as Level-3, implying they exhibit synergy across tasks. This inconsistency raises concerns about the validity of the classification framework, as it does not appear to align with the actual capabilities of these models. A more precise and well-justified definition of "synergy" is necessary to ensure the classification accurately reflects the models’ multimodal abilities.
方法与评估标准
- The evaluation criteria are well-founded; however, as previously mentioned, I disagree with the classification method applied to different MLLMs.
理论论述
No theoretical claims.
实验设计与分析
The experimental design is thorough, encompassing a wide range of domains, tasks, data, and models.
However, the analysis in Section 5.2 feels somewhat superficial, especially considering the large-scale experiments conducted in this study. The observations are fairly straightforward and lack deeper insights, making them less compelling.
补充材料
I roughly go through the supplementary material given that it has almost 300 pages.
与现有文献的关系
It is related to many works in multi-modal large language models.
遗漏的重要参考文献
No.
其他优缺点
No, the primary concern lies in the reasonableness of the classification method and the depth of the analysis.
其他意见或建议
It would be much more engaging to see more unique insights drawn from the vast amount of experiments conducted.
We sincerely thank the reviewer for your time, meaningful questions, and constructive suggestions. Also, your recognition of our paper means a lot to us, which is the source of power to push us forward and enhance this work/project for a greater meaning to the community. We address each concern in detail below and hope that our clarifications help improve your evaluation of our work.
Q1. The five-level classification method lacks clarity, as it does not provide well-defined criteria for each level. In particular, the concept of "synergy" between different modalities is not clearly articulated, making the classification ambiguous. For instance, Unified-IO-2 and Next-GPT, which are any-to-any modality generative models capable of producing visual and audio outputs from multimodal inputs, are categorized as Level-2, suggesting they lack synergy in comprehension and generation. Meanwhile, DeepSeek-VL and LLaVA-One-Vision, which do not even possess visual generation capabilities, are classified as Level-3, implying they exhibit synergy across tasks. This inconsistency raises concerns about the validity of the classification framework, as it does not appear to align with the actual capabilities of these models. A more precise and well-justified definition of "synergy" is necessary to ensure the classification accurately reflects the models’ multimodal abilities.
A: Thank you for the detailed and thoughtful feedback. Due to space constraints, we provided the full definitions and criteria for each of the five levels in Appendix C. We kindly refer the reviewer to this section for a comprehensive explanation.
First, it is important to note that the levels in our framework are not mutually exclusive, but rather hierarchical. That is, a model classified at Level-3 will also have valid scores at Level-2, and so on. In Table 1, Unified-IO-2 and Next-GPT are shown as example models under Level-2 to illustrate that they satisfy the baseline criteria for this level—not to imply they are limited to Level-2. In fact, as shown in Table 19 (Appendix), both Unified-IO-2 and Next-GPT also receive valid Level-3 scores, indicating that they exhibit synergy across comprehension or generation tasks. However, they do not appear at Level-4, which requires a synergy in comprehension and generation.
Secondly, as for Level-3, the definition explicitly includes models with synergy in comprehension and/or generation. Although DeepSeek-VL and LLaVA-One-Vision do not support generative outputs, their performance on comprehension tasks exceeds that of single-modality specialists, thereby qualifying them for Level-3 based on comprehension synergy alone.
We appreciate the reviewer pointing out this potential source of confusion, and we will revise the paper to clarify the hierarchical nature of the levels, emphasize that the examples in Table 1 are illustrative rather than exclusive, and provide clearer articulation of "synergy" as it applies to comprehension and generation tasks.
Q2. However, the analysis in Section 5.2 feels somewhat superficial, especially considering the large-scale experiments conducted in this study. The observations are fairly straightforward and lack deeper insights, making them less compelling. It would be much more engaging to see more unique insights drawn from the vast amount of experiments conducted.
A: Thank you for the valuable feedback. Due to space limitations in the main paper, we have provided more in-depth analyses and insights in Appendices B.5, B.6, and B.7, including detailed discussions on synergy across skills, modalities, and comprehension/generation dimensions. In addition, we include fine-grained performance breakdowns for each model across individual skills, which allow for more precise diagnosis of model weaknesses and emerging capability trends. These detailed results offer a solid foundation for identifying underexplored areas and informing future research directions.
In the revision, we will work to highlight additional non-obvious patterns and findings in the main text to better reflect the depth of our analysis.
I would like to thank the authors for the rebuttal. As the authors didn't update Table 1 of the draft as they mentioned in the rebuttal, I am not able to judge whether they will do so properly. I keep my original rating.
The paper introduces General-Level, a framework inspired by the autonomous driving industry's capability grading system, to classify Multimodal Language Models (MLLMs) across five levels based on their synergy in comprehension, generation, and multimodal interactions. To support this classification, the authors propose General-Bench, an extensive multimodal benchmark covering 700+ tasks and 325,800 instances across diverse modalities, including text, images, video, audio, and 3D. The evaluation of over 100 MLLMs reveals that most models fail to exhibit true synergy across modalities and tasks, challenging the idea that current MLLMs are progressing toward Artificial General Intelligence (AGI). The authors argue that synergy—the ability to transfer knowledge between modalities and tasks—should be the key metric for evaluating multimodal generalists. However, their assumptions, such as the claim that multimodal synergy can enable generalist models to outperform task-specific SoTA specialists, remain debatable. The benchmark itself is massive and computationally expensive, making it impractical for fast model iteration. The results suggest that multimodality does not necessarily enhance language abilities, contradicting some empirical findings. The paper aims to redefine multimodal evaluation standards but introduces several questionable assumptions and practical challenges.
给作者的问题
No
论据与证据
see below.
方法与评估标准
see below.
理论论述
No
实验设计与分析
see below.
补充材料
N/A.
与现有文献的关系
Authors need to discuss more their work and prior holistic evaluation benchmarks like HELM, VHELM, etc.
遗漏的重要参考文献
No
其他优缺点
Strengths:
- General-Bench covers a vast range of modalities, making it one of the most extensive multimodal evaluation suites to date.
- The proposed General-Level system offers a structured way to assess MLLM capabilities beyond simple task performance.
- The paper correctly highlights the lack of cross-task and cross-modal synergy in existing MLLMs.
- The authors claim they will maintain an open benchmark and leaderboard, potentially aiding long-term MLLM progress.
Weaknesses:
-
Benchmark Construction Relies on Existing Datasets and Is Computationally Expensive: The main contribution of the paper, General-Bench, is largely a repurposing of existing benchmarks rather than a fundamentally new dataset. The biggest issue is its sheer size, making model evaluation prohibitively compute- and time-intensive. While the authors claim they will maintain the benchmark and leaderboard, this does not address the core problem: practical usability during model development. Large-scale evaluations are not feasible for fast iteration, which is why even in the current era, development sets remain essential.
-
Overly Strong and Questionable Assumptions: The paper assumes that "a model’s synergy capability enables it to outperform SoTA specialists in specific tasks by leveraging knowledge across tasks or modalities." This is highly unrealistic. There is no current evidence that a general-purpose LLM can outperform a domain-specific model in its specialized task. For instance, no LLM has surpassed a fine-tuned BERT model in NER. Furthermore, pursuing an all-encompassing LLM for every task is inefficient, both computationally and economically. The cost of using a large LLM for a task that a smaller, specialized model can perform better is unjustified.
-
Misleading Claim About Multimodal Synergy Not Enhancing Language Performance: The paper concludes that multimodality does not enhance language abilities, but this is not always correct. Empirical results from training Vicuna, Qwen2, and LLaMA show that models trained with both image-text and text-only data consistently outperform those trained with text-only data on language benchmarks like MMLU. This trend is reproducible across multiple models, contradicting the paper’s claim. The authors fail to acknowledge variations in training setups that could impact this conclusion.
其他意见或建议
Line 60: Yhe -> The
We appreciate you carefully reviewing our paper and raising meaningful questions and constructive suggestions. Below we address your concerns one by one and are open to further discussion. Hope you can raise your evaluation.
Q1. Discussion of related evaluation benchmarks (HELM, VHELM, etc.)
A: Thank you for highlighting this. We acknowledge the contributions of holistic benchmarks like HELM, VHELM, and MMBench, which provide open, standardized infrastructures for evaluating vision foundation models. Our Generalist Benchmark aligns with this direction and is designed to serve as an open platform enabling broad community participation and transparent comparison of MLLMs. We will include a more detailed comparison with these existing efforts in the revision.
Q2. The benchmark’s large scale makes evaluation costly and impractical for rapid iteration...
A: We believe the benchmark's scale is a strength, reflecting diversity across tasks, modalities, and formats. For any single task, we limit evaluation to ~500 instances—computationally reasonable for most models.
To enhance usability, we will further propose a multi-scope evaluation structure with three leaderboard types:
- Scope-1: Full-spectrum leaderboard (current General-Bench), for general-purpose MLLMs.
- Scope-2: Modality-specific leaderboards, for modality-specialized models.
- Scope-3: Fine-grained task-cluster leaderboards under each modality.
All rankings are computed via our General-Level framework, allowing users to select appropriate evaluation scope based on their model’s capacity and resource constraints. This makes the benchmark scalable and adaptable: lightweight scopes allow fast iteration; broader scopes offer more visibility at higher computational cost—it's up to the user.
We do not plan to release dev sets for each task, as this benchmark is strictly designed for zero-shot evaluation. Model development and training are left to the developers.
Q3. The assumption that synergy enables generalists to outperform SoTA specialists is unrealistic.
A: We understand the concern regarding synergy and its comparison to SoTA specialists.
As explained in Appendix C.3 (“Rationality of Scoring Relaxation”), our design follows two steps:
(1) Define synergy as the core metric for capability levels;
(2) Propose a practical scoring method, given real-world model training constraints.
Ideally, synergy could be defined as a model performing better on a joint distribution than on or separately. However, isolating such distributions is infeasible, as large generalist models are already jointly trained on many tasks. Retraining to cleanly separate task spaces is impractical.
Thus, we relax the evaluation: we treat cases where generalists match or exceed SoTA specialist performance in a task (without in-domain fine-tuning) as indirect but valid signals of synergy—i.e., effective cross-task/modality generalization.
This is supported by multiple studies showing generalists can outperform fine-tuned specialists:
- [1] shows GPT-4, via prompt engineering alone, achieves SoTA in medical QA benchmarks without fine-tuning.
- [2] shows OpenMedLM, via prompting techniques, surpasses prior fine-tuned open-source models on multiple medical benchmarks.
- Flamingo [3] outperforms fine-tuned specialist models on six vision-language tasks.
These findings support our core assumption: the stronger a model’s synergy capability, the more likely it is to surpass SoTA specialists when synergy is effectively activated. This avoids costly pairwise modeling and enables practical, scalable synergy evaluation.
[1] Can Generalist Foundation Models Outcompete Special-Purpose Tuning?
[2] OpenMedLM: Prompt Engineering Can Outperform Fine-Tuning in Medical QA
[3] Flamingo: A Visual Language Model for Few-Shot Learning
[4] Perceiver IO: A General Architecture for Structured Inputs & Outputs
[5] Segment Anything
Q4. Multimodal synergy does improve language—models like Vicuna, Qwen2, and LLaMA benefit from image-text data.
A: We appreciate this feedback and apologize for the unclear statement. We do not deny that multimodal data can improve language understanding. Our point is more specific: such improvement has not yet enabled models to outperform SoTA NLP specialists on core language tasks.
There is a clear distinction between enhancing language performance and exceeding SoTA NLP models. While models like Vicuna, Qwen2, and LLaMA show better results with image-text pretraining, our large-scale evaluation shows they still fall short of outperforming fine-tuned language specialists. Therefore, our statement does not contradict existing evidence. But for sure we will refine the statement for clarity in the revision.
Q5. Typos: Line 60 “Yhe” -> “The”
A: Thanks!Will correct this.
This paper introduces a five-tier General-Level framework that assesses multimodal generalists based on their synergy across comprehension, generation, and cross-modal interactions. Also, inspired by autonomous driving grading, it proposes a new benchmark, General-Bench, covering over 700 tasks and 325K instances across diverse modalities. Evaluation of 100+ state-of-the-art models reveals that most multimodal large language models (MLLMs) struggle to achieve true cross-task and cross-modal synergy. Authors show many interesting insights via their framework and benchmark, highlighting significant challenges on the journey toward genuine AGI overall progress.
给作者的问题
I have a few minor questions:
- Do the skills in Tables 3–7 refer to meta-tasks, corresponding to the specific tasks shown in Appendix Table 103? Given the enormous scale and hierarchical structure of the dataset, I couldn’t fully grasp this aspect from the paper.
- In Table 9, I noticed that some pure language LLMs are included. Since the evaluation targets MLLMs, why evaluate the performance of LLMs?
- Although explanations may have been provided in the experiments, I remain curious: why did the GPT series models fail to achieve leading rankings across all levels, and why do only three models have scores at Level 4? Does this result seem reasonable, as it appears somewhat counterintuitive to me?
论据与证据
- The authors claim they have developed a new evaluation framework and have indeed proposed a completely new theoretical basis for it.
- They assert that they have introduced the most comprehensive and largest-scale benchmark dataset to date; subsequent comparisons with other datasets clearly demonstrate that its scale and scope exceed those of existing benchmarks.
- The authors contend that current MLLMs still face numerous issues that existing benchmarks fail to evaluate, and they have verified these problems in their experiments.
方法与评估标准
This is a benchmark paper. The authors propose a new evaluation approach for multimodal generalists (MLLMs/agents), focusing on the models' synergy effects across comprehension, generation, and cross-modal interactions. They have also introduced an entirely new benchmark dataset to evaluate over 100 LLMs from different perspectives and methodologies. Further, the authors provide extensive and detailed information in the appendix—nearly 300 pages—to substantiate the reliability of both the evaluation framework and the dataset.
理论论述
The authors propose a five-tier General-Level evaluation framework that incorporates innovative theoretical contributions. Their core claim is that current benchmarks for multimodal generalists or MLLMs merely compare performance across individual tasks, which fails to fully assess the true capabilities of these models. Consequently, they introduce a new evaluation approach based on the synergy effects of MLLMs in comprehension, generation, and cross-modal interactions. Further, to validate the soundness of their evaluation framework, the authors provide extensive mathematical proofs in the appendix, which I have reviewed and found to be both mathematically correct and robust.
实验设计与分析
The authors conducted extensive evaluations on over 700 multimodal tasks for more than 100 MLLMs. The assessments include individual task results, meta-task outcomes, the number of supported tasks, comparisons where models surpass state-of-the-art specialists, and the final ranking of models across different levels. Also extensive visualization analyses reveal the performance preferences of various models, all of which are both interesting and insightful. I found the experimental scope to be so vast, with validation approaches and perspectives that are both reasonable and comprehensive.
补充材料
The supplementary material is detailed and spans nearly 300 pages, providing extensive information for a comprehensive understanding of the work, including:
- All the evaluated multimodal large models
- The complete experimental results across 700 tasks
- The full ranking of the large models at each level
- Various visualization analyses
- Extensive theoretical proofs
- A detailed extended introduction to the benchmark data
与现有文献的关系
I believe there are two core contributions:
- The authors introduce an entirely new perspective for evaluating the rapidly increasing number of MLLMs. Instead of simply comparing performance across various tasks, they propose a capability grading system akin to that in the autonomous driving industry, based on the core idea of synergy. This approach is poised to revolutionize the field of MLLM evaluation and guide the development of the MLLM community.
- I think the authors have developed an ultra large-scale evaluation benchmark for MLLMs that encompasses 145 skills across more than 700 tasks with over 325K samples, involving five common modalities and covering 29 domains. This unprecedented comprehensiveness and high quality ensure that the evaluation results should be extremely reliable, which, to my knowledge, might largely position this benchmark as the future standard for performance assessment in the field.
遗漏的重要参考文献
The references provided in the paper are adequate.
其他优缺点
I appreciate this work because it makes a clear contribution to the community. I believe the significance and value of this paper will be revolutionary in the field. Currently, research on multimodal generalists—whether MLLMs or agents—is gaining increasing traction and is progressively oriented toward developing more powerful models, as the authors claim. An important question for the community is how these multimodal generalists should evolve: should they focus on achieving higher performance or on supporting a broader range of capabilities? Simply assuming that higher scores on various multimodal tasks equate to a more capable generalist is too simplistic. The authors reject this notion and propose an entirely new evaluation approach. They apply a five-level classification system, borrowed from the autonomous driving industry, to rate multimodal generalists, where each level represents a specific range of capabilities, with further differentiation possible within each level. This idea is truly eye-opening, as it is not only theoretically rigorous and correct—thanks to a carefully designed scoring algorithm ensuring key attributes—but also highly feasible. The authors have introduced a novel evaluation perspective and methodology that is set to revolutionize the field.
For the second point, I was struck by the enormous effort reflected in this work. For instance, the authors have contributed an new dataset—possibly the largest benchmark I have seen—which includes 145 skills across more than 700 tasks with over 325K samples, involving five common modalities and covering 29 domains. Its high quality ensures extremely reliable evaluation outcomes, and this benchmark is likely to become the future standard for performance assessment in the field. Also the team evaluated 100+ current state-of-the-art MLLMs. Lastly, the paper, including an appendix spanning nearly 300 pages, provides exceptionally detailed information, which I find impressive.
Other strengths include the meticulous craftsmanship of the paper; both the writing and the visual presentation (such as the organization and visualizations) are of high quality. The paper is exceedingly detailed, providing necessary detail, and the experimental findings and conclusions are both fascinating and highly instructive for the community.
As for potential weaknesses, the only concern I can identify is that the authors might need to further strengthen their guidance for the community by providing clearer directions for future research. Although they include a "Limitations and Future Investigation" part in Section 7, I feel it could be even more detailed, as noted in the following comments.
其他意见或建议
I believe the authors could offer further guidance from multiple perspectives to the future research community on how to steer the development of MLLMs to achieve higher performance within the General-Level evaluation framework.
We greatly appreciate the reviewer’s recognition of our work and the thoughtful, constructive feedback. Below, we provide point-by-point responses to each comment. Hope you can reevaluate our work if you feel the response is effective and useful.
Q1. The “Limitations and Future Investigation” section is helpful, but future research directions could be elaborated further.
A: Thank you for this helpful suggestion. In addition to the Limitations and Future Investigation at section 7, we have also included a more actionable guide in Appendix C.5: “Path to Advancing Higher in General-Level”. This section outlines concrete guidelines for how future MLLMs can progress through the levels in our framework, offering clear directions for advancing synergy, modality coverage, and generalization—key factors in moving toward Level-5 multimodal generalists.
Q2. Do the “skills” in Tables 3–7 refer to meta-tasks? Are they directly mapped to the specific tasks in Appendix Table 103?
A: Not really, the “skills” listed in Tables 3–7 refer to meta-tasks, but they do not directly correspond one-to-one with the specific tasks in Appendix Table 103. Instead, each skill (i.e., meta-task) includes multiple specific tasks. For example, the skill Crack Detection includes the specific tasks Tire Crack Detection and Road Crack Detection. This hierarchical organization allows us to abstract task capabilities at a higher level, making the benchmark both manageable and scalable.
Q3. Why are pure language LLMs included in Table 9, given that the focus is on MLLMs?
A: We intentionally included language-only LLMs in the comparison to provide a reference point. This helps readers assess the performance gap between unimodal LLMs and MLLMs on NLP tasks. It also highlights a key finding of our benchmark: multimodality has not yet enabled MLLMs to outperform SoTA LLMs on core NLP tasks, which is critical for understanding the current limitations of multimodal synergy and what it would take to reach Level-5 performance.
Q4. Why do GPT-series models not achieve top rankings across all levels? And why do only three models reach Level 4? These results feel somewhat counterintuitive.
A: This is an excellent question. Although GPT models demonstrate strong performance in several areas, our analysis shows that they often excel in specific task types, such as vision comprehension, where they behave like specialists. However, they lack broad modality and task support, and in many cases, do not even support certain modalities or task formats.
Our General-Level scoring framework is based on two core principles:
- The model should be a true generalist, i.e., capable across a wide range of modalities and task types.
- Synergy is the key metric—performance gains must come from meaningful cross-modal, cross-task integration.
GPT models, while powerful, do not consistently meet these criteria across all levels, which explains their limited presence at Level 4 and absence from the top in other levels. Thus, the result is not only reasonable but aligns with our framework’s goals: to reward generality and synergy, not just isolated task excellence.
Thanks a lot to the authors for the detailed explanations in response to my questions. I re-read the paper and would like to confirm that my overall evaluation remains unchanged—I continue to be very supportive of this work and am willing to raise my score.
That said, I was a bit surprised that GPT-4o did not achieve a higher ranking in the current version of the paper. For example, recently GPT-o1/o3 (deepseek as well) has demonstrated quite strong long-chained reasoning capabilities, and I would have expected better performance. Also, the very recent updates to GPT-4o have shown impressive advancements in image generation (I bet you tried it).
Given this, I would suggest that the authors consider timely updates to the leaderboard, so that the rankings can more accurately reflect the evolving capabilities of the latest multimodal foundation models. If the goal of this work is to make a long-term contribution to the field, maintaining and updating the leaderboard over time would be essential.
We extend our heartfelt thanks to the reviewer. Your recognition and support are the greatest encouragement for us to continue refining our work. We will keep improving this paper, and more importantly, we are committed to maintaining this evaluation platform and turning it into the most beneficial resource for the multimodal large foundation model community.
Regarding the powerful capabilities recently demonstrated by models such as GPT-o1/o3, GPT-4o, and DeepSeek, we will definitely include these models in future evaluations. However, objectively speaking, we still do not expect them to make a significant improvement on our overall leaderboard. The fundamental reason lies in our General-Level scoring principle: our leaderboard (as its name highlights: towards multimodal generalist) prioritize broader coverage across modalities and tasks, rather than only rewarding models that achieve expert-level performance in isolated capabilities or specific domains.
Fortunately, as mentioned in our rebuttal to Reviewer 7pDZ, we will further propose a multi-scope evaluation structure with corresponding leaderboards based on different scopes of capability:
- Scope-1: A full-spectrum leaderboard (the currecnt version as General-Bench) covering all modalities and task types, intended for highly capable, general-purpose multimodal models.
- Scope-2: Modality-specific leaderboards that focus on a single modality, accommodating models that specialize in one area (i.e., modality-specific generalists).
- Scope-3: Finer-grained, task-cluster-specific leaderboards under each modality, designed for meta-task generalists with partial or specialized abilities.
All sub-leaderboard rankings are derived using our General-Level framework, allowing users to select their evaluation scope based on model capability, resource constraints, and intended impact. Those prioritizing rapid iteration and low cost can opt for smaller-scope evaluations, while more powerful models seeking higher visibility may choose to participate in full-scope rankings. We believe that the latest models you mentioned above like GPT-o1/o3, GPT-4o, and DeepSeek are likely to achieve leading rankings in Scope-3 or Scope-2 evaluations.
Anyway, thank you again for your great support!
The authors introduce General-Level, a comprehensive evaluation framework for multimodal generalist models that emphasizes synergy across tasks and modalities. It further presents General-Bench, an extensive benchmark covering over 700 tasks spanning various modalities to assess both comprehension and generation capabilities. The framework categorizes model performance into five levels, reflecting the progression from task-specific skills to cross-modal generalization critical for achieving AGI. Extensive experiments reveal that while current models show progress, significant gaps remain in true synergy and broad task support.
给作者的问题
See Weakness.
论据与证据
This paper is quite extensive in both length and content, and thus presents a lot of scientific claims. Many claims are backed by extensive experimental results and comprehensive benchmark data, particularly regarding the framework’s ability to differentiate models based on their task and modality support. Other claims, such as those concerning the detailed characteristics of the datasets, are supported by sufficient details. Also, the findings and conclusions in the experimental section are validated and supported by corresponding experimental data. In particular, regarding the general-level properties, the authors provide ample mathematical proofs in the appendix. But I think the normalization and metric mapping methods are mentioned without thorough empirical validation, leaving their effectiveness in accurately comparing heterogeneous tasks not 100% substantiated.
方法与评估标准
The authors propose a completely new evaluation approach, called the General-Level framework. The proposed methods and evaluation criteria, in my opinion, are innovative and largely appropriate for assessing multimodal generalist capabilities. The multi-level General-Level framework and the expansive General-Bench dataset provide a comprehensive way to capture both comprehension and generation across various modalities. Overall, I have not found any obvious issues with the validation methods or processes.
理论论述
The paper’s theoretical claims, especially those regarding the “synergy effect” that underpins the proposed 5-layer evaluation framework, are more conceptual than rigorously proven. Their core argument is that current benchmarks for multimodal generalists or MLLMs only compare performance across different tasks, failing to fully assess these models’ true capabilities. The authors provide some mathematical proofs in the appendix concerning certain properties at the General level. I examined the rationale for defining synergy levels and the assumptions underlying the ability of multimodal generalists to outperform specialized models.
实验设计与分析
The experimental design is very extensive, evaluating over 700 tasks with diverse modalities, which provides a broad view of multimodal capabilities. The analysis comparing generalists against SoTA specialists is methodically structured. Also the paper includes very appealing visualization-based analyses. But there might be several points to be further improved. 1) The synergy effect design would benefit from more ablation studies to isolate the contribution of individual components. 2) While the large-scale benchmark is impressive, some analyses appear to emphasize breadth over in-depth evaluation of failure cases, which limits understanding of the underlying challenges.
补充材料
Yes, I review the supplementary material, which looks remarkably comprehensive, extending to nearly 300 pages and offering an abundance of details that give a thorough understanding of the work. It includes information on every evaluated multimodal large model, complete experimental results for 700 tasks, a detailed ranking of these models at each level, an array of visualization analyses, rigorous theoretical proofs, and an extensive introduction to the benchmark data.
与现有文献的关系
The paper’s contributions are deeply rooted in the evolving landscape of MLLMs or multimodal generalist models. It provides a novel perspective on understanding and evaluating the capabilities of multimodal generalists. The notion of “synergy”, a central theme in the paper, probably builds on earlier ideas of cross-modal joint learning or transfer learning, where knowledge transfer between modalities should be explored in various studies. But to my knowledge, there isn’t any prior related work that evaluates MLLMs in this way. Furthermore, the comprehensive benchmark (General-Bench) and the tiered evaluation framework resonate with existing efforts in creating standardized tests for model performance, such as LVLM-eHub, MME, and others. By synthesizing these ideas, I think the paper provides a structured method to compare and improve multimodal generalists, advancing the broader conversation on achieving AGI.
遗漏的重要参考文献
There are no essential related works missing from the citations.
其他优缺点
In the above fields, I have thoroughly emphasized the value and strengths of this work. Overall, I believe this work will bring a significant revolutionary impact to the MLLMs community, potentially leading and even changing the current development direction of large multimodal foundation models.
However, I do have a few minor concerns (which might be potential weaknesses) that I hope the authors can address or clarify during the rebuttal phase:
-
The experimental study may lack sufficient discussion of failure cases with detailed analyses to fully summarize the common errors made by existing MLLMs. Such insights could help guide future research directions.
-
The paper assumes effective cross-modal synergy without adequately addressing integration and robustness issues; I think future work should focus on these aspects for improved model interoperability.
-
While the paper posits that non-language modalities can enhance language intelligence, the experimental evidence for this reverse synergy might be largely absent.
其他意见或建议
See Weakness.
We greatly appreciate the reviewers' recognition of our work and the valuable feedback provided. Below, we provide detailed responses.
Q1. But I think the normalization and metric mapping methods are mentioned without thorough empirical validation, leaving their effectiveness in accurately comparing heterogeneous tasks not 100% substantiated.
A: We would like to clarify that the proposed normalization and metric mapping techniques are primarily designed to enable fair and consistent comparisons across heterogeneous tasks. This is necessary because certain evaluation metrics, such as FID, are not naturally bounded within the [0, 1] range, and lower values indicate better performance. While it is reasonable to directly compare models using FID within a single task, averaging performance across multiple tasks becomes problematic when combining metrics with different scales and monotonicities—such as FID (lower is better) and ACC (higher is better).
To address this, we carefully designed a normalization and metric mapping strategy that brings different evaluation metrics into a unified range. We also took special care in choosing appropriate scaling factors to ensure that the transformed scores remain faithful representations of the original quality measurements. For example, an FID of 25 and an FVD of 100 are mapped in a way that preserves their relative performance levels. This normalization process allows us to compute an overall average performance across tasks without introducing unintended biases.
Q2. The synergy effect design would benefit from more ablation studies to isolate the contribution of individual components.
A: Thank you for the suggestion. We would like to point out that a detailed analysis and discussion of the synergy effect in our multimodal generalist framework—across skills, comprehension and generation capabilities, and different modalities—is provided in Appendix B.7. We kindly refer the reviewer to that section for a comprehensive breakdown.
Q3. While the benchmark provides a broad and impressive evaluation across diverse tasks, it currently lacks in-depth analyses of failure cases, which could limit insights into the specific challenges faced by MLLMs.
A: Thank you for raising this important point. The core motivation behind General-Bench is to provide a comprehensive and systematic evaluation of MLLMs that goes far beyond conventional VQA-style assessments. To this end, we deliberately designed the benchmark to cover a wide range of task formats, modalities, and skills. This enables us to quantitatively and qualitatively measure MLLM performance from multiple perspectives, offering valuable insights into their generalization capabilities and limitations.
That said, we fully agree with the reviewers that in-depth failure case analysis is crucial for understanding model weaknesses and guiding future improvements. While our current focus has been on establishing broad coverage and performance patterns across dimensions, we acknowledge the value of qualitative case studies. In future work, we plan to include more detailed failure analyses and case-based evaluations to better illuminate the specific challenges MLLMs face and foster more targeted research efforts in this direction.
Q4. The paper assumes effective cross-modal synergy without adequately addressing integration and robustness issues; I think future work should focus on these aspects for improved model interoperability.
A: Thank for the reviewer’s constructive suggestion. In Appendix B.7, we provide a preliminary analysis of cross-modal synergy, particularly in terms of whether different models have learned synergy effects across modalities. We agree and will conduct a deeper exploration of integration mechanisms and robustness to modality-specific noise or failure for improving model interoperability in the revision.
Q5. While the paper posits that non-language modalities can enhance language intelligence, the experimental evidence for this reverse synergy might be largely absent.
A: We appreciate the reviewer’s comment and would like to clarify a possible misunderstanding regarding our claim. Our primary argument is that language serves as a strong prior to enhance other modalities, rather than the reverse. While we do not deny that non-language modalities can, to some extent, aid language understanding, our empirical findings suggest that current multimodal signals are not yet capable of boosting language performance beyond that of SoTA NLP models. In other words, we distinguish between helping language understanding and surpassing NLP SoTA performance through multimodal input—two very different thresholds. Our intention was to highlight this gap and motivate future work toward developing truly synergistic models where multimodal information can meaningfully and reliably elevate language intelligence beyond what is achievable through text alone.
Thank you for the authors’ response.
Once again, I find the ideas presented in this paper both meaningful and thought-provoking—especially the discussion around cross-modal synergy. I generally agree with the perspective that most current MLLMs achieve a form of pseudo-intelligence by leveraging the emergent capabilities of language models, rather than realizing true multimodal intelligence.
In my own team, we’re also exploring ways to enable native cross-modal emergent intelligence—a truly foundational and native form of multimodal intelligence. Within this framework, one of the key goals is to observe symmetric cross-modal synergy, where multimodal inputs not only benefit from language but also actively enhance language intelligence itself sufficiently.
In this regard, I'm curious, have the authors considered how the concept of native multimodal foundation models might be integrated into your general-level evaluation framework?
By the way, I’d be happy to champion this paper.
We sincerely thank the reviewer again for your recognition and support, which is the most crucial driving force behind our continued efforts to advance and maintain this grand benchmark. We will keep investing resources to ensure the long-term maintenance of this open evaluation platform.
Regarding the idea you raised about “how to achieve a truly native multimodal foundation that enables native bidirectional cross-modal synergy (e.g., multimodality synergizing language intelligence)”, we believe it's a very thought-provoking and trending question.
We actually touched upon some preliminary discussions related to this topic in the paper. Overall, we firmly confirm that achieving Level-5 multimodal generalist intelligence must involve this kind of bidirectional or symmetric cross-modal synergy, where different modalities and tasks can assist and enhance each other.
From a technical perspective, given the current SoTA research landscape of the MLLM community, we believe two key aspects need attention:
-
Model Architecture: It is essential that an MLLM treats all modalities equally, including adopting a universal approach to task modeling. We believe that using an autoregressive framework, combined with a unified tokenization method across modalities, is one of the most promising approaches for unifying both understanding and generation across modalities. Also the AR solution has gained a lot of debate recently.
-
Training Paradigm: The training process must involve super large-scale data from all modalities, not just language. Moreover, the training should explicitly model cross-modal reasoning. For example, a type of modality-interleaved reinforcement learning mechanism (e.g., used in long-chained LLMs) could be employed to facilitate mutual enhancement and learning among different modalities.
Once again, thank you so much for your continued support!
The paper presents a new General-Level framework and the General-Bench benchmark, a hierarchical evaluation system for multimodal large language models, evaluating across comprehension, generation, and cross-modal interactions. The benchmark, code, and evaluation framework will be open-sourced to support community advancements. Reviewers praised the work’s ambitious scale, innovative evaluation framework, and transformative potential for guiding MLLM development. Reviewers also suggested improvements in synergy definition clarity, computational practicality, and failure analysis depth. The authors addressed concerns by clarifying hierarchical scoring logic, proposing modular evaluation scopes, and committing to long-term maintenance. Despite minor issues, reviewers unanimously recognized the paper’s technical contribution and community impact. Given the resolution of concerns and strong support for its contributions, the paper is recommended for acceptance.