PaperHub
4.9
/10
Poster4 位审稿人
最低2最高3标准差0.4
3
3
2
3
ICML 2025

Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
GraphVisual ReasoningReasoningLLMMultimodal LLM

评审与讨论

审稿意见
3

This paper introduced a new dataset, named Visual Graph Arena (VGA), specifically designed to evaluate and enhance deep models for visual graph analysis. VGA comprises six distinct tasks: Easy Isomorphism, Hard Isomorphism, Hamiltonian Path, Shortest Path, Hamiltonian Cycle, and Biggest Chordless Cycle. With its diverse visual layouts, VGA effectively challenges models to recognize invariant conceptual properties despite variations in graphical layouts.

给作者的问题

Is it possible to introduce more different layouts for the same visual graph?

论据与证据

Empirical evidence demonstrates that models consistently struggle with path and cycle tasks across diverse layouts, whereas human participants achieve near-perfect accuracy. This supports the claim that current multi-modal large language models face significant challenges in recognizing and reasoning about the same concept when presented in varying visual forms.

方法与评估标准

This paper does not propose a method.

理论论述

This paper does not involves any theoretical claims.

实验设计与分析

The authors compare the performance of ViT-based models, CNN-based models, and closed-source MLLMs on the proposed dataset. However, they do not fine-tune open-source MLLMs, such as LLaVA and Qwen-VL, which would have enabled a more comprehensive comparison. Incorporating such models could provide valuable insights into the adaptability of open-source alternatives.

Additionally, the authors emphasize the significance of varying visual representations of identical concepts across different real-world domains. However, they do not present any empirical evidence to support this claim. At a minimum, conducting preliminary experiments on real-world datasets would be feasible and could strengthen their argument.

补充材料

I reviewed the appendix section, which provides comprehensive details on the implementation of the proposed dataset, including template prompts and a clear visualization of the dataset structure.

与现有文献的关系

The proposed dataset could enhance AI models' ability to recognize and reason about visual representations of identical concepts across different scientific and technical domains, such as electrotechnics and chemistry.

遗漏的重要参考文献

The core argument of this paper closely aligns with that of [1], which emphasizes the importance of layout augmentations in visual graphs. In [1], the authors identified layout augmentation as the most effective method for improving AI models' recognition and reasoning abilities on visual graph reasoning, despite variations in visual form. In other words, [1] has already demonstrated the significance of this concept. While the proposed dataset includes one additional task (i.e., Isomorphism) that was not covered in [1], it is recommended to cite [1] to acknowledge its contributions.

[1] Wei, Yanbin, et al. GITA: Graph to visual and textual integration for vision-language graph reasoning. NeurIPS 2024.

其他优缺点

Weaknesses:

The contribution of this paper appears marginal due to significant overlap with previously published work [1]. While I appreciate the authors’ effort in incorporating the performance of human participants, the novelty remains limited. Additionally, to enhance the potential value of the proposed VGP dataset, I recommend providing more empirical evidence across diverse domains to better differentiate it from prior works [1, 2, 3].

[1] Wei, Yanbin, et al. GITA: Graph to visual and textual integration for vision-language graph reasoning. NeurIPS 2024.

[2] Deng, Yihe, et al. GraphVis: Boosting LLMs with Visual Knowledge Graph Integration. NeurIPS 2024.

[3] Hu, Baotian, et al. VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context. ICML 2024.

其他意见或建议

For Figure 2, it is recommended to provide additional explanations for each layout, as not all readers may be experts in this field.

作者回复

Thank you for thoughtful comments and the time you spent for the detailed review. We hope to address your concerns below:

More Baselines

We acknowledge your suggestion to include more baseline models. We agree this would strengthen our analysis and have already begun running these additional experiments. Our GPUs are currently processing. Given that our framework requires six distinct tasks per model and considering our computational constraints during the ICML rebuttal period, these experiments will take some time to complete. We will update you with the results during the discussion period as they become available. Thank you for valuable recommendation.

Real world domains:

Regarding our emphasizing of the significance of varying visual representations of identical concepts across different real-world domains, we would like to be more clear: The ability to understand concepts is a fundamental capability expected from an "intelligent" agent. And this makes investigating the capability of our models in this regard significant. The simplest and dataset-makeable instance of conceptualization that we could think of was graphs. Our work deliberately focuses on graphs as a clear, well-defined example of such visual representation challenges. Upon testing the existing models on these basic graph tasks, especially isomorphism, we saw them significantly struggling. Just to be more clear, we do not in any shape claim that our dataset directly enhances model conceptualization in other scientific domains like chemistry. We believe that these, like simple graphs, are other instances of conceptualization, but more complex. We hope this work will inspire future research to introduce diverse real-world representation datasets for conceptualization.

Regarding similarities with [1]:

We thank the reviewer for bringing this work to our attention. We missed this reference as our creation of VGA took several months, and at the time we began, [1] was not yet published. We will certainly add a comprehensive discussion of it to our final paper.

Our work is fundamentally distinct from [1] in several important ways:

  1. Representation approach: [1] inputs textual data of graphs node and edge lists in addition to images. In contrast, our work focuses exclusively on visual conceptualization without additional text information. More importantly, the edge list that they are giving as text information, is layout independent, and this is in striking contrast with the whole idea of our paper.

  2. Purpose of layout variation: We deliberately employ different layouts in training vs testing (and no two graphs being the same) to assess models' conceptualization abilities and to prevent pattern memorization between training and testing. Conversely, [1] uses different layouts as an augmentation technique to expand training data, while also providing layout-invariant textual representations. This is strongly the opposite of what we intend to do.

  3. Human interpretability: The essences of the plotted data and the way they are plotted are far apart. We carefully selected layouts and representations to ensure they remain fully understandable by human subjects, as confirmed by our human evaluation studies. This was critical since our approach relies solely on visual information. By contrast, [1] employs layouts and graphs that are often visually complex and challenging even for humans to interpret without supplementary textual information. For reference, examine this example of graph 100 in different layouts from [1]:

Reagding [2]: we have already cited it in L132. We will include [1] and [3] in our final version.

Question Regarding layout selection, we carefully chose the current layouts from standard graph representation methods that remain human-understandable. For instance, we deliberately avoided circular layouts since they become visually challenging even for humans as the number of nodes increases. We are currently working on a carefully hand-crafted test set with special graph representations to publish on the dataset website in the future. However, since these are manually created, it takes time to develop a collection that is both rich enough and of high quality.

审稿人评论

Thank you for the detailed response. However, I disagree with the discussion on the "Representation approach". Specifically, reference [1] performs augmentation experiments using only images with different layout augmentations and necessary prompting text (this paper also need to input such prompt information into models). I recommend that the authors review [1] carefully to fully appreciate its methodology.

Regarding the "Purpose of layout variation", I concur with the authors' perspective. It is indeed an interesting attempt to intentionally employ different layouts during training and testing.

Lastly, I would like to discuss the application of visual graphs in real-world scenarios and human interpretability. Often, there is a tension between the two: real-world graph data are typically large, containing numerous nodes and edges, whereas visual graphs that are easily interpretable by humans tend to be relatively simple. As the examples demonstrated by authors, increasing the number of edges or nodes can lead to visually complex graphs that are difficult to interpret.

In conclusion, I recommend that the authors introduce novel elements that clearly differentiate their work from previous studies [1, 2, 3]. This approach will enhance the overall contribution of the paper and help it meet the high standards expected at a top-tier conference. Personally, maybe it is a promising future direction to consider integrating attribute information directly into visual graphs.

作者评论

Thank you for your continued engagement with our paper. We would like to address your concerns. We first confirm that we have carefully reviewed paper [1] and fully appreciate its methodology. Our responses here are not to downplay paper [1] but to clarify the significant differences between our paper and [1].

  • Layout as data augmentation: Paper [1] has clearly stated in many instances that they use the different layouts as data augmentation. For example, in Section 3.3: "Visual graphs generated for the same graph G can be considered as a unique data augmentation technique."
    This is far from the whole idea of our paper and shows how fundamentally different our papers are. While paper [1] uses layouts as a tool for augmenting data, our paper is "about conceptualization of layouts" and trainers are strongly prohibited to use test layouts as augmentation.
  • Graph data as text: We think there is a misunderstanding here. The difference is not language text; the difference is that they are giving "graph information" as text to the model. They are expressing the graph via edge nodes as text input. Not only do we not give text information to the model, but more importantly, the edge list data they are providing is "layout independent," meaning the model doesn't even need to look at the visual layouts to understand the graph.
    We believe it is obvious how different this is from our paper. We are testing visual conceptualization over layouts, and they are giving "layout independent," textual information as input. For instance, the graph textual input for both graphs of our most important task, isomorphism, would be the same!
  • Real-world scenarios and visually interpretable: Visually-interpretable graphs are extensively used in education, research papers, textbooks, and technical documentation. These are legitimate real-world applications where conceptualization across visual layouts is critical. Our work deliberately focuses on Visually-interpretable graphs because they're perfect for testing conceptualization - they're simple enough for humans to understand intuitively yet complex enough to reveal AI limitations. Both papers are valuable; we're simply examining different aspects of reasoning, with our work specifically targeting the conceptualization gap between human and AI understanding. Paper [1] is not doing that, and it is clear from their produced images.

We strongly assert that our paper is not only novel but fundamentally different from [1,2,3] (arguably more different than these three cited papers are from each other) in its core research question, methodology, and findings. The conceptualization problem we explore is a new direction not addressed in prior work. Our experimental design deliberately evaluates this specific capability, our human comparison reveals a striking cognitive gap absent from previous studies, and our analysis uncovers fundamental limitations in current AI systems.

For clarity on our paper's focus, we direct the reviewer to our "Defining conceptualization" and "Why Graphs?" sections on the first page. In this work, graphs are just a "first step" or "toy example" to explore the bigger challenge of conceptualization which is a subset of out-of-distribution generalization.

We appreciate your suggestion about possible future work on textual conceptualization.

Two additional experiments are ready; our computers are still running to conduct more experiments with more base models. Interestingly, "SigLIP" is the first model to not fail the easy isomorphism task, with 4% more than a random agent:

Isomorphism 1Isomorphism 2Path 1Path 2Cycle 1Cycle 2
SigLIP54.4FAIL59.525.263.528.0
DINov2FAILFAIL56.836.4FAIL31.1

In light of these clarifications, we want to respectfully request that you reconsider your evaluation score if possible. Our paper addresses a novel aspect of visual reasoning, and introducing a simple task for humans that our SOTA models fail.

We are gladly open to further discussion and are happy to address any additional questions you may have.

审稿意见
3

The paper proposes a new benchmark to evaluate whether visual models can understand underlying concepts with different visual appearances. The benchmark contains generated graphs with three layouts: Random, Kamada-Kawai, and Planar. The authors then train the visual model on trained on the Kawai layout, then ask them to determine whether two graphs are isomorphic, find the required path, and circle.

给作者的问题

How will GPT-o1 perform when given the graphs described in natural language?

论据与证据

The paper claims the benchmark can evaluate visual models' ability of visual abstraction. The benchmark can mostly support the claim.

  • A model needs to understand the concept and logic behind the graph image to predict the correct answer.
  • The concepts are mostly related to logic and graphs instead of the image. For example, one can perform a similar evaluation by describing the graph in natural language and without images. The visual concepts are more related to color, material, and object relations in a natural image, whose details are hard to describe all within limited text.

方法与评估标准

The proposed benchmark evaluates models based on yes/no classification and multiple-choice accuracy. Additionally, all training and test samples have the same candidate choices. Thus, the evaluation criteria can correctly reflect the models' ability.

理论论述

No theoretical claims or proofs.

实验设计与分析

The experimental designs are sound. The paper evaluates several closed-source MLLMs, as well as open-source models that have been fine-tuned on the large provided training set. The results can show the new difficulties for current vision models.

补充材料

I have reviewed the supplementary material for more implementation details and dataset examples.

与现有文献的关系

The proposed dataset can inspire future work on extracting logical information represented in different visual forms.

遗漏的重要参考文献

The paper only cites graph datasets but neglects datasets such as Geometry problems, Table QA, and Science Figure QA that also require vision models to perceive underlying information. Authors may include some of these works and explain the differences.

其他优缺点

The proposed dataset contains large numbers of training samples, which can evaluate the effectiveness of a training method for disentangling visual and logical information.

其他意见或建议

N/A

作者回复

We thank you for your thoughtful feedback and positive evaluation of our paper. We're happy that you found our benchmark sound and valuable for evaluating visual models' ability to understand concepts across different visual representations. Below we try to address your concerns.

Additional datasets to discuss

Regarding the datasets you mentioned (Geometry problems, Table QA, and Science Figure QA), we appreciate the recommendation. We have discussed and cited FigureQA in our related work section, ~L160 (Kahou et al., 2017). We will expand our discussion to include the other suggested datasets and clarify the key differences between our approach and these works. Our benchmark uniquely focuses on isolating the specific challenge of conceptualization across visual representations using graph-theoretic concepts as the test case.

Graphs described in natural language

Your question about how GPT-o1 would perform when given graphs described in natural language is insightful. While this is an interesting direction, our focus was specifically on the visual format. Graphs visually represent complex data and relationships in ways that are intuitive for humans to understand without textual descriptions. The visual aspect is critical to our research question about conceptualization across different visual layouts. There is indeed existing work on textual graph representation for language models, such as Fatemi et al.'s "Talk Like a Graph: Encoding Graphs for Large Language Models," ICLR2024, which explores different textual encoding strategies for graph reasoning tasks. However, our work deliberately targets the visual domain to evaluate a different capability - the ability to recognize the same underlying concept despite variations in visual form, which is fundamental to human visual reasoning. We hope to revise the paper to include this refrenece and clarify the focus.

审稿意见
2

This paper investigates multimodal models' ‘conceptualization’—the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. They introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems’ capacity for visual abstraction. VGA uses graphs rendered in diverse layouts to test reasoning independent of visual form. Experiments with state-of-the-art vision models (ViT, Swin Transformers, ConvNeXt) and multimodal LLMs (GPTo1, Claude 3.5 Sonnet) reveal a striking divide: human participants achieved near-perfect accuracy (88–100%) across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks.

update after rebuttal

I appreciate authors' replies and the additional experiments. This paper is focused on evaluation and analysis, and my concerns about limited baseline mllm experiments still remains, which I think is very important to this evaluation paper, just to show people how different models behave. Therefore I'd remain my original score. But if this concern was addressed I would raise to 3.

给作者的问题

See above. Happy to raise score if questions are addressed.

论据与证据

I generally like the idea of VGA. It is interesting to see how multimodal models learn graphs. But I wonder if these graphs should be given in 3D format instead of in 2D, e.g. as according to Figure 2?

方法与评估标准

More baseline multimodal language models should be included, especially the ones designed for perceptual tasks. E.g. phi4 vision, llava-video, Gemini 2.0 Flash, MetaMorph etc. Also, testing on 20 examples seems a bit less convincing because the number is small and result number is replaced by FAIL. Should also add more vision model baselines, e.g. Dino, Siglip, clip, etc.

理论论述

N/A

实验设计与分析

See above. Also in Table 2, it's a bit confusing which models are finetuned on the training set and which are not. 20 testing number is not very convincing.

补充材料

N/A

与现有文献的关系

It's related to testing Graphs understanding abilities of Multimodal language models.

遗漏的重要参考文献

N/A

其他优缺点

Experiments are good. Since this dataset is synthetic by nature, I wonder if training on it would hurt a bit model performance on natural images?

其他意见或建议

See above.

作者回复

We sincerely thank you for your thoughtful and detailed feedback. We appreciate the positive reception to the core idea of VGA. Below, we address each of the points raised:

3D format

You raise an interesting point about testing graphs in 3D format. Our current focus was on testing visual conceptualization across different 2D layouts (Kamada-Kawai, planar, random) as these represent common ways graphs are visualized in practice. Testing 3D representations would be a valuable extension; however, it would present significant technical challenges. Current top multimodal LLMs like ChatGPT and Claude only support 2D image inputs, with no native capability to process or reason about true 3D representations. This limitation would make direct testing of 3D graph comprehension infeasible.

More Baselines

We acknowledge your suggestion to include more baseline models. We agree this would strengthen our analysis and have already begun running these additional experiments. Our GPUs are currently processing. Given that our framework requires six distinct tasks per model and considering our computational constraints during the ICML rebuttal period, these experiments will take some time to complete. We will update you with the results during the discussion period as they become available. Thank you for valuable recommendation.

Table 2 Clarification

Thank you for pointing this out, You're absolutely right that this needs clarification. In our study, all vision models (ViT, Swin, ConvNeXt) were fine-tuned on VGA's training sets, while the multimodal LLMs were evaluated in a zero-shot setting without fine-tuning. We will make this distinction explicit in the revised table, caption, and methodology section.

Sample Size for LLM testing

We appreciate your concern about the sample size for LLM testing. Our 20-sample approach was sufficient to detect random-level performance in most cases; we expanded to 100 samples only where GPT-o1 showed promise, to obtain better precision. We adopted the "FAIL" designation from Tay et al. (2021) when performance was not better than random guessing, as precise values near the random baseline offer minimal scientific insight. It's worth noting that querying these models is not free—accessing models like GPT-4o, Claude, and Gemini involves substantial API costs. Testing across different layouts means each model required ~160 experiments, and we’ve conducted nearly 900-1000 evaluations including unreported models (Claude Opus, Gemini 1) and we have to update the existing models to Claude 3.7 and O3-mini. Expanding to 50+ samples would significantly increase costs with low returns in clearly failing conditions. However, if the reviewer believes that larger sample sizes are essential for the scientific contribution of this paper, we would absolutely prioritize and carry out additional testing accordingly.

Effect on model performance on natural images

This is an insightful question. Since we start with pretrained models and finetune them on VGA tasks—replacing the classification layer with a binary one—it inevitably alters the underlying weights. As such, it's unclear how to meaningfully assess "performance on natural images" post-finetuning. In practice, a multitask setup would likely be needed to preserve general visual capabilities. If the reviewer has thoughts on how to evaluate this more effectively, we’d be very interested to hear them.

We greatly appreciate your feedback, which will help us strengthen the paper. We believe addressing these points will improve the work and hope you'll reconsider your assessment based on our planned revisions.

审稿人评论

My major concerns are still unsolved — limited baselines and sample size. I’ll remain my original score.

作者评论

Dear Reviewer,

Thank you for your comment, since it allowed us to reply (open review does not allow to add response without reviewer's comment) We understand your concern, and wanted to report that two additional experiments are ready; our computers are still running to conduct more experiments with more base models. Interestingly, "SigLIP" is the first model to not fail the easy isomorphism task, with 4% more than a random agent:

Isomorphism 1Isomorphism 2Path 1Path 2Cycle 1Cycle 2
SigLIP Base54.4FAIL59.525.263.528.0
DINov2FAILFAIL56.836.4FAIL31.1

Regarding sample sizes, we will take experiments with more samples for the closed-source models. Since these experiments cost money (especially o1), could you please let us know how many samples would be sufficient in your opinion. We will add the results here, once ready.


UPDATE

Dear reviewer, since you have not responded to our question regarding the sufficient number of experiments for closed-source models, we had to make an independent decision and chose to conduct 100 experiments per task (1,800 in total). These experiments contained personal costs, and it is financially challenging for us to increase the sample size beyond this point.

We also analyzed the cases where the models produced the correct final outputs. However, in strong majority of these instances, the underlying reasoning provided by the LLMs was fundamentally flawed, despite arriving at the right answer. In conclusion, we find little evidence that these models perform better than a random agent on most tasks.

Model
GPT-0150485354.8-66.625
GPT-4o4955*4726-4821
Claude 3.5 Sonnet52514724-5027
For GPT-4o's 55% accuracy, we believe it is just random chance, as examining the "true" outputs revealed that the reasoning was completely incorrect, and it got 49% for the easy isomorphism task.

In light of these updates, we want to respectfully request that you reconsider your evaluation score if possible.

Best regards, Authors

审稿意见
3

This paper presents visual graph arena (VGA), a multimodal dataset designed to evaluate and improve AI systems’ capacity for visual abstraction.

Although being straightforward for humans, the authors find that VGA is very challenging for current MLLMs: they totally fail on some of the tasks in VGA and show limited success on other tasks.

By deeper analyais, the authors find the behavioral patterns of current MLLMs, especially o1, are significantly different with humans'. Such finding brings an open question about whether the current models are able to conceptualize as humans.

update after rebuttal

The distinction with previous benchmarks generally make sense to me. The analysis provided in this paper is helpful for the community to understand MLLMs' ability in visual conceptualization and abstract reasoning. Therefore I will keep my rating to 3.

给作者的问题

I would like to hear how VGA is different from the papers I mentioned above.

论据与证据

The problem and evaluation scopes are well defined and motivated, and the experiments are well designed. Claims made by this paper are convincingly supported by the experimental evidence.

方法与评估标准

Yes, the choice of the evaluation design is well motivated and is suitable for evaluating MLLMs conceptualize ability.

理论论述

There is no complex proofs for theoretical claim that need to be checked in this paper, it is an empirical study and all proposed methods are justified by experiments.

实验设计与分析

Yes, please refer to paper summary.

补充材料

Yes, I have check the detailed dataset description and cases.

与现有文献的关系

The scope of this paper is revelant to visual abstract reasoning. Compared with piror works, this work clearly connects its evaluation with the conceptualize ability of MLLMs, and make detailed analysis about current MLLMs behaviors.

遗漏的重要参考文献

Since this paper is studying the conceptualization/congnitive behavior of MLLMs/vision models, the below list of papers explored and discussed similar topics and should be included in the discussion in this paper.

[1] Jiang, Y., Sun, K., Sourati, Z., Ahrabian, K., Ma, K., Ilievski, F., & Pujara, J. (2024). Marvel: Multidimensional abstraction and reasoning through visual evaluation and learning. Advances in Neural Information Processing Systems, 37, 46567-46592.

[2] Zhang, C., Gao, F., Jia, B., Zhu, Y., & Zhu, S. C. (2019). Raven: A dataset for relational and analogical visual reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5317-5327).

[3] Campbell, D., Kumar, S., Giallanza, T., Griffiths, T. L., & Cohen, J. D. (2024). Human-like geometric abstraction in large pre-trained neural networks. arXiv preprint arXiv:2402.04203.

[4] https://github.com/fchollet/ARC-AGI

其他优缺点

No

其他意见或建议

My hypothesis on the Middle-Score anomaly and the easier-worse anomaly is that the models learns to memorize some specific patterns during training. (similar with 5.3) So it would be good if there are some experiments that can fully isolate the memorization problem of the MLLMs. This is a suggestion for the future extension not for the current work, so the authors don't have to add any additional experiments in response to this.

作者回复

We thank you for your positive assessment of our work and for helpful comments. Below, we address your questions and suggestions.

Distinction from Related Work The reviewer asked how VGA differs from the cited papers on visual reasoning. While all these works explore aspects of visual abstraction and reasoning, VGA offers several important distinctions:

  • VGA uniquely tests whether models can recognize the same graph concept across different visual layouts. By training on one layout (e.g., Kamada-Kawai) and testing on another (e.g., planar), we isolate the ability to abstract away from visual form to understand invariant properties.

  • Graph-theoretical foundation: Unlike more general visual reasoning datasets, VGA's tasks are grounded in well-defined graph theory problems with deterministic answers. This provides clear evaluation criteria without subjective interpretation.

  • With 27,000-150,000 training samples per task, VGA is substantially large, while maintaining careful separation between train and test sets with no graph repetition.

Other comments or suggestions

We appreciate the reviewer's hypothesis about the Middle-Score and Easier-Worse anomalies. We're interested in exploring controlled experiments that could isolate memorization effects from genuine conceptualization, and agree that evaluating the memorization effect is a promising direction for future work. We used different layouts for training and testing sets to avoid memorization when trained on our dataset.

To test on MLLMs, one could design experiments with systematically varied graph structures that share superficial patterns but require different abstract reasoning. We would appreciate it very much if the reviewer has experiment suggestions in this direction for future work.

最终决定

This paper received 1 weak reject and 3 weak accepts after the rebuttal. On the positive side, the reviewers recognized the value of the benchmark as it may inspire more future work in this direction. On the negative side, the reviewer qrzb had concerns about limited baselines and sample sizes. The authors had a couple of rounds of discussions with the reviewer, where results of both opne-source and closed-source models are reported.

Overall, the AC found the positive side outwieghs the negative side and recommends an accpetance. The authors are highly encouraged to incorporate the new results reported in the rebuttal into the final version.