PaperHub
3.8
/10
withdrawn4 位审稿人
最低1最高6标准差1.9
5
3
6
1
4.0
置信度
正确性2.3
贡献度2.3
表达2.5
ICLR 2025

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

OpenReviewPDF
提交: 2024-09-26更新: 2024-11-13

摘要

关键词
Large Vision-Language Modelautomatic evaluationbenchmark

评审与讨论

审稿意见
5

This paper introduces AUTOBENCH-V, an innovative framework for automatically testing LVLMs. AUTOBENCH-V combines text-to-image generation with the visual question-answering task performed by LVLMs. The framework generates images according to specific evaluation criteria and evaluates LVLM performance through customized visual questions, combining self-verification and error control mechanisms to ensure the correctness of the generated visual question-answer pairs.

优点

  1. This study represents the automatic evaluation of multimodal large language models and has a certain degree of noverty.
  2. The paper is elegantly written, and the automatic evaluation process designed in the methodology section is clearly written.

缺点

  1. "they lack the flexibility xxx" in 053 may involve overclaim. For some charts and flowcharts (ChartQA, ScienceQA), it is difficult to draw these pictures using the T2I model, but this ability is also need to be considered. Many benchmarks consider inputting charts, documents and other pictures (MMT-Bench[1]). Although the author construct many types of tasks, all of these input pictures seem to be natural images. So I don't think this can completely become the reason for the author to claim that he is more flexible than other benchmarks.

  2. The questions are all generated through LLM, lacking visual information. Although LLM is very common in constructing QA data, visual information is also very critical in multimodal evaluation. Simply using LLM to generate questions will ignore this aspect (for example, questions like: what is the object in the upper left corner of the picture; what is the red square object in the picture)

  3. The author only evaluated seven LVLMs and there is only one open source model. Compared with other benchmarks (such as MME[2]), most of them evaluate 20+ LVLMs, which affects the second point stated by the author, contribution in line 123: "Extensive experiments...", which seems not to be extensive.

  4. The author's contribution mentioned that the construction of AutoBench-V greatly reduced the cost of manually constructing benchmarks, which may be an overclaim. Although the author conducted a manual evaluation of the correctness of the automatically constructed benchmark, I prefer to see the manual comparison results of the automatically constructed benchmark and the manually collected benchmarks (like MME[2], MMBench[3], MEGABench[4]) in terms of quality, diversity, etc. This result better illustrates this contribution (in other words, it is not difficult to automatically build a benchmark, but it is difficult to have similar quality to the manually collected benchmarks)

  5. The benchmark built by AutoBench-V may be too simple. GPT-4o has an accuracy of 75% on the hard type defined by the author and an accuracy of 90% on the easy type, which is too high compared to other benchmarks, e.g. GPT-4o only has 54% on MEGA-Bench[4].

[1] MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

[2] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

[3] MMBench: Is Your Multi-modal Model an All-around Player?

[4] MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

问题

Please refer weaknesses for details.

审稿意见
3

This work proposes an LVLM-as-Judge pipeline, AutoBench-V, that automatically produces VQA examples for LVLM evaluation. AutoBench-V can be configured to cover multiple aspects for vision-language understanding, and to produce test cases in diverse difficulties.

优点

The authors have shown tremendous efforts to build up the comprehensiveness of AutoBench-V's design. It is a fluent experience to read through the manuscript in order to interpret the flow of the pipeline.

缺点

However, my biggest concern lies in the experiments. In fact, there is one critical issue that greatly undermines my overall impression regarding the value of this work -

Since all the prompts and the test cases are generated using GPT4o on-the-go, which is the sole Examiner/Judge in the paper's setting, aren't the baseline performances in Table 2 by GPT-4o technically obtained by testing on the training set?

Even if the prevention actions of self-enhancement leakage are done on the image description text encoder factor as stated in Section 4.2, the ad hoc-generated visual examples used in the test cases, as well as the guidelines used along the pipeline, are not fully deprived of the confounding factor that is the prior brought by GPT-4o. Thus, the top-benchmarking performances by GPT-4o, and to a lesser extent GPT4o-mini, in Table 2, are very unconvincing results to me.

问题

To yield fairer benchmark baselines, I highly suggest the authors amend this being the player and the referee at the same time issue by employing multiple Examiners in the AutoBench-V pipeline using the same prompts from the Appendix. As in (Ye et al., 2024), please first make sure each candidate LVLM be evaluated across multiple Examiners that are isolated from the candidate model itself.

Also, may I know how suboptimal are the other T2I models such as Stable Diffusion as the authors have mentioned in Section 4.1? The model size could also be an important contributing factor to the performance on AutoBench_V.

Ye et al. Justice or prejudice? quantifying biases in llm-as-a-judge. 2024. https://arxiv.org/abs/2410.02736

审稿意见
6

The paper presents a novel automated framework for evaluating Large Vision-Language Models and conducts a detailed experimental analysis on seven popular LVLMs.

优点

The idea in this paper is excellent as it can reduce the human cost involved in constructing datasets and provides multiple evaluation categories.

缺点

The paper should include more discussion on quality control for the dataset, such as avoiding issues with generating incoherent or low-quality images and questions.

问题

  1. Data Quality: How can the quality of generated data be controlled while automatically constructing datasets? Is there a difference in quality across different data categories?
  2. "Unnecessary Visual Content" Issue: Does the AutoBench-V incorporate effective strategies to address the issue of "unnecessary visual content" as discussed in MMStar and NaturalBench? The serious issue can lead to questions being answered correctly without the image.
审稿意见
1

Authors suggest an automatic way to evaluate LVLMs by using GPT-4 to generate image descriptions, questions and answers, and FLUX to generate images from those descriptions. The created benchmark contains five different evaluation dimensions and the authors evaluate 7 different VLMs on their benchmark.

优点

The generation procedure is interesting and the generated samples look reasonable. The models are evaluated in detail.

缺点

Unfortunately the presentation quality is absolutely not ready for publication. The very first sentence of the introduction already contains several errors: Authors cite the “Attention is all you need” from 2017 as a 2023 paper. LLaVA is cited as an LLM even though it is a vision model. And the formulation suggests that the LLM works from 2023 pave the way for NLP works from 2020, which cannot hold.

This continues in the related work section, where this paper is cited as past work: (1) “Deep visual-semantic alignments for generating image descriptions” with then a subsequent study (2): “Microsoft COCO: Common Objects in Context**”** mentioned. However in reality it is the other way around, where paper (1) uses the COCO dataset (2) for evaluation. Then, ImageNet is cited as a “benchmark for assessing LVLMs”, however it is an image classification dataset. Finally ImageNet is cited as an arxiv paper instead of properly citing it as a International Journal of Computer Vision paper. Clearly these citations were not thoroughly examined.

Citations in the references are all lowercased, probably due to incorrect usage of bibtex. English quality is low which makes the paper difficult to read, e.g. the sentences in lines 85-87, and 136-137 could be written more cleary. The paper contains typos like “fine-frained” aspect on line 215.

Method: The benchmark covers “atmospheric understanding”, however the cited paper does not contain the word atmospheric but talks about emotion recognition (one of the 4 aspects under atmospheric understanding). Figure 1 does not properly show which question belongs to which aspect, leaving the reader guessing. Also, the hierarchical structure seems random, e.g. it contains “emotional and psychological” on the left under semantic understanding, and “emotional cues” under “atmospheric understanding”. This hints to problems of automatically generating this hierarchy of aspects using GPT-4. Since it is only a few aspects, those could have been generated manually.

The guidelines chapter is not understandable to me. Sentences like “This guideline acts as a guideline…” do not help. Details are missing: How is the exact procedure? The descriptions of the method are more prosaic than scientific. The reader has to find and read the prompt in figure 23, which is not referenced to in the chapter.

Similarly in the next paragraph, image description with difficulty grading. Even after reading Appendix C.1 it is unclear which prompts are used, and how any of this would not suffer from the common weaknesses of VLMs (e.g. hallucinations, or ignoring some parts of the text when generating images).

In the semantic graph part, the authors describe what the function f could be, but miss describing what it actually is in the end.

A big problem in VQA is that questions can be ambiguous. E.g. in figure 198 the question is formulated in a way that the answer “the stem appears to be pointing upwards” is reasonable but does not match the ground truth.

In figure 19 the ground truth is just wrong, since the child is not wearing a red clothing. Just from these two examples, the generated benchmark seems to be too low quality to provide meaningful results.

Summary: The presentation quality must be improved substantially which would suggest a resubmission. The benchmark quality is at least questionable but in the currently presented form cannot be judged well. Not properly citing related work is a KO criterium.

问题

Please provide some random samples of your dataset.

In chapter 3.4 the defects here might come from either the text-to-image model not properly generating that part of the text, or from the VLM not being able to answer despite the image being correct. How do you separate those two cases?

撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.