Understanding, Abstracting and Checking: Evoking Complicated Multimodal Reasoning in LMMs
摘要
评审与讨论
This paper introduces UnAC, an advanced multimodal prompting technique that significantly improves the reasoning capabilities of large multimodal models (LMMs) through adaptive visual prompting, image abstraction, and gradual self-checking, as evidenced by superior performance on MathVista, MM-Vet, and MMMU benchmarks.
优点
-
The paper introduces the novel UnAC method, combining adaptive visual prompting, image abstraction, and self-checking, which is a new approach for enhancing LMM reasoning.
-
The research is thoroughly validated with extensive experiments on MathVista, MM-Vet, and MMMU benchmarks, showing significant performance improvements.
-
The paper is well-organized and clearly explains the methods and results, making the innovative contributions easy to understand.
缺点
-
Although this method includes very detailed ablation experiments, it lacks comparisons with other methods.
-
It would be even better if these methods could be validated on test sets that are not OCR and do not require reasoning.
问题
-
Why is the gradual checking in Fig. 2 a sequential problem? Are the subproblems related to each other?
-
Why does the ADAPTIVE VISUAL PROMPTS phase primarily categorize reasoning problems as segmentation/OCR? Are there other possible methods?
-
If the OCR results are incorrect, are there any remedial measures?
This manuscript proposes a multimodal prompting method for large multimodal models (LMMs) named UnAC. UnAC consists of three modules: adaptive visual prompting building, image abstraction and a gradual checking scheme. The experimental results on three distinct benchmarks demonstrate the effectiveness on multiple LMMs, achieving satisfactory improvements.
优点
- The paper is easy-to-follow.
- Strong experimental results.
缺点
- Very weak technical contributions. This paper basically borrows some off-the-shelf model to give additional annnotations and then uses existing prompt engineering methods to guide LMM to do reasoning.
- Lack of details on Gradual Checking in Section 3.3. e.g., what if LMMs found the answers are wrong? How does it correct the wrong answers? Figure. 2 did not clarify how this module works and lacked necessary information. It could be improved with some examples.
- Line 352 indicated that UnAC introduced extra errors on the task of Mathematical reasoning, but there is no analysis of these errors. It would be better to report and analyse some failure cases.
- Line 420 stated that UnAC can make the reasoning more stable under different values of temperatures. But in Figure. 4, there is no comparison between models w and w/o UnAC. This statement is not supported by any experimental results in the manuscript.
- Very rush writing. Line 263, 317 and 356 stated that more experiments and examples are provided in the Appendix. However, there is no appendix in the PDF.
问题
- How is OCR specifically implemented in adaptive visual prompts? Does it just add extra boxes as markers for detected texts on the image (Line 429 to 431)? If so, why not utilize the detected texts with their locations as extra text prompts as in MathVista [1]?
- The adaptive visual prompts module helps the model focus on objects with semantic meaning (Line 164), but it is not helpful for problems such as solving a geometry problem or understanding a function plot (Line 431). But in Table 4, the improvements of adopting adaptive visual prompts on figure question answering (FQA), textbook question answering (TQA) and visual question answering (VQA) are quite different. FQA and TQA mainly focus on solving and understanding geometry/chart/function problems, and the improvements are relatively high (5.7% and 4.1%). In contrast, on VQA , which mainly focuses on understanding objects with semantic meaning, the improvements are limited (1.3%). These experimental results are against the aforementioned statements, what is the reason?
This paper proposes an adaptive visual prompting method for Multimodal LLMs to boost the visual reasoning in vision-language tasks. Experiments on three public benchmark verify the efficiency of the proposed method.
优点
This paper tackles an interesting problem in multimodal LLMs where how to enhance the reasoning ability of the visual context is challenging. The proposed method can lift the reasoning ability of base MLLMs to another level.
缺点
More thorough experiments on base MLLMs such as GPT4o, Claude 3 and Claude 3.5 would be expected.
问题
Please see weakness.
This work presents UnAC, a multimodal prompting method for complex multimodal problems in large multimodal models. It includes an adaptive visual prompting method for better image understanding, an image abstracting prompting for effective information extraction, and a gradual self-checking scheme for improved reasoning.
优点
The paper introduces UnAC, a new multimodal prompting approach. It consists of an adaptive visual prompting method to enhance image understanding, an image abstracting prompting to extract information, and a gradual self-checking scheme for better reasoning. Extensive experiments on three public benchmarks validate the effectiveness of this method.
缺点
-
This paper presents a method that appears overly simple. However, it is claimed to be absolutely effective. Whether it's pure large language models (LLMs) or vision-language models (VLLMs), more context and thought chains have been proven effective multiple times.
-
Some references can be added to the experiments in this paper, such as CCOT: Compositional Chain-of-Thought Prompting for Large Multimodal Models. CVPR, 2024.
-
The author conducts experiments only on three public benchmarks (MathVista, MM-Vet, and MMMU). The question is why only these datasets are chosen and what makes them more challenging compared to other datasets.
问题
See the weaknesses.
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.