PaperHub
5.0
/10
Rejected3 位审稿人
最低4最高6标准差0.8
5
6
4
4.3
置信度
正确性2.3
贡献度3.0
表达3.0
NeurIPS 2024

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
Vision-Language ModelsMultimodal Large Language ModelsVisual ReasoningChain of Manipulations

评审与讨论

审稿意见
5

The paper presents CogCoM, a novel approach to training large Vision-Language Models (VLMs) using a mechanism called Chain of Manipulations (CoM). This mechanism enables the model to solve visual problems step-by-step with evidence, inspired by human cognitive processes like marking and zooming into images. CogCoM integrates manipulations such as grounding, zooming, and OCR into the VLM architecture, allowing it to handle various visual problems without external tools. The model is trained using a robust data generation pipeline and evaluated across multiple benchmarks, demonstrating state-of-the-art performance.

优点

Advantages of the Paper

  1. Explainable Reasoning and Manipulation Mechanism: CogCoM generates intermediate steps with evidence, making the reasoning process transparent and explainable, which is crucial for complex visual tasks. The model incorporates a flexible set of manipulations that can be adapted to various visual problems, improving its versatility and problem-solving capabilities.

  2. Data Generation Pipeline: The paper introduces an efficient pipeline for generating high-quality training data, which is essential for training VLMs to perform detailed visual reasoning.

  3. Superior Performance: CogCoM achieves superior results across multiple benchmarks, including detailed visual question answering and visual grounding, showcasing its effectiveness and robustness.

These advantages highlight the paper's contributions to advancing the capabilities of VLMs in solving detailed and complex visual problems through a novel, human-inspired approach.

缺点

Weakenss in Points

This paper is generally good but I can still spot the following issues.

  1. Design of Figures and Tables: The figures in the paper are not well-designed. The first and second figures are repetitive in meaning, and the colors in the first figure are too light (consider adding black outlines to the boxes). The font size in the second figure is too small to be legible on smaller screens. Additionally, the captions for Table 2 and Table 3 are too close to the tables, violating the submission guidelines.

  2. Lack of Discussion on Related Work: The paper lacks a discussion of existing related work. It should consider citing and comparing with at least other agentic LMMs such as LLAVA-Plus[1] to provide a comprehensive comparison and context.

[1] https://arxiv.org/abs/2311.05437

问题

See above Weakness part.

局限性

See above Weakness part.

作者回复

Dear Reviewer, we are very grateful for the valuable time you have spent reviewing our paper and for your recognition of our work, which is of great significant to us. Concerning the issues your have mentioned in our paper, we will make the following improvements:

  • Design of Figures: Thank you very much for pointing out this issue. Following your suggestion, we have adjusted the Figure 1 by adding dark background to reduce the overall brightness of the image. Compared to the Figure 1 that introduces different capabilities of the model, the Figure 2 is primarily intended to depict the example mentioned in the introduction (VLMs could not answer correctly or even output hallucinations for the detailed recognition problems without reasoning). We have increased the font size for the Figure 2 and will add a diagram of the image decoder structure to depict content that differs from Figure 1.
  • Design of the Tables: We have adjusted the tables in our paper to increase the spaces between tables and corresponding captions. Please refer to the tables in the supplemented PDF file to view the effects of these adjustments.
  • Discussion to the related work: Thank you very much for your reminder. We have added the following description for this related work to our Related Works section: The authors from LLaVA-PLUS [1] have contributed efforts to train VLMs to develop the capability of invoking external tools. They constructed an instruction-tuning dataset incorporating tool use examples and trained VLMs to call external tools to solve challenging tasks. In comparison to their efforts, this work focus on stimulating the model’s inherent abilities to solve problems in an end-to-end manner through active reasoning. Though disadvantage in producing pixel-level masks, it offers advantages and potential for enhancing the model’s inherent reasoning abilities and reducing the time complexity of reasoning.

[1] https://arxiv.org/abs/2311.05437

评论

Dear Reviewer,

Thank you once again for your time and valuable suggestions. We sincerely hope that our efforts can address your concerns and look forward to your response.

审稿意见
6

Drawing inspiration from human cognition to solve visual problems through localizing, zooming, etc., this paper introduces a new framework called CogCom, which solves visual problems by automatically combining six types of basic manipulations. When facing a visual problem, CogCom can use reasoning to solve each step and employ basic tools to aid in the problem-solving process. To achieve this goal, CogCom constructed a data generation pipeline that leverages GPT4 to build the training data for CogCom. The CogCom leads to performance gains compared to its baseline CogVLM on several benchmarks.

优点

  1. The CogVLM makes gains based on CogVLM on several benchmarks.
  2. The pipeline that leverages GPT4 to construct manipulation pipelines for problem-solving is reasonable.

缺点

  1. The VQA benchmarks reported in Table 1 are not very convincing. It would be beneficial to consider more modern and challenging benchmarks such as MMBench, MathVista, and SeedBench.
  2. The comparison of baseline methods seems to be based on relatively outdated approaches. It might be more informative to compare them with more recent LVLMs like LLaVA-1.5, Monkey, and ShareGPT4V.
  3. It would be helpful to discuss a closely related work ViperGPT [3] and V* [4]. ViperGPT shares an idea for solving visual problems via planning tool pathways. V* shares the idea of searching and zoom-in progressively.
  4. The differences with some other related works should be discussed [5][6].

[1] Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

[2] ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

[3] ViperGPT: Visual Inference via Python Execution for Reasoning

[4] V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

[5] CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

[6] DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models

问题

  1. The performance improvements of CogCom in comparison to CogVLM appear to be relatively insignificant. Could this be attributed to the low success rate (around 55% in Figure 6)? Further explanations are necessary to address the situation where the planned pathway failed.

I will be pleased to raise my rating if my concerns as in weaknesses and questions are resolved.

局限性

The limitations have been discussed.

作者回复

Dear reviewer, thank you for taking your valuable time to review our paper and the insightful suggestions. We have added experiments (in the supplementary PDF) and supplemented content (in this page) of our paper, as detailed below:

  • We have evaluated our model on the suggested new benchmarks, MMBench, SEDD-Bench, and MathVista, and the experimental results are listed in the Table 2 of the supplementary PDF file. In conclusion, our model achieved stronger performance in comparison with concurrent and the suggested new baseline models, outperforming LLaVA-1.5 by 10.2, 4.9 and 8.1 percentage points on the three benchmarks, respectively.

  • We compared our model with the suggested up-to-date baseline models, and the results are listed in the Table 1 of the supplementary PDF file. In conclusion, our model consistently outperforms these baseline models across the VQA and general multimodal benchmarks, demonstrating effective multimodal capabilities.

  • Discussion with closely related works: (1) The efforts of ViperGPT shares the same basic idea with our work that decompose complex visual problems into reasoning steps. In comparison with their concentration that aims to build a training-free framework combining external VLMs using a code LLM, we focus on training an end-to-end VLM to enhance its multimodal reasoning ability to solve complex visual problems. (2) V* is a concurrent work who also aims to solve VQA problems by progressively acquiring visual cues. Their two-parts framework first utilize a VQALLM model to list visual objects needed for answering, followed by a dedicated search model to accurately acquire the objects. On the other hand, our approach focuses on using one model to actively performing reasoning and to identify or mark the most useful visual information, which may offer the potential for solving more complex reasoning tasks in addition to the detailed identification, such as the challenging geometric math problems.

  • Discussion for the differences with other related works: (1) CogVLM is a multimodal foundation model which aims to achieve reliable performance on broad multimodal tasks (e.g., VQA, Grounding, Captioning). This work is motivated by the observation that CogVLM produces hallucinating answers for the detailed reasoning questions (i.e., the Figure 2), and aims to study an effective methodology to enhance the multimodal reasoning capabilities of VLMs while reducing the hallucinations by visual evidences. (2) The DualFocus was released around the same time as ours. They also made efforts to construct a training dataset that includes intermediate cues (bounding boxes) and trained the model with two stages based on accurate localization. Compared to their work, our CoM training places more emphasis on answering questions in a single reasoning process and marking image to assist in solving complex problems. We will include the above works and discussions into the related works section.

  • The insignificant improvements: Due to the focus of this paper on enhancing the reasoning capabilities of VLMs, the improvements of our model over CogVLM on Grounding and General Multimodal benchmarks is relatively limited. However, on the evaluation sets that require logical reasoning (i.e., GQA), complex counting (i.e., TallyVQA), and detailed recognition (i.e., TxtVQA, ST-VQA), our model achieves improvements of 6.5, 10.9, 1.04 and 9.0 percentage points relatively. Additionally, we found that mistakes in reasoning paths are indeed a major factor affecting performance on the general multimodal benchmarks.

评论

Dear Reviewer,

Thank you once again for your time and valuable suggestions. We sincerely hope that our efforts can address your concerns and look forward to your response.

评论

Thank you for your efforts in solving my concerns. One additional question: since mistakes in reasoning paths are indeed a major factor affecting performance on the general multimodal benchmarks, the author may have some discussions (experiments are not required) about how to compensate for the negative effects of planned pathway failure, as in the concurrent work DualFocus.

Overall, most concerns are solved. I will raise my rating to 6.

评论

Dear Reviewer,

We are very pleased to learn that our efforts have addressed your concerns and that you kindly raised the rating. Regarding your additional question about the discussion on how to mitigate the negative effect of incorrect reasoning paths, we have following response:

  • DualFocus adopts a good strategy that compares the PPLs of generated answers to switch the mode between direct-answering and reasoning-before-answering to alleviate the paths with low confidence. In comparison to DualFocus, we add the explicit launching prompts to allow the switch by users actively, and the randomness during training to encourage model switching the mode by itself accordingly with problem scenarios. As the PPL-based method is general, we will compare and implement this strategy into CogCoM in near future work.
  • Using Reinforcement Learning to penalize the negative paths during training is another strategy to mitigate the negative effects, as we have derived multiple negative reasoning paths during our data collection pipeline. However, using the RL-based objective will introduce the unstable training and low-efficiency compared to the straightforward cross-entropy on positive answers, and improving the effectiveness of this solution is a promising topic.
  • Humans solve difficult questions with a period of thinking before outputting the answers. As LLMs/VLMs generate outputs immediately following the question prompt, CoT mechanism serves as an effective substitute of the thinking. Similar to human thinking, the backtracking mechanism might be a reliable way to form correct and concise reasoning paths. However, in our experiments, we found that our integration of backtracking results in heavy time-complexity and relatively marginal benefits. Therefore, developing an effective backtracking strategy to enhance the correctness of reasoning paths while maintaining the efficiency is meaningful for mitigating the negative impact of the path failures.

Thank you once again for your time and valuable comments. We will add the above discussion to our paper.

评论

Thank you for the feedback. The new discussions are valuable.

审稿意见
4

This paper introduces the Chain of Manipulations (CoM) mechanism for data generation to enhance visual reasoning in VLMs. The authors developed a data generation pipeline, producing 70K high-quality samples, and created the CogCoM model. CogCoM achieves state-of-the-art results across nine benchmarks, demonstrating significant improvements in various visual tasks.

优点

1.The CoM introduces a new data generation mechanism that enables VLMs to perform step-by-step visual problem solving with supporting evidence. 2.A data generation pipeline is proposed, producing a dataset of 70K high-quality samples. 3.The trained model, CogCoM, achieved SOTA in nine benchmarks. 4.This paper is well-written and easy to understand.

缺点

1.During data generation, the process relies entirely on GPT-4 for prompting and existing models (GroundingDino, PaddleOCR) for generation. As mentioned in the appendix, inaccuracies in these current visual models can affect the quality of generated data and the model's reasoning capabilities. However, the system lacks validation or filtering mechanisms to enhance data quality. 2.To highlight the specific improvements brought by CoM, it would be helpful to provide results both with and without the incorporation of CoM data. This would clarify the impact of CoM, especially since CogCoM integrates a significant amount of additional data such as MultiInstruct and LLaVAR during the instruction tuning stage as shown in Table 1. 3.The CoM dataset includes 6K high-quality manually annotated math samples, but no test results for math problems are provided. Clarification is needed on whether the purpose of this math data is solely to enhance the model's reasoning capabilities. 4.The paper emphasizes that CogCoM is a model capable of multi-image multi-turn understanding, but no corresponding test results (qualitative or quantitative) are provided. 5.In the model section, some parameters are not specifically explained, such as the maximum turns the model can accept and the predefined threshold. 6.Typos error: Line 288 CogOM->CogCoM

问题

see weakness

局限性

see weakness

作者回复

Dear reviewer, thank you for taking your valuable time to review our paper and the insightful review. In response to your suggestions and questions, we have conducted experiments and provided explanations, as follows:

  • The filtering strategy for quality control in data collection: Thank you for your carefully reading of our analysis on tool recognition errors during data generation as stated in Appendix C.5. During the collection of our 70K CoM training data, we discard the wrongly recognized data (i.e., we refer to these data as the negative paths in our paper), as these data can not terminate to the golden answer node during the DFS traversal. Therefore, we used this filtering strategy to ensure that only the correct data capable of reaching the golden answer (i.e., positive paths) was included in the 70K training data. We look forward to utilizing these negative paths as negative rewards in future work.
  • The ablation experiment to validate the effectiveness of CoM data: Thanks for your suggestion. We have conducted ablation experiments on training our model with and without the incorporation of the collected CoM training data to validate the impact, and the comparison results on three typical categories of benchmarks, TextVQA, MMVet, and MathVista are shown in Table 3 of the supplementary PDF file. In conclusion, our model benefits from the CoM corpus that contains informative visual reasoning evidence, achieving improvements of 6.6, 0.2 and 0.9 percentage points on the three benchmarks, respectively.
  • The experiments to validate the effectiveness of manually annotated CoM-math data: Motivated by the effectiveness of AlphaGeometry, we annotated 6K reasoning chains for the purpose of advancing the development of VLMs on this specific task. Solving geometry math problems is highly challenging for current VLMs. We conducted experiments on MathVista, and the results are shown in Table 3 of the supplementary PDF file. In conclusion, our model, trained on the 6K annotated samples, achieved an improvement of 0.90 percentage points on MathVista. Though the relatively limited improvement in answer’s matching metrics, we believe that the annotated data involving the multimodal reasoning processes like humans (such as drawing auxiliary lines) may contribute the potential to the development of this field.
  • The clarification of the multi-turn multi-image capability: CogCoM can solve visual reasoning problems through multiple rounds of interaction with an input image (e.g., marking auxiliary lines), re-inputting the resulting image at each new round (e.g., a marked image with auxiliary lines) in an end-to-end manner. Taking the geometry math problem-solving as an example (e.g., the last case in Figure 1), given the initial inputs of an image I0I_0 and a question QQ, our model undergoes the first round of reasoning (outputting multiple reasoning steps) and then draws an auxiliary line on the image to obtain a new image I1I_1, and then re-input this marked new image to continue the next round of reasoning (may draw additional lines or crop and zoom in on a specific region). Compared to most of existing methods, we refer to this approach, which is similar to the human thinking process, as a multi-turn, multi-image process.
  • The maximum number of the multi-image turns: Thanks for you suggestion. We have detailed the statistics for the training and testing data in the Appendix C.3, including the total number of reasoning chains, the average number of reasoning steps, and the average number of manipulation types. In accordance with your suggestion, we have also conducted statistics on the number of turns divided by multiple images. The results are as follows: in the training data source from TextVQA and ST-VQA which may involve generating new images such as through zooming, the average number of turns is 1.42, and the maximum number of turns is 7 (we restrict the maximum number of turns to 4 during training to prevent OOM). In the test set of TextVQA, our model produced an average of 1.54 turns involving multiple images. It is worth noting that not every image requires manipulations such as zooming, and some can be answered through reasoning with evidence or direct observation. We will add this statistics to the paper content.
  • Thank you for pointing out the typos. We will thoroughly check the paper again to ensure there are not ambiguities and mistakes.
评论

Dear Reviewer,

Thank you once again for your time and for pointing out the shortcomings in our work. We sincerely hope that our additional experiments and explanations can address your concerns, and look forward to your response.

评论

Dear Reviewer,

I sincerely apologize for disturbing you again. In response to the multiple questions you raised, we hope that our experiments and explanations are through to address your concerns. As the discussion period is nearing its end, please let us know if you have any questions or feedback. Thank you very much.

作者回复

We extend our gratitude to all the reviewers for the time and effort they have invested in reviewing our paper. In response to the review comments, we have added (1) evaluation on the suggested new benchmarks, (2) comparison with new baseline models, and (3) ablation experiment controlling for CoM training data variables. The detailed results are included in the supplementary PDF file.

最终决定

The idea of having a chian of manipulation for VLM makes human sense. However, does such a mechanism truly help with VLM reasoning capabilities, as pointed out by reviewers, it is unclear. The authors argue CogVLM is targeting a wider range of tasks, while CogCoM is mainly focusing on the aspects as being discussed and designed by the authors, this makes the performance gain, either on general tasks or claimed reasoning tasks, less valuable, as the proposed method is supposed to focus on the task as it designed for. The draft received diverging opinions without a strong champion of the work from the set of reviewers, given its current form and presentation that are needing further improvement, AC would suggest a reject.