Unveiling the Tapestry of Consistency in Large Vision-Language Models
We propose a Consistency benchmark, get an in-depth analysis and design a simple method to improve VLMs.
摘要
评审与讨论
This paper introduces a multimodal Consistency Benchmark (ConBench) to systematically evaluate the capabilities of LVLMs via diverse question formats. ConBench has a total of 4k questions on 1k images and corresponding 3k discriminative ground truths, as well as two special metrics to evaluate the consistency of LVLMs. Based on ConBench, this work conducts a comprehensive analysis of inconsistency in LVLMs.
优点
- The constructed ConBench can better evaluate LVLMs and encourage further advancements in the consistency domain.
- Their findings provide insight for future community research.
- This paper proposes a trigger-based diagnostic refinement method (TDR) to ameliorate the generation skill of LVLMs without any additional training.
缺点
- The results in Table 4 suggest that TDR can markedly improve the consistency of baseline LVLMs. In addition to ConScore[C], how does TDR affect LVLMs' performance on comprehensive multimodal benchmarks?
问题
See Weakness
局限性
Yes, the authors discuss limitations in the Appendix.
We sincerely thank the reviewer for the efforts in reviewing our paper and the positive evaluation. Our responses according to the reviewer's comments are summarized as follows.
- In addition to ConScore[C], how does TDR affect LVLMs' performance on comprehensive multimodal benchmarks?
Our proposed method TDR aims to utilize the discriminative prompt template and VLMs' judgment capability to enhance their open-ended generation ability.
Given the difficulty in evaluating open-ended responses, we randomly selected 100 images from MMMU, MMBench, and MMStar. Then, LLaVA-v1.5-7B was employed to generate initial captions for these images, which were then improved by our TDR method. After conducting five rounds of manual evaluation, we observed that the TDR method resulted in improvements in 37% of cases, no change in 60% of the cases, and a slight decrease in 3% of the cases (GSB = 37:60:3). The results further confirm the effectiveness of the TDR approach.
Furthermore, while not directly for the improvement of discriminative responses, motivated by your question, we also attempted two discriminative benchmarks (rewriting prompts for some low-probability answers). As shown in the table below, we are pleased to see that our method improves LLaVA-v1.5-7B by and , and boosts MiniGemini-7B by and on GQA and POPE, respectively. These promising results will inspire our further research.
| Method | GQA | POPE |
|---|---|---|
| LLaVA-v1.5-7B (baseline) | 61.94 | 85.88 |
| + TDR (ours) | 62.78 | 86.80 |
| MiniGemini-7B (baseline) | 63.66 | 87.65 |
| + TDR (ours) | 64.19 | 88.44 |
This paper presents ConBench, a multi-modal benchmark to intuitively analyze how LVLMs perform when different prompts are used for one model around a knowledge point. Based on the proposed benchmark, several interesting findings are pointed out, such as the relationships between the prompt space and the answer accuracy in both the discrimitive and generative realm. Based on the findings, the paper proposes a way to improve the consistency of LVLMs by trigger-based diagnostic refinement. This is specifically conducted on improving the caption capabilities of LVLMs.
优点
- The benchmark is comprehensive and the related findings are interesting.
- The motivation for the proposed method to improve the consistency of LVLMs is straightforward based on the findings.
- The evaluations conducted on the designed benchmarks provide evidence of the effectiveness. However, there are some concerns regarding the experimental results, which will be further discussed in the weaknesses section.
缺点
- Some of the presentations are unclear. For example, the definition of the discriminative and generative domains should be clearly explained at the beginning to avoid obscure understandings.
- Some metrics are not clearly explained, including but not limited to ConScore[C], metric[C] and metric[D] in Sec 4.1.
- For the generative problem, the paper uses GPT4 to compare the consistency of the LVLM output answer and the ground truth. However, several issues arise here: (1) Does GPT4 have a certain bias; (2) Does GPT4 also have inconsistent judgment issues; (3) How to eliminate the impact of GPT4's inconsistent judgments, for example, is it feasible to conduct several experiments and provide results variation.
- About the Trigger-based Diagnostic Refinement: it seems that previous work [a] also used the same way to give more accurate answers in LLM realm, that is prompting the models with previous answers and the same-meaning questions. Further explanations are suggested to clarify on the common and different things between the proposed Trigger-based Diagnostic Refinement and other works (e.g., [a]) in LLM realm.
[a] Chuanyang Zheng, etc. Progressive-Hint Prompting Improves Reasoning in Large Language Models, 2023.
问题
Shown in the weaknesses part
Justification of final rating:
I appreciate the opportunity to evaluate this submission. Overall, the proposed benchmark is interesting. The findings based on the benchmark are reasonable. The proposed methods based on the findings are practical. I give my rating as boardline accept.
局限性
The author has claimed it in the checklist.
We sincerely thank the reviewer aYRC for the efforts in reviewing our paper and positive evaluation. Our responses according to the reviewer's comments are summarized as follows.
- Explain the definition of the discriminative and generative domains clearly.
(1) Discriminative questions provide prior information about the image in the text, restricting the model to answer based on this information, making it a closed-ended question.
(2) Generative questions, on the other hand, do not provide the model with specific image priors, allowing the model to generate answers totally based on its own reasoning, making it an open-ended question.
(3) Furthermore, we will carefully review the clarity of other presentations and make corrections in the revised version
- Explain the definition of ConScore[C], metric[C], and metric[D].
(1) ConScore[C]
The ConScore[C] can be clearly defined by the following formula:
ConScore[C] = (Con[T] + Con[C] + Con[V]) / 3,
where the is the Consistency ratio between discriminative answer type and the caption. The ConScore[C] is a generative domain evaluation metric to indirectly quantifying and assessing the actual user experience.
(2) Metric[C] and Metric[D]
Metric[C] and Metric[D] actually refer to the tables with ConScore[C] and Score[D] as evaluation metrics, respectively. We will provide clearer explanations in the revised version.
- The concerns about using GPT4: (1) Does GPT4 have a certain bias; (2) Does GPT4 also have inconsistent judgment issues; (3) How to eliminate the impact of GPT4's inconsistent judgments?
(1) Initially, we only employ GPT-4 to determine the Consistency of model responses, (e.g., prompts like "Based on the caption, is my response correct? Please answer with yes or no only."). Therefore, this is a binary choice task for GPT-4, and our primary focus is on its accuracy. Regarding this task, we manually checked its performance on five different VLMs by randomly sampling 200 cases in each trial. As shown in the following table, the GPT-4 achieves an accuracy of 95.4% (until August 4, 2024) with standard deviation of 0.0035, which is highly reliable and stable.
| Responder | Accuracy rate of GPT-4 (6 trials) |
|---|---|
| GPT-4V | 95.1% |
| Gemini-Pro-Vision | 95.8% |
| Qwen-VL-Max | 94.9% |
| LLaVA-NeXT-34B | 95.7% |
| InternVL-v1.2P-40B | 95.5% |
(2) Inconsistency refers to the model providing inconsistent answers when faced with different solution spaces of prompts for the same knowledge point. Based on the above, GPT-4 in our evaluation always answers 'yes' or 'no', where the solution space of prompts remains the same, so the concept of Inconsistency does not need to be considered in this situation.
- The comparison between the proposed Trigger-based Diagnostic Refinement and other works (e.g., [1]) in the LLM realm.
(1) Similarity
The similarity between ours and [1] lies in asking follow-up questions based on the historical responses of the model and confirming the final answer, to enhance the quality of responses.
(2) Differences
However, the motivation and technical details of ours and [1] are completely different. [1] utilizes all the accumulated responses as options for the next round of prompts, iteratively searching through historical responses to find the correct answer, which is just effective for mathematical and reasoning abilities. Our method only picks out the uncertain words in model responses, and constructs discriminative templates for questioning, proving effective comprehensive abilities.
[1] Chuanyang Zheng, etc. Progressive-Hint Prompting Improves Reasoning in Large Language Models, 2023.
Dear Reviewer aYRC,
We sincerely thank you for your efforts in reviewing our paper. We have provided corresponding responses and results, which we believe have covered your concerns. We hope to further discuss with you whether your concerns have been addresses or not. Please let us know if you still have any unclear part of our work.
Best,
Authors
The paper presented a comprehensive study LVLMs on their inconsistent answers given different prompt solution spaces (true/false, multiple choice, and limited QA). Specifically, the authors introduced the ConBench benchmark to evaluate the performance of various models. They investigated the relationship between the size of the solution space and the accuracy of answers provided by these models, and analysed the consistency between answers and the generated captions. They also presented a simple yet effective method to ameliorate the consistency.
优点
- It is an interesting perspective to construct a benchmark to evaluate the model's performance when prompt with different solution spaces are provided. In fact, this is a common problem that we might encounter when we used LVLMs to answer our questions.
- The analysis was comprehensive with some insightful findings. In particular, the positive correlation between the accuracy and consistency, and the relationship between the confidence and consistency.
- The authors also provided a simple yet effective method, i.e., picking those words with lower confidence scores for re-prompting, to generate more accurate captions. The method did not require re-training the model.
缺点
- The proposed benchmark was not sufficient to evaluate the model, with the size of only 1K images, and three types of questions. It would be great if the bench can be more diverse and representative for various domains.
- It was not clear what was the insight for Fact 4.4.1. Did the authors want to show that closed-source models put more efforts on improving the consistency compared with open-sourced ones?
- The current method to improve consistency was ad-hoc and based on such an idea, it is likely that simply changing the generation arguments like temperature, top_p or top_k might have similar effects.
- The paper seems more suitable to be submitted to the Datasets & Benchmarks track instead of the main track.
问题
See above for questions.
局限性
The authors discussed potential social impacts as well as limitations of their paper in the appendix. I agreed with them that using GPTs for evaluation might pose a bias and a more rigorous human evaluation was preferred.
We sincerely thank the reviewer AnYL for the efforts in reviewing our paper. Our responses according to the reviewer's comments are summarized as follows.
- It would be beneficial to increase the diversity and scale of ConBench.
We greatly appreciate your attention to the scale and diversity of ConBench. The current version of ConBench contains 1,000 images with 3,000 questions, and covers 19 subdomains, comparable to popular multimodal datasets such as MME [1] and MMBench [2]. In this version, multimodal models have not yet achieved high scores (e.g., the current best model Qwen-VL only achieves 37.00 on ConScore[D]). With further development of VLMs, we will update and scale up the cases of benchmark datasets to tens of thousands, providing a more comprehensive evaluation of Consistency.
[1] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.
[2] MMBench: Is your multi-modal model an all-around player?
- Is the Fact 4.4.1. to show that closed-source models put more effort into improving the consistency compared with open-sourced ones?
In the paper, we discovered Fact 4.4.1 to provide an alternative perspective to demonstrate why closed-source models, despite sometimes having lower accuracy on benchmarks, offer a better user experience in practical applications. One possible reason, based on our paper, is that closed-source models have a Consistency bias. Therefore, evaluating the Consistency of VLMs is a feasible approach to indirectly quantifying and assessing the actual user experience.
- Changing generation arguments like temperature, top_p or top_k might have similar effects with the proposed method TDR.
In fact, adjusting these hyperparameters only affects the randomness in sampling, not improving the quality of the generation. As shown in the following table, we conducted comprehensive experiments on LLaVA-NeXT-34B and MiniGemini-34B. The temperature and top_p in the baselines are and , and we set temperature to and top_p to in the ablation studies. The results show that tuning the temperature or top_p values can not enhance the Consistency of model responses. In contrast, our approach makes the model aware of where lacks confidence and has errors, leading to corrections that boost generation quality. Therefore, changing generation argument hardly achieves similar effects.
| Method | ConScore[C] | Con[T] | Con[C] | Con[V] |
|---|---|---|---|---|
| LLaVA-NeXT-34B (baseline) | 48.3 | 46.00 | 52.20 | 46.80 |
| + temperature (0.2) | 48.0 | 45.75 | 51.75 | 46.51 |
| + temperature (0.8) | 44.1 | 42.55 | 47.15 | 42.57 |
| + temperature (0.2), top_p (0.9) | 47.8 | 45.57 | 51.72 | 46.12 |
| + temperature (0.2), top_p (0.7) | 48.2 | 45.88 | 52.12 | 46.60 |
| + TDR (ours) | 57.4 | 69.10 | 57.40 | 45.70 |
| MiniGemini-34B (baseline) | 49.6 | 56.80 | 48.00 | 44.10 |
| + temperature (0.2) | 48.9 | 56.00 | 47.32 | 43.38 |
| + temperature (0.8) | 45.6 | 52.22 | 47.99 | 36.59 |
| + temperature (0.2), top_p (0.9) | 48.1 | 55.08 | 50.62 | 38.60 |
| + temperature (0.2), top_p (0.7) | 49.1 | 55.97 | 47.64 | 43.69 |
| + TDR (ours) | 60.2 | 76.10 | 53.80 | 50.80 |
- The paper seems more suitable to be submitted to the Datasets & Benchmarks track instead of the main track.
Our paper differs from the papers in the Datasets & Benchmarks track. It not only introduces a new dataset ConBench but, more importantly, we provide deep analyses of the Inconsistency phenomenon based on that. Our findings establish a relationship between the discriminative and generative realms, highlighting the importance of Consistency between the discriminative answer and caption. Moreover, we propose a solution by forcing VLMs to self-think, where a discriminative prompt is constructed via uncertain words in the caption. It is a comprehensive evaluation pipeline instead of introducing the dataset or benchmark solely.
Dear Reviewer AnYL,
We sincerely thank you for your efforts in reviewing our paper. We have provided corresponding responses and results, which we believe have covered your concerns. We hope to further discuss with you whether your concerns have been addresses or not. Please let us know if you still have any unclear part of our work.
Best,
Authors
Dear ACs and Reviewers,
We thank all the reviewers for their valuable comments and efforts in reviewing our paper.
We are delighted that Reviewer AnYL, aYRC, and wv4L stated that our findings are interesting and the benchmark is comprehensive; Reviewer AnYL and wv4L acknowledged that our method TDR is efficient and has a technical contribution.
Regarding the questions and concerns of the reviewers, we have provided our responses in their respective sections.
This paper presents a new multi-modal consistency benchmark, called ConBench, to analyze how LVLMs perform when the solution space of a prompt changes. The benchmark consists of 1000 public images from available benchmark with their original prompt and 2 additional discriminative type of questions from gpt4. All reviewers gave slightly positive feedback (2 borderline accept and 1 accept). All reviewers agree that the proposed benchmark is interesting and provides insight for the community on the relationship between consistency and accuracy. In additional, the proposed trigger-based diagnostic refinement method is simple, practical and effective on the new benchmark. There are also some concerns raised by reviewers: Reviewer AnYL thinks that 1k is very small volume as an evaluation benchmark and the types of question is also limited, we would need more diverse images and questions with more domains; Reviewer aYRC questions the accuracy of using GPT4o as judge, and delta with another trigger-based method; Reviewer wv4L wants to understand improvement of the proposed method on other benchmarks. The authors have provided rebuttals to address these questions. After reading all reviews and rebuttals, the AC agree with reviewers and would like to recommend accepting this paper. AC wants to encourage authors to scale up this benchmark to tens of thousands and increase the diversity, so that the community can widely adopt this benchmark.