PaperHub
7.0
/10
Poster4 位审稿人
最低6最高8标准差1.0
6
8
6
8
3.5
置信度
正确性3.0
贡献度3.0
表达3.0
ICLR 2025

Natural Language Inference Improves Compositionality in Vision-Language Models

OpenReviewPDF
提交: 2024-09-25更新: 2025-02-28

摘要

Compositional reasoning in Vision-Language Models (VLMs) remains challenging as these models often struggle to relate objects, attributes, and spatial relationships. Recent methods aim to address these limitations by relying on the semantics of the textual description, using Large Language Models (LLMs) to break them down into subsets of questions and answers. However, these methods primarily operate on the surface level, failing to incorporate deeper lexical understanding while introducing incorrect assumptions generated by the LLM. In response to these issues, we present Caption Expansion with Contradictions and Entailments (CECE), a principled approach that leverages Natural Language Inference (NLI) to generate entailments and contradictions from a given premise. CECE produces lexically diverse sentences while maintaining their core meaning. Through extensive experiments, we show that CECE enhances interpretability and reduces overreliance on biased or superficial features. By balancing CECE along the original premise, we achieve significant improvements over previous methods without requiring additional fine-tuning, producing state-of-the-art results on benchmarks that score agreement with human judgments for image-text alignment, and achieving an increase in performance on Winoground of $+19.2%$ (group score) and $+12.9%$ on EqBen (group score) over the best prior work (finetuned with targeted data).
关键词
text to image evaluationimage to text generationnatural language inferencesentence decompositionlarge language modelsvisual question answeringquestion generationbenchmark

评审与讨论

审稿意见
6

This work proposes to utilize entailments and contradictions generated by LLMs conditioned on captions on visio-linguistic compositionality benchmarks. Specifically, the authors prompt LLMs to generate entailment and contradiction statements in a one-shot manner, then calculate a weighted average of the conditional likelihood of a 'yes/no' response from a VL model, conditioned on image-statement pairs. The statements include the generated entailment/contradiction statement and the original caption. And the weights are hyperparameters. The proposed method shows effectiveness on both EqBen and Winoground benchmarks.

优点

  • The paper is clear and easy-to-follow.
  • The proposed method shows significant gains compared with other baselines on two visio-linguistic compositionality benchmarks.

缺点

  • Although the proposed methods demonstrate empirical effectiveness, the motivation for including contradiction statements lacks clarity. According to the prompt in Appendix A.1, the LLM is instructed to generate contradiction statements as sentences opposite to the entailment statements. This setup implies that each pair of entailment and contradiction statements may be semantically equivalent. Given this, it is unclear where the benefit lies in incorporating both entailments and contradictions.
  • In Table 6, it would be helpful to include the performance results for using contradictions alone.
  • In the qualitative error analysis in Fig. 5, the entailments and contradictions generated by the proposed method generally represent only a subset of the caption’s full semantics, rather than comprehensively capturing its meaning. This limitation may introduce semantic bias. An ablation study on the impact of the number of entailment/contradiction statements, along with an analysis of the semantic diversity among these statements, would enhance the understanding of their effects.

问题

Refer to weaknesses.

评论

We thank Reviewer G1JW for all the feedback. We are glad you found our paper clear and easy to follow. We address the weaknesses and questions below.

[W1] Why contradictions help. Natural Language Inference (NLI) tasks inherently require models to leverage common-sense knowledge to determine semantic sub-sets and relations. Our experiments show that CECE provides a strong prior for caption expansion and exclusion by considering both entailments and contradictions. Specifically, we instruct the LLM to exclude negations; without this instruction, the LLM would output: “the banana is not in two or more pieces” instead of “the banana is in one piece” (example from Figure 2), or “no water is coming out of the spraying device” instead of “the spraying device is dry” (example from Figure 5).

NLI-based expansion produces semantically equivalent statements that contain different vocabulary and sentence structures; thus, the resulting subset of captions (both entailments and contradictions) aids the VLM in covering cases where the caption is too complex or ambiguous.

Contradictions contribute uniquely to this process by offering an additional mechanism for validation. If a given image-caption pair is correct, its contradiction represents an untrue statement of the image. For example, in Figure 2, the correct caption is “there is a split banana”. The generated contradiction, “the banana is in one piece”, is a description inconsistent with the image. When the VLM correctly identifies the contradiction as untrue, this adds further evidence supporting the correctness of the original image-caption pair. Thus, contradictions enhance compositional reasoning by providing a complementary layer of validation that enriches the semantic understanding of the image-text pairs. Contradictions help the model disambiguate nuanced relationships, such as in the example of “split banana” vs. “banana split.” This interplay between entailments and contradictions ensures robust coverage of the semantic space, improving the model's performance on complex tasks.

[W2] Include contradictions alone in Table 6. We include the results when using only the contradictions below and in the revised version. As expected, contradictions alone do not yield state-of-the-art results; however, the result is competitive against prior SDS approaches.

Entail.Contrad.CaptionWinoG TextWinoG ImageWinoG GroupEqBen TextEqBen ImageEqBen Group
49.347.336.045.057.933.6
31.738.024.220.727.113.6
46.857.539.047.160.038.6
52.061.342.858.664.347.1

[W3] Impact of the number of generated captions and semantic diversity. We take the number of entailments and contradictions for Winoground, and compute some statistics:

StatisticEntailmentsContradictions
Mean2.2252.1375
Standard Deviation0.77260.6234
Maximum Value109
Minimum Value21
Median2.02.0

Broadly, the LLM generally follows the instruction of 2 entailments and 2 contradictions; however, sometimes it outputs up to 10. In our experiments, we take all outputs from the LLM to compute the final values. We found that, in some cases where the caption is correctly structured and contains more than one action, the LLM tends to provide shorter and more nuanced entailments and contradictions, resulting in more captions.

For semantic diversity, we show the Jaccard Similarity to assess the lexical diversity of the generated captions (L428-L431). CECE captions and the original captions show a low similarity score; this supports the diversity statement. We also included human evaluations on captions expansions (L444-L457). Using a Likert scale, the entailment captions received an average score of 4.7, while the contradiction captions received an average score of 1.7. This indicates that the entailment and contradictions are both accurately generated.

Each set of entailments and contradictions provides evidence about the relationship between the image and caption, with some being more discriminative than others in specific cases. This does not necessarily imply a bias but reflects the complementary nature of these expansions in covering diverse semantic dimensions. Entailments often affirm subtle consistencies, while contradictions highlight mismatches that are crucial for distinguishing between closely related visual-text pairs.

We hope our response answers your questions. Please let us know if you would like to ask for any further clarifications.

评论

Dear Reviewer G1JW,

We sincerely thank you for your valuable comments and feedback on our submission. As the rebuttal period is coming to an end, we wanted to follow up to see if you have any additional questions or concerns about our responses. We have also submitted a revised version of our work addressing your feedback. If there’s anything further you’d like us to clarify, we’d be happy to address it. Thank you again for your time and for reviewing our paper.

审稿意见
8

The authors propose a new text-image alignment metric, CECE, that considers the original caption alongside a set of LLM-generated entailed and contradictory statements as opposed to past work breaking down captions into smaller sub-statements. This is motivated by a desire to increase lexical diversity and improve performance on captions involving reasoning or world knowledge. The authors validate their approach on Winoground and EqBen and also measure alignment with human judgements on five benchmarks for text-to-image generation and one benchmark for text-to-3D generation, outperforming prior work on all but StanfordT23D. This is coupled by finer-grained analyses of performance, lexical diversity and correctness on Winoground.

优点

  1. The method achieves state-of-the art performance on Winoground and EqBen, outperforming the second best evaluated model by large margins of 11.2% and 9.3% respectively. Its alignment with human judgements is similarly strong, albeit with a lesser gap over prior work.
  2. This improvement is achieved via a relatively simplistic inference-time strategy requiring no finetuning.
  3. It was unexpected for me that "expanding" a caption via entailed and contradictory statements would lead to better performance than breaking down statements into smaller components as in prior work. This could inspire further inference-time strategies either refining the current approach or combining it with elements of past work.

缺点

  1. One unaddressed limitation of the approach compared to the VQAScore baseline is the inference-time cost of additionally using Llama3.1 70B for caption expansion. This is an integral component of the approach and I would have expected ablations of the LLM (for instance by using smaller variants of Llama3.1) in the same way that there were ablations of the VLM.

  2. I think the paper would be improved by further discussions of when CECE underperforms against the baselines. On Table 4, for instance, CECE without ensembling underperforms against the VQAScore baseline for the Text score for the Object_L and Both_L categories (and by a roughly 25% margin for the latter). Similarly, for the Pragmatic_V split, CECE can only decisively outperform the VQAScore baseline after ensembling and is either competitive with or poorer than the VQAScore baseline otherwise. An analysis of why would lead to a better understanding of the strengths and weaknesses of the approach and would help future work improve the method further.

  3. On a related note, I think the comparisons for the ensembling strategy would be fairer if a similar ensembling strategy were applied for VQAScore and DSG, particularly when considering the boost ensembling affords on Table 4. For instance, would VQAScore still perform worse if some weighted average of the Llava-1.5 and 1.6 probabilities were used instead?

问题

  1. In most of the Winoground splits on Table 4, CECE performs better on selecting images rather than text unlike the VQAScore baselines. Any ideas why this may be the case?

  2. Is there any reason why CECE with InstructBLIP was not included for Tables 2 and 3 (especially considering that VQAScore with InstructBLIP achieves superior performance compared to Llava-1.5)?

  3. Is the value of M for the number of entailments and contradictions set in stone or variable based on the input caption?

评论

[W2] Why the Text Score is better in Both_L, Symbolic_V baselines. Thank you for raising this question! Upon further investigation, we identified an error in the reported results. The baseline values for LLaVA-1.5 were slightly overestimated, while the CECE values with LLaVA-1.6 included only the entailment scores, leading to a slight underestimation. We updated Table 4 in the manuscript with the correct scores and attach the table below for reference:

MethodObject_L TextObject_L ImageObject_L GroupRelation_L TextRelation_L ImageRelation_L GroupBoth_L TextBoth_L ImageBoth_L GroupSymbolic_V TextSymbolic_V ImageSymbolic_V GroupPragmatics_V TextPragmatics_V ImagePragmatics_V Group
Human92.2090.7888.6589.2790.5686.7076.9257.6957.6996.4392.8692.8658.8241.1841.18
InstructBLIP42.549.727.734.333.920.265.438.534.631.721.914.625.029.28.3
LLaVA-1.546.146.828.445.148.330.865.446.238.546.331.720.832.941.712.5
LLaVA-1.648.253.935.543.840.827.965.446.138.546.341.526.839.241.716.7
LLaVA-1.5+1.651.753.936.246.244.234.361.546.138.556.143.931.737.533.320.8
DSG45.444.027.642.941.230.961.553.846.146.341.529.333.333.320.8
DSG*52.550.335.551.045.534.353.842.334.641.431.719.550.041.733.3
Ours
CECE51.866.043.351.559.242.557.738.342.353.765.939.045.850.033.3
CECE*56.768.849.753.656.745.957.761.550.058.558.543.937.541.733.3

The text score measures if the model can select the correct caption given an image. In our manual evaluation, we observed that the LLM outputs tend to be noisy when the original captions contain multiple objects, attributes, and actions. These captions are generally tagged as Both_L (the subset of samples where both objects and relations are swapped, or a single swap that changes parts of speech).

评论

We thank Reviewer ZcB9 for all the feedback. We are glad you found our method simple yet effective, as it does not require fine-tuning since it is performed in inference time. We also hope our work could inspire further inference-time strategies, building on our findings. We address the weaknesses and questions below.

[W1] Ablations of the LLM. We include additional experiments using Llama 3.1-8B. This was also pointed out by reviewer Xyex, and we thank both for bringing it to our attention. We include these experiments in the revised version of the manuscript. We also show a fraction of the table below for reference.

MethodToolsLLMWinground TextWinground ImageWinground GroupEqBen TextEqBen ImageEqBen Group
Random Chance--25.025.016.725.025.016.7
Human Evaluation--89.588.585.5---
End-to-end models
VQAScore (Lin et al., 2024)LLaVA-1.5-45.541.329.845.047.128.6
VQAScore (Lin et al., 2024)LLaVA-1.6-46.845.831.346.454.332.9
Sentence Decomposition via Semantics (SDS)
DSG (Choe et al., 2023)LLaVA-1.5Llama3.1 (8B)5.79.53.710.014.36.4
DSG (Choe et al., 2023)LLaVA-1.6Llama3.1 (8B)4.510.22.710.714.36.4
Ours
CECELLaVA-1.5Llama3.1 (8B)47.749.735.548.654.335.0
CECELLaVA-1.6Llama3.1 (8B)48.057.538.750.764.340.0
CECELLaVA-1.5, LLaVA-1.6Llama3.1 (8B)50.053.539.053.657.140.7

Our results show that even with a significantly smaller model, our CECE approach consistently outperforms the end-to-end approaches, including the GPT4-Vision model (which leverages the GPT4V-Eval method proposed in Zhang et al., 2023).

As we mentioned in the Reviewer’s Xyex response, it is also worth noting that Llama3.1 8B outputs hurt other Sentence Decomposition via Semantics (SDS) methods. Our experiments show that a smaller LLM tends to output unrelated or truncated outputs with these approaches. We hypothesize that this behavior is due to the complexity of the prompt and required task decomposition in these methods (e.g., dividing the sentence as a subset of tasks that require separate processing, along with the larger textual context due to the multi-shot example demonstrations).

评论

[W3] Ensembling VQAScore Llava1.5+Llava1.6: Table 4. We show results when ensembling VQAScores with LLaVA-1.5 and LLaVA-1.6. In this case, only the Symbolic_V tag subset surpasses our CECE without LLaVA-1.5 and LLaVA-1.6 ensembling. Symbolic_V tag includes samples with non-natural images such as drawings or illustrations. While the LLM outputs are aligned with common-sense knowledge, drawings are not necessarily grounded in reality. For the samples tagged as Pragmatics_V, the images need to be interpreted non-literally due to idiomatic uses of language in a caption.

In addition, we add qualitative samples in the appendix (Figures 10 and 11). These observations align with the semantic drift issue we mitigate by balancing CECE with the whole caption score. We include additional examples in the appendix (please also take a look at Figures 8 and 9, which are closely related).

[Q1] Why CECE performs better Image Score. The text score evaluates the model's ability to distinguish between captions for a given image, and the image score evaluates its ability to distinguish between images for a given caption. Prior work has pointed out that some visual features may be subtle or difficult for the models to capture at their input resolutions, making the image score task harder [2]. This is particularly true for almost all end-to-end models (e.g., in [1], all models shown in Table 3 perform worse under the image score metric).

A key advantage of the CECE and SDS approaches is their interpretability. By expanding or decomposing captions into smaller semantic units, these methods provide models with more fine-grained information to match against images. With CECE, noise or errors in the expansion process for one caption apply equally to both images, maintaining consistency in the scoring process for the image score case. In contrast, the text score involves comparing two captions with separate sets of entailments and contradictions against one image, which introduces variance and increases the likelihood of errors or inconsistencies affecting the result. This asymmetry makes the image score less sensitive to erroneous expansions, justifying CECE’s stronger performance in this metric.

[Q2] Including CECE with InstructBLIP in Tables 2 and 3: We run experiments with CECE using InstrucBLIP as the VLM backbone. Our results show that CECE outperforms the base model, and particularly in Table 3, CECE outperforms both InstructBLIP and LLaVA-1.5. We include both updated tables in the revised version. We also add Table 3 below for reference.

MethodPairwise AccPearsonKendall
End-to-end models
CLIPScore61.048.132.6
BLIPv2Score56.634.323.4
InstructBLIP68.059.547.5
LLaVA-1.564.955.840.8
Finetuned on human feedback
ImageReward66.357.143.6
PickScore60.141.330.3
HPSv255.931.521.9
CECE
w/ InstructBLIP68.564.048.4
w/ LLaVA1.565.357.441.8

[Q3] Number of entailments and contradictions. We take the number of entailments and contradictions for Winoground, and compute some statistics:

StatisticEntailmentsContradictions
Mean2.2252.1375
Standard Deviation0.77260.6234
Maximum Value109
Minimum Value21
Median2.02.0

Broadly, the LLM follows the instruction of 2 entailments and 2 contradictions; however, sometimes it outputs up to 10. In our experiments, we take all outputs from the LLM to compute the final values. We found that, in some cases where the caption is correctly structured and contains more than one action, the LLM tends to provide shorter and more nuanced entailments and contradictions, resulting in more captions.

We hope our response answers your questions. Please let us know if you would like to ask for any further clarifications.


[1] Thrush, Tristan, et al. "Winoground: Probing vision and language models for visio-linguistic compositionality." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[2] Diwan, Anuj et al. “Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality.” Conference on Empirical Methods in Natural Language Processing (2022).

评论

I thank the authors for their response, which addresses all of my questions and concerns.

评论

I want to write a follow-up comment to my original response. I am leaning towards accepting the paper due to the robustness of the method to smaller LLMs, the correction of the results in Table 4 and the added results for the SugarCrepe dataset and have accordingly increased my score to an 8.

评论

Thank you very much for your detailed feedback and for taking the time to review the updates. We also appreciate your time to follow-up on your original response! Your suggestions are invaluable in improving our work, and we sincerely appreciate your recognition of our efforts to address your concerns.

审稿意见
6

This paper proposes Caption Expansion with Contradictions and Entailments (CECE) to enhance compositional reasoning capabilities in vision-language models (VLMs) by leveraging a large language model (LLM) to generate entailments and contradictions through natural language inference (NLI) prompts. The core idea is to prompt the LLM to expand each image caption by producing semantically related statements, both for entailments and contradictions. The intuition is that CECE provides the VLM with lexically diverse cues that encourage deeper visual-textual alignment for complex compositional reasoning. The authors conducted extensive evaluations on two common compositionality benchmarks: Winoground and EqBen, CECE achieves competitive results.

优点

This paper is well-written and relatively easy to follow. The core ideas are intuitive. I think The proposed method, CECE, addresses a critical limitation in vision-language models, as vanilla pretrained VLMs have traditionally struggled with compositional reasoning tasks. CECE provides a novel solution that enriches the semantic understanding of VLMs and I see it beneficial to the broader community. The experimental results are competitive with clear improvements over existing methods.

缺点

A crucial limitation of this method is its reliance on two balancing hyperparameters. While the authors provide some analysis, the need for the balancing hyperparmeters could hurt CECE’s generalizability to new tasks and datasets. I think documenting example cases of semantic drift in the appendix would be valuable to the community and it will provide insight into how auxiliary prompting can affect semantic meaning of original captions.

问题

Minor experimental detail:

Did the authors use any system prompt or chat template for the open-source LLMs during the experiments?

评论

We thank Reviewer 1DuW for all the feedback. We are glad you found our core idea intuitive, pointing out CECE as a novel solution that enriches semantic understanding and our experimental results competitive, with clear improvements over existing methods. We address the weaknesses and questions below.

[W1] Semantic drift with CECE. We add detailed explanations for cases where CECE captions drift away from the original semantic meaning in the appendix. Particularly, in Figures 8 and 9 (in the Appendix), we add examples where the LLM outputs imprecise entailments, or the caption is not decomposed appropriately. We also add detailed descriptions in Figures 5-6 to explain cases where CECE fails in the scoring process.

[Q1] Minor experimental detail. System prompt: we do not use a particular instruction for the system (i.e., an empty string). Chat template: we do not modify the chat template (i.e., tokenizer.apply_chat_template has the default functionality from the huggingface model card).

We hope our response answers your questions. Please let us know if you would like to ask for any further clarifications.

评论

Dear Reviewer 1DuW,

We sincerely thank you for your valuable comments and feedback on our submission. As the rebuttal period is coming to an end, we wanted to follow up to see if you have any additional questions or concerns about our responses. We have also submitted a revised version of our work incorporating all reviewers' feedback. If there’s anything further you’d like us to clarify, we’d be happy to address it. Thank you again for your time and for reviewing our paper.

审稿意见
8

The paper presents Caption Expansion with Contradictions and Entailments (CECE), a new approach to improve compositionality in the vision-language models by transforming the caption into a natural language inference task. The main idea is to paraphrase the caption into contradictions and entailments with a large language model and then combine them with a vision-language model for final prediction. The results on two important benchmarks (image-to-text and text-to-image) show that CECE outperforms existing baselines without any fine-tuning.

优点

The paper is well-written, and the motivation and method are clearly explained. The related work is thorough, which makes it easy to understand the existing body of work on the topic.

The method focuses on manipulating the captions and shows improved performance without any fine-tuning, which is impressive and useful.

The analysis shows that all the components — caption, entailment, and contradiction — are important for the highest performance on the Winoground and EqBen datasets.

缺点

Inference cost. The main weakness of the method is that it requires a very powerful LLM, such as LLama 3.1 70B, for caption expansion. This adds to the inference and compute cost for generating lexically diverse captions. It would be helpful to know if a smaller model can improve performance similarly. It would be ideal to have an experiment showing the performance of the VLM on a range of language models.

More datasets. The paper experiments on two datasets, namely, WinoGround and EqBen. However, Winoground is a small dataset with only 1600 examples. It would be helpful to know if the method works on larger datasets such as SugarCrepe [1] or larger synthetic datasets such as the concept binding dataset [2].

[1] SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality. NeurIPS: Datasets and Benchmarks Track, 2023.

[2] Does clip bind concepts? probing compositionality in large image models. EACL Findings, 2024.

问题

See the weaknesses.

评论

[W2] More datasets. We show results using CECE variants with SugarCrepe [1]. It is worth noting that prior work that reports strong results in this benchmark tends to perform poorly on Winoground and EqBen. CF2C uses GPT-4V, where the input is an image and the pair of captions, and the model outputs which caption is more likely to describe the image. We include this number in the table below, even though we evaluate using only one image and one caption at a time.

MethodToolsLLMReplace ObjectReplace AttributeReplace RelationSwap ObjectSwap AttributeAdd ObjectAdd AttributeAvg
Human Evaluation--100999799100999999.1
VQAScoreLLaVA-1.5-82.086.489.685.785.691.485.586.7
VQAScoreLLaVA-1.6-92.990.192.190.689.093.789.491.0
CF2C [2]GPT-4V-96.393.590.383.190.191.691.890.6
DSGLLaVA-1.5Llama-3.197.792.178.982.086.398.293.190.7
DSGLLaVA-1.6Llama-3.197.693.880.683.291.198.096.091.6
CECELLaVA-1.5Llama-3.193.588.188.685.281.791.984.587.2
CECELLaVA-1.6Llama-3.196.693.093.388.588.695.589.391.8
CECELLaVA-1.6, LLaVA-1.6Llama-3.196.193.093.890.290.895.990.292.6

These results provide strong evidence that supports the effectiveness of our work. We expect future research directions that leverage caption expansion techniques (i.e., entailments and contradictions) to develop inference-time methods and training strategies that improve compositional reasoning in vision-language models. We hope our response answers your questions. Please let us know if you would like to ask for any further clarifications.

[1] SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality. NeurIPS: Datasets and Benchmarks Track, 2023.

[2] https://github.com/RAIVNLab/sugar-crepe/tree/main/gpt-4v-results

评论

Thanks for the response and the additional results. The results on SugarCrepe look impressive. It would be awesome if you could include these results in the paper.

Increasing my score to 8 and confidence to 4.

评论

Thank you very much for your detailed feedback and for taking the time to review the updates. Your suggestions are invaluable in improving our work. We sincerely appreciate your recognition of our efforts to address your concerns. We will include the results on SugarCrepe in the re-revised version.

评论

We thank Reviewer Xyex for all the provided feedback. We are glad you find our paper well written, with a clear motivation and method. We are also glad you acknowledge CECE’s strong performance, especially since it’s architecture-agnostic and does not require any fine-tuning. We address the weaknesses and questions below.

[W1] Inference cost. We include results using a smaller open-sourced model, i.e., Llama 3.1-8B in the revised version. For a fair comparison, we also run DSG using this same LLM. We show a fraction of the table below for reference.

MethodToolsLLMWinground TextWinground ImageWinground GroupEqBen TextEqBen ImageEqBen Group
Random Chance--25.025.016.725.025.016.7
Human Evaluation--89.588.585.5---
End-to-end models
VQAScore (Lin et al., 2024)LLaVA-1.5-45.541.329.845.047.128.6
VQAScore (Lin et al., 2024)LLaVA-1.6-46.845.831.346.454.332.9
Sentence Decomposition via Semantics (SDS)
DSG (Choe et al., 2023)LLaVA-1.5Llama3.1 (8B)5.79.53.710.014.36.4
DSG (Choe et al., 2023)LLaVA-1.6Llama3.1 (8B)4.510.22.710.714.36.4
Ours
CECELLaVA-1.5Llama3.1 (8B)47.749.735.548.654.335.0
CECELLaVA-1.6Llama3.1 (8B)48.057.538.750.764.340.0
CECELLaVA-1.5, LLaVA-1.6Llama3.1 (8B)50.053.539.053.657.140.7

Our results show that even with a significantly smaller model, our CECE approach consistently outperforms the end-to-end approaches, including the GPT4-Vision model (which leverages the GPT4V-Eval method proposed in Zhang et al., 2023). It is also worth noting that Llama3.1 8B outputs hurt other Sentence Decomposition via Semantics (SDS) methods.

Our experiments show that a smaller LLM tends to output unrelated or truncated outputs with these approaches. We hypothesize that this behavior is due to the complexity of the prompt and required task decomposition in these methods (e.g., dividing the sentence as a subset of tasks that require separate processing, along with the larger textual context due to the multi-shot example demonstrations).

AC 元评审

CECE improves compositional reasoning in vision-language models (VLMs) by using a large language model (LLM) to generate entailments and contradictions of captions. These expanded captions provide more diverse cues for better visual-textual alignment, leading to improved performance on compositional tasks like Winoground and EqBen, often without fine-tuning.

The paper is well-written, and the motivation and method are clearly explained. The core ideas are intuitive and the experiment results demonstrate the merit of the work.

审稿人讨论附加意见

The reviewers agrees the novelty and contribution of the work. The authors also addressed the concerns of the reviewers by adding more experimental results and analysis.

最终决定

Accept (Poster)