Hidden in plain sight: VLMs overlook their visual representations
VLMs perform worse on vision-centric tasks than their underlying vision models, relying on their language priors instead. Improving their integration of visual data—not just adding stronger vision backbones—is key to unlocking their full potential.
摘要
评审与讨论
This article studies the performance of VLM and compares it with the output of using only the visual encoder. The study found that the performance of VLM is much lower than that of using only the visual encoder.
接收理由
The experiments are systematic.
拒绝理由
-
Under the task of Depth Estimation: The comparison between the visual encoder and the VLM is not fair, as the visual encoder benefits from additional training data, effectively incorporating extra knowledge. Unfortunately, this undermines the validity of this task within the context of the paper, and I believe it does not offer meaningful insight in its current form.
-
Other Tasks: While these tasks have some value, the scientific contribution is quite limited. Fundamentally, the output of the visual encoder serves as the input to the VLM’s language model. If the visual encoder’s output does not contain sufficient information to perform these tasks, it is natural to expect that VLM does not as well. Therefore, what I mean is that it is natural that the performance of only (and sufficiently) using the visual encoder's output is a somewhat upper bound for the performance of the VLM, as VLM utilize the visual encoder's output as input. As far as I understand, these experiments can only show that VLM performs poorly in these tasks, and the author's analysis shows that at least it is not caused by the visual encoder.
-
Figure 5: How many trials were averaged to compute the accuracy shown in Figure 5? The curves exhibit substantial fluctuation, which likely stems from an insufficient number of trials. Based on Figure 5, I can somewhat agree with the claim that “vision representations remain intact as they pass through the projector (gray region) and LLM (white region).” However, I do not find the interpretation in lines 199 to 218 convincing. In particular, the last layer does not consistently show a performance drop. Instead, I observe a mix of increases and decreases, which again may be attributed to the small number of trials.
给作者的问题
see "Reasons To Reject"
We thank the reviewer for their thoughtful response to our work. Below, we reference and respond to each concern:
W1: The reviewer raises a valid point that this task is the only one whose visual evaluation involves training on data that the VLM evaluation does not use. We maintain that the Depth Estimation task contributes meaningfully to the thesis of the paper. The DPT head is commonly used to read out spatial structure already encoded in a vision model, not as an external source of “extra knowledge.” To the best of our knowledge, learning a lightweight depth decoder is the standard method to evaluate the representations from a generic vision model incapable of producing depth maps on its own, and we make sure to train the probe on a different dataset (NYUv2) from the model evaluation (Omni3D).
This observation also gets at another important point: language is a flexible way to allow users to specify tasks without bespoke procedures such as training a depth head or computing texture features. Finding solutions that close the visual-VLM gap within this closed set of tasks can lead to improved performance on tasks that are difficult to specify zero-shot without language.
W2: We agree with the reviewer’s point that, if the visual encoder does not contain sufficient information to perform a task, the rest of the VLM should not be expected to “rescue” performance. We point out a concrete example of this when evaluating InternVL (L161-164). On the other hand, it’s not completely impossible for the VLM to outperform our visual evaluations, which are designed to be as simple as possible. For example, the LLM backbone could learn a more complex computation along the vision representations to outperform a simple pointwise cosine similarity function for correspondence tasks. In theory, the VLM could also identify paintings by name or artist, and reason through the Art Style task in a more robust way than texture comparison.
W3: Our data for Figure 5 was computed over the whole benchmark, so there was one trial. We will add shaded intervals for standard deviation in the final revision of the paper by subsampling 80% of the benchmark data for 10 trials. While we cannot provide an updated figure in OpenReview comments, we provide a table with some stats when computing standard deviation in this bootstrapped way. We include the min, max, and mean STD values computed across layers, as well as the last-layer STD to contextualize its variability against the variability in other layers. Hopefully these stats provide some more background for the results in Figure 5.
Semantic Correspondence
| Model | Mean | Min | Max | Last Layer |
|---|---|---|---|---|
| CLIP | 0.0191 | 0.0115 | 0.0285 | 0.0220 |
| DINOv2 | 0.0197 | 0.0106 | 0.0295 | 0.0278 |
| IN-1k | 0.0196 | 0.0086 | 0.0318 | 0.0148 |
| SigLIP | 0.0203 | 0.0111 | 0.0267 | 0.0129 |
Low-level Matching
| Model | Mean | Min | Max | Last Layer |
|---|---|---|---|---|
| CLIP | 0.0177 | 0.0083 | 0.0276 | 0.0113 |
| DINOv2 | 0.0167 | 0.0093 | 0.0248 | 0.0190 |
| IN-1k | 0.0169 | 0.0089 | 0.0271 | 0.0266 |
| SigLIP | 0.0174 | 0.0080 | 0.0269 | 0.0159 |
Object Affordance
| Model | Mean | Min | Max | Last Layer |
|---|---|---|---|---|
| CLIP | 0.0178 | 0.0086 | 0.0269 | 0.0239 |
| DINOv2 | 0.0180 | 0.0109 | 0.0308 | 0.0161 |
| IN-1k | 0.0189 | 0.0098 | 0.0314 | 0.0247 |
| SigLIP | 0.0171 | 0.0087 | 0.0250 | 0.0250 |
Art Style
| Model | Mean | Min | Max | Last Layer |
|---|---|---|---|---|
| CLIP | 0.0185 | 0.0093 | 0.0282 | 0.0141 |
| DINOv2 | 0.0188 | 0.0107 | 0.0295 | 0.0281 |
| IN-1k | 0.0199 | 0.0118 | 0.0314 | 0.0118 |
| SigLIP | 0.0193 | 0.0128 | 0.0265 | 0.0185 |
Thank you for the detailed response.
W3: I recommend that the authors include these results in the final version of the paper.
W1: Thank you for acknowledging that "visual evaluation involves training on data that the VLM evaluation does not use." This clarification highlights the potential unfairness in the comparison.
W2: It is obvious that current VLMs have limitations in tons of tasks. I am concerned that merely identifying such limitations offers insufficient scientific contributions in a top-tier AI conference. A meaningful advancement would involve not only highlighting these shortcomings but also proposing solutions or at least preliminary solutions to address them.
I sincerely appreciate the authors' efforts in this paper. However, at this point, I would prefer to maintain my current score, as explained above.
Dear Reviewer Nkzv,
The authors have responded to your questions. Does their answer address your reasons to reject?
This paper shows that VLMs do not make optimal use of the visual representations that they ingest. It does so by comparing, on "vision centric" tasks, that is, tasks that do not require and cannot rely on world-knowledge that is not visual, classifiers trained on top of the visual encoders directly with the performance of VLMs that use them. It shows that across the 6 selected tasks (all from existing benchmarks), the VLMs perform worse. Additionally, the paper shows that the visual information does not get lost in the token stream of the transformer, by putting probing classifiers on the contextualised vision tokens and showing that they still perform well. The authors conclude that it is the process of mapping the textual prompt to the textual reply at which the task performance gets lost.
接收理由
The paper is very well written and clear. It makes a straightforward point, that it supports well with experiments.
拒绝理由
The paper does not go beyond stating the problem, but that's not a big problem; the first step in alleviating a problem is recognising it.
给作者的问题
Please look up the difference between \cite, \citep, and \citet.
We thank the reviewer for the observation that identifying and diagnosing a problem is alone an important step in the process of improving VLMs. We will also revise the citation format to make references more readable.
I've read the response to my and the other reviews, and will maintain my score.
The paper compares VLMs to study their visual encoders to understand their ability to integrate text and vision modalities. The paper demonstrated that despite strong performance on knowledge tasks, VLMs struggle with vision-centric tasks because they underutilize the visual information from their encoders, relying instead on language priors. Contrary to previous work suggesting weak vision encoders, this work argues that the issue lies in the poor integration of visual and linguistic information within VLMs. Evaluating vision encoders through VLMs can be misleading, as performance often degrades and rankings change compared to direct visual evaluation. The paper suggests that while stronger vision encoders are still needed, improving how VLMs integrate visual information is crucial for better vision-centric performance.
接收理由
- The paper identifies and highlights a significant problem in current VLMs - their inability to effectively utilize visual information, even when the visual encoders themselves demonstrate strong performance on visual tasks.
- The authors conduct a thorough analysis to diagnose the issue, examining potential factors such as vision representation degradation, prompt sensitivity, and the LLM's role. The paper employs well-designed experiments and benchmarks to compare VLMs and their visual encoders across various vision-centric tasks.
- The study reveals key findings, such as the performance drop from visual encoders to VLMs, the inconsistency in rank ordering of vision encoders, and the LLM's underutilization of visual representations. Furthermore, the paper provides valuable insights and directions for future research on improving visual understanding in VLMs.
拒绝理由
I do not have reasons to reject the paper.
给作者的问题
- Although the paper provides evidence for the LLM's underutilization of visual information, is it possible that there might be other contributing factors that were not fully explored?
Although the paper provides evidence for the LLM's underutilization of visual information, is it possible that there might be other contributing factors that were not fully explored?
Great question! Our investigations fully cover the main architectural components of open-source VLMs to draw a general conclusion applicable across models, but there can be more nuanced reasons for how and why a particular model or training scheme leads to the failure of LLMs to use their input vision representations. For example, contributing factors may include issues in the alignment or instruction-tuning data that do not train the LLM to look carefully at its vision inputs. We will clarify in the discussion that our analysis does not preclude additional contributing factors, but highlights the LLM’s underutilization of visual inputs as the most consistent and actionable bottleneck.
Thank you for your response.
There have already been multiple existing papers showing the shortcomings of VLMs. The unique point in this paper is that they show these shortcomings side-by-side with the performance of only the visual components / visual encoders. The simple assumption would be that the poor VLM performance of visual tasks can be attributed to poor visual encoders, but the paper actually shows otherwise -- these visual encoders by themselves work well, but the performance drops when integrated into a VLM.
Relatedly, the paper also shows that the LLM’s ability to use its vision representations is a limiting factor in VLM performance, and fine-tuning the LLM increases the performance more than fine-tuning the ViT or the projector.
Tasked evaluated were from vision datasets like CV-Bench, BLINK, and MOCHI, on a variety of tasks like depth estimation, correspondence-based evaluations, 3D object awareness, and art style.
接收理由
- The paper sticks to this single topic and really dives deep, approaching it from different angles and testing out different hypotheses.
- The discussion is very thorough. Aside from the main conclusion, there are nuggets of insight sprinkled around the paper. For instance, there is a section comparing the VLM performance to the "blind" VLM performance (blank image input). There is also another section showing the performance differences when we fine-tune the projector layer vs the LLM vs the ViT. These supporting experiments really add to the strength of the paper.
拒绝理由
I think I have not much issues with the execution of the paper. I feel this part was done well. Maybe there are a few knobs missing to make it a truly complete paper. For instance, the paper claims that fine-tuning the LLM is very important but then they train the VLM, they freeze the LLM and only train the VLA. I think it would be helpful to see experiments where this is not frozen.
I have two somewhat conclicting thoughts here about the premise/conclusion of the paper.
- I think the conclusion is somewhat obvious. I think it's fair to expect that the VLM will perform worse than the Vision Encoder, simply because it has to learn more things and has to juggle multiple representation spaces. Additionally, there have been several existing papers that are not exactly the same as this, but basically also showing that VLMs are generally quite poor at tasks that require paying attention to the image.
- While I think the conclusion is obvious, I'm also somehow not convinced that the conclusion is actually the correct conclusion to make here. This reminds me a lot of Sutton's bitter lesson, where he was saying basically that in the short term, these smaller more specialized components (in this case, the vision encoders) might work better, but in the long term as we get more data and better models, the VLM might just "figure it out", and this phenomenon the authors observed will just be completely irrelevant.
- (Note: Definitely not expecting any large scale experiments here, but I think this idea might be worth mentioning somewhere. Maybe in the discussions or appendix)
给作者的问题
Minor comments:
- I'm a bit unclear on the choice of task. In the paper you highlight that you specifically select vision-centric tasks, which I think is good. My question is: Doesn't this apply to most VLM evaluations? Like, even simple VQA ones I feel like most of these do not require any high-level reasoning.
- Essentially I just want more justification to be sure that these tasks were not chosen specifically because they supported your hypotheses the most.
- When citing sources, you can use \citep to put the names in parentheses.
We thank the reviewer for their thoughtful response to our work. Below, we reference and respond to each concern:
the paper claims that fine-tuning the LLM is very important but then they train the VLM, they freeze the LLM and only train the VLA. I think it would be helpful to see experiments where this is not frozen.
In Section 4.3, we report results on VLMs after fine-tuning on 1) vision encoder only, 2) projector only, and 3) language model only. These three settings cover the VLM components at the level of abstraction that we study them. We are happy to discuss this point further however if this section of the paper did not address the reviewer's concerns.
I have two somewhat conclicting thoughts here about the premise/conclusion of the paper. [...]
Both of these are very valuable discussion points. We agree that the main observation - that VLMs underperform vision encoders on vision-centric tasks - is unsurprising. The results supporting that point, in fact, exist when multiple separate papers are combined (e.g., [1] + [2] + [3] + [4]) though to our knowledge, we are the first to directly point out this persistent phenomenon.
What is not apparent, however, is to what part of the VLM these performance drops should be attributed. By understanding that the bottleneck for visual understanding within VLMs lies in the LLM’s usage of vision representations, the community can concentrate efforts to better exploit the visual abilities of stronger vision models rather than focus on overoptimizing for a particular vision model (e.g., CLIP-based ones).
We appreciate the reflection on Sutton’s bitter lesson, and agree that the storyline of progress often favors end-to-end systems trained at scale. We observe the reality to be that many open-source VLMs are still constructed from existing single-modal components, so understanding how these components interact - and where integration breaks down - remains critical for diagnosing limitations in today’s systems. Even if future VLM developments phase out late-fusion architectures, we hope that our experimental procedures and findings can serve as a framework for auditing multimodal models, evaluating representation usage, and guiding training data/task choices. Without such mindfulness toward diagnosing where VLMs fail, future models risk scaling blindly over persistent inefficiencies.
I'm a bit unclear on the choice of task. [...]
This is a great point - most VLM evaluations depend on proper understanding of the visual input. Among the large pool of possible tasks, we have a few core desiderata that narrows us down to the measures in the paper:
- We don't evaluate for complex reasoning or alignment to language description, since such tasks require more ability than is encoded in the vision model.
- To ensure fair comparison, we focus on tasks with wide-accepted evaluation methods for vision models (e.g., correspondence and depth) which eliminates some benchmarks such as object counting, the jigsaw task in [2], and more complicated spatial relationship tasks than the Depth Estimation task.
- We aim to cover a variety of visual understanding capabilities in both single-image and multi-image settings with little redundancy: these six tasks together cover object identity/pose (3D Object Awareness), scene understanding (Depth Estimation), correspondence within the same scene (Low-Level Matching), and correspondence between different scenes (Semantic Correspondence). Additionally, since these tasks include a localization component, we include Art Style that emphasizes texture and whose visual evaluation explicitly removing spatial information.
We will make our definition for “vision-centric tasks” more precise and differentiate them from other benchmarks used by the VLM community in our paper revision.
References
[1] Tong, Peter, et al. "Cambrian-1: A fully open, vision-centric exploration of multimodal llms." Advances in Neural Information Processing Systems 37 (2024): 87310-87356.
[2] Fu, Xingyu, et al. "Blink: Multimodal large language models can see but not perceive." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
[3] El Banani, Mohamed, et al. "Probing the 3d awareness of visual foundation models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[4] Fu, Stephanie, et al. "Dreamsim: Learning new dimensions of human visual similarity using synthetic data." arXiv preprint arXiv:2306.09344 (2023).
Thank you for your detailed response!
I think this has addressed my issues. (1) (re: freezing) and (3) (re: task choice) were mostly detail-oriented. I'm happy with the authors' clarifications here. My second question was more philosophical. I'm also happy with the authors' response here. Specifically, I agree there's lots of value in understanding what exactly makes current VLMs work well or not work well. I still think the value of this paper is somewhat limited by the fact that the findings are tied to one particular implementation of one particular architecture, and that I'm not sure how long that particular implementation will stay relevant, but in general I do think at least for the next year or so, this paper will likely provide some good value to the community.
I'm keeping my score since I think I already gave a pretty high score to begin with. I do think it's a great paper and I'm willing to fight for it to be accepted. Best of luck!
Quick note: This decision to keep my score also comes after reading some of the other reviews and responses in this paper. Specifically, I think that Reviewer Nkzv's points are very valid. I think W3 was a valid criticism that I didn't initially think of but it seems the authors have addressed that. For W1, I see the argument but I feel like this is a minor point, and even if the argument holds, I don't think it detracts from the overall value of the paper.
Dear Reviewer 34qn,
The authors have responded to your review. Does their answer address your concerns?
We thank all reviewers for their insightful questions and feedback. We are glad they found:
- The paper is well-written and clear [34qn, zfkz]
- The experiments are thorough and focused [34qn, Nkzv, 6sUQ, zfkz]
- The analyses provide valuable insights into the identified problem [34qn, 6sUQ]
We also thank reviewers 34qn and zfkz for pointing out a more readable way to embed citations, and will make this change in the revision.
Individual questions and concerns are addressed in reviewer-specific responses below. We hope to engage in fruitful discussion in the coming week to strengthen this work!
This paper investigates to what extent VLMs make use of the information encoded in the representations obtained from their vision encoders (CLIP, DINOv2, SigLIP, IN-1k). The authors test the capabilities of VLMs and vision encoders in vision-centric tasks such as visual semantic correspondence, depth estimation, object affordances and art style. The results of comprehensive experiments reveal that VLMs underutilize the information encapsulated in vision encoders, suggesting that there is a bottleneck in LLMs’ use of the visual representations.
Pros
Clarity: The paper is very well written and clear. The figures and tables are easy to follow.
Quality: The experiments on probing the visual representations, prompt-tuning, and exploring the LLM’s role were nice and comprehensive. The reviewers appreciated the thoroughness of the experiments, analysis and discussion.
Originality: Although there exist various works focusing on the shortcomings of VLMs, this paper has a clear focus on vision-centric tasks, comparing VLMs to their vision encoders utilizing multiple methodologies that follow each other nicely.
Significance: The paper identifies a significant problem regarding the underutilization of the information encoded in vision encoders when integrated in VLMs, locating a potential bottleneck to focus on in the future. These findings would benefit the research on improving and understanding multimodal models.
Cons
The choice of tasks was questioned by Reviewer Nkzv asking whether it is fair to use extra data to learn a lightweight head to perform one of the tasks with the vision encoders, and by Reviewer 34qn regarding the rationale of choosing vision-centric tasks. I suggest acknowledging the potential unfairness and justifying the motivation behind the choice of tasks more clearly, as the authors do in their response.
Additionally, Reviewer Nkzv indicated their concern that merely identifying limitations in models does not offer sufficient contributions to COLM. Following COLM's review guidelines, I would like to point out that this paper has merit in multiple dimensions such as its empirical thoroughness, depth of the analysis and the implications of the findings. That said, the authors can extend their discussion to touch more on potential solutions.
Please make sure to incorporate the results with multiple trials and use the citation format indicated by the reviewers.
143 vision-langauge
151 Table 3 links to Table 1
360 there there