We thank the reviewer for the detailed questions and feedback and aim to address them below.

“Is the unembedding matrix trained by the authors?”

We use the unembedding matrix from the LVLM and do not train, as we intend to provide a training-free method for interpreting the internal representations of LVLMs. The logit lens method, applied to text-only models, showed that the model’s unembedding matrix effectively interprets language models. We show that this capability can surprisingly be extended to LVLMs.

Vision-language modality gap

We agree with the important point about the modality gap and token distributions. We include classification results in Table 8 in the Appendix, performing patch-level evaluation on 500 images in the COCO dataset. We find that the classification accuracy varies highly between different COCO classes, with strong top-3 classification accuracy with classes such as “toothbrush” (78.9%), “toilet” (92%), and “banana” (77.2%) and much lower accuracy with some classes such as “person” (0.5%) and cup (9.4%). We hypothesize that the variation stems from how consistently objects are represented linguistically - classes that map to specific, consistent tokens perform better than those that can be described with many specific terms (e.g., "person" → "doctor", "skier", "girl"). The LVLM sometimes captions the image with more specific terms (such as in the case of class “person”) so we can interpret the image representations with these more specific terms well, e.g. since there is only one way to describe “banana”. More importantly, our quantitative results demonstrate that the logit lens effectively captures the model’s learned semantic alignments in practice. Section 5.1 shows that our method can distinguish objects present vs. not present, achieving significant improvements in hallucination detection, despite this modality gap. Additionally, in our zero-shot segmentation results (Section 5.3), we demonstrate that these projections accurately localize classes spatially. These quantitative results across multiple tasks suggest that the logit lens captures the model’s internal understanding of visual semantics, specifically with practical applications in hallucination intervention, despite modality differences.

Does random sampling choose some objects “obviously” not present in the image?

We sample only from the set of 80 COCO classes, where many objects commonly co-occur, and believe that the strong performance across applications validates that distributions of objects present and not present are reliably separable. In Section 5.1 (“Hallucination detection”), we narrow down the scope of randomly selected objects to only hallucinated objects, where hallucinations are often objects that are not present in the image but less obviously so. Through this specific application, we intend to show here that the logit lens can classify these objects as present or not present with strong results (mAP improvement by 22.45% in LLaVA and 47.17% in InstructBLIP), even while these objects not being present may be less obvious.

Last Tokens for Multi-Token Object Representations

Our approach of using the last tokens is motivated by past work that find that information about multi-token entities is moved to the last token position. For example, [1] finds that the last subject token encodes crucial factual associations, and [2] demonstrates that information is carried over to the last token position through relation propagation and attribute extraction. Thus, extracting a residual hidden representation of the last token, which is conditioned on the previous tokens of the class, is the most likely to contain the concept of the whole class (ex. “traffic light”) and not merely a single part.

“Is the generated class name from LLaVA or the ground truth label used?

Similar to the reviewer’s findings, we found that LLaVA tends to generate some very general class when classifying an image. We use the generated class name from LLaVA in zero-shot segmentation for two reasons. (1) We generate the segmentation without knowing the ground truth label, and only the LLaVA object prediction, to have true end-to-end zero-shot segmentation. (2) If LLaVA predicts a “dog” rather than a “husky” in the image, we find that it maps the image representations closer to “dog” tokens than to “husky” tokens. This is likely because this is how it internally processes the objects in the image representations (as “dog,” not “husky,” resulting in higher internal confidence for “dog”), which we interpret with text.

[1] https://rome.baulab.info/

[2] https://arxiv.org/abs/2304.14767