PaperHub
8.0
/10
Oral4 位审稿人
最低8最高8标准差0.0
8
8
8
8
3.5
置信度
ICLR 2024

Interpreting CLIP's Image Representation via Text-Based Decomposition

OpenReviewPDF
提交: 2023-09-18更新: 2024-03-26
TL;DR

We investigate the CLIP image encoder by analyzing how individual model components affect the final representation

摘要

关键词
CLIPinterpretabilityexplainability

评审与讨论

审稿意见
8

The research dissects the CLIP image encoder, identifying specialized roles of its internal components, and reveals an inherent spatial awareness in its processing by using the CLIP’s text representation. Insights gained from this analysis facilitate the removal of extraneous data and the enhancement of the model, notably demonstrated through the creation of an effective zero-shot image segmenter. This underscores the potential for in-depth, scalable understanding and improvement of transformer models.

优点

  1. Well written text

  2. Excellent figure to explain the pipeline. (Fig 1)

  3. Tested on various backbone and datasets

  4. Carefully designed ablation study to show why we should focus on MSA blocks instead MLP blocks. This also provides nice reasoning for the decomposition algorithm design.

  5. An excellent way to provide explanations to researchers to understand the trained models. Instead of providing an enormous amount of random feature number to explain the model, the proposed method is able to align the reasoning to text. This could serve as a nice tool to collaborate with human researchers.

  6. Smart way to utilize ChatGPT 3.5 to provide text prompts

缺点

  1. Seems like human users are still required to provide some sort of heuristic to decide the role of a head. I would like to know how hard it is from the user point of view.
  2. Seems like most of the experiments have been done on general image datasets. I am curious about the results on some fine grained tasks or datasets for example human face recognition.

问题

  1. Human AI teaming.
  • I am thinking about the task from a human-AI taming point of view. How hard is it for the human to identify the role of each head? Is it possible to introduce some humans during inference time to prune/select which head to use for final prediction?
  1. “Note everything has a direct role”
  • Does it mean there exists complicated features that we need to leave it as a black box, or we can ignore them as redundant feature points?
评论

We thank the reviewer for the valuable comments.

I am thinking about the task from a human-AI taming point of view. How hard is it for the human to identify the role of each head? Is it possible to introduce some humans during inference time to prune/select which head to use for final prediction?

In our experiments, we manually annotated the heads by examining the output text descriptions of our algorithm. For many of the heads, this task is simple (requires finding a commonality between 60 text descriptions). A possible approach for automating it is to query an LLM to summarize the role of the head based on its text descriptions (e.g. “What is common between all these image descriptions?”). Moreover, for pruning the heads for a specific task (e.g. bird classification), we can provide the discovered head roles to an LLM, and query it about each spurious ques that each role can provide for the given task (e.g. “Can geo-location be a spurious cue for bird image classification?”). This model-driven strategy would probably be more efficient than introducing humans during inference time, which might be too costly.

“Note everything has a direct role” - Does it mean there exists complicated features that we need to leave it as a black box, or we can ignore them as redundant feature points?

By “not everything has a direct role”, we are referring to the fact that our analysis studies the direct effects of individual components on the output representation rather than indirect effects.

In more detail, the direct effects are the contributions of each layer to the residual stream (which is later projected into the output space). The indirect effects are the contributions of a layer that are processed by subsequent layers. Most of the early layers, for example, have meaningful indirect effects - they contribute to the inputs of later layers. To analyze these features, one should examine these contributions and their downstream effects. As a preliminary example of this, we unrolled the direct effect of a “counting head”, and found a second-order effect from an MLP layer that fires mostly when digits appear in the image, which could produce spurious cues if the presented digit usually corresponds to the number of objects in the images (e.g. a pack of 6 tomatoes with the text “6 tomatoes” appearing on the pack). Removing the second-order effect of this MLP on this head may result in more accurate counting abilities.

评论

Thanks for the clarification. I noticed my typo from my question. Sorry about that.

I found your proposed method could be useful to human observers. Nice work.

审稿意见
8

This paper delves deep into the famous CLIP vision-langauge model for understanding its image-encoder components with the help of text representation. This includes understanding the role of attention heads, self-attention layers, MLP layers as well as the effect of each layer on the final representation in terms of downstream task performance. As the image backbone is completely aligned with text embedding representations, this work makes advantage of this property to interpret the image encoder components. Interestingly, each head role is associated with capturing specific attributes and properties, which leads to several downstream applications including, image retrieval with specific properties defined by head and removing spurious features by neglecting the heads which focuses such features. The analysis is performed on CLIP models with various scales, which shows the generality of this study.

优点

  1. This paper presents important timely study to understand the inner working of image backbones of the large scale vision-language models like CLIP which has become a very popular paradigm. This will help to improve the next generation of such models in terms of architecture and training.
  2. The use of text to interpret the image backbone components is motivating by the fact that CLIP backbone understands text representations.
  3. Extensive analysis shows the main performance drivers of the image backbone of CLIP, and could help in rejecting the redundant modules present in such vision-language architectures.
  4. The proposed TextBase algorithm to associate specific roles per head using text is fairly motivated and its effectiveness is shown via downstream retrieval experiments.
  5. This analysis unlocks several improvements for downstream tasks including reducing known spurious cues and zero-shot segmentation.
  6. Finally the paper is well written and easy to understand.

缺点

I could not think much about any significant weaknesses in this work. I have some questions as follows:

  1. How the zero-shot results compare against from methods like MaskCLIP [1]?
  2. It has been shown that the zero-shot accuracy of the embeddings from only late attention layers is very competitive to the original performance. Will this also hold true where the representations are used for downstream tasks which require adaption? For example, it will be good to see the linear probe performance of the filtered embeddings on imagenet or other datasets like CIFAR100, Caltech101.

[1] Extract Free Dense Labels from CLIP, ECCV 2022, Oral

问题

Please refer to the weaknesses section! Thank you.

评论

We thank the reviewer for the valuable comments.

“the zero-shot accuracy of the embeddings from only late attention layers is very competitive with the original performance. Will this also hold true where the representations are used for downstream tasks that require adaption?”

To verify that the effect of the early attention layers and MLPs is negligible even after adaptation, we followed the suggested experiment and applied linear probing on ViT-B-14 for CIFAR100 classification. The test set accuracy of linear probing is 84.3. When we mean-ablate all the MLPs and the early attention layer up to the last 4 layers we get an accuracy of 84.2%. This suggests that even after adaptation, the late attention layers still have most of the effect on the output.

“How do the zero-shot results compare against methods like MaskCLIP”

MaskCLIP uses a different approach, that works only on CLIP models that have attention pooling before the projection layer. Differently from these models, other CLIPs (and specifically - all the OpenCLIP models) use the class token directly as the input to the final projection, without using attention pooling. We note that it is possible to compare our model to maskCLIP by modifying the decomposition to include attention pooling in it, but we think it is out of the scope of this paper.

评论

Thank you for providing a response to my queries!

The results of Linear probing with just attention layers-based embeddings seem encouraging!

Regarding the MaskCLIP, I actually was referring to comparing current zero-shot segmentation results (presented in your paper) and MaskCLIP results on the same datasets.

评论

Thank you for the clarification.

We compared our decomposition to MaskCLIP for zero-shot segmentation. We used OpenAI's ViT-B-16 and calculated metrics on Imagenet Segmentation dataset:

modelmIoUpixel-wise Acc.MAP
MaskCLIP54.873.083.6
Ours57.777.282.6

As shown, our decomposition is better in pixel-wise accuracy and mIoU, and slightly worse in MAP.

评论

Thank you for providing the results and they look encouraging.

Overall I am satisfied with the paper and will keep my rating as it is.

审稿意见
8

The paper ’Interpreting CLIPs Image Representation via Text-based Decomposition’ studies the internal representation of CLIP by decomposing the final [CLS] image representation as a sum across different layers, heads and tokens. Using direct effect to ablate certain layers, the authors first understand that the late attention layers are important for the image representation. Furthermore, the authors further study the attention heads and token decompositions in the later layers to come up with two applications: (i) Annotating attention heads with image-specific properties which was leveraged to primarily reduce spurious cues; (ii) Decomposition into image tokens enabled zero-shot segmentation.

优点

  • This paper is one of the well executed papers in understanding the internal representations of CLIP. The writing is excellent and clear!
  • The text-decomposition based TextBase is technically simple, sound and a good tool to annotate internal components of CLIP with text. - This tool can infact be extended to study other non-CLIP vision-language models too.
  • Strong (potential) applications in removing spurious correlation, image retrieval or zero-shot segmentation.

缺点

Cons / Questions:

  • While the authors perform the direct effect in the paper, can the authors comment on how indirect effects can be leveraged to understand the internal representations? I think this is an important distinction to understand if understanding the internal representations in more detail can unlock further downstream capabilities. If affirmative, what downstream capabilities will be feasible?
  • I am not sure if the current way to create the initial set of descriptions is diverse enough to capture “general” image properties or attributes. I believe the corpus of 3.4k sentences is too small for this analysis. While this set is a good starting point, can the authors comment on how this set can be extended to make it more diverse?
  • Did the authors analyse the OpenAI CLIP variants using this framework? The OpenCLIP and OpenAI variants are trained on different pre-training corpus, so a good ablation is to understand if these properties are somehow dependent on the pre-training data distribution.
  • Can the authors comment on how the images set in Sec. 4.1 is chosen? This is not very clear from the text. Is this set a generic image set that you use to obtain m text-descriptions per head from a bigger set of text-descriptions?

问题

Check Cons / Questions;

Overall, the paper is an interesting take on understanding the internal representations of CLIP with the additional benefit of showing applications on image retrieval and reducing spurious correlations. I am leaning towards acceptance and happy to increase my score if the authors adequately respond to the questions.

评论

We thank the reviewer for the valuable comments.

"can the authors comment on how indirect effects can be leveraged to understand the internal representations?"

Examining the indirect effect can result in a finer understanding of the internal computation graph in these models. For example, we can analyze the second-order effects going through a specific layer/head (e.g. what does MLP 2 contribute to attention head 5 in layer 30). This way, one can remove more fine-grained spurious correlations.

As a preliminary example of this, we unrolled the direct effect of a “counting head”, and found a second-order effect from an MLP layer that fires mostly when digits appear in the image, which could produce spurious cues if the presented digit usually corresponds to the number of objects in the images (e.g. a pack of 6 tomatoes with the text “6 tomatoes” appearing on the pack). Removing the second-order effect of this MLP on this head may result in more accurate counting abilities.

“Did the authors analyse the OpenAI CLIP variants using this framework? The OpenCLIP and OpenAI variants are trained on different pre-training corpus, so a good ablation is to understand if these properties are somehow dependent on the pre-training data distribution.”

We provide here an additional comparison between OpenCLIP-ViT-L14 and OpenAI’s CLIP-ViT-L14. First, our observation that the last few attention layers have most of the direct effects on the final representation still holds: keeping the last 5 attentions, and ignoring the direct effects of other layers, result in only a small decrease from 75.5% to 73.2% in accuracy on ImageNet. Second, we evaluate OpenAI's CLIP-ViT-L-14 for zero-shot image segmentation:

modelmIoUPixel-wise Acc.MAP
OpenAI ViT-L-1455.2476.1981.37
OpenCLIP ViT-L-1454.5075.2181.61

As shown here, the localization properties of OpenAI-CLIP are comparable to the results of OpenCLIP.

“While this (initial descriptions pool) set is a good starting point, can the authors comment on how this set can be extended to make it more diverse?”

We repeat our experiments with a larger and more diverse image description corpus and present the results in section A.4 in the revised version. We query ChatGPT for class-specific descriptions and create an additional pool of 28767 unique sentences (8 times larger than our previous corpus). This results in higher accuracy with fewer basis vectors per head (smaller mm) compared to the other pools that contain 3498 vectors, as shown in Figure 7.

Aside from this, and more importantly for future work, our method can be made adaptive, by applying it iteratively and using a data-driven refinement step to update the pool. Specifically, at each iteration, we can apply our algorithm to retrieve a basis set and the corresponding descriptions (which indicates roughly what concepts the model is using), query an LLM to generate “more like these” descriptions (to retrieve more fine-grained descriptions), and add the generated new examples to the pool before re-computing the basis set again. This approach may result in more fine-grained descriptions that capture better the output space of each head. We plan to explore this idea further in future work.

Clarification about the image set in Sec. 4.1

We apply our algorithm on ImageNet validation set (50000 images), as it is the largest and most diverse dataset we could use given our compute budget.

评论

I thank the reviewer for their responses. They have resolved my comments and therefore I increase my rating as promised!

审稿意见
8

The paper delves into the analysis of the CLIP image encoder, breaking down the image representation by considering individual image patches, model layers, and attention heads. Using CLIP's text representation, the authors interpret these components, discovering specific roles for many attention heads, such as identifying location or shape. They also identify an emergent spatial localization within CLIP by interpreting the image patches. Leveraging this understanding, they enhance CLIP by eliminating unnecessary features and developing a potent zero-shot image segmenter. The research showcases that a comprehensive understanding of transformer models can lead to their improvement and rectification. Furthermore, the authors demonstrate two applications: reducing spurious correlations in datasets and using head representations for image retrieval based on properties like color and texture.

优点

Strength:

  1. Paper is well-organized
  2. The proposed analysis (importance of different attention layers) and two use cases (removal of spurious correlations and head-based image retrieval) are interesting.

缺点

Weakness: No ablation studies on the impact of the pool size (M) and the basis size (m) on the performance of the decomposition.

问题

Questions:

  1. “all layers but the last 4 attention layers has only a small effect on CLIP’s zero-shot classification accuracy” maybe just because the early layers’ feature are not semantically distinctive? But they should be still important to extract low-level features.
  2. Is it possible to achieve the “dual” aspect of the text encoder of the CLIP: 1) find layer-wise importance of the text encoder; 2) find and remove redundant heads to reduce spurious correlations; 3) perform head-based text retrieval based on query images.
评论

We thank the reviewer for the valuable comments.

“No ablation studies on the impact of the … basis size (m)”

We believe that Figure 3 already implicitly contains the desired ablation. Specifically, in Figure 3 we simultaneously replace each head’s direct contribution with its projection to the mm text representations found by our algorithm. We consider a variety of output sizes m10,20,30,40,50,60m \in \\{10, 20, 30, 40, 50, 60\\}. We evaluate the reconstruction by comparing the downstream ImageNet classification accuracy. As shown in the figure and discussed in section 4.2 the accuracy improves with larger mm values. We found that 60 descriptions (m=60m=60) were sufficient to reach an accuracy that is close to the original model accuracy, but with fewer descriptions, the accuracy drops.

“No ablation studies on the impact of the pool size (M)”

To vary the pool size MM, we repeat our experiments with a larger and more diverse image description corpus and present the results in section A.4 in the revised version. We query ChatGPT for class-specific descriptions and create an additional pool of M=28767M=28767 unique sentences (8 times larger than our previous corpus). This results in higher accuracy with fewer basis vectors per head (smaller mm) compared to the other pools that contain M=3498M=3498 vectors, as shown in Figure 7 in the revised version.

“all layers but the last 4 attention layers has only a small effect on CLIP’s zero-shot classification accuracy” maybe just because the early layers’ feature are not semantically distinctive? But they should be still important to extract low-level features.

In our analysis, we consider only the direct effects and ignore the indirect effects (which would include extracting low-level features that get used in later stages of processing). Our claim is only that the direct effect of the early layers on the output is negligible (Figure 2). In contrast, we expect the indirect effects to be large, since the early layers contribute to the inputs of later layers.

“Is it possible to achieve the “dual” aspect of the text encoder of the CLIP”

While we think that decomposing the text encoder is out-of-scope for this paper, we believe that such a dual analysis is possible, since the text encoder has a similar transformer architecture. Moreover, decomposing both the text encoder representation and the image encoder representation will allow us to examine what each (patch, word) pair contributes to the overall similarity score. We plan to explore it in future work.

评论

Dear reviewer,

Towards the end of the discussion phase, we trust that our response has successfully addressed your inquiries. We look forward to receiving your feedback regarding whether our reply sufficiently resolves any concerns you may have, or if further clarification is needed.

Thank you,

Authors

AC 元评审

This paper provides an interesting and scalable interpretability study about CLIP. Specifically, it leverages the alignment of the image backbone with text representations, allowing for insightful interpretations of each head's role in capturing specific attributes. This interpretability also enables several downstream applications, including targeted image retrieval and effective zero-shot image segmenter. Overall, all reviewers enjoy reading this paper, and highly appreciated its in-depth interpretability analysis of CLIP, with interesting applications. There are only a few minor concerns raised, mainly about further ablations/clarifications to improve the quality of this work. The rebuttal addresses all of them; all reviewers unanimously (and strongly) recommend accepting it.

为何不给更高分

N/A

为何不给更低分

I (personally) believe this paper brings a very novel approach to (scalably) interpreting the CLIP model. By leveraging the alignment between image backbones and text representations, the paper not only deepens our understanding of CLIP but also demonstrates practical and interesting applications. Furthermore, the strong endorsement from reviewers, emphasizing the paper's in-depth analysis and practical implications, underscores its high quality and relevance.

最终决定

Accept (oral)