What do we learn from inverting CLIP models?
摘要
评审与讨论
The paper examines the biases present in the CLIP model and proposes using model inversion to generate content that is assigned to a chosen class label with high confidence. The authors claim to demonstrate that CLIP can blend concepts, produce NSFW images from seemingly harmless prompts, and exhibit inherent gender bias. They also show that more training data can lead to better model inversions and that textual components are present within the inverted images, which they link to TV regularization not being used in the loss function.
优点
The paper presents a solid analysis of CLIP models through a novel approach using model inversion. The paper is well written and the motivation is clear. The problematic studied in the paper is timely and will be of good use to the community. To my knowledge, the idea is indeed novel.
Specifically, the authors clearly demonstrate that CLIP models possess the capability to blend concepts, akin to generative models like DALLE and IMAGEN. Their study reveals associations between seemingly harmless prompts and NSFW content, particularly concerning female celebrities, highlighting the necessity for better data curation. Additionally, the paper uncovers gender biases related to professions and roles. Furthermore, the investigation into textual components within inverted images, especially the effects of TV regularization, enriches the understanding of model behavior, as well as unwanted artefacts that might arise from usage of CLIP models. Overall, the paper offers good contributions to the understanding of CLIP models, emphasizing both their capabilities and areas for improvement.
缺点
Overall, the paper appears to be poorly formatted, giving the impression of being rushed without proper attention to formatting guidelines. For example, Table 4 is misaligned and requires reformatting, and there are 10 unnecessary empty lines between Figure 2 and the text. The same issue occurs with Table 6 in the appendix. This is disappointing, as the text within the paper is well-written and addresses an important topic.
Additionally, while the experiments in Table 4 seem convincing, I believe additional experiments would be beneficial, such as examining potential racial biases (for example you can do the same for Table 4 but for black/white people). The paper generally lacks comprehensive experimental analyses - for example, for "a beautiful landscape, "the map of the african continent" or "a scientist conducting groundbreaking research", how many times did you see NSFW images? Do all of them not pass through the safety checker because the residuals (as marked with red squares in your images) are always the same? Is there anything else that appears in these images (for the same prompts) that is flagged as NSFW? The findings are very interesting and find the reader wanting for more but even after reading through the appendix, the paper falls short. Also, the authors should provide details regarding the model which they perform classification with (this model is vaguely mentioned in line 392).
Finally, regarding the fourth point which you raise (line 106), I am not sure what contribution does it bring to the paper. Although it is visible that scale has quality on the inversions, why is this not obvious? Furthermore, since the paper does not improve the techniques for inversions, what does this insight contribute to the paper, in particular in examining whether CLIP contains biases or not? In my opinion it would be better to remove this point from the paper as it does not really relate to the problem that you are trying to investigate, but I would love to hear why you think it is important if you decide to keep it.
Since there are no space constraints and an additional page is available, I would be inclined to raise my score if the authors reformat the paper, improve its overall structure, and provide further experiments similar to those in Table 4.
Other minor details: Line 107: The abbreviation "TV" is used without introduction. Line 112: Please clarify what "activating a target class maximally" means. Line 146: Reference error. Line 150: Change "delves" to "delve." Line 183: Typo ("which"). Tables 1 and 2: More words should have an asterisk, e.g., "sh*thead," and the filtering should be stronger. Line 321: Reference the word lists by providing links (e.g., English words, Naughty Obscene and Otherwise bad words). Lines 395 and 396 overlap. Line 448: Typo with a double period.
问题
Does using random affine transformations instead of ColorShift affect the conclusions? I believe an analysis like this would be beneficial as well. You mention in line 198 that it has a significant impact on the quality of images, but it is not clear whether the underlying conclusions change and whether the generated images semantically change or remain the same.
What classification model is used for experiments in Table 4? Could you provide details about the architecture, training data and some performance metrics of the model used please?
The paper employs model inversion on CLIP models to visualize images that induce high similarity with a variety of text prompts. By studying the generated images, the paper concludes that:
- CLIP can blend multiple semantic concepts,
- the prompt of (primarily female) celebrities is close to sexually explicit words and can produce sexually explicit images,
- the inverted images reflect gender biases learned from data,
- CLIP models trained on a larger dataset produce higher quality inversions, and
- CLIP inversions include written text in the images.
优点
- Applying model inversion to CLIP models is a suitable way to obtain insights into its proprietary and unavailable training data.
- The paper reveals that the CLIP training data was not cleaned from potentially harmful content.
- It is interesting to see that CLIP model inversion can produce (somewhat) coherent objects and text
缺点
- While CLIP model inversion is interesting for the sake of scientific curiosity, the paper does not discuss any practical implications on downstream tasks, such as retrieval, classification, segmentation, text-to-image modeling. Sec. 7 (l. 470) even states that "these behaviors do not have to be represented in other operational modes".)
- The paper makes rather strong claims which are mostly supported by few qualitative examples. Furthermore, the experimental analysis is not thorough enough. The following elaborates on each contribution.
- Very few images show blending of concepts, and it is not clear if such combined concepts existed in the original training data (e.g., fictional art). If the study was performed on CLIP models trained on open data, it is relatively easy to verify through retrieval if the given concept combination is novel (or at least to provide stronger support).
- For the prompt analysis, a word pool of the 10,000 most common English words and 1,913 additional unsafe words are used. Such overrepresentation of unsafe words can skew and bias the results and their interpretation. For example, it is likely the that unsafe words are much closer to each other in the embedding space than most of the top 10,000 words which have greater variety. Hence, if one unsafe word is retrieved, multiple similar words will follow, potentially exaggerating the qualitative results of Tab. 2. Extending the most common word list naturally would be a better way to incorporate more words, especially because the top 10k words already include unsafe words.
- Top-k closest word lists (Tab. 1, 2) can be misleading because they do not reveal the similarity of the words. If the retrieval word list does not contain relevant words, similarity can be low and decrease quickly. The closest words in Tab. 2 contain stop-words, characters, or other non-words such as d, j, to, ok, lol, mm, yea, yo, ha, um, ia, da, si, which suggests that the surrounding words have rather low similarity with the celebrity names.
- Tab. 4 suggests strong gender biases. However, model inversion is not a faithful generative, but an optimization process. If CLIP was gender biased 55% to 45%, i.e., slightly higher similarity for a prompt when depicting one gender over the other, an optimization process will always seek to generate that one gender. As a result, model inversion likely exacerbates model biases. This could explain the very one-sided values in Tab. 4. which do not allow to draw conclusions about the severity of the CLIP model's biases.
- + 5. The findings are not particularly surprising and not quantitatively supported, e.g. through prompt similarity or OCR. As stated in line 434, it is well known that CLIP recognizes text in images.
- While the paper presents some insights from model inversion, it does not try to explain them. E.g., could a confounding latent concept such as "(female) celebrity" cause these findings? An ablation could test for regular male and female names to support this hypothesis. Similarly, more general concepts/words such as "woman" or "man" could be tested.
- The paper does not try to make a connection to real images. Inverted images are out-of-distribution for CLIP. Are the image-text similarities also out-of-distribution, e.g., much higher than what real images typically achieve? Do the conclusions change if the inversion process is restricted/regularized to similarity magnitudes of real images?
Minor:
- L. 107: The abbreviation 'TV' is used without prior introduction.
- L. 146: Missing citation
- L. 181: Equation should be numbered; x and p are not introduced
- Sec. 5 seems to better fit as Sec. 4.5
- Tab. 4 extends beyond page boundaries and there is not enough whitespace between caption and main text
- While the text mentions ViT-B16 (OpenAI?) is used unless otherwise specified, it would be clearer if the captions of each image specify the model
问题
Please address the points raised in the Weaknesses section. Here are additional relevant questions:
- How do the findings affect typical downstream tasks of CLIP (embeddings)?
- The generated images of Fig. 4 seem to be most surprising because the prompts are harmless. Is there a reasonable explanation for this behavior? The paper does not include a quantitative evaluation such as the ratio of flagged images by the stable diffusion safety checker, e.g., when generating 100 images of each of the three prompts, how many are flagged?
- Please explain the hypothesis of line 320 that the prompts of Fig. 4 are near to NSFW language when Fig. 1 (top) only shows safe words as close in the embedding space.
- What is the similarity value of each word in Tab. 1 (top) and in Tab. 2? Do the unsafe words have a high similarity with celebrity names or other harmless words?
- Regarding Sec. 4.3, what if the prompt is "a successful male or female student in university"? Is the bias still one-sided?
- In this study cosine similarity is optimized in isolation. In practice, CLIP is often used contrastively (the way it was trained), e.g., in zero-shot classification. What happens if the Softmax of the cosine similarity is optimized, e.g., for the three prompts in Fig. 4, or for the different celebrities?
Verdict:
The paper applies model inversion to CLIP models, yielding interesting insights about their training data, as well as some expected results regarding model behavior. However, it is unclear how these findings would translate to downstream applications of CLIP. Unfortunately, the experimental evaluation falls short of ICLR publication standards and does not adequately support the claims made. To strengthen the study, quantifying the frequency or severity of each insight wherever possible and seeking explanations for the observed behaviors could provide a more robust analysis. This could involve inspecting open CLIP models where both model and data are available.
伦理问题详情
I did not take ethical concerns into account for the main review and score. These are the potential ethical concerns:
- While the paper includes a warning at the beginning of the paper, it could do a better job at protecting the reader who does not want to be exposed to potentially harmful content. For instance, the blurring in Fig. 4 and 5 does not sufficiently censor sexually explicit content. Instead, one solution could involve proper censorship in the main paper (including an brief, objective textual description), while linking to the current figures in the supplementary with a warning. The same applies to the text in Tab. 1 and 2.
- The study includes the names of real people (celebrities) for which sexually explicit images (blurred) are generated by the inversion method and unsafe words are retrieved by the CLIP model.
This work presents an insight into embeddings learnt by CLIP via the lens of model inversion. Since the data used for training CLIP is proprietary, the authors essentially leverage model inversion to generate images whose embeddings align closely with the embeddings of certain text prompts. An analysis leads to the following conclusions on both safety and model capabilities: a) Seemingly normal prompts elicit NSFW images unexpectedly, specifically for many female celebrities b) There exists prevalent gender bias based on providing prompts on factors like profession etc c) the images represent blending in the semantic space as expressed by concept blending in the prompt, implying the model’s understanding of semantic relationships (as opposed to more visual, pixel level features). The paper is accompanied by appropriate visualizations and statistics where apt to support the presented claims.
优点
This work presents an interesting analysis on interpreting embedding-based image features from the widely popular CLIP model (where the information about training data is proprietary). It flags important drawbacks elicited via model inversion which point out potential flaws in the training data, flagging an important issue given that CLIP image embeddings are widely used. Further, the authors have aptly presented the different implications in an organized and coherent fashion making it easy to follow the paper.
缺点
While the work presents an interesting analysis, it is unclear how these insights can be concretely leveraged to improve image generation pipelines as of today. Can we inform any of the following? a) Modelling strategies and making models more robust to such potentially bad data points? Any kind of safety finetuning? b) data curation strategies if any? Further, some modelling choices (e.g. choice of transformations) are not well motivated. These questions have been outlined in the next section.
问题
I would love to hear from the authors regarding the following questions:
- From the above section on weaknesses, how can the research community today concretely leverage these insights to inform data, modelling or evaluations? For example, how do any of these insights translate into useful modelling pointers for generative diffusion models? Do any of the learnings from CLIP-like models transfer to the generative context?
- Tangentially, this work also demonstrates the ability to inspect training data ~ is it accurate to say that this can imply training data leakage (and inspection via model inversion) and thus there could be evidence of memorization? If yes, how can we ensure downstream models are robust to such inversion analysis?
- How can we improve text rendering based on the analysis presented? What is the intuition behind the choice of regularization being correlated with the text rendering quality?
- On the statement of "Similarly, in CLIP inversion, if an image aligns with a given prompt, its augmentations must align with that prompt as well. " - why was a different set of transformations chosen? What is the motivation? It is suprising that this is a big discriminator of image quality.
The paper investigates CLIP models using image inversion techniques to analyze their learned representations and biases. Their analysis reveals three findings: (a) CLIP models can effectively blend different concepts in inverted images, (b) models contain embedded NSFW associations that emerge even from innocent prompts, and (c) strong gender biases exist, particularly in occupational and status-related concepts.
优点
Strengths:
- The analysis shows that CLIP models trained on more data are more amenable to image inversion (better quality of inverted images
缺点
Weaknesses:
- No novelty in methodology or findings. The problem of image inversion has been studied in the context of discriminative (deepdream and papers cited in the submission) and generative models (e.g., https://arxiv.org/abs/2405.15012, https://dl.acm.org/doi/abs/10.1145/3372297.3417270). The problem of identifying CLIP biases has also been studied in the past (e.g., https://arxiv.org/abs/2311.05746, https://ojs.aaai.org/index.php/AIES/article/view/31657 and references therein).
- Fairly indirect approach to study model biases. The image inevrsion approach proposed in this paper requires a second human-in-the-loop inspection of inverted images to identify biases. This can be tedious, error-prone, and is not scalable. Also, it is unclear if there is a principled way to quantify the extent to which a given model is biased w.r.t. a given concept, which is essential for making progress in this direction of forgetting / unlearning undesirable concepts.
- This paper specifically focuses on CLIP models, even though the image inversion approach can be applied to more recent (and better performing) image-text models. Studying biases across models (e.g., do inverted images w.r.t. CLIP models tranfer to other models) would be interesting.
问题
- Chris Olah's work on feature visualizaion (https://distill.pub/2017/feature-visualization/) shows that applying image inversion on fourier-transformed images can yield better results; would be interesting to see if the results are different if this fourier-based regularization technique is used.
- Is there a way to leverage the proposed image inversion approach to mitigate identified biases in CLIP models?
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.