5.8

/10

Poster4 位审稿人

最低3最高7标准差1.6

4.3

置信度

正确性2.8

贡献度2.8

表达3.0

NeurIPS 2024

CLIP in Mirror: Disentangling text from visual images through reflection

Tiancheng Wang,Yuguang Yang,Linlin Yang,Shaohui Lin,Juan Zhang,Guodong Guo,Baochang Zhang

OpenReview PDF

提交: 2024-05-13更新: 2025-01-14

摘要

关键词

Disentanglement of CLIPFlip invarianceTypographic attackText recognition

评审与讨论

审稿意见

评分: 3置信度: 42024-07-04

This paper attempts to address typographic attacks by disentangling visual and language representations. The proposed framework, MirrorCLIP, leverages the observation that visual models struggle to recognize text semantics in mirrored images. By using both original and flipped images, MirrorCLIP compares their features in the latent space to create disentangling masks. These masks aim to separate textual and visual elements more precisely, leading to improved disentangled representations. However, while the experimental results demonstrate improvements on the current dataset, they also prompt questions about MirrorCLIP's actual effectiveness in disentangling visual and textual representations.

优点

Clarity and Conceptual Simplicity: The idea behind MirrorCLIP is straightforward, and the writing is consistently clear and accessible.
Training-Free Methodology: Employing a training-free approach, MirrorCLIP demonstrates an effective strategy for tackling typographic attacks.

缺点

The proposed MirrorCLIP framework might be easily circumvented. For instance, by overlaying text and its mirrored version on one image, the method's ability to disentangle textual features could be compromised.
The hypothesis that "mirrored texts are difficult for visual models" is overly generalized. While this may hold true for most cases, exceptions exist, such as palindromic words like "mom" or ambiguities arising from handwritten text, which can still be recognized or confused by visual models.
In Figure 7, the results in the first row show that the textual features do not generate corresponding semantics (e.g., dog and earphones), but rather produce nonsensical words. This raises questions about whether MirrorCLIP truly disentangles the semantics of text or merely separates text-form visual features.

问题

Does the disentangling mask still work if the typography includes both text and its mirrored version?
Why do the "textual features results" in the first row of Figure 7 not produce corresponding semantics of the typography?
In Table 6, why is the text recognition accuracy as high as 61.03 when textual features are zeroed out? Does this indicate that the disentangling of textual features is insufficient?

局限性

Please see weakness

作者回复

2024-08-07

W1&Q1: Does … still work … text and its mirrored version? The proposed … might be easily circumvented.

A1: Yes, the disentangling mask still works. Although there is a 10.22 drop (59.71 to 49.49) in performance compared to the accuracy with ordinary typography (See Table Ⅱ in attached PDF), MirrorCLIP still achieves disentanglement and defends against typographic attacks.

As shown in Figure Ⅱ(b), we constructed a dataset that contains the original and the mirrored text. Our results revealed that, after adding original and mirrored text, the cosine similarity between image features before and after flipping also exhibited a great decrease from 0.9855 to 0.8566, as shown in Table Ⅳ. As the core idea of our method is to leverage the lack of feature invariance in CLIP when flipping images, MirrorCLIP still can locate the textual components by comparing image features before and after flipping, as shown in the activation map in Figure Ⅱ(b). Moreover, according to Table Ⅱ, MirrorCLIP still achieves disentanglement with 9.73 improvements (39.76 to 49.49) compared to the baseline, and defends against typographic attacks. We suspect that besides semantic information, the positional information of text may also have some impact on the disentanglement of MirrorCLIP. Yet, the performance experiences a decline compared to the accuracy with ordinary typography, due to significant interference from the original and mirrored text.

We somehow disagree. Because defense/circumvention is not our main focus, and our MirrorCLIP is primarily proposed as a disentanglement method. Compared to ordinary typography, typography with original and mirrored text is a targeted strong attack for our method, and this is not common in the real world. We sincerely appreciate your thorough insights, will add a discussion of this in the limitation section, and explore defense methods against such a strong attack in the future.

W2: The hypothesis … overly generalized … palindromic … "mom" … handwritten text …

A2: MirrorCLIP is capable of managing ordinary palindromes like "did" and "radar" or handwritten text, which change upon mirroring. However, it struggles to achieve disentanglement when dealing with special palindromes like "mom" and "wow". Yet, note that those special palindromes are extremely rare and hence basically have no impact on our hypothesis.

For the case of handwritten text, we have already conducted experiments on 3 real-world typographic datasets where the text is all handwritten and show excellent disentanglement results (Table 4 and Table 5).

For the case of palindromes, we categorized them into two types: ordinary palindromes, where the shape of the words changes before and after flipping ("did" to "bib"), and special palindromes, where the shape of the words remains basically unchanged ("mom" to "mom"). We constructed corresponding datasets: the ordinary palindrome dataset includes 26 words ("dad", "madam", "radar", etc.), while the special palindrome dataset includes 5 words ("wow", "noon", "mom", "nun", "minim"). Both palindromes are illustrated in Figure Ⅱ(c) and Figure Ⅱ(d) in attached PDF. The results are shown in Table Ⅲ in attached PDF. For ordinary palindromes, MirrorCLIP achieves disentanglement with 13.85 improvements compared to the baseline. This is a comparable improvement like other words. However, for special palindromes, MirrorCLIP struggles to achieve disentanglement and only improves the accuracy by 5.29. As special palindromes are quite rare compared to other words, according to the results in Table Ⅲ, their impact is limited.

Thanks for pointing out this. We will include a description of the special palindrome scenario in the limitation section.

W3-1&Q2: Why … not produce … semantics … typography?

A3: This issue is likely due to the limitations of the Stable UnCLIP model we used for feature visualization. It does not possess the capability to directly generate semantically relevant images when dealing with text-only images. The generated images are often meaningless characters, more examples are shown in Figure Ⅱ(a) in attached PDF.

As seen in Figure 7 first row, after disentanglement, images generated with visual features do not carry textual components, and images generated with textual features do not carry visual components. This shows the effective disentanglement of MirrorCLIP.

W3-2: Whether … disentangles the semantics … text-form visual features.

A4: Our method can disentangle the textual semantics. This is verified through text recognition. According to the results in Tables 5 and 6, with disentangled textual features, the accuracy of text recognition improved significantly. This indicates the excellent disentanglement capability of MirrorCLIP for features with textual semantics, not only text-form visual features.

Q3: In Table 6, why … 61.03 when … zeroed out? Does … disentangling … insufficient?

A5: There might be some misunderstanding. Text recognition accuracy when textual features are zeroed out is actually 23.18 (as shown after visual features (zero) in Table 6), not 61.03, 61.03 is actually the text recognition accuracy when visual features are zeroed out. We would like to clarify that the label (zero) denotes textual features or visual features obtained by performing Hadamard product of textual or visual masks with image features, as defined in L247. We will clarify the meaning of the label (zero) more explicitly in the revision to avoid any confusion.

The disentanglement of textual features is sufficient based on the large decrease (from 72.51 to 5.29) in text recognition accuracy in Table 6. The text recognition accuracy of textual features obtained with the textual filter is 72.51 while the text recognition accuracy of visual features is 5.29. The large decrease in text recognition accuracy is due to the efficient removal of textual information.

2024-08-11

Dear Reviewer ueRM,

I hope this message finds you well. As the deadline for the discussion phase approaches, we wanted to check in and see if our rebuttal has addressed your concerns. We would greatly appreciate it if you could reconsider your rating based on the responses and updates we've provided.

Best regards from all authors

评论- Thanks for your response

2024-08-11

My primary concern remains whether mirrorCLIP can genuinely achieve semantic-level disentanglement, rather than merely a formal-level disentanglement. The current experimental results do not convincingly address this issue. My specific questions are as follows:

Could the authors provide accuracy results using typography semantics as ground truth (similar to Table II but with disentangled textual features)? This would allow for a more objective assessment of mirrorCLIP's performance in semantic disentanglement.
Regarding Figure 7, my question pertains to the "textual features results" in the top row, whereas your response focuses on the "image features results" in the bottom row. If the "textual features results" in the bottom row can generate semantically correct images, this would demonstrate the semantic capability of Stable UnCLIP. Consequently, the "textual features results" in the top row should also produce images with specific, corresponding semantics rather than meaningless text.
Text recognition appears to address formal-level (texture-like) disentanglement, which does not necessarily demonstrate semantic-level disentanglement. While text recognition can be a preprocessing step for semantic understanding, it does not itself provide semantic-level features.
The results in Table II indeed show that mirrorCLIP has some effectiveness against mirrored text attacks, but the reasoning behind this remains unclear. Could the authors provide further analysis and explanation of this experimental result?
In Table IV, could you provide the similarity results for the normal typographic attack?

2024-08-12

Thank you for your thoughtful response. Before addressing your points individually, we want to clarify a key aspect. Our main disagreement seems to be whether MirrorCLIP extracts true semantic-level features or merely format-level (text-like) features.

We would like to clarify that all text recognition experiments in our paper are conducted by directly calculating the similarity between the disentangled textual embeddings from MirrorCLIP and text embeddings from CLIP’s text encoder (as shown in the pipeline in Figure 5). No additional network or layers are used (as inferred from your comment about 'text recognition as a preprocessing step').

Text recognition experiment Setup: Compared to the experiment for image recognition, the only adjustments were changing the ground truth from visual categories to typographic categories and modifying the prompt from "a photo of [CLS]" to "text of [CLS]," as detailed in Appendix A.

CLIP's text encoder has been shown to learn robust representations that capture semantics of input text, which has been widely used in text-conditional image generation [1,2] and phrase understanding [3,4]. Our text recognition experiment directly ultilized text embeddings of CLIP and naturally valids semantic-level disentanglement, where the content of typography is predicted by selecting the category corresponding to the text embeddings with the highest similarity to the disentangled textual embeddings from MirrorCLIP. Therefore, we are convinced that our validation is effective. We will supplement this setup more clearly in our manuscript.

[1] Clip-forge: Towards zero-shot text-to-shape generation. CVPR. 2022.
[2] Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
[3] When does CLIP generalize better than unimodal models? When judging human-centric concepts. ACL workshop, 2022.
[4] Clip also understands text: Prompting clip for phrase understanding[J]. arXiv preprint arXiv:2210.05836, 2022.

2024-08-12

Here are our point-to-point responses to your further questions:

We think there might be some misunderstanding here. In our paper, all text recognition experiments use typography semantics as the ground truth, as you suggested. Detailed experimental setup has been elaborated above and the accuracy results has been already presented in Table 5, where CLIP's typography semantics recognition accuracy significantly improves after disentanglement.
The generation of meaningless text-only images by Stable UnCLIP can be attributed to two main factors:
- Lack of Optimization on Typographic Images: As highlighted in [5], Stable UnCLIP was not specifically optimized for typographic images during its training. This limitation makes the model unstable when generating such images. Our validation experiment, shown in Figure II(a), confirms this issue, where generated images often fail to align with the input image’s semantics, sometimes resulting in nonsensical characters.
  
  [5] High-resolution image synthesis with latent diffusion models[C]. CVPR. 2022.
- Textual-Visual Entanglement: As discussed in our Limitation Section, while MirrorCLIP is designed for text-visual disentanglement, the separation is not always perfect, which can negatively impact image generation. Our experiments in Figure 7 illustrate that in low-noise scenarios, such as with solid-color backgrounds, the generated images maintain semantic consistency. However, in more complex scenes with multiple elements, the generation process is prone to producing meaningless text.
In summary, the challenges Stable UnCLIP faces with noise sensitivity complicate its performance in generating typographic images, and by extension, the validation of MirrorCLIP. We will elaborate on this issue in our revised manuscript. Nevertheless, we believe that with the ongoing advancements in generative technologies, this challenge will not hinder the validation of MirrorCLIP’s effectiveness. We sincerely appreciate your valuable feedback.
We think there might be some misunderstanding here. As introduced in our supplement setup, the text recognition experiment was not a preprocessing step for semantic understanding. Instead, it directly determines the semantic category of typography, much like image recognition, by calculating the similarity between the disentangled textual embedding from MirrorCLIP and the text embeddings. As prior work [1,2,3,4] suggests, these text embeddings contain different category semantics. Thus, the observed performance improvement after disentanglement suggests that MirrorCLIP indeed achieves semantic-level disentanglement, aligning with our claims.
We conducted further analysis on how the positioning of original and flipped text within an image, in conjunction with CLIP's position embeddings, contributes to its ability to distinguish content. Specifically, we designed an experiment where the original and flipped text were placed vertically close, separated by only 10 pixels.

The experiment-conducted under the same conditions outlined in Table II (Original and Mirrored Text) — showed that MirrorCLIP’s performance in image classification dropped by 3.72 points (from 49.49 to 45.77 as shown below) when the text was closely positioned. This outcome shows that positional proximity diminishes the model’s classification accuracy due to the reduced impact of positional information. We will discuss and dive deeper into this in the final version.

imagenet food flowers avg.
random position 45.99 68.21 34.27 49.49
vertically close 43.72 62.77 30.82 45.77
The detailed results of the similarity results for the normal typographic attacks have already provided in Table 1, and we put it as follows.

imagenet food flowers avg.
normal typographic attack 0.8164 0.8643 0.8074 0.8294

It can be seen that compared to various special scenarios, the similarity for normal typographic attacks is lower before and after flipping.

	imagenet	food	flowers	avg.
random position	45.99	68.21	34.27	49.49
vertically close	43.72	62.77	30.82	45.77

	imagenet	food	flowers	avg.
normal typographic attack	0.8164	0.8643	0.8074	0.8294

I hope the above answers address your concerns. Thank you again for your feedback, which has prompted us to think more deeply about and evaluate MirrorCLIP. We sincerely look forward to your response.

2024-08-12

There seems to be a misunderstanding regarding the text recognition question. I did not state that text recognition is used as a preprocessing step for handling typography attacks in this paper; rather, I mentioned that it 'could be'. My main point is that text recognition, as a task, does not necessarily require semantic understanding. Therefore, improvements in text recognition performance do not demonstrate semantic-level disentanglement, as the task primarily depends on morphological similarity.
The visualization results in Figure 7 further support my concern. The authors’ response continues to focus on the "image features results" in the bottom row, attributing the nonsensical outputs in the top row’s "textual features results" to limitations of Stable UnCLIP. However, if Stable UnCLIP’s performance were indeed the issue, the bottom row’s "textual features results" should also be nonsensical. If semantic-level disentanglement were achieved, the "textual features results" in both the top and bottom rows should be consistent, regardless of Stable UnCLIP’s performance. Figure E in the appendix illustrates this issue across examples, with only the 'chair' example in the middle row producing the expected results. These results suggest that mirrorCLIP's disentanglement of textual semantics is insufficient.

A closely related work [1], despite some noise, effectively demonstrates its semantic-level textual disentanglement by visualization.

Additionally, in response to “why mirrorCLIP is effective against mirrored text attacks,” the authors note that accuracy improvements can arise from factors such as position. Given that different tasks emphasize different aspects, including position, morphology, semantics, and so on. it would be beneficial for the authors to provide a more detailed analysis of the specific sources of performance improvements in their method. This would offer valuable insights into the effectiveness of the approach.

Based on the current results, I remain doubtful about mirrorCLIP’s ability to achieve semantic-level disentanglement, which is claimed as their core contribution. Therefore, I maintain my original score.

[1] Disentangling visual and written concepts in CLIP. CVPR 2022.

2024-08-13

We need to clarify that in our original paper, we have never claimed semantic-level disentanglement as our core contribution. In fact, our core contribution is the observation that CLIP's image embeddings do not exhibit horizontal flip invariance for typography, and we proposed the zero-shot framework MirrorCLIP to achieve the disentanglement of visual and textual embeddings based on this observation. The idea of semantic-level disentanglement was introduced during the rebuttal phase to address your question about whether MirrorCLIP truly disentangles the semantics of text or merely separates text-form visual features, and our experiments confirm that MirrorCLIP is indeed capable of disentangling the semantics of text.

Detailed answers are shown below.

There seems to be a clear misunderstanding on semantics of CLIP's representations. If CLIP's text encoder merely focused on the shape of text-like images, it would struggle to recognize images that do not visually resemble their textual labels-for instance, identifying an image of a cat, which obviously does not resemble the word "cat". Given this, our text recognition experiment is far from typical morphological analysis. Apart from the differences in ground truth categories and input embeddings, the process for our text recognition experiments is identical to that of image recognition. Both use CLIP's text encoder to establish the text embeddings of ground truth categories. Instead, it validates how effectively MirrorCLIP preserves and aligns semantic content with text embeddings, demonstrating that our disentanglement process maintains semantic integrity.

Furthermore, to confirm that our text recognition task is based on semantic disentanglement, we conducted an experiment where images containing the typography of "little" were used as input to the image encoder, and the text ["little", "Iittle", "littIe", "IittIe"] was used as input to the text encoder, where "l" is the lowercase of "L" and "I" is the uppercase of "i", and their shapes are almost identical. We then recognize the content of typography by comparing the cosine similarity between the disentangled textual embeddings obtained from MirrorCLIP and the text embeddings from the text encoder. The final prediction probabilities are shown below.

"little" "Iittle" "littIe" "IittIe"
predicted probability 0.9536 0.0162 0.0298 0.0005

Results of the same experiment with images containing the typography of "apple" and text ["apple", "appIe", "opple", "oppIe", "abble"] are shown below.

"apple" "appIe" "opple" "oppIe" "abble"
predicted probability 0.9995 0.0002 0.0001 0.0002 0

According to the above experimental results, it is evident that our text recognition experiments rely almost entirely on semantics rather than shape.

In summary, we would like to emphasize that our approach differs significantly from standard text recognition methods, where we use CLIP's text encoder to establish the ground truth, an encoder known for capturing the semantics of input text rather than just morphological patterns [1,2,3,4].

[1] Clip-forge: Towards zero-shot text-to-shape generation. CVPR. 2022.
[2] Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
[3] When does CLIP generalize better than unimodal models? When judging human-centric concepts. ACL workshop, 2022.
[4] Clip also understands text: Prompting clip for phrase understanding[J]. arXiv preprint arXiv:2210.05836, 2022.

	"little"	"Iittle"	"littIe"	"IittIe"
predicted probability	0.9536	0.0162	0.0298	0.0005

	"apple"	"appIe"	"opple"	"oppIe"	"abble"
predicted probability	0.9995	0.0002	0.0001	0.0002	0

2024-08-13

We highlight our conclusion here - due to the limitation of Stable UnCLIP and ours, we cannot generate expected results based on textual semantics and Stable UnCLIP. Actually, Stable UnCLIP is barely able to work in extreme cases. This does not indicate that our disentanglement of textual semantics is insufficient. Instead, we have shown the effectiveness of our disentanglement of textual semantics via contrasting disentangled textual embeddings from MirrorCLIP and text embeddings from CLIP's text encoder. Moreover, below are some key points that we need to clarify compared to [1]:
- As described in Section 7.1 of [1], the image generation method used in [1] is entirely different from ours. The method in [1] is text-to-image generation with text prompt, whereas our method is image-to-image generation with no text prompt. And the generation model used in [1] was optimized with their proposed models, while ours is not.
- It is obvious that the image generation experiments in [1] only can demonstrate that they achieved format-level disentanglement, instead of semantic-level disentanglement.
- In the third row of Figure 7 in our paper, the images generated using disentangled textual embeddings contain semantically relevant visual components rather than typography. This clearly demonstrates that our method operates at the semantic level rather than the format level.
  
  [1] Disentangling visual and written concepts in CLIP. CVPR 2022.
The examples of generated images in [1] demonstrate that they can only control whether they prefer to generate visual or textual components within the same semantic context using a trained projection matrix. This precisely indicates that their work only achieves format-level disentanglement rather than semantic-level disentanglement, as there is no second semantic input provided to the generative model. For example, as shown in Figure 1 of [1], when the input text prompt is "corn", the approach in [1] can only control whether to generate visual components of corn or the typography of the word "corn". This clearly demonstrates that only format-level disentanglement is achieved, not semantic-level disentanglement.

Moreover, the generative model in [1] is optimized with their projection matrix as described in Section7.1 of [1], while ours is not. Despite this, the image quality produced by our method using disentangled visual embeddings is noticeably better than that produced by the "forget to spell" model used in [1]. Additionally, the model we used was trained for images with normal visual components and not specifically optimized for typography. As a result, our generated results are more susceptible to noise interference when dealing with textual embeddings of typography. It is evident that the textual embedding in the first row is much more affected by noise than in the third row of Figure 7, which explains why the first row struggles to generate semantically relevant images.

Furthermore, in the third row of Figure 7, the images generated using textual embeddings of typography contain semantically relevant visual components rather than typography. This clearly demonstrates that our method operates at the semantic level rather than the format level.

I hope the above answers address your concerns. We sincerely look forward to your response.

审稿意见

评分: 7置信度: 52024-07-08

This paper proposed a simple yet effective disentanglement framework for CLIP, leveraging the different characteristics of visual and textual semantics when facing mirror reflection and reveals that the CLIP model does not exhibit horizontal flip invariance for text, demonstrating a certain degree of innovation. The framework achieves zero-shot disentanglement of textual and visual features and conducts various experiments to validate the effectiveness of the disentanglement framework utilizing methods such as CAM and image generation. Additionally, it enhances the robustness of the CLIP against typographic attacks without any additional training, surpassing the defense performance of existing methods.

优点

The paper is easy to follow. The proposed zero-shot, training-agnostic method could have similar performance to non-zero-shot methods.

缺点

In the ablation experiments, detailed experimental results of the disentangling framework when dealing with images containing flipped text were not provided.
More description of the potential applications of this disentanglement framework in practical tasks should be provided in conclusions.
False mathematical notations. Cross multiplication ( $\times$ ) is used throughout equations in the manuscript, which may cause misunderstanding.

问题

Please refer to weaknesses.

局限性

Limitations are adequately addressed.

作者回复

2024-08-06

W1: Detailed experimental results of the disentangling framework when dealing with images containing flipped text were not provided in Ablation Experiment.

A1: Thanks for your thorough review of the paper. We show the detailed experimental results with images containing flipped text in Table Ⅰ shown below. Based on the results of Table 6 and Table Ⅰ, we can see that the vision features obtained through MirrorCLIP achieve high accuracy in image classification tasks when handling both normal text or flipped text. The results will be added in the revision.

Table Ⅰ: Results of different features on image recognition with flipped text.

	original	typographic
image features	61.38	55.97
flipped image features	61.59	37.56
visual features	61.84	50.30

W2: More description of the potential applications of this disentanglement framework in practical tasks should be provided in conclusions.

A2: Thanks for your advice. We have initially explored object detection and text segmentation by combining MirrorCLIP with RegionCLIP and SAM. The results show the potential of MirrorCLIP for different downstream tasks or applications. Relevant examples are shown in Figure Ⅰ in attached PDF. By using MirrorCLIP to get the disentangled visual region features of RegionCLIP, we can reduce the influence of textual factors and get more accurate detection results. By using the textual features obtained from MirrorCLIP to generate prompts for SAM, we can achieve text localization within images and perform preliminary text segmentation. In our revision, we will include a description of the potential applications of MirrorCLIP.

W3: Cross multiplication ( $\times$ ) is used throughout equations in the manuscript, which may cause misunderstanding.

A3: Thank you for pointing out the notation issue. We will correct it and thoroughly check all mathematical notations in the revision.

评论- Response to Rebuttal

2024-08-11

I want to thank the authors for their rebuttal. The concerns are adequately addressed. I will raise my score.

2024-08-13

Dear Reviewer b2Kp,

Thanks for your recognition of our work and constructive suggestions. We appreciate your careful consideration of the suggestions and are glad to hear you view the paper more positively.

Best regards from all authors

审稿意见

评分: 6置信度: 42024-07-09

The paper highlights that CLIP may erroneously identify visual objects due to the influence of textual information, thereby reducing the accuracy of visual object recognition. The objective is to extract more precise visual and textual features from the image. The paper proposes that mirroring the image can preserve the consistency of visual semantics while disrupting textual semantics. Based on this insight, a zero-shot framework has been designed. Specifically, a disentangling mask is generated by inputting both the original and flipped images. Additionally, filters are designed to separate textual and visual factors, resulting in disentangled representations. Experiments using stable diffusion models and class activation mapping (CAM) validate the effectiveness of the proposed method.

优点

The proposed methodology is straightforward and easy to implement.
The results are comprehensive, covering experiments across various settings, including typographic attacks.
The appendix contains additional results, demonstrating an extensive empirical effort.

缺点

The proposed method claims to achieve more precise visual semantics by disentangling the semantics of images. I am curious whether this kind of visual semantics can be generalized to other tasks.
The paper primarily presents experiments for classification. I am interested in whether this approach can be extended to other tasks, such as object detection.

问题

The results are comprehensive. But I am still curious if the disentangled representations can be explored for other downstream tasks or applications. It would be better to have a discussion on this.

局限性

The authors discussed the limitations in the paper.

作者回复

2024-08-06

W1&W2&Q1: It would be better to have a discussion on whether MirrorCLIP can be explored for other downstream tasks or applications.

A1: To explore MirrorCLIP's applications and downstream tasks, we combined it with RegionCLIP and SAM, for detection and text region segmentation. Specific examples can be found in Figure Ⅰ in attached PDF.

For detection, RegionCLIP extends CLIP to learn region-level visual representations, allowing for detailed alignment between image regions and textual concepts. This capability supports region-based reasoning tasks, such as zero-shot and open-vocabulary object detection. However, the vanilla RegionCLIP is susceptible to textual components during object detection tasks. By using the MirrorCLIP framework to disentangle the region features of RegionCLIP, we can similarly reduce the influence of textual factors. In Figure Ⅰ(a), vanilla RegionCLIP mistakenly identified a price tag with text "papaya" as papaya. Moreover, after adding the text "television" on the laptop screen, vanilla RegionCLIP was misled and identified the laptop monitor as a television set. These errors were corrected by replacing the region features with the disentangled visual features obtained through MirrorCLIP. This highlights the potential of MirrorCLIP for applications in object detection.

For text region segmentation, by using the disentangled textual features obtained from MirrorCLIP to generate prompts for SAM, we can achieve text localization within images and perform preliminary text segmentation. Specific examples can be seen in Figure Ⅰ(b). This shows that disentangled features through MirrorCLIP can be used for downstream tasks such as image segmentation. Our future work will continue to explore the applications of MirrorCLIP in various tasks.

2024-08-13

Dear Reviewer J5xD,

Thank you for your recognition of our work and constructive suggestions. We hope our additional experiments have addressed your questions. As the deadline for the discussion phase approaches, if you have any other questions or would like to discuss further, please let us know. We sincerely look forward to your response.

Best regards from all authors

审稿意见

评分: 7置信度: 42024-07-11

This paper introduces a zero-shot framework, MirrorCLIP, to solve the confusing issues of CLIP facing text-visual images. Unlike existing methods, this method exploits CLIP’s invariance for visual factors and variance for textual factors of images when horizontally flipped. In particular, this paper reveals the difference in mirror effects between visual objects and text on CLIP representation. It first develops a dual-stream disentanglement framework that generates masks by comparing the original text-visual images with the flipped ones in latent space. Additionally, the designed filters generate textual and visual features, respectively, ensuring disentangling quality. This paper compares the proposed method with various methods across multiple datasets, including clean images, synthetic typographic images, and real-world typographic images. During the experiments, MirrorCLIP showed better disentanglement effectiveness and quality for the textual and visual parts of the images, as well as robustness for typographic-attacked images. The paper also uses CAMs and generative models to further evaluate the disentanglement performance.

优点

The finding of the mirror effects of CLIP is novel. This work analyzes the differences between the effects of visual factors and text factors by horizontal flip. Exploiting this, the work proposes an efficient and simple solution to disentangle textual and visual factors in latent space, and address the issue of CLIP networks caused by text-visual images.
The experiment is sufficient, and the performance is excellent. Compared to exiting baselines and SoTA, MirrorCLIP has better performance and robustness when facing image classification on both original images and typographic-attacked images. Furthermore, the qualitative results obtained using the CAM method and the generative method demonstrate MirrorCLIP's disentanglement performance.

缺点

This approach can achieve good disentanglement and solve confusion in text-visual image understanding. However, it would be beneficial to delve deeper into the differences by comparing this approach to the existing CLIP-based method for textual and visual disentanglement in related works.
The pipeline is based on CLIP. Providing a preliminary introduction to CLIP would be better. Moreover, adding an image that introduces the concept of textual and visual objects of images will improve the clarity of the paper.

问题

See above.

局限性

N/A

作者回复

2024-08-06

W1: It would be beneficial to delve deeper into the differences by comparing this approach to the existing CLIP-based method for textual and visual disentanglement in related works.

A1: Thanks for your constructive advice. Compared to other CLIP-based works, our MirrorCLIP is the only training-free method without additional parameters and data, yet exhibits superior disentanglement performance. Moreover, MirrorCLIP remains the performance unaffected on the original datasets while others may degrade the performance.

Due to space constraints, we briefly introduced CLIP-based methods in L90. We will highlight the differences between MirrorCLIP and others in the revision. Specifically, Lemesle et al. introduced methodological tools from the cognitive science literature to assess the language biases of CLIP, and found that the textual and visual factors of an image do not share semantic representations in CLIP by presenting words that distort image classification across different categories levels. However, they cannot achieve disentangled representations of CLIP. Materzynska et al. disentangled visual and textual features by training different projection matrices and applying them to the CLIP outputs. However, it requires the introduction of additional model parameters and data for training; this also results in a performance decrease on the original datasets.

W2: Providing a preliminary introduction to CLIP would be better. Moreover, adding an image that introduces the concept of textual and visual objects of images will improve the clarity of the paper.

A2: Thanks for your advice. In our final version, we will add the preliminary of CLIP, along with a more straightforward presentation of visual and textual components of images to enhance the clarity of our work.

2024-08-13

My concerns are well addressed and I have no further questions. Thanks!

2024-08-13

Dear Reviewer xWuF,

Thank you for acknowledging our work and putting forward precious suggestions. As the deadline for the discussion phase approaches, if you have any other questions or would like to discuss further, please let us know. We sincerely look forward to your response.

Best regards from all authors

作者回复

2024-08-07

Dear Reviewers,

Please see the attached one-page PDF with added experimental results.

We sincerely thank all the reviewers for their positive and constructive comments:

All reviewers appreciate that our paper introduces a simple yet effective training-free approach to disentangle textual and visual factors of CLIP image embedding in latent space (reviewer 1,2,3,4),
The observation of difference in mirror effects between visual objects and text on CLIP representation is novel (reviewer 1,3),
The experiment is sufficient and the results are comprehensive (reviewer 1,2,3).

They also voiced several valid concerns. We have been diligently enhancing the paper on multiple fronts, addressing concerns, and providing a point-to-point response. We summarize the changes updated below.

1. Exploring the potential applications of MirrorCLIP in various downstream tasks.

To explore MirrorCLIP's applications and downstream tasks, we combined it with RegionCLIP and SAM, for detection and text region segmentation. Specific examples can be found in Figure Ⅰ in attached PDF. By using MirrorCLIP to get the disentangled visual region features of RegionCLIP, we can reduce the influence of textual factors and get more accurate detection results. By using the textual features obtained from MirrorCLIP to generate prompts for SAM, we can achieve text localization within images and perform preliminary text segmentation. These examples demonstrate the potential of MirrorCLIP for various downstream tasks.

2. Further tested MirrorCLIP's disentanglement capability in various extreme scenarios.

We further tested disentanglement capability of MirrorCLIP in three special scenarios, including typography with original and mirrored text, ordinary palindromes and special palindromes. We constructed corresponding datasets and conducted experiments, detailed results are shown in attached PDF.

According to the results, when handling ordinary palindromes, where the shape of the words changes before and after flipping ("did" to "bib"), MirrorCLIP can still achieve disentanglement performance comparable to that of handling other normal words. However, when handling special palindromes, where the shape of the words remains basically unchanged before and after flipping ("mom" to "mom"), MirrorCLIP struggles to achieve disentanglement. Yet, due to special palindromes are quite rare compared to other words, their impact is limited.

When handling typography with original and mirrored text, MirrorCLIP can still achieve disentanglement, but there is a noticeable decline in performance. Yet, compared to ordinary typography, typography with original and mirrored text is more like a targeted strong attack for our method, and is not common in the real world. Also, MirrorCLIP is primarily proposed as a disentanglement method, instead of a defense method.

We will discuss MirrorCLIP's performance in these scenarios in the Ablation and Limitation sections of the revision.

3. More revisions that help enhance the clarity of the paper.

We will further clarify the differences between MirrorCLIP and other disentanglement methods.
We will add the preliminary of CLIP, along with a more straightforward presentation of visual and textual components of images.
We will correct and clarify all symbols and definitions that could lead to misunderstandings.

Please see our reviewer-specific feedback for detailed information.

最终决定Accept (poster)

2024-09-25

The reviewers and the authors agree on the method proposed is straightforward and it is a "simple" (quoted from the authors' own word) approach. AC believes and agrees that the presented approach offers an interesting way to peek into how CLIP represents. It is still a small cut to diagnosing the common failures of CLIP (and similar models), but insigntful enough to motivate similar lines of future research. AC recommends Accept.