Whitened CLIP as a Likelihood Surrogate of Images and Captions
Direct likelihood approximation of images and captions based on CLIP's learned distribution.
摘要
评审与讨论
update after rebuttal
I will keep my score as is (3)
This paper proposes Whitened CLIP (W-CLIP) that transforms the CLIP latent space, providing direct access through log-likelyhood function. W is computer only once, based on data, a-priories.
给作者的问题
- I think "Hyperbolic Image-Text Representation" ( https://arxiv.org/abs/2304.09172) and "Embedding Geometries of Contrastive Language-Image Pre-Training" ( https://www.arxiv.org/abs/2409.13079) are two related work for this paper. I do not see them referenced anywhere unless I am missing something. Wondering what do the authors think about these two work? What will happen if Whitened technique is applied to those two representation?
- log-likelyhood representation helps separate real image vs synthetic image is interesting. Can it help separate synthetic image if there is no artifact?
论据与证据
Yes. The whitening transform in section 3.2 and whitening CLIP embedding in section 3.3 contains convincing evidence. The authors did quantitative statistical experiments using Anderson-Darling and D’Agostino-Pearson tests, indicating the features in the whitened space can be well approximated by a normal distribution.
方法与评估标准
Since this work is on CLIP embedding transformation - the paper would be much more stronger if the authors did evaluation on zero shot transfer, classification, OCR etc. downstream tasks similar to the original CLIP paper https://arxiv.org/pdf/2103.00020
理论论述
The output log-likelyhood is consistent with intuitive understanding - specific words have less log likelyhood vs common words, synthetic image with artifacts has less log likelyhood vs real image. Sound claims except it is not clear in line 897-900 why W matrix becomes unstable (and may not be invertible) if the features are highly correlated specially for text embedding
实验设计与分析
The experiments are sound as in section 4. But I think this paper could have experiments to see how W-CLIP impacts the performance of downstream tasks such as image classification, OCR etc.
补充材料
Yes, all the sections A, B, C, D specifically A and D for reproducibility and W calculation
与现有文献的关系
This work expands the original CLIP paper https://arxiv.org/pdf/2103.00020 . It transforms the CLIP latent space, ensuring each feature in the embedding space has unit mean, zero standard deviation and no correlation with other features. As a result, one can express this embedding as a direct log-likelyhood function which is a novel idea
遗漏的重要参考文献
I think "Hyperbolic Image-Text Representation" ( https://arxiv.org/abs/2304.09172) and "Embedding Geometries of Contrastive Language-Image Pre-Training" ( https://www.arxiv.org/abs/2409.13079) are two related work for this paper. They do not do what this paper suggests but seem to be a good reference for the related work since they show alternative representation of CLIP embedding. I do not see those two papers referenced anywhere unless I am missing something.
其他优缺点
Strength: 1. Transforms the CLIP latent space, ensuring each feature in the embedding space has unit mean, zero standard deviation and no correlation with other features 2. Formulating the CLIP method in log likelyhood 3. This log-likelyhood representation helps separate real image vs synthetic image with artifacts. The later having less log likelyhood which is intuitive 4. Very well written paper with clear instruction on reproduction.
Weaknesses: 1. Lack of evaluation of W-CLIP on downstream tasks such as image classification, similarity search, OCR etc.
其他意见或建议
None
We thank the reviewer for their constructive feedback and comments. Below, we address key concerns regarding the applicability of our approach to zero-shot settings, and address the reviewers questions.
Methods and Evaluation Criteria
To address the reviewer’s comment regarding zero-shot transfer to a downstream task, we conducted a large-scale experiment on the generated image detection task. The generated images were synthesized from text, taken from three benchmarks [1,2,3]. Images are generated using 20 generative models, totaling 100k generated images. Since these datasets include fewer real images, we supplemented real samples from the MSCOCO training set.
Here, zero-shot refers to having no exposure to generated content and no task-specific training. We use only real images (from MSCOCO validation set) to compute the whitening matrix and do not fine-tune on this task. Thus, we compare against other zero-shot image detection baselines.
We benchmark against four zero-shot detection methods: AEROBLADE [4] (CVPR 2024), RIGID [5] (arXiv 2024), ZED [6] (ECCV 2024), and Manifold-Bias [7] (ICLR 2025). We use official implementations for [4] and [7], and self implement [5] and [6] based on their papers and available code (of other models used in these methods).
Each method outputs a continuous criterion score, which we binarize using a threshold from a small calibration set of 1k real images:
th = mean(C) + std(C), where C denotes criterion values. This calibration set is disjoint from the evaluation set.
We use four metrics: AUC and AP (Average Precision) as separation metrics, and F1-score and Accuracy as classification metrics. Evaluation is based on 100k generated and 100k real images. As shown below, our method outperforms all baselines across all metrics. This setting follows the protocol used in [7], the most comprehensive of the compared works. Our results for baseline methods align with those reported in [7]. We believe this validates our method’s ability to distinguish real from generated images in a zero-shot setting.
Performance Comparison
| Method | AUC | AP | F1 | Acc |
|---|---|---|---|---|
| AEROBLADE | 0.52 | 0.48 | 0.64 | 0.53 |
| RIGID | 0.51 | 0.53 | 0.28 | 0.52 |
| ZED | 0.69 | 0.66 | 0.69 | 0.62 |
| Manifold-Bias | 0.85 | 0.88 | 0.76 | 0.78 |
| Ours | 0.89 | 0.89 | 0.82 | 0.81 |
While on some generative models methods [6] or [7] occasionally perform competitively, ours is more consistent across all generative models, whereas others experience significant drops on specific ones. Full per-model results are available here:
https://drive.google.com/file/d/1hQFwAqpo3opByTqC70mb4-fDz0z8TkVQ/view?usp=drive_link
Theoretical Claims
When CLIP embedding features are highly correlated, the covariance matrix can become ill-conditioned, yielding near-zero eigenvalues. During whitening, eigenvectors are scaled by the inverse square root of these values, so eigenvalues close to zero can result with a numerically unstable and non-invertible whitening matrix W.
Questions for Authors
-
We thank the reviewer for highlighting two relevant papers, which we will cite in the final version. These works propose alternative, hierarchical embedding spaces to CLIP. The whitening transform is agnostic to the embedding space, as long as the data adequately represents it. In the context of these embedding spaces we assume this requires diverse samples, including many complex examples. For our likelihood approximation, a key factor is the distribution of each feature. As shown in Fig. 4.c, CLIP features exhibit a Gaussian-like distribution, which becomes standard normal after whitening, as validated in Tab. 1. If the original features of these embedders follow a different distribution, whitening may not yield standard normal features, undermining the assumptions of our likelihood model.
-
This question is addressed in the zero-shot detection experiment discussed above.
We hope these additional results and clarifications further support our proposed approach.
[1] Wang, Sheng-Yu, et al. "CNN-generated images are surprisingly easy to spot... for now." CVPR 2020.
[2] Ojha, Utkarsh, et al. "Towards universal fake image detectors that generalize across generative models." CVPR 2023.
[3] Zhu, Mingjian, et al. "Genimage: A million-scale benchmark for detecting AI-generated image." NeurIPS 2023
[4] Ricker, Jonas, et al. "AEROBLADE: Training-free detection of latent diffusion images using autoencoder reconstruction error." CVPR 2024
[5] He, Zhiyuan, et al. "RIGID: A training-free and model-agnostic framework for robust AI-generated image detection." arXiv 2024
[6] Cozzolino, Davide, et al. "Zero-shot detection of AI-generated images." ECCV 2024.
[7] Brokman, Jonathan, et al. "Manifold induced biases for zero-shot and few-shot detection of generated images." ICLR 2025
The authors propose the use of a whitening transform in the CLIP space, which offers an efficient solution in closed-form via the SVD. Using “WCLIP” they explore a wide number of practical downstream tasks one can tackle using the now-quantifiable likelihood (OOD image detection, caption complexity, quantifying artifacts, etc).
update after rebuttal
The authors provide useful clarifications during the rebuttal, and additional evidence of the usefulness of the proposed methodology. However, I maintain that work is needed on the experimental section to clarify the authors' key contributions--ultimately, I now lean towards a weak acceptance based on these considerations.
给作者的问题
n/a
论据与证据
Yes — the authors do not make many claims in the paper, and the few small claims they do make are reasonable and sound.
方法与评估标准
The method is well-formulated and makes sense theoretically. The absence of a need to train a mapping end-to-end (and tune the hyperparameters that come with such approaches) is a key strength of the authors' proposal to use the whitening transform. However, there is a lack of appropriate experimental results comparing the proposed method to alternative baselines for the tasks the authors explore (please see my comments in the “experimental design” section for more on this).
理论论述
N/a — no theoretical claims are made in the paper.
实验设计与分析
The authors explore multiple properties and downstream use cases of WCLIP. However, I am not convinced by the majority of the experiments.
The authors did an insufficient job for motivating the practical benefits of the properties induced by the method, and the downstream tasks explored lack any comparisons to baselines to confirm that the proposed method confers any advantages over existing/simpler techniques. Concretely:
Text complexity: the authors show that longer captions yield lower likelihood scores. However, I am confused about what insights this finding adds — CLIP is (presumably) trained with short image-caption pairs. Thus, I would argue the decreasing likelihood as a function of caption complexity likely predictably arises from the data itself, and likelihood here offers no unique insights. I fully expect to see the same relationship with CLIP’s original cosine similarity for short vs long/specific captions. What value does this analysis provide over the same analysis with the cosine similarity metric? The authors could perform the same experiments with the original CLIP cosine similarity to show why the likelihood is more informative, or why this is useful.
Uniformity: the authors show in Figure 4: “the effectiveness of the whitening transform in achieving unit variance and zero correlation among features”. Isn’t this just by definition of using the whitening transform? There is no motivation for why this is a desirable property or the unique benefits of it. Furthermore, the authors next mention uniformity as a “desirable” property too, but do not elaborate at all on this point.
Data analysis: In the authors’ experiments showing the use of likelihood to distinguish between real/fake/OOD data, they once again fail to benchmark against even a simple baseline to show why one would use this method over existing techniques—for example, what about simple k-means anomaly detection in the original CLIP space?
I am not convinced there is any practical benefit of WCLIP over much simpler analyses currently.
补充材料
Yes, I went through the additional experiments included here. I did not pay much attention to the theoretical preliminaries describing the standard whitening transform, however.
与现有文献的关系
I am not familiar with the wider literature for analysis of CLIP. However, even without specific knowledge of the literature, I am confident in my assessment of the paper as needing experiments comparing their method to simpler alternative baseline approaches to justify and motivate why one would use the whitening transform.
遗漏的重要参考文献
The paper appears to do quite a good job in the literature review. I don’t know the literature well enough to identify whether they have missed important works.
其他优缺点
I like the authors’ ideally conceptually — whilst i wouldn’t totally agree with the authors' characterization that it is “training-free” (perhaps one could say instead it offers a closed-form solution, based on the SVD), I do think the efficiency of the method is a key strength of the paper.
However, as discussed above, the experiments in the paper need a lot of work. I would encourage the authors to prioritize depth over breadth here, and polish the motivation for the whitening analysis.
其他意见或建议
n/a
We value the reviewer’s critical insights and suggestions. Below, we provide detailed clarifications and supporting evidence addressing the concerns raised.
Experimental Designs or Analyses
Text Complexity
One example in Fig. 5 does show higher likelihood for a shorter sentence (left), but this does not imply a bias toward shorter inputs. In Fig. 7c, our method's likelihood is shown to be unaffected by caption length, unlike the other LLMs and VLMs. Thus, our findings are contrary to the reviewer’s claim. In fact, in Fig. 7a and Tab. 4, we present a surprising finding: LLMs and VLMs show similar distributions for captions with and without nouns. As illustrated, the noun-free captions are semantically illogical. Since LLMs and VLMs are highly biased toward shorter inputs, the distribution remains unchanged. In contrast, our method—agnostic to caption length—assigns significantly lower likelihood values to the noun-free captions.
Regarding cosine similarity as an alternative to likelihood: the two serve fundamentally different roles. Cosine similarity compares two inputs and outputs a scalar based on angular distance, whereas likelihood evaluates a single input. Therefore, cosine similarity is not suitable as a substitute for likelihood in this context.
Uniformity
-
Unit Variance and Zero Correlation: As the reviewer notes, these properties are directly achieved through the whitening transform. While this is given in theory, this experiment empirically confirms that the transformation results in an isotropic space where features have zero mean, unit variance, and are uncorrelated. These characteristics, along with the normality verified in Sec. 3.3, are essential for computing likelihood using a multivariate Gaussian model (Sec. 3.4).
-
Uniformity: We agree that we did not elaborate on uniformity in the main paper, due to space constraints. Here we supply a more detailed explanation - uniformity in latent representations ensures that embeddings are evenly distributed across the latent space, preventing representation collapse where different inputs become indistinguishable due to overly similar embeddings [1]. A uniform space enhances discrimination between inputs, improving downstream tasks such as classification [2]. It also promotes generalization and avoids overfitting to specific latent regions [3]. In contrastive learning, uniformity complements alignment by pushing dissimilar inputs apart [1], resulting in a well-structured, robust, and generalizable representation space. We will add a more detailed explanation to the appendix and refer to it from the main paper.
Data Analysis
We refer the reviewer to our new experiment on zero-shot generated image detection (see response to Reviewer 531Y). Specifically in relation to cosine similarity, we note that both RIGID and Manifold-Bias use cosine similarity as part of their detection pipelines. However, our likelihood-based method is simpler, significantly faster (see table below), and outperforms both baselines. Moreover, other cosine-based methods [4,5] for this task are not applicable in a zero-shot setting. This experiment highlights both the effectiveness and practical advantages of our approach. To demonstrate the efficiency of our method, we report per-image inference times on a single A100 GPU for the zero-shot detection task:
| Method | Running time [sec per image] |
|---|---|
| AEROBLADE | 4.66 |
| RIGID | 0.59 |
| ZED | 0.26 |
| Manifold-Bias | 1.66 |
| Ours | 0.05 |
Other Strengths and Weaknesses
Not "Training-Free"
The reviewer raises a valid point: PCA in the whitening transform can be seen as unsupervised training. However, this is analogous to CLIP pretraining—performed once on unlabeled data and reused across downstream tasks. In our case, this step is extremely lightweight (under 1 second). As demonstrated above, leveraging the likelihood for a downstream task results with a very efficient method.
We hope these clarifications and new experiment address the reviewer’s concerns and underscore the practical strengths of our approach. We would appreciate positively considering increasing the final rating of the paper.
[1] Wang, Tongzhou, and Phillip Isola. "Understanding contrastive representation learning through alignment and uniformity on the hypersphere." ICML 2020
[2] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." ICML 2020
[3] Balestriero, Randall, and Yann LeCun. "Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods.” NeurIPS 2022
[4] Cozzolino, Davide, et al. "Raising the Bar of AI-generated Image Detection with CLIP." CVPR 2024
[5] Sha, Zeyang, et al. "De-fake: Detection and attribution of fake images generated by text-to-image generation models." ACM SIGSAC 2023
Thanks to the authors for their thorough response!
-
Text complexity & uniformity: thanks to the authors for helping me interpret the results--my apologies to the authors for my misunderstanding here. This result is rather more interesting than I understood it to be! I do maintain my opinion that "depth over breadth" would be helpful in preventing confusion, however -- for example, to include the lengthier discussion here in this rebuttal in Sec 4.1., and to deprioritize some of the discussions about uniformity.
-
Additional results: the additional experiments on generated image detection in response to Reviewer 531Y are appreciated. Whilst this is not an experiment they ask for explicitly, I think the authors' response does highlight some additional practical value of the method, and outperforms some very recent baselines.
Ultimately, I change now my rating to weakly support the paper (after reading the other reviewers' responses), whilst maintaining that the experimental section needs a re-work for the camera-ready version to hightlight the strengths of the method.
This paper studies the CLIP features to approximate the likelihood of images and captions. The presented method, Whitened CLIP (W-CLIP), uses an invertible linear operation to convert the CLIP features into a zero-mean, unit standard deviation space. This normalized embedding space can be used in many cases, including detecting artifacts in synthetic images, analyzing the domain drifts for different datasets, and enhancing image manipulation of two images.
给作者的问题
I am open to discuss and would be happy to reaccess my rating during rebuttal.
论据与证据
This paper is clearly motivated and proposes a simple yet effective method with abundant experiments.
方法与评估标准
Some key results lack quantitative metrics to validate the effectiveness of the proposed method.
- While we have qualitative examples in Fig 2 and Fig 8 to show that W-CLIP can detect artifacts in synthetic images, we don't have a quantitative metric to know how good it is. Moreover, can W-CLIP embedding be used to tell the aesthetics or quality of images?
- When using full circle SLERP to do image manipulation, we have qualitative examples in Fig 20 and Fig 21, but again, we don't have any large-scale evaluation to show the superiority of W-CLIP compared with CLIP.
理论论述
This paper doesn't have any theoretical contributions.
实验设计与分析
The experiment design is holistic including the image and text domain. However, this paper lacks some quantitative evaluation for some experiments. (See Methods And Evaluation Criteria part)
补充材料
This paper provided the codes and evaluation documentation, but I didn't run their codes. I also checked their appendix in the paper.
与现有文献的关系
This paper claims they are the first to whiten the CLIP features for image and text likelihood analysis. I haven't tracked this domain closely, so I cannot verify their claim. I'll look at other reviewers' comment on this.
遗漏的重要参考文献
I haven't tracked this domain closely, so I cannot verify their claim. I'll look at other reviewers' comment on this.
其他优缺点
See other parts.
其他意见或建议
The figures in this paper are vague and have low resolution. For instance, subfigures in Figure 4 have small x-y axis labels and legends. Figure 3 is a good example, I highly recommend the authors redraw all the figures like Figure 3.
We appreciate the reviewer's detailed and thoughtful feedback. Below, we address each point raised.
Methods and Evaluation Criteria
First Point, first part
"While we have qualitative examples in Fig 2 and Fig 8 to show that W-CLIP can detect artifacts in synthetic images, we don't have a quantitative metric to know how good it is."
To address this, we conducted a large-scale experiment on zero-shot detection of generated images. Full details are provided in our response to Reviewer 531Y. A summary of the comparative results is given below:
| Method | AUC | AP | F1 | Acc |
|---|---|---|---|---|
| AEROBLADE | 0.52 | 0.48 | 0.64 | 0.53 |
| RIGID | 0.51 | 0.53 | 0.28 | 0.52 |
| ZED | 0.69 | 0.66 | 0.69 | 0.62 |
| Manifold-Bias | 0.85 | 0.88 | 0.76 | 0.78 |
| Ours | 0.89 | 0.89 | 0.82 | 0.81 |
First Point, second part
"Moreover, can W-CLIP embedding be used to tell the aesthetics or quality of images?"
We refer to Fig. 3 (top right), where an ImageNet-C experiment shows that norms of whitened embeddings respond to image corruptions (e.g., impulse noise). Additional results appear in Fig. 10 (Appendix B), covering 12 corruptions including blur, defocus, and low contrast. In all cases, corrupted images exhibit higher norms (in the whitened space) than the original images. As explained in Sec. 4.2 (lines 316–322), higher norms in the whitened space correspond to lower likelihood values, indicating reduced image quality. These results suggest that W-CLIP embeddings capture information related to image quality.
Second Point
"When using full circle SLERP to do image manipulation, we have qualitative examples in Fig 20 and Fig 21, but again, we don't have any large-scale evaluation to show the superiority of W-CLIP compared with CLIP."
To provide quantitative insight, we conducted a full-circle SLERP experiment on the MSCOCO validation set (5k images). For each image, we performed full-circle SLERP in both the CLIP and W-CLIP embedding spaces. In this process, a source image is interpolated toward a destination image along a circular path within the embedding space. Crucially, the image generated at the 180° position from the source—referred to as the “opposite image” (generated from the “opposite embedding”)—is invariant to the chosen destination and determined solely by the source. While other positions along the path are influenced by the destination embedding, the 180° embedding is a fixed, symmetric counterpart.
We generate these opposite images using both CLIP and W-CLIP embeddings and observe a stark contrast: in the CLIP space, opposite images degrade into structured noise, whereas in the W-CLIP space, they remain visually natural and semantically meaningful, as shown in Figs. 20, 21.
The structured noise produced by CLIP exhibits 4×4 pixel blocks and a restricted color palette (black ('0' in all color channels), white ('1' in all color channels), red, green, blue, magenta ('1' in red and blue channels), cyan ('1' in green and blue channels), and yellow ('1' in red and green channels)), suggesting synthetic artifacts. We provide visual examples (20 opposite images) in the following link:
https://drive.google.com/drive/folders/1Q85pz8y-36K2eHXDHsRcciblihvHgwgX?usp=drive_link
To quantify these differences, we compute Total Variation (TV), Entropy, and the percentage of extreme saturation values (top or bottom 1% of the pixel range). All metrics are computed per channel and averaged per image across three sets: original MSCOCO images, CLIP opposites, and W-CLIP opposites. The results are summarized below:
| Method | TV | Entropy | Saturation Values [%] |
|---|---|---|---|
| MSCOCO | 222.3 | 7.3 | 4.2 |
| CLIP Opposite | 156.7 | 4.8 | 55.5 |
| W-CLIP Opposite | 215.9 | 7.2 | 6.4 |
These findings confirm that W-CLIP opposites are statistically similar to natural images, whereas CLIP opposites exhibit significantly reduced entropy and variation and much higher percentage of saturation values, indicating a lack of natural structure.
Other Comments or Suggestions
"The figures in this paper are vague and have low resolution. For instance, subfigures in Figure 4 have small x-y axis labels and legends. Figure 3 is a good example, I highly recommend the authors redraw all the figures like Figure 3."
We thank the reviewer for this helpful suggestion. We will improve the visual quality of all figures in the final version. Specifically, we will enlarge the legends and the axes ticks in Figs. 4, 6, and 7, and reformat histograms for clearer presentation, following the style of Fig. 3.
We hope our additional experiments and clarifications address the concerns raised and further strengthen the validity and impact of our proposed approach. We would appreciate positively considering increasing the final rating of the paper.
Hi,
Thank the authors for their detailed rebuttal. It has resolved my major concerns on the quantitative metrics of the proposed W-CLIP. I raised my rating to weak accept.
This paper introduces W-CLIP, a simple way to normalize CLIP embeddings so they can be used like likelihood scores. It’s fast, training-free, and works across tasks like artifact detection and image editing.
After a strong rebuttal with new experiments, including solid zero-shot detection results, most concerns were addressed, and all reviewers lean towards weak accept. The idea is clear and useful, though some analyses could be better motivated. The ACs hope the authors can revise and improve the method based on the feedback from the reviewers.