PaperHub
5.8
/10
Poster5 位审稿人
最低4最高8标准差1.3
6
5
8
4
6
3.8
置信度
正确性2.6
贡献度2.6
表达3.0
NeurIPS 2024

When does perceptual alignment benefit vision representations?

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

We find that aligning the representations of large vision models to human perceptual judgements improves downstream performance on a variety of tasks.

摘要

关键词
representation learningalignmentperceptiontransfer learningcomputer visionfoundation model

评审与讨论

审稿意见
6

The paper investigates the benefits of aligning vision model representations with human perceptual judgments to improve their performance across various computer vision tasks. The study fine-tunes state-of-the-art models using a dataset of human similarity judgments and demonstrates that this alignment enhances performance in tasks like semantic segmentation, depth estimation, and instance retrieval. The results suggest that integrating human perceptual knowledge as an inductive bias can improve vision model representations without significantly sacrificing performance on other tasks, including specialized domains like medical imaging and 3D environments.

优点

  1. The experiment evaluation is comprehensive. The paper evaluates the impact of human perceptual alignment on a wide range of computer vision tasks, including semantic segmentation, depth estimation, instance retrieval, and counting. The paper also use multiple state-of-the-art models (e.g., CLIP, DINO, DINOv2, SynCLR) and a detailed experimental setup.

  2. The idea of introduces alignment for vision models with human perceptual judgments is reasonable and innovative.

  3. The analysis is insightful. The paper provides a nuanced discussion on the benefits and limitations of human perceptual alignment, offering insights into how different levels of perceptual judgments (low-level, mid-level, high-level) impact model performance on various tasks.

缺点

Writing Part:

  1. There is a question mark in a citation on line 95.
  2. The order of the references is shuffled; reference pages usually come before the supplementary files.
  3. The paper lacks a conclusion section.

Suggestions for Experiments:

  1. More experiments on advanced vision-language models (VLMs) are expected, such as Llava, MiniGPT-4, and instructBLIP.
  2. While not necessarily a weakness, it would be interesting to see the performance improvement on CLIP-blind image pairs, as introduced in the paper "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs."

问题

How does the patch-level objective differ from the image-level objective? Does it show different performance on various downstream tasks?

局限性

Introduced in Section 5.

作者回复

We thank the reviewer for their helpful comments. We are glad that the reviewer found our evaluation to be comprehensive, the paper idea innovative, and the analysis insightful. We address questions and concerns below.

Writing comments

We thank the reviewer for their helpful notes and will fix the citation typos in revision. We intend for the “Discussion” section at the end of the paper to serve as the conclusion, and can revise it to clarify this – and expand upon our conclusions – in revision.

Further experiments on VLMs

We thank the reviewer for the suggestion. In preliminary experiments, we found that LLaVA, MiniGPT-4, and instructBLIP were not well suited to in-context prompting (potentially due to their instruction tuning and optimization for other tasks). However, we replicated our RAG experiment on IDEFICS2, a recently released 8B multimodal model achieving state-of-the-art results across several multimodal benchmarks [2]. Our results are shown in Fig. 1 of the PDF attached to our global response. We find that across the same four datasets evaluated in Section 4.2 of the paper, performance consistently improves when using NIGHTS-tuned models in the RAG pipeline. The exceptions are DINOv2 with the Diabetic Retinopathy dataset and DINO/DINOv2 for the SVHN dataset. Overall, these results support and validate our results on OpenFlamingo; we will add them to our revision.

CLIP-Blind image pairs

We thank the reviewer for suggesting the MMVP-VLM benchmark [2]. Unfortunately, evaluating on this benchmark requires computing the similarity of text-image pairs. We only fine-tuned the CLIP vision encoder; thus it no longer shares the same feature space as the text encoder, making it difficult to fairly run a multimodal evaluation and interpret any positive or negative results.

Clarification on patch-level objective

We provide further details and intuition regarding patch-level training in section 2 of the global response. Below we address specific questions:

How does the patch-level objective differ from the image-level objective? Does it show different performance on various downstream tasks?

Finetuned patch tokens were necessary to evaluate NIGHTS fine-tuning on dense tasks (depth estimation, segmentation). In preliminary experiments, we found that the image-level objective did not induce any significant changes in the patch tokens. The patch-level objective is functionally the same as the image-level objective, the only difference being that it propagates the similarity annotations more directly to the patch tokens.

Note that only the fine-tuned patch tokens are needed for segmentation/depth-estimation, and only the CLS token for image-level tasks (all others).

[1] Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246, 2024b.

[2] S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie, “Eyes wide shut? exploring the visual shortcomings of multimodal llms,” CoRR, vol. abs/2401.06209, 2024.

评论

Thank you for the response. I will maintain my score at this stage.

评论

Dear Reviewer VQay,

We appreciate your positive feedback and are very grateful for your suggestions and questions that have made our paper stronger. Thank you once again for dedicating your time and effort to reviewing our work and providing us with insightful suggestions!

审稿意见
5

This paper investigates the effects of aligning pretrained vision models to human judgments. Image-level and patch-level learning objectives are proposed for fine-tuning the pretrained models. The experimental results demonstrate that fine-tuning pretrained models (such as CLIP, DINO, and DINOv2) on the NIGHTS dataset leads to performance improvements in various downstream tasks, including semantic segmentation, depth estimation, retrieval-augmented generation (RAG), counting, and instance retrieval. Ablation experiment regarding the fine-tuning dataset is conducted to demonstrate the perceptual qualities embedded in the NIGHTS dataset.

优点

  1. The proposed perceptual alignment process is simple yet effective, leading to performance improvements across various downstream tasks.
  2. This paper discussed the effects of aligning vision models using datasets with different levels of variations, providing emprical evidence for further research.

缺点

  1. This paper demonstrates the effects of perceptual alignment by showing the performance improvement in downstream tasks. However, the presented evidence cannot fully explain "how" perceptual alignment affects model features as general-purpose representations. Specifically, it is unclear what the difference is between the features before and after perceptual alignment, and why such differences lead to performance improvements.

  2. This paper investigates the effects of aligning vision models using dataset with mid-level variations leads to a better “general-purpose representation”. However, the reasons why using datasets with such mid-level variations (instead of datasets with high or low levels of variations) for alignment results in better representations remain unclear.

  3. Experiments in this paper are conducted using ViT based models. However, the effects of perceptual alignment for CNN based models are not explored.

  4. This paper compares the effects of aligning vision models using datasets with low, mid, and high levels of variations on counting and instance retrieval tasks. However, to better demonstrate the high perceptual qualities embedded in the NIGHTS dataset, comparisons should be conducted in a wider range of downstream tasks.

问题

Please refer to the weaknesses.

局限性

The authors discuss some of the limitations of their work in Section 5. But I would like them to consider some of my concerns above.

作者回复

We thank the reviewer for their helpful comments. We address questions and concerns below.

How does perceptual alignment affect model features?

We emphasize that this paper aims to answer this question in terms of the competency of representations. There is a rich precedence of understanding and comparing representations via their competencies at downstream tasks [1-2]. Evaluating general-purpose features on transfer tasks – particularly with simple methods such as KNNs and linear probes – allows us to quantify what the features capture and make decodable.

We further flesh out our empirical evaluation in the global response, section 1.1, where we additionally compare different levels and types of perceptual alignment across segmentation, depth estimation, and classification. We hope that our additional evaluations provide insight into which datasets and tasks are represented in a useful way by finetuned models. We also provide further discussion of the learned feature space in section 3.

We finally note that while we probe representations in terms of competency, understanding them in terms of their mechanism is an important direction for future work.

Why is mid-level alignment better than other levels?

We address this question directly in our global response, section 3. Please let us know if further detail is needed, or there are follow-ups.

Perceptual alignment for CNNs

We thank the reviewer for their suggestion. We assess the effects of perceptual alignment with NIGHTS on two popular CNNs: ResNet50 and ConvNeXt-B, using the same loss described in the paper*. We evaluate on counting and instance retrieval tasks; our results are shown below:

Counting (Clevr-Count):

ModelRMSEMAE
ConvNeXt2.0451.522
ConvNeXt-HA1.6311.193
ResNet3.1402.551
ResNet-HA1.7291.282

Instance Retrieval (DF2) **:

ModelTop-1Top-3Top-5
ConvNeXt2.123.564.59
ConvNeXt-HA2.814.85.98
ResNet0.0180.120.14
ResNet-HA0.0180.0740.17

*Due to time constraints we were unable to implement LoRA tuning for the CNNs, and instead trained MLPs on top of the final-layer improvements. Previous work [3] indicated training on NIGHTS using MLPs can steer alignment in the right direction but may be less effective; nonetheless we still see downstream improvements using this method.

**We report results for both ConvNeXt and ResNet, however acknowledge that the accuracy numbers for ResNet are likely too small to draw conclusions from.

Comparisons against other perceptual datasets.

We thank the reviewer for their suggestion. In addition to our comparison on counting and instance retrieval, we further evaluate models trained on BAPPS, THINGS, and ImageNet triplets on a wider range of downstream tasks: segmentation, depth estimation, and a and a selection of natural/specialized/structured datasets from VTAB. Our full results can be found in the global response, section 1.1, and we will include these in our revision.

[1] Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, and Phillip Isola. Learning vision from models rivals learning vision from data. ArXiv, abs/2312.17742, 2023.

[2] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICLR, 2021

[3] Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. In NeurIPS, 2023

评论

Dear reviewer, The author-reviewer interaction period has started. Please read the responses provided by the authors, respond to them early on in the discussion, and discuss points of disagreement. Thanks

评论

Thanks for your feedback. The hypotheses outlined in Section 3 are interesting and have partly addressed my concerns regarding "why mid-level alignment can lead to better general-purpose representations". I will raise my score accordingly.

评论

Dear Reviewer rQrx,

We appreciate your positive feedback and are truly delighted to see that our response has addressed your concerns and questions.

Thank you once again for dedicating your time and effort to reviewing our work and providing us with insightful suggestions!

审稿意见
8

As the title clearly indicates, this article investigates how aligning vision model representations to human perceptual judgments impacts the performance of models relying on these representations for downstream tasks such as dense prediction (semantic segmentation and depth estimation), retrieval-augmented generation, object counting and instance retrieval. Given a vision model backbone, it is finetuned with LoRA on the NIGHTS dataset, which contains human similarity judgments over synthetically-generated image triplets (NeurIPS 2023). For each considered downstream task, several backbones are compared to their "human-aligned version" with globally better results when the backbone is fine-tuned on human perception.

优点

  • the scientific question raised by the paper is simple but interesting and the study shows it is worth investigation. The problem is clearly stated and introduced, the method and protocol are well explained and the article is well written overall. This bullet point is not just a false pretext to artificially add a "strength" to the review, it is a real pleasure to read.

  • the method to fine-tune the backbones relies on well-known and standard methods and tools (visual transformer, triplet loss...), which leads to a convincing methodology. The reader is not confused by an incomprehensible labyrinthine system and can therefore fully concentrate on the results of the study.

  • the experimental part is impressive, with an evaluation of a large variety of tasks, with various backbones as well. This is obviously a strong part of the paper: although the original idea was simple and interesting, the general quality of the study is finally supported by all these experiments. The supplementary material and the code (zip file) allow a precise investigation of the experiments and ensure reproducibility.

  • Section 4.5 investigates the use of alternative datasets to fine-tune the model. Some were proposed by previous works and others were built by the author to test a specific question. All of them have a size similar to NIGHTS. The resulting analysis is interesting and is a nice complement to the study.

缺点

  • the alignment is performed both at the image level (section 3.2) and patch level (section 3.3) and the authors claim (lines 164-166) that "Additionally, we find that local patch-level representations can be improved by tuning on image-level perceptual judgments, and show performance increases on semantic segmentation and depth estimation". However, in the following all experiments only compare a backbone with its "human aligned version" (HA) and it is not clear what is the contribution of the alignment at a global and a local level. On line 148, the text says that the model consists to "train heads for semantic segmentation and depth estimation [at the local level]". Globally, the manuscript could be more explicit on this "details": is the local patch level used for training for segmentation and depth estimation only or is it the same training protocol (with global and local levels) for all experiments? If so, what is the contribution of adding (or not) the heads that are trained at a local level?

  • one may regret that the negative result on VTAB is not reported in the main paper and only discussed in the Limitation without further investigation. However, one must recognize that the paper already contains significant experimental works, that allow to support the main demonstration.

  • the fact that the same backbones are used in several experiments/tasks is interesting but sometimes it leads to models that are very weak baselines in comparison to the state of the art. In particular, in Section 4.3, specialized class-agnostic models on few-shot counting have much better performance than those reported (less than 0.15 MAE/RMSE on FSC147 for SafeCount or CounTR)

minor

  • the experimental protocol for "object counting" reports that kk is determined on the training set among 4 values then the best one is used on the test set. Reporting the performance for each kk on the training set, or at least the one that was retained on the test set, would be useful.
  • line 95: a reference is missing

问题

  • is the local patch level used for training for segmentation and depth estimation only or is it the same training protocol (with global and local levels) for all experiments? If so, what is the contribution of adding (or not) the heads that are trained at a local level?

  • will the code (zip file) be released with the paper?

局限性

An experiment on the VTAB benchmark, whose results are reported in the supplementary material, leads to lower performance with fine-tuned representations. This case is significantly discussed in Section 5, with several hypotheses proposed to explain the phenomenon.

作者回复

We thank the reviewer for their helpful comments. We are glad that the reviewer finds the method convincing, the paper well-written, and the experiments comprehensive. We address specific concerns and questions below.

Global v. local representations

We provide details regarding training patch tokens in the global response, Section 4 of the paper, and further clarifications below.

Global representations refer to the output embedding for CLIP/OpenCLIP, and the CLS token for all other backbones. These alone are used for counting, instance retrieval, RAG, and classification, as they contain information across the image. In these global tasks, we either compute k-Nearest Neighbors over these global representations, or train linear probes. Local features are used for dense tasks (segmentation, depth estimation); good spatial features are necessary, as the output depends on per-pixel labels rather than image-level labels. Following the procedures from [1], we train segmentation and depth estimation heads on top of only the local features.

We thank the reviewer for raising these questions, and will clarify our methodology in our revision.

VTAB results

We appreciate that the reviewer brought up this concern yet recognized the significant coverage of experiments in the main paper. Our main contribution is to empirically identify how tuning on perceptual datasets impacts transfer performance; for a thorough investigation, this includes negative impact. We acknowledge, however, that readers may be most directly interested in where this alignment succeeds, and structure the paper to highlight these applications while retaining all our findings.

Comparison to state-of-the-art task-specific models

As the reviewer notes, we use the same backbones across all experiments and tasks. This was done to evaluate general-purpose representations, and we acknowledge that these may be outperformed in cases such as counting by task-specific models. We do not aim to achieve state-of-the-art performance on any single task; rather, we demonstrate how a representation becomes comparatively better (or worse) over an array of multiple tasks. By evaluating the competency of a single representation over multiple tasks – which would not be possible with task-specific models – we gain insight into how general-purpose feature spaces are affected by alignment.

Counting experiment parameters

Below we report the training performance for each value of kk in the counting task.

DINO:

kAcc.RMSEMAE
10.2671.7491.313
30.2841.7531.292
50.2851.6921.255
100.2771.7421.293

DINO-HA:

kAcc.RMSEMAE
10.2711.7061.331
30.2881.7011.284
50.2891.6941.251
100.2901.7081.261

Code release

Our code will be fully open-sourced along with the release of our paper.

We also appreciate the reviewer’s note of a missing reference and will address this in revision.

[1] Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).

评论

I acknowledge the authors for their feedback and encourage them to include in their camera ready version as much as possible clarification provided in the rebuttal.

I also read the other reviews and I note that my perception is significantly more positive than the average other reviews. Nevertheless, the weaknesses raised do not seem sufficiently convincing to me, and I therefore maintain that this article deserves to be brought to the attention of the community.

评论

Dear Reviewer 9gB3,

Thank you for your positive feedback throughout the entire reviewing process; we are truly delighted to see appreciation for this line of work. Thank you once again for dedicating your time and effort to reviewing our work and providing us with insightful suggestions!

审稿意见
4

This paper aligns representations with human perception on mid-level semantics to improve performance on various downstream tasks. Specifically, it does this by pre-training models on additional synthetic image triplets, where the visual similarity within each triplet is annotated by human subjects.

优点

  • The dataset this paper used for additional pre-training is rather small compared to the standard datasets, yet authors show that there is still a noticeable improvement on multiple tasks.

缺点

  • Lacks general instructions on when aligning with human perception would benefit representation learning. From the title, I would expect a series of studies providing insights into which levels of alignment benefit or impair different types of tasks. The empirical study in its current version is not comprehensive enough to compensate for the lack of theoretical novelty compared to previous work [1].

    [1] "Improving neural network representations using human similarity judgment," Oh et al., NeurIPS 2023

  • It would be better if the authors could provide some empirical analysis to give insights into why mid-level alignment is better than other levels.

问题

  • Regarding dense prediction tasks, there are cases where pre-training with human alignment impairs performance, e.g. DINOv2(-HA) on VOC, etc. The authors stated that DINOv2 has seen these data, and considering that DINOv2-HA has also seen these data, how does this explain the performance drop? Since the only difference in the dataset for models with or without HA should be NIGHTS, this needs clarification.

  • It seems odd that the performance change from using ImageNet and THINGS for pre-training is reversed in Figure 8 for the counting task, and pre-training on THINGS looks extremely bad for retrieval tasks. Can the authors provide an explanation for this phenomenon?

  • Compared with previous work [1], the authors use additional patch-level alignment during pre-training. Isn't there some redundancy between the average pooling of patches and the [CLS] token? How does patch-level alignment affect performance? Additionally, since some patches from negative samples and reference images are similar, it seems counterintuitive to push them away in the embedding space.

    [1] "Improving neural network representations using human similarity judgment," Oh et al., NeurIPS 2023

局限性

See above

作者回复

We thank the reviewer for their helpful comments. We address questions and concerns below.

Which levels of alignment benefit or impair different tasks?

To strengthen our empirical results, we extend our comparison of different “levels” of alignment (i.e. fine-tuning on THINGS, BAPPS, ImageNet) to the majority of tasks evaluated in the paper: counting, retrieval, segmentation, depth estimation, and a selection of natural/specialized/structured datasets from VTAB. We refer the reviewer to section 1.1 of the global response for our new results and observations. We also examine how the strength of alignment (i.e. number of training steps) impacts performance, in section 1.2 of the global response.

Taken together with our paper, we glean the following insights:

  • Finetuning on NIGHTS generally benefits representations over base models on object counting, segmentation, depth estimation, instance retrieval, RAG, and some structured classification datasets (such as smallnORB).
  • Fine-tuning on perceptual datasets that are solely low-level (BAPPS) or high-level (THINGS) typically performs worse than fine-tuning on NIGHTS, and in many cases worse than the base models themselves. This demonstrates that representation quality depends on the type of perceptual data, and in fact some forms of perceptual alignment can harm representations.
  • Perceptual alignment impairs performance on most natural classification datasets, particularly fine-grained. This phenomenon is consistent across perceptual datasets; we discuss potential reasons in section 6 of the paper. Low-level alignment (BAPPS) seems to best preserve base model performance in this case.
  • A small amount of alignment may yield the most benefits; training for >1000 timesteps appears to cause a decline in retrieval performance.

Why is mid-level alignment better than other levels?

We address this question directly in our global response, section 3. Please let us know if further detail or follow-up is needed.

Performance drop on dense tasks.

The reviewer is correct that NIGHTS is the only difference for models with/without HA. We flag downstream datasets that the pretrained model has seen because if a dataset is already in-distribution for a backbone, fine-tuning on different data may be more likely to change the feature space such that that dataset is more out-of-distribution. We do not see this as the sole explanation for the results, simply a possible factor. We will clarify this in our revision.

Why does THINGS hurt retrieval?

The triplet choices in THINGS reflect coarse-grained/high-level semantic similarity, i.e., the concepts that humans use for judging object similarity [1, 2]. Thus, THINGS is ill-suited to retrieval tasks (of which counting is one) because they rely on the local/fine-grained similarity structure of representations, rather than the global/coarse-grained similarity structure. Muttenthaler et al. [3] found that learning a linear transform to match the THINGS triplet odd-one-out choices on top of ImageNet-trained models deteriorates downstream performance for tasks where local similarity structure is important, such as few-shot learning on single domain datasets. The authors observed that changing a model’s representation space without preserving the local similarity structure of the pretrained representations decreased few-shot learning performance, and harmed their nearest neighbor structures.

Fu et al. [4] further found that fine-tuning vision models on NIGHTS, which better reflects human local similarity structure, leads to decreased performance on THINGS. This indicates that the concepts captured in NIGHTS – which are helpful for retrieval – are different from those useful for THINGS.

Questions on patch-level training

We provide further details and intuition regarding patch-level training in section 2 of the global response. Below we address specific questions:

Isn’t there redundancy between the average pooling of patches and the CLS token?

While there is some redundancy among patch and CLS features, they do not encode identical information as the former is a uniform pooling over the image space, and the latter does not have such a constraint (previously, Zhai et al. [5] train the SigLIP model and use attention-pooled patch tokens as their global embedding instead of using a separate CLS token). In our work, we take the simplest approach to include all feature information available to tune on, and we find it effective for dense prediction tasks.

Since some patches from negative samples and reference images are similar, it seems counterintuitive to push them away in the embedding space.

We agree with the reviewer; this was our intuition for supervising on the average-pooled patches, rather than using a dense contrastive loss. Average pooling allows us to supervise on a global representation, but also propagates that supervision to the patch tokens, enabling evaluation on dense tasks. In preliminary experiments, we found that patch-level alignment without pooling led to poor results, likely for this reason.

[1] Hebart, M.N., Zheng, C.Y., Pereira, F. et al. Revealing the multidimensional mental representations of natural objects underlying human similarity judgements. Nat Hum Behav 4, 1173–1185 (2020).

[2] Muttenthaler, L., Dippel, J., Linhardt, L., Vandermeulen, R. A., & Kornblith, S. (2022). Human alignment of neural network representations. In ICLR 2023.

[3] Muttenthaler, L., Linhardt, L., Dippel, J., Vandermeulen, R. A., Hermann, K., Lampinen, A., & Kornblith, S. Improving neural network representations using human similarity judgments. In NeurIPS, 2023.

[4] Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, Tali., and Isola, P. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. In NeurIPS, 2023.

[5] Zhai, Xiaohua, et al. "Sigmoid loss for language image pre-training." In ICCV, 2023.

评论

Dear reviewer, The author-reviewer interaction period has started. Please read the responses provided by the authors, respond to them early on in the discussion, and discuss points of disagreement. Thanks

评论

Dear reviewer,

The author-reviewer interaction period is about to end. Please read the responses provided by the authors, respond to them early on in the discussion, and discuss points of disagreement. Thanks

评论

Dear Reviewer LKEh,

Thank you again for your detailed review! We have presented various results and clarifications in the rebuttals and global response, which we believe address your questions and concerns. In the remaining short time, we would be very happy to alleviate any of the remaining questions or concerns you still have about our paper.

Thank you once again for dedicating your time and effort to reviewing our work and providing us with insightful suggestions!

审稿意见
6

This paper studies the question of when perceptual alignment supervision makes better vision representations. They fine-tune existing pre-trained vision backbones (eg, CLIP and DINO) with LoRA on NIGHTS, a synthetic triplet-based dataset of human similarity judgments. Models fine-tuned in this way generally perform better on vision tasks, ranging from dense prediction, RAG, counting, and object retrieval. However, the results can be the opposite when evaluating classification tasks, especially natural classes. They also considered fine-tuning on different types of human preference datasets, and found mid-level supervision to be more beneficial.

优点

  1. The tasks considered in this paper for evaluation are comprehensive and cover many different properties of the representation.
  2. The delivery of this paper is clear and easy to follow. Key motivation, setting, and conclusions of each part are easy to spot.
  3. Evaluating human preference for vision representations is an emerging topic and interesting for research.

缺点

  1. My major concern is that although the topic is somehow interesting and the evaluation is comprehensive, many conclusions are expected and unsurprising. Provided in the form of triplets, the supervision of human preference is just another form of "classes", being very fine-grained and having flexible class definitions. This is similar to the learning signal of self-supervised contrastive learning, with a better form of data augmentation. In this regard, it is expected that human preference makes generally better representations.
  2. Regarding the performance drop in fine-grained classification (especially natural classes), this might be a result of the domain gap. Given that the model is fine-tuned on all synthetic images, it is unsurprising that discriminating natural images is harder afterward. The definition of human preference can be ambiguous, and the dataset for fine-tuning can have distracting factors (eg, synthetic distribution). These factors are not tackled well in the design of comparisons and could make some conclusions unreliable.

问题

I still find this paper could spark some interesting questions. More investigations on these aspects might increase the significance of this paper.

  1. When can human preference be harmful? One motivation of this paper is to provide some empirical guidance for future vision research, if human preference is introduced as the way in the language community. As I mentioned above, the benefit of it to vision is not very surprising, but the prevention of possible risks could be valuable. This paper has shown some and more in this direction might be more valuable than the benefits (as most improvements are just marginal).
  2. How should human preference be better defined and categorized? A precise definition of human preference is hard to obtain, but important. This paper has considered the level of supervision, and a discussion of other aspects could help readers.
  3. What distracting factors exist in current human preference datasets and how to isolate the effects of them? This is important to ensure reliable conclusions.

局限性

Limitations are discussed in the paper. Suggestions are flagged above.

作者回复

We thank the reviewer for their helpful comments and appreciate that they found the evaluation comprehensive, the paper clear, and the topic interesting. We address questions/concerns below.

“It is expected that human preference make generally better representations.”

We broadly agree with the reviewer that human preference is a form of fine-grained “classes”, with flexible class definitions, depending on the level/type of similarity judgments. Our key contribution is to elucidate how these classes should be defined to benefit representations: we find that the particular preferences in NIGHTS improve performance across many tasks.

A key finding, in fact, is that not all human preferences are always better. In the global response section 1.1, and section 4.5 of the paper, we compare finetuning on multiple types of human preferences (both high- and low-level). Finetuning on NIGHTS outperforms other perceptual datasets (BAPPS, THINGS) across segmentation, depth estimation, retrieval, and counting. Finetuning on triplets formulated without any human preferences (from ImageNet) often outperforms models fine-tuned on BAPPS/THINGS, showing that some human preferences harm representations.

Synthetic-Real domain gap

We briefly discuss the synthetic-real domain gap in section 4.5. We ablate the use of synthetic images in NIGHTS by including SynCLR in our experiments, which was pre-trained solely on generated images. SynCLR exhibits the same performance drops across fine-grained/natural datasets as real-trained backbones, indicating that they must result from the perceptual alignment. We will clarify this in our revision.

When can human preference be harmful?

As detailed further in the next section, “human preference” lacks a rigorous definition in vision and language literatures; it can include aesthetic preferences and other harmful types of annotations. The scope of our work is to evaluate human perceptual alignment, which extracts human preference at a psychophysical level rather than a more cognitively-penetrable level. Nevertheless, we empirically observe several cases in which human preferences may harm representations:

  • In the global response, section 1.2, we show that overfitting to perceptual datasets harms performance.
  • In our dataset ablations (global response section 1.1, paper section 4.5) we find that finetuning on perceptual datasets with solely high-level (THINGS) or low-level (BAPPS) variations hurts performance for many tasks.
  • Finetuning on NIGHTS seems to degrade how well representations to distinguish between distinct fine-grained categories that are very visually similar (paper section 6.1). As mentioned above, due our evaluation of the synthetically-pretrained SynCLR, we attribute the performance drop to perceptual alignment.

There are other possibilities for harm outside the scope of the type of perceptual annotations we study:

  • Human preferences may reflect unwanted biases. A long-standing problem in both language and vision is that biases (e.g. gender, racial) reflected in Internet language/images are inherited by large models, and reflected in their embeddings. One can imagine similar phenomena in visual preferences [3].
  • Humans may disagree on a preference label, or even disagree with themselves if asked at different times. This may lead to noisy data if not filtered carefully, thus harming representations [1].
  • Without sufficient demographic diversity in the annotator group, emerging biases may be reflected in the model. For example, some RLHF-trained language models have been shown to develop a bias towards the opinions of high-income, liberal individuals over their non-RLHF counterparts. [2]

How should human preference be better defined and categorized?

We thank the reviewer for pointing this out; it is important to define human preferences in vision. In the language space, human preferences largely refers to ethical judgments, or aspects of the user interaction experience. In vision, we define preferences as concepts (or categories) that humans use for making image similarity judgments [1]. By training on these preferences, we make models learn a concept space that is aligned with the concepts that humans use to navigate the visual world. We will clarify this in our paper. In addition, we will refer the reader to a recent review paper, “Getting aligned on representational alignment” [1] that clarifies definitions related to aligning vision representations with human judgments.

We also note that aesthetic judgments (i.e. concepts that humans use to determine visual appeal) have been used to improve diffusion models [3]. We consider this a separate category of annotation, however can refer the reader to relevant works.

Distracting factors in human preference datasets

The signal-to-noise ratio of human cognitive data (may it be behavioral or neural) is often low. Thus, it is important to perform a denoising step before using the data downstream. For behavioral data this can be achieved by collecting a large number of judgments and filtering out the low quality judgments or a strong regularization technique. Distracting factors may relate to long response times or other confounding factors. Isolating such effects could be achieved by showing human participants triplets of different kinds, with varying background or object complexity where complexity can be the visual richness of the scene(s) or the number of objects.

[1] I. Sucholutsky, L. Muttenthaler, A. Weller, A. Peng, A. Bobu, B. Kim, B. C. Love, E. Grant, J. Achterberg, J. B. Tenenbaum, et al. Getting aligned on representational alignment. arXiv, 2023.

[2] Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P. and Hashimoto, T.. Whose opinions do language models reflect?. In ICML, 2023.

[3] Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. 2023.

评论

Dear reviewer, The author-reviewer interaction period has started. Please read the responses provided by the authors, respond to them early on in the discussion, and discuss points of disagreement. Thanks

评论

Thanks to the authors for the detailed response. I found them profound and have substantially addressed my concerns and questions. I hope the additional discussions (including the general response) can be reflected in the next version of this paper. Overall I think the contents of this paper should be shared with the community, and I'll raise my score accordingly.

评论

Dear Reviewer kSUr,

We appreciate your positive feedback and are truly delighted to see that our response has addressed your concerns and questions.

Thank you once again for dedicating your time and effort to reviewing our work and providing us with insightful suggestions!

作者回复

We thank all reviewers for their insightful questions and feedback. We are glad they found:

  • The paper is clear, well-written, and easy to follow. [kSUr, 9gB3]
  • The experiments are comprehensive. [kSUr, 9gB3, VQay, rQrx]
  • The analysis is insightful and interesting. [9gB3, VQay]
  • The paper studies a simple but interesting scientific question worth investigating [kSUr, 9gB3]; the key method is convincing and effective [9gB3, rQrx, VQay]

We present key results and details here, and respond to individual questions in reviewer-specific responses.

1. What are the benefits/drawbacks of different levels of perceptual alignment?

1.1 Ablating perceptual datasets We extend our dataset ablation to our full range of tasks. We LoRA-tune DINO on triplets from BAPPS, THINGS, and ImageNet, using the procedure from Section 4.5. We emphasize that fine-tuning on these datasets ablates the level and type of perceptual alignment. BAPPS contains judgments of low-level distortions, THINGS contains semantic-level distortions, and ImageNet contains no perceptual judgments, instead grouping images by semantic category.

We evaluate models tuned on these datasets on segmentation, depth estimation, and a VTAB subset. Results are in Fig. 2-4 of the attached PDF. We observe:

  • For all dense tasks, training on NIGHTS outperforms all other datasets. The dataset ranking varies across evaluations, but NIGHTS-tuned models consistently transfer best.
  • For many tasks, training on ImageNet triplets or the base model, outperforms THINGS or BAPPS. This indicates that the types of perceptual judgments in NIGHTS specifically are helpful, whereas training solely on low-level or high-level variations may harm representations.
  • On VTAB datasets, base models perform best. The exception is sNORB (pose prediction), for which NIGHTS is best. Amongst perceptual datasets, NIGHTS is sometimes outperformed by BAPPS/ImageNet.

1.2 Ablating training time

We further evaluate how the strength of alignment – i.e. training loss when finetuning on perceptual datasets – impacts features. We evaluate DINO checkpoints tuned for increasing # steps on NIGHTS/BAPPS/THINGS/ImageNet on instance retrieval. Results are in Fig. 5 of the PDF. We observe:

  • Tuning on NIGHTS outperforms other datasets over the full training trajectory.
  • Performance rises significantly with a small amount of alignment to NIGHTS, however it consistently trends down after >1000 steps. This indicates that a small amount of alignment is helpful for the downstream task, however overfitting to human preferences may be harmful.

2. Clarification on patch token training

For dense prediction, we modify the original training objective for stronger supervision on patch tokens, which are output alongside the CLS token and are the basis for our segmentation and depth maps. Although supervising only on the CLS token can still modify patch tokens (as tuning model weights affects all feature outputs), our initial experiments showed that the CLS-based objective was too global to change the patch tokens in any impactful way. Thus, we switch to a loss more directly tied to local features.

In more detail: our local objective only differs from the global objective in how the feature is extracted: instead of computing L(CLSA,CLSB)L(`CLS`_A,`CLS`_B), we compute L(cat[CLSA,pool(PATCHA)],cat[CLSB,pool(PATCHB)])L(cat[`CLS`_A, pool(`PATCH`_A)], cat[`CLS`_B, pool(`PATCH`_B)]). CLS`CLS` is of dimension (1,d)(1, d), PATCH`PATCH` is (s,s,d)(s, s, d) where ss is the number of patches along each spatial dimension, and we spatially average the patch tokens to get dimension (1,d)(1, d). We then concatenate the CLS and pooled patch tokens to get dimension (1,2d)(1, 2d).

Only our experiments in 4.1 (semantic segmentation and depth estimation) use this patch objective, as they are the only ones that require local features; all other applications in the paper are evaluated with CLS tokens (original objective detailed in 3.2). We appreciate the reviewer feedback to clarify this, and will do so in our revision.

3. Why does tuning on NIGHTS outperform high/low-level datasets?

We hypothesize that variations found in both BAPPS and THINGS are restricted to solely high- or low-level, whereas the mid-level judgments in NIGHTS cover some measure of both (see Fig. 11 in paper for examples of the difference). Previous work [1] has found that similarity judgments by perceptual metrics trained on BAPPS correlate better to low-level metrics such as color than semantic attributes. THINGS contains solely semantic distortions, reflecting the concepts humans use to judge object-level similarity rather than visual concepts.

In contrast, NIGHTS contains a broad set of visual variations including style, pose, color, and count. Previous work [1] has found that models fine-tuned on NIGHTS seem to attend to both low-level attributes and semantic attributes . Aligning a feature space to these concepts may be useful for visual tasks requiring some semantic knowledge, such as retrieval, counting, segmentation, etc. This hypothesis may also explain why tuning on NIGHTS hurts performance on fine-grained tasks, in which images that are perceptually similar may belong to different categories.

The main contribution of our paper is to empirically identify how tuning on mid-level variations, in comparison to other perceptual datasets, impacts transfer performance across a wide variety of downstream tasks. Further understanding the mechanism of the respective learned representations is a rich avenue for future work.

We greatly appreciate that the reviewers see perceptual alignment as an interesting scientific question worth investigating. We hope that our paper encourages research in this emerging topic and sparks fruitful discussion.

[1] Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, Tali., and Isola, P. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. In NeurIPS, 2023.

最终决定

More perceptual alignment with human similarity judgments benefits visual representations.

Four out of five reviewers recommend accepting the paper (6, 8, 5, and 6) while one reviewer LKEh with 4 did not give follow-up feedback on the authors' responses after the rebuttal period. After reading the paper, the reviewer's questions have been adequately addressed.

Overall, the paper is well-written and easy to follow. As commonly acknowledged by the four reviewers, the idea of perceptual alignment with human similarity judgments is interesting. The experiments are comprehensive with many different types of tasks and model backbones. The results are strong and convincing. The experiment designs address the claims on the alignment very well.

Please ensure the reviewers' comments are incorporated into the final version.