Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations
摘要
评审与讨论
This work looks at whether deep neural networks can learn generalizable same-different relations. The authors extend prior work by testing different architectures like ViT and training schemes like CLIP. In addition, they test a variety of datasets (including one containing natural objects) and evaluate out of distribution generalization. They show strong in and out of distribution same-different generalization using a CLIP pre-trained ViT. Finally, the authors assess color and texture bias, showing that CLIP pre-training and fine-tuning on shape-centric datasets reduces color and texture bias.
优点
The paper is well written. The authors clearly state how their work addresses limitations of prior studies on same-different visual-relations.
The authors contributed a number of new evaluations. These include: testing out-of-distribution generalization of same-different relations, testing ViT models, testing CLIP pre-training, and evaluating both abstract and natural object datasets.
The finding showing strong same-different relations using a CLIP pre-trained ViT is novel and interesting to me. I appreciate how the authors carefully varied parameters such as architecture and dataset to test their hypotheses. In addition, the analyses of texture and color bias provided new insights into the factors that contribute to learning generalizable same-different relations. The authors did a good job of motivating the analysis by citing prior work.
缺点
Generally, the idea of testing same-different relations in deep neural networks is not new. The authors make an interesting contribution in studying out of distribution generalization, but the work is not especially novel.
Along those lines, I appreciate that the authors tested models beyond the usual ImageNet trained CNN, but they only extend the study to a ViT and CLIP pre-training. I think the work could have been strengthened by testing more widely. For example, self-supervised pre-training could have been examined or even models that have been shown to align well with human visual processing in other areas (such as top-performers on metrics like BrainScore).
Regarding the datasets, I agree that the authors extend prior work by testing more natural objects in addition to the standard shape datasets. However, I would argue that these datasets are still highly unnatural. They do not contain the regularities and context contained in natural scenes. I think the study could have been enriched by using even more natural stimuli, such as full images, but I recognize that the same-different task is harder to set-up in that scenario.
问题
I am curious about why the authors chose to evaluate CLIP pre-training over other methods. I am wondering why you did not test more pre-training methods and if there was something particularly interesting to you about CLIP.
Could the authors explain more about their rationale in choosing the datasets they evaluated? In particular, I am curious to hear your thoughts on the point I made about using more natural scene images.
Finally, could the authors explain the significance of their study to a broader representation learning audience? My main hesitation for strongly accepting this paper is that the findings seem incremental and niche to a subset of cognitive science.
I generally liked the paper and would be open to raising my score if questions and weaknesses are addressed.
Thank you for the valuable feedback. We agree with your point that our testing datasets were not yet sufficiently natural. While they were designed to match prior work, these evaluations were missing additional complexities involved in judging same-different relations in naturalistic settings. To that end, we have added a new section (3.3 Out-of-Distribution Generalization to Photorealistic Stimuli) to the paper exploring generalization to images consisting of two 3D objects placed in a realistic scene. We find that in fact, CLIP ViT has a median test accuracy of up to 90% (0.98 AUC-ROC) for these far-OOD scenes (depending on the type of objects in our 2D fine-tuning task) without any additional fine-tuning on the realistic 3D scenes. This result is quite surprising to us, given that there was no incentive for models to learn a same-different relation that generalizes beyond pixel-wise similarity. We hope that the reviewer will find our article improved with these additional results.
We agree that it would be nice to test more models and see if there are any others able to consistently generalize OOD. For the models that we tested which are currently on the BrainScore leaderboard, CLIP ResNet-50 (0.406) scores much higher than both ViT and ResNet-50 pretrained on ImageNet-1k (0.190, 0.137), which is intriguing. However, due to the combinatorial nature of our experiments, expanding beyond 2-3 pretraining methods and architectures became quickly intractable for the scope of our work. Our original goal was to see whether there might exist any model that can learn a generalizable same-different relation, addressing an open cognitive science question that had so far only shown negative results. Thus, we do not believe that it weakens our argument to not include other models or pretraining paradigms in our evaluations. Future work on self-supervised pretraining and alternative architectures would be interesting in this space, and may shed more light on what conditions allow for consistent OOD generalization.
Answers to your questions:
-
We chose CLIP over other pretraining methods because of an intuition that linguistic supervision may allow visual models to better develop abstract relations. More concretely, it might be the case that being trained alongside a language model forces the ViT to develop more systematic semantic representations, allowing for abstract relations to be more easily learnable. Work in developmental cognitive science has shown that children begin to succeed on same-different tasks at the age where they can justify their choices using the words “same” and “different,” suggesting one possible influence of linguistic supervision (Hochmann et al., 2017). Whether this actually explains CLIP’s success is up for debate, and would require comparison against a model trained on a similar amount of data in a self-supervised, non-linguistic manner.
-
We chose the Squiggles dataset due to its difficulty in prior work and its “abstractness,” since each shape is procedurally generated and completely unique. We constructed the Alphanumeric dataset as a point of comparison to Squiggles: ALPH objects are similar to Squiggles in that they are black lines, but they are not enclosed shapes. They also may have been seen by the model already in pretraining. The Shapes dataset is the analogue of the Squiggles dataset (novel objects), except with the added variation of color and texture. The Naturalistic dataset was chosen because we wondered whether using objects that pretrained models were already familiar with would make learning easier. We agree that it is important to consider naturalistic stimuli during testing. Interestingly however, our results seem to suggest that training on naturalistic stimuli may not actually lead to the best results, mainly due to the color/texture bias we observe in Section 4.
References
Hochmann, J., Tuerk, A.S., Sanborn, S., Zhu, R., Long, R., Dempster, M., & Carey, S. (2017). Children’s representation of abstract relations in relational/array match-to-sample tasks. Cognitive Psychology, 99, 17-43.
I appreciate the new experiment you added with more natural images. I think the these results increase the contribution of your work and provide interesting insights into OOD generalization.
In light of these new results, I will raise my score to a 6. I do not raise it further because I still feel that the overall concept and methods are not particularly new.
This paper presents a list of experiments to examine the generalizability of two kinds of deep neural networks, i.e., ResNet and ViT, and whether they are pretrained by CLIP, in classifying same-different visual relations onto four datasets. The conclusion is that ViT pretrained on CLIP can learn a pixel-level same-different relation that is generaliable to out-of-domain datasets. The experiments also find fine-tuning the model by abstract shapes will introduce more OOD generalization for the mentioned same-different relation prediction.
优点
- In my opinion, understanding the ability to learn generalizable same-different visual relations is important, especially if one would like a DNN to also perform basic logical operations in addition to instance-level perceptions. This paper gives a well-organized experiment-based summary that attempts to answer how and why recent DNN architectures and pretraining datasets can perform generalizable same-different visual relation classification.
- The paper is well-written and easy to follow. A comprehensive list of experiments are conducted to support its conclusions.
缺点
I am afraid that the definition of same-different visual relation should be beyond comparing objects at a pixel level. Thus, the observations and analyses may be of limited insights. The reasons are explained below:
- In the human-visual system, two objects that are considered the same are usually based on some specific semantic and/or attribute similarities, rather than counting how many pixels that exactly have the same values. It means the dataset/task should define same-different visual relations by more criteria, such as geometry, texture, color, categories, identities, and etc. Moreover, the dataset should consider more visual distortions in real scenarios, such as cluttered background, color jittering, slight/moderate shape distortion, or allowing overlapping between objects. Sec. 4.2 tries to dissociate color, texture and shape, but more test scenarios can be included.
- Defining the same-different visual relations at the pixel level actually gives quite a strong clue that can be captured by DNNs. Possibly it is the reason why the predication results are almost saturated (even can be achieved 100% accuracy if fine-tuned) by different pretrained models, especially when the texture similarity is involved (i.e., SHA and NAT datasets). Therefore, some conclusions drawn from these experiments may not be sound enough. For example, CLIP pretrained ViT models can achieve 100% in-distribution and nearly 100% out-of-distribution test accuracy is not a universal conclusion, but depending on the carefully designed testing datasets. The conclusion that fine-tuning abstract shape leads to a more generalizable predication may come from that all the four datasets (SQU, ALPH, SHA, NAT) can classify the same-different relations by shape consensus. If evaluating OOD generalization onto texture-based relations, the observations may be quite different.
问题
In the paper weaknesses, I have mentioned several questions that should be addressed in the rebuttal. Here is one more question:
- Sec. 3.2: As what I have mentioned, the OOD generalization in this paper may come from how well the shape is extracted. The closeness of stimuli may not be a reliable cue to better extraction of shapes, and thus is not well correlated with OOD generalization. The authors apply another dataset containing patches with random noise to prove that closeness of stimuli is not a perfect correlate of OOD generalization. But classifying this dataset relies more on texture similarity, thus it is reasonable that models fine-tuned on this dataset exhibit weaker generalization. I am interested in another dataset combining SQU with noises filling in each object. Possibly this dataset has more close stimuli, but also a higher OOD generalization.
We would like to thank the reviewer for their comments on our work. In particular, the reviewer's comment encouraging us to go “beyond comparing objects at a pixel level” was very helpful in guiding our revisions and in inspiring a new experiment, which we added to the article. We hope the reviewer will find our work improved.
To address your main points:
-
We agree that same-different in real-world environments also requires invariance to factors like perspective, overlap, and background noise. To address this point, we have added a new experiment (3.3. Out-of-Distribution Generalization to Photorealistic Stimuli) testing how well our models generalize to highly realistic images that include object rotation, perspective (which distorts shape), object overlap, and textured/cluttered backgrounds. We do not fine-tune or train any of our models on these more realistic images, only using them as an additional evaluation (much like how we evaluated our models on Puebla and Bowers’ datasets in Appendix A.4). Surprisingly, our CLIP ViT models have a median test accuracy of up to 90% on these realistic stimuli (0.98 AUC-ROC), even when fine-tuned only on very simple synthetic 2D stimuli. This suggests that models actually learned a much more “human-like” representation of same-different than they needed to during fine-tuning, enabling them to further generalize to photorealistic stimuli under very different viewing conditions. This hopefully also addresses your concern of saturated prediction results due to carefully constructed testing datasets: generalization success is not actually limited to 64x64 non-overlapping items on a white background.
-
It is also a valid point that all four of our datasets can be classified using shape, which makes shape-based models look more “correct” than models biased in other manners. In fact, we see that our shape-biased CLIP ViT fine-tuned on SQU does slightly worse generalizing to photorealistic images (82% median test accuracy) than the same model fine-tuned on NAT (87%) or SHA (90%), because apparent object shape changes with rotations in the photorealistic setting. We also fully expect that our SQU and ALPH models would not perform very well on a texture-based same-different relation; we can already see hints of this in our Table 13 results, which show that CLIP ViT fine-tuned on a shape-only dataset consistently classifies objects with the same texture but different shape as “different” (the T and CT columns in the last row). However, we designed the experiment in this manner to match experimental designs from the pre-existing cognitive science literature. The consensus in cognitive science and developmental psychology literature is that human object recognition is based on shape above all other factors (Hummel, 2013; Landau et al., 1988), so favoring shape-biased models during evaluation is a natural choice.
Your idea of a dataset combining SQU with noise filling in each object is interesting because it would combine the “closeness” idea from cosine similarity results with a dataset only solvable by shape. Our guess as to why the experiments in Section A.7 failed is because there was no incentive for the model to learn a global shape-based comparison, with the task being solvable even if a subset of pixels were sampled and compared. We worry that filling in SQU with noise would simply result in the same issue.
References
Hummel, J. E. (2013). Object recognition. Oxford handbook of cognitive psychology, 810, 32–46.
Landau, B., Smith, L.B., & Jones, S.S. (1988). The importance of shape in early lexical learning. Cognitive Development, 3, 299-321.
The paper focuses on learning a same-different relation from images and analyzes 2 networks (ResNet and ViT) and 3 different pre-training strategies (none/from-scratch, pre-trained on ImageNet, pre-trained on CLIP). Extensive experiments show that CLIP-pretrained ViT models can generalize same-different relations while prior works need strong inductive biases (e.g., separately processing two objects in the image) when only ResNet models are studies.
优点
- The paper is clearly written.
- The authors present their empirical results in a sound way. For example, in section 3.2, they provide additional experiments fine-tuning on random noise to illustrate that "closeness" of stimuli is not a perfect correlate of OOD generalization.
缺点
- It is hard to tell how much the paper can contribute to the community. In my opinion, this work is an empirical study. For this category of works, the criteria are usually a) how many new observations are found, how surprising they are, and how useful they are for future works; b) whether the study is systematic and convincing; c) whether a new perspective is proposed or a new methodology is used to study the problem. I think this paper focuses more on a) and b). I am not sure how novel and important the findings introduced in this paper are, since the authors study a network architecture (ViT) which are actually already widely used in VLM models and tasks that require relation understanding like VQA. I suggest the authors should clarify the generality and importance of their findings if only same-different relation is studied, or try tasks beyond same-different relation.
问题
- Figure 5 shows that when randomly initialized ViT is trained on Masked Shapes, it generalizes to all datasets (including in-dist one) poorly. According to the authors' claim "fine-tuning on abstract shapes that lack texture or color provides the strongest out-of-distribution generalization", if trained on Masked Shapes, the model should generalize the best, which is in contrast to results.
- In Appendix B.1, "Because training datasets are constructed by sampling random objects, the exact objects used be tween the original, grayscale, and masked datasets are not the same" is a little confusing to me. Given the description above this claim, it seems that the authors can convert the original datasets to grayscale or masked versions with the same objects.
- As ViT takes image patches as inputs, it natively "segments" objects in the image. Have the authors tried to change object sizes in the image or experiment with other transformer architectures like Swin-Transformer?
We thank the reviewer for their thoughtful comments on our work. We view learning the same-different relation as a case study for the larger problem of whether neural networks have the capacity to learn abstract relations. Since prior work has claimed that NNs are not capable of acquiring abstract same-different relations, our work centers around a question of existence, and whether we are able to find any neural network that is able to solve the same-different task in a generalizable way.
We agree that ViT models performing well on tasks like VQA may already suggest that they have the capacity to handle certain kinds of abstract relations. However, this is only implicit evidence for relational reasoning in ViTs. Our experiments strip away confounding variables that may be present in more realistic tasks in order to explicitly test models on OOD generalization for the same-different task. Only some ViT models are able to generalize well (ViT pretrained with CLIP), while others fail (randomly-initialized ViT and ImageNet ViT, as seen in Table 6 and Table 8).
Notably, with regards to the reviewer’s questions regarding generality and importance, we update the article with a new section (3.3. Out-of-Distribution Generalization to Photorealistic Stimuli) showing that CLIP ViT models can generalize to 3D objects placed in a realistic scene with up to 90% median test accuracy (0.98 AUC-ROC) when fine-tuned only on our toy 2D stimuli. This is a surprising result, as “same” in our fine-tuning datasets was defined based on pixel-level similarity, making generalization risky when transferring across differences in dimension, depth, orientation, lighting, occlusion, etc. Remarkably, we find that the best models retain strong performance. We hope that this new result, combined with the above clarification regarding the significance of our findings, will improve the reviewer’s view of our paper.
Regarding your criteria for empirical studies: apart from our surprising results (your point (a)), the novelty of our approach to this problem (your point (c)) also comes from our detailed analysis of the impact of dataset distribution on generalization (Section 4). This is a factor that has not yet been thoroughly explored, despite clearly having a large impact on how models learn to solve a given task. As our results show, models will learn qualitatively different “solutions” to the same-different task depending on things like the presence of color variation in their training data.
Here are our answers to your questions:
-
Because in-distribution accuracy is so low for randomly-initialized ViT-B/16 trained on Masked Shapes, it is difficult to interpret these OOD results. Nonetheless, the model seems to perform better on color/texture datasets than on shape-only datasets, which would support the idea that “training on abstract shapes that lack texture or color provides the strongest out-of-distribution generalization” (relative to in-distribution accuracy, which we clarify in Appendix B.1). Still, you raise a good point: why is there any difference in performance between models fine-tuned on Masked Shapes, Squiggles, and Alphanumeric datasets? Although these are all shape-based, there are still differences in generalization performance between the three. From this fact we know that color and texture bias is not the full story, motivating our analysis of image cosine similarity as a way to differentiate between Squiggles and Alphanumeric in Section 3.2. Table 11 in the Appendix also shows updated pairwise cosine similarity results for our Masked and Grayscale datasets, with further discussion on that point.
-
In order to construct the Grayscale Shapes and Masked Shapes datasets, we first created grayscale and graymasked versions of all the objects in the Color Shapes dataset, then randomly sampled objects that would actually be included in the train/val/test data. As all of our datasets contained 1600 unique objects, we sampled 1600 objects out of 1793 unique Shapes objects we had access to, meaning that one specific object may have made it into the Grayscale Shapes dataset, but not the Color Shapes dataset (for example). We were not worried about test set contamination as shapes were modified across datasets.
-
This was an intuition we had as well, but it is important to point out that even though objects are not actually aligned to ViT image patches in any of our main experiments, models still succeed on the task. In terms of object size, two of our experiments indirectly address this: the new section we added to the PDF testing on photorealistic images (Section 3.3) involves changes in object size due to distance from the camera, and Appendix A.4 involves testing on objects of different sizes. CLIP ViT is robust to these changes in object size in both cases. We also added experiments to the appendix (Appendix E) showing that performance does not seem to improve when objects are aligned with ViT image patches during fine-tuning. While it would certainly be interesting to see how Swin-Transformer would compare to our models, we leave this investigation to future work.
This work investigates the learning and generalization of some existing architectures on the same-different relation image data.
优点
The work attempts to better understand same-different challenges for current architectures. The work can provide useful conclusions for the same-different task.
缺点
The novelty and contributions of this work are limited. The work fine-tuned some pretrained existing architectures to improve the same-different recognition performance. The work is also very incremental in relation with the work: Guillermo Puebla and Jeffrey S Bowers. Can deep convolutional neural networks support relational reasoning in the same-different task? Journal of Vision, 22(10):11–11, 2022. I think the contributions of this work are not enough for publication at such strong venue. In my opinion this work is more appropriate as a workshop paper.
问题
See above my concerns
We appreciate the reviewer’s comparison with Puebla and Bowers (2022), although we respectfully disagree that this is a reason to dismiss our article. Most notably, Puebla and Bowers’s article arrives at the opposite conclusion that ours does! While Puebla and Bowers do not find evidence of deep neural networks being able to learn a “same-different” relation in a generalizable way across different evaluation datasets, we in fact do find a model that learns a sufficiently abstract relation to generalize well out-of-distribution. (A discussion of the reasons for these differences is available in Appendix A.4.) In other words, the novelty of our work is due to the novelty of our findings, not our methodology. Please also see our updated PDF, which contains a new section (3.3. Out-of-Distribution Generalization to Photorealistic Stimuli) showing further proof that certain models can generalize same-different relations that are not just based on pixel similarity, but that also extend to more naturalistic stimuli.
We hope that our comment better contextualizes our work in relation to previous related work. In addition to discovering models that can generalize the same-different relation, our results also speak to why previous attempts were unable to, as generalization requires the right combination of architecture, pretraining, and fine-tuning data.
This paper presented systemic analyses for learning the generalizable same-different relation with neural networks. For this, the authors investigated multiple factors including architectural biases, pre-training methods, and fine-tuning datasets.
Three out of four reviewers recommended the rejection of this paper. The primary concerns from the reviewers were about the lack of novel technical contributions and insights and limited experiment results in a synthetic setting. The authors provided an extensive volume of additional results in the rebuttal, which addressed some of the reviewers' concerns. However, most of the negative reviewers maintained their initial recommendation. After reading the paper, reviews, and rebuttal, AC agrees with the reviewers that the significance of the work to the field is unclear and the paper needs resubmission to properly assess the newly added contents. Hence, AC recommends rejection this time.
为何不给更高分
N/A
为何不给更低分
N/A
Reject