Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases
A Benchmark for Fine Control of Spurious Correlation Biases
摘要
评审与讨论
The paper studied the spurious correlation (SC) problem. To study this problem, the paper introduces the Spawrious dataset. Unlike previous SC benchmarks that only contain one-to-one SCs, the Spawrious benchmark introduces new many-to-many SCs that jointly consider spurious correlations and domain generalization problems. The images are generated using Stable Diffusion v1.4. The experimental results show that many group robustness methods struggle with the new benchmark.
优点
- The paper evaluated many (11) group robustness methods (Table 3) on the newly introduced dataset.
- The discussion on potential ethical concerns about using Stable Diffusion to generate training images (Appendix C) is appreciated.
- The paper is well-written.
缺点
Major Concerns
[Benchmark W2D]: Although the paper evaluates many group robustness methods in Table 3, none of them is designed to handle both correlation shift and domain shift. Why not evaluate the W2D method (Huang et al. 2022) that is designed to handle two shifts, which is the main focus of Spawrious?
[More comprehensive benchmark of architecture]: Wenzel et al. [1] did a comprehensive evaluation of OOD generalization. One of the conclusions is that architectures (such as Deit, Swin, and ViT) play a key role in improving OOD robustness. Although the paper benchmarks many group robustness methods and compares two architectures (Appendix D), I think it is necessary to see more results in terms of different neural architectures on the new benchmark based on conclusions in [1].
Minor Concerns
[More results of foundation models]: To show that this Spawrious is really challenging, I think we evaluate the performance of foundation models (e.g., CLIP [2] with zero-shot transfer) pretrained on web-scale datasets.
I wonder why the authors argued that ImageNet-W (Li et al, 2023) is synthetic (Section 2, page 3). The watermark shortcut in ImageNet-W naturally exists in the real-world ImageNet dataset.
References
[1] Florian Wenzel, Andrea Dittadi, Peter Vincent Gehler, Carl-Johann Simon-Gabriel, Max Horn, Dominik Zietlow, David Kernert, Chris Russell, et al., “Assaying Out-Of-Distribution Generalization in Transfer Learning,” in NeurIPS, 2022.
[2] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, et al., “Learning Transferable Visual Models From Natural Language Supervision,” in ICML, 2021.
问题
In the rebuttal, I expect the authors to address my concerns:
- Add results of W2D.
- Add more results by using different architectures.
- Add results of CLIP or other foundation models to better demonstrate how challenging the benchmark is.
Thank you for the constructive feedback that improves the quality of our submission! We address your comments below.
Add results of W2D.
Thank you for highlighting this method to us; we have since collected results for W2D and added the results to Table 3 in the revised submission. Averaged across all Spawrious challenges, W2D ranks 6th out of 12 methods.
Add more results by using different architectures.
Thank you for this suggestion! We have since included results in the appendix for the effects of using the Vit-B/16 architecture instead of the ResNet-50. The results we included in the initial submission remain the strongest.
Add results of CLIP or other foundation models to demonstrate better how challenging the benchmark is.
Thank you for suggesting an interesting direction for investigation! We shall return to you with the results of fine-tuning a head applied to a CLIP feature extractor on the Spawrious challenges.
We believe there is some slight confusion – the problem setup of domain generalization assumes some distribution shift between training and test domains. Consequently, training methods addressing this problem have only access to training domains that differ from the test domains. For a fair evaluation of such methods, the models must be trained on training data that strictly differs from the test data. In the case of CLIP, the training set is unknown and vast, consisting of internet-crawled data. Hence, the CLIP model may have already been trained on {class, background} combinations similar to the ones in the test data. However, this result may interest the OOD community, and we will attempt to gather results for the next revision.
Results of W2D
I appreciate the authors for adding the results of W2D, which addressed my concerns.
Results of different architectures
I appreciate the authors' efforts in adding the ViT-B/16 architecture. However, it still lacks the comprehensiveness to cover more architectures, such as Deit, Swin, and different sizes of these architectures. Thus, my concern on this is not fully addressed.
Results of foundation models
I appreciate the authors' promise to add the results of CLIP. However, I wonder whether the results will be added during the rebuttal.
I appreciate the explanations on the domain generalization setup. However, the purpose of evaluating foundation models is to see if the proposed Spawrious challenge is sufficiently challenging to fail foundation models. Otherwise, if the domains can be easily seen through web-crawled images, then I am afraid that combining domain generalization with spurious correlation (i.e., the key contribution of Spawrious compared to existing benchmarks) may not be realistic since the former can be easily addressed by models trained on web-crawled data (e.g., CLIP) whereas the latter remains unaddressed.
I appreciate the authors' efforts in adding the ViT-B/16 architecture. However, it still lacks the comprehensiveness to cover more architectures, such as Deit, Swin, and different sizes of these architectures. Thus, my concern on this is not fully addressed.
Sorry, we were just in the process of posting more architecture results and it took us a bit more time than we anticipated.
Here are more test accuracy results for the hardest challenge (M2M-Hard) trained with ERM; for comparison, we also include the ResNet50 results from our initial submission.
| Architecture | M2M Hard |
|---|---|
| Swin-B-4 | 67.07% |
| DeiT-III-B-16 | 77.41% |
| Beit-B-16 | 66.79% |
| LeViT-128s | 45.41% |
| Eva-B-14 | 68.02% |
| ViT-B-16 | 30.20% |
| ResNet50 | 58.70% |
After the rebuttal, we will add results for all other challenges too (ie results for {all challenges} x {all architectures}).
I appreciate the authors' promise to add the results of CLIP. However, I wonder whether the results will be added during the rebuttal.
Again, our sincere apologies for letting you wait. Instead of the original, by now outdated CLIP checkpoint, we fine-tune via ERM the state-of-the-art SigLIP-14 model [1] pre-trained on 400 million images (like CLIP) of the WebLI dataset. For comparison, we also include the ResNet50 results from our initial submission.
| Architecture | O2O Easy | O2O Medium | O2O Hard | M2M Easy | M2M Medium | M2M Hard |
|---|---|---|---|---|---|---|
| SigLip | 60.32 | 72.77 | 61.42 | 73.58 | 49.56 | 57.51 |
| ResNet50 | 77.49 | 76.60 | 71.32 | 83.80 | 53.05 | 58.70 |
I appreciate the explanations on the domain generalization setup. However, the purpose of evaluating foundation models is to see if the proposed Spawrious challenge is sufficiently challenging to fail foundation models. Otherwise, if the domains can be easily seen through web-crawled images, then I am afraid that combining domain generalization with spurious correlation (i.e., the key contribution of Spawrious compared to existing benchmarks) may not be realistic since the former can be easily addressed by models trained on web-crawled data (e.g., CLIP) whereas the latter remains unaddressed.
Agreed, based on the above results, we conclude that the Spawrious challenges remain sufficiently challenging, even for foundation models pre-trained on web-crawled data.
This paper introduces a new image classification benchmark dataset called "Spawrious" that addresses the problem of spurious correlations in classifiers. Spawrious contains both one-to-one (O2O) and many-to-many (M2M) spurious correlations between classes and backgrounds in images. The dataset is carefully designed to meet six specific desiderata and is generated using text-to-image and image captioning models, resulting in ~152k high-quality images.
Experimental results show that even state-of-the-art group robustness methods struggle with the Spawrious dataset, especially in challenging scenarios (Hard-splits) where accuracy remains below 73%. Model misclassifications reveal a reliance on irrelevant backgrounds, highlighting the significant challenge posed by the dataset. Experimental results demonstrate the difficulty of the dataset and the limitations of current group robustness techniques.
优点
The strengths of the paper can be summarized as follows:
Novel Benchmark Dataset: The paper introduces a novel benchmark dataset, Spawrious, which contains a wide range of spurious correlations, including both one-to-one and many-to-many relationships. This dataset offers three difficulty levels (Easy, Medium, and Hard) for evaluating the robustness of classifiers against spurious correlations. The dataset consists of approximately 152,064 high-resolution images of 224 × 224 pixels. The dataset's size and quality make it a valuable resource for testing and probing classifiers' reliance on spurious features.
Experimental evaluation: The paper explores different model architectures and robustness methods, evaluating their performance on the dataset, revealing that larger architectures can sometimes improve performance but the gains are inconsistent across methods. The experimental results demonstrate that state-of-the-art methods struggle to perform well on the Spawrious dataset, particularly in the most challenging scenarios (Hard-splits) where accuracy remains below 73%. This highlights the dataset's effectiveness in pushing the boundaries of current classifier robustness. The paper provides evidence for the reliance of models on spurious features through an analysis of model misclassifications.
Overall, the strengths of the paper lie in its creation of a challenging benchmark dataset and the empirical evidence it provides about the limitations of state-of-the-art methods in handling spurious correlations in image classification, stimulating the need for future research and developments in this domain.
缺点
t would have been better if the paper included empirical results on some well known domain generalization datasets (using the same methods as the ones in Table 3). By comparing between the accuracy of various methods on multiple such datasets, the case could be made stronger for the paper introducing a strong benchmark for spurious correlations. One such dataset could be: FOCUS: Familiar Objects in Common and Uncommon Settings
However, the paper is not cited. Nor is any such comparison provided.
Moreover, no details are provided for any filtering of the model generated images. There should be some human study in the paper which shows what percentage of images generated by the diffusion models are aligned with the the prompts. If the images generated are not aligned with the prompt, the dataset cannot be trusted to contain the variety of spurious correlations.
问题
Did the authors conduct any crowd study to show that the images generated by the diffusion model follow the intent of the prompt? If not, it is hard to say whether the generated data actually contains the different images in different backgrounds and can be useful for research.
Thank you for the constructive feedback that improves the quality of our submission! We address your comments below.
Did the authors conduct any crowd study to show that the images generated by the diffusion model follow the intent of the prompt? If not, it is hard to say whether the generated data actually contains the different images in different backgrounds and can be useful for research.
We have since sent out 480 random samples of our dataset to human volunteers who have been asked to decide on the image prompt alignment, and have found that 97.2 % of the dataset is clean. Additional details are provided in Appendix H of the updated submission.
By comparing between the accuracy of various methods on multiple such datasets, the case could be made stronger for the paper introducing a strong benchmark for spurious correlations. One such dataset could be: FOCUS: Familiar Objects in Common and Uncommon Settings. However, the paper is not cited. Nor is any such comparison provided.
Thanks for pointing this out! We admire the effort by the authors of the FOCUS work to curate a dataset with unfamiliar class-background combinations. While they examine an image classifier’s OOD performance on their dataset, they do not attempt to introduce spurious correlations between class and background in their benchmark, nor do they attempt to balance the datasets by class and group. Hence, Spawrious provides a fundamentally different contribution to the OOD community. We will cite the FOCUS dataset and add this comparison to the revision.
Thank you for the comment.
I admire the author's efforts to conduct a human study to evaluate alignment of generated images with the human intent.
However, I disagree with the author's assertion FOCUS does not attempt to introduce spurious correlations between class and background. FOCUS tries to discover images with spurious correlations via retrieval while Spawrious attempts to do the same via generation.
To make a case that image generation is a better approach compared to image retrieval for obtaining images with spurious correlations, it is imperative that a comparison be made between the two approaches.
This paper presents a new benchmark for assessing algorithms for training models to be robust to spurious signals. The benchmark uses a text-to-image model to generate inputs under different specifications, i..e input text, which allows one to control the difficulty of the task. Because of this text-based control, the benchmark has a many-to-many spurious signal set, which can be completely reversed between training and testing---a challenging condition for current algorithms. The primary task is dog classification. This paper then tests several approaches for training a model to be robust to spurious correlations, and finds that Mixup does particularly well for many-to-many spurious correlation and Just-train-twice is the best performing for One-to-One spurious settings.
优点
Overall, I enjoyed reading this paper, and think it was well executed. Here I discuss some of these key strengths.
-
Nice Dataset Design: I particularly enjoyed the use of the image-to-text models here for performing dataset design. I think this type of dataset design is going to be increasingly common for various settings. Essentially the design here it to use a text-conditioned generative model to create toy datasets where the data generation process is carefully controlled to induce various proportions of features of interest. This is approach was also used in the Instructpix2pix paper.
-
Scale of Empirical Assessments: The coverage in algorithms here is also quite substantial since this literature is quite active. To my count, the authors test 10 methods across 6 settings, which is a substantial amount of work, and commendable.
缺点
I have two weaknesses with this work, but they don't factor into my rating.
-
Failures of Dataset Design: My first issue is about how to verify whether the output of the text-to-image model matches and satisfies all conditions or features specified in the prompt. The authors discuss this issue in Appendix F. The authors attempt a manual and an automatic filtering process. However, both of these also might be susceptible to failure in different ways.
-
Insights: Table 1 is a very compelling result for mixup, and it points at trying to better understand its properties theoretically w.r.t. to spurious signals. Given the already wide scope of this paper, it is unable to delve into explaining the effectiveness of various methods.
问题
-
In Table 3, how is this average computed? Also, does it make sense to report an average if it includes a combination of different settings/environments?
-
Section E of the appendix is very interesting. Did you also happen to compute the saliency maps for the mixup models on these same inputs? Would be interesting to compare both of them.
伦理问题详情
N/A
Thank you for the constructive feedback that improves the quality of our submission! We address your comments below.
The authors attempt a manual and an automatic filtering process. However, both of these also might be susceptible to failure in different ways.
This is a valid point, and we appreciate your insight. Our automatic filtering process may fail due to errors from the captioning model, and the human filtering part might fail when we do not reach out to a variety of human volunteers to assess the alignment of the image-prompt alignment. We have since sent out 480 random samples of our dataset to human volunteers who have been asked to decide on the image prompt alignment and have found that 97.2 % of the dataset is clean. Additional details are provided in Appendix H of the updated submission.
In Table 3, how is this average computed? Also, does it make sense to report an average if it includes a combination of different settings/environments?
This average is computed as the average of the row, giving the average Spawrious benchmark performance for a given optimization algorithm.
Section E of the appendix is very interesting. Did you also happen to compute the saliency maps for the mixup models on these same inputs? Would be interesting to compare both of them.
Thank for raising an interesting direction of investigation! We are currently running a training loop for ERM and Mixup models, and will present saliency maps of their misclassifications, in order to investigate reasons for Mixup’s superior test performance.
The new human study done to check prompt alignment is quite helpful and allays my concern. I'll be keeping my score as is.
I have some additional questions about JTT vs Mixup: There is quite a drastic difference between the one-to-one performance and the Many-to-Many performance, which is somewhat surprising. For example, in the hard setting. JTT is 7 percent better than the closest performing method, which is not even Mixup. However, Mixup is effect for the Many-to-Many setting. Is it because the first classifier in JTT can only capture one 'group'/'environment' in the many-to-many setting?
Are there works that have studied Mixup and JTT theoretically or tried to show what it is that makes these approaches effective (or ineffective) in different settings? If this is the first time the one-to-one vs many-to-many effectiveness of these approaches is being studied then that'll be interesting future work, but I wonder if there is enough known already about how these approaches confer robustness to spurious signals.
We could not find any works that theoretically analyze the OOD performance gains from JTT, nor specifically in the subproblem of robustness to spurious correlations. Zhang et al. (2020) [1] find that Mixup loss acts similarly to a second derivate adversarial loss, explaining adversarial robustness improvements gained from it. This seems unlikely to explain its performance on Spawrious, which is a spurious correlation benchmark. Perhaps the background information is mostly degraded in the mixing of the images while the dog features preserve better, in which case it is hard for the classifier to identify the background as easily as the dog features, making the dog features more useful for prediction. This could be tested by switching the benchmark to background prediction with spurious dogs, and measuring the ability of Mixup trained classifiers to rely on the backgrounds rather than the dogs. Given the time constraints, this experiment remains future work.
It is hard to say why JTT performs well on O2O but not well on M2M. Our speculation is that O2O can be partially solved by reweighting the training samples so that the backgrounds are less predictive, which is possible since the corr(background, label) <1, and there exist samples where the background is not spuriously correlated with the label, for all classes. For example, the beach background samples in O2O easy. Then, reweighting the samples so that the beach background samples have a larger proportion of the loss will reduce the utility of the background as a predictive feature and increase the utility of the dog features. How to find the right reweighting? JTT does this by letting ERM models overfit, making mistakes on the beach background, and then upweighting these samples accordingly.
This strategy will not work on M2M however because the spurious correlation between groups is corr(B,C) = 1, thus all samples have spurious features and JTT cannot easily identify the samples that would rebalance the spurious correlation.
[1]: How Does Mixup Help with Robustness and Generalisation? https://arxiv.org/abs/2010.04819
Previous benchmarks testing robustness to spurious correlations faced problems, such as over-saturation and a lack of many-to-many (M2M) social correlations. The authors introduce Spawrious-{O2O, M2M}-{Easy, Medium, Hard}, an image classification benchmark with 152k high-quality images. State-of-the-art group robustness methods struggle with Spawrious, especially on the hard splits, achieving less than 73% accuracy using an ImageNet pre-trained ResNet50. Model misclassifications expose dependencies on spurious backgrounds, underscoring the dataset's significant challenge.
优点
- The authors do a good job of elaborating six desiderata for a spurious correlations benchmark including multiple training environments, photo-realism and high fidelity backgrounds. This can act as general guidelines that future works in this area can build upon.
- The authors formally present the O2O and M2M spurious correlation settings, which helps make their contribution clear. See questions for some clarifications on these.
- Overall the paper is well written (Figure 3 is particularly nice) and addresses an outstanding concern of insufficient benchmarks for the spurious correlation/distribution shift community. Though, there are some other recent benchmarks (like PUG) that do the same (see Weaknesses), and addressing some of the points below (especially the hardness of M2M setting) would help distinguish this work from those.
缺点
- Comparison with/discussion on some other relevant spurious correlation benchmarks that also use a synthetic/combinatorial construction pipeline like the PUG dataset is missing.
- Some understanding of the hardness of the proposed benchmark would be relevant. Presumably there is some optimal re-weighting function for this dataset. How does the JTT assigned weights compare with this?
- Explanation/discussion of why MixUp does better than other baselines in M2M setting would be helpful.
- Performance of some other recent baselines like RWY (uses group info), BR-DRO/LfF (does not use group info) is missing.
Overall, this paper makes an attempt towards a useful and much needed SC benchmark, but falls slightly short of building some understanding of the distinguishing characteristics of the proposed dataset -- it is unclear how M2M is different from O2O but with superclass/superattribute labels. It is possible that I missed some details. Therefore, I would be happy to consider raising my score after the authors have had a chance to respond to my questions.
问题
- In the M2M case, there is still a one-to-one relationship between disjoint subgroups of classes and attributes. I did not fully understand why it is M2M, since it can still be thought of as O2O with respect to labels and attributes at a higher granularity? I imagined that the M2M case would involve overlapping subsets of classes and attributes.
- Why is the correlation completely flipped in the M2M case only, and not O2O case?
- In Figure 2c and 2d why is the correlation flipped and not randomized, i.e., zero correlation, which is typical of test distribution on existing datasets like waterbirds (unless you are looking at only the worst group as the test set)?
- What weights were used for group DRO, since the test set has correlations that do not appear at all in the training set, so theoretically the weights are infinite in this case?
Thank you for the constructive feedback that improves the quality of our submission! We address your comments below.
Comparison with/discussion on some other relevant spurious correlation benchmarks that also use a synthetic/combinatorial construction pipeline like the PUG dataset is missing.
We appreciate you highlighting the PUG dataset to us and have added a reference to PUG in the revised submission. Firstly, the PUG dataset was released long after (>6 months) our benchmark has been publicly released and already adopted in several DG works. Unfortunately, the PUG paper missed mentioning our benchmark. Secondly, to answer your actual question, PUG does not introduce spurious correlations between class and background in their benchmark, and neither do they attempt to balance the datasets by class and group. In this regard, we feel that PUG addresses a different problem and Spawrious directly contributes to the OOD community. Please let us know if there are other synthetic construction pipelines you have in mind and we will make sure to cite these.
Some understanding of the hardness of the proposed benchmark would be relevant.
Thanks for pointing this out! The hardness of the challenges differ due to the variation of the background features for a given location. For example, while the Jungle and Dirt location seems to have little variation in the features observed, with forestry and dirt paths, the Mountain background varies substantially. Further, there doesn’t seem to be much overlap between Mountain and Snow background features with other locations, while Jungle and Dirt often overlap with each other in grassy dirt scenes. We will clarify this further in the revision.
Presumably there is some optimal re-weighting function for this dataset. How does the JTT assigned weights compare with this?
The optimal (reweighting) importance weights can be inferred from Table 2 in our initial submission. We will add the JTT assigned weights to next revision.
Explanation/discussion of why MixUp does better than other baselines in M2M setting would be helpful.
Thank you for pointing this out! MixUp reduces spurious correlations between background and class because a lot of the background information from two images being interpolated may be lost while edges and curves remain in greater detail, rendering the backgrounds to be less predictive of the class and the dog features more predictive. We will add this explanation to the next revision.
Performance of some other recent baselines like RWY (uses group info), BR-DRO/LfF (does not use group info) is missing.
Spawrious is designed to be class-balanced in O2O and both class- and location-balanced in M2M. Hence, re-weighting algorithms like RWY are not applicable. In other words, RWY rebalances the dataset for each class by its size in each group, but Spawrious classes are already balanced in each group. We will clarify this in the next revision, thanks for asking this.
We acknowledge that there are missing baselines from our benchmark, while curating those we believe are representative of the literature at large, such as IRM for group info optimization and JTT for group agnostic optimization. We intended to capture a wide variety of baselines to demonstrate the utility of our dataset. We look forward to extending our benchmark results to other methods (such as BrDRO) within a reasonable time frame.
Unfortunately, we could not find any repository for the LfF code. BrDRO is exciting to look into, however, the documentation provided is sparse, making it challenging to evaluate it quickly within the rebuttal time. We already have evaluated JTT, another method that does not use group info.
Overall, this paper makes an attempt towards a useful and much needed SC benchmark, but falls slightly short of building some understanding of the distinguishing characteristics of the proposed dataset -- it is unclear how M2M is different from O2O but with superclass/superattribute labels. It is possible that I missed some details. Therefore, I would be happy to consider raising my score after the authors have had a chance to respond to my questions.
We agree that the community lacks a SC benchmark, and thank you for your recognition. M2M is qualitatively different from the O2O case because the spurious feature is no longer fully predictive of the class. In the O2O case, setting the spurious correlation to 1 would render the background features equally as predictive of the class as the dog features. In the M2M case, setting the spurious correlation to 1 renders the background predictive of 2 classes but insufficient to distinguish between the two classes. In our dataset, we set the O2O spurious correlation to less than 1, so the dog features are marginally more predictive than the background features. We set the M2M spurious correlation to 1 because the dog features are already much more predictive than the background features.
I thank the authors for their detailed response. Thank you for acknowledging PUG including that in your discussion.
-
I understand that some SC baselines may not be easy to implement in the given time frame, but would encourage adding some discussion on why these baselines (LfF/BR-DRO/JTT/LISA etc.) may or may not fail in the M2M setting (in addition to some of the empirical observations already in the paper). This can motivate the need for algorithmic interventions specifically for M2M that current algorithms may lack, and can also expose some overfitting of these algorithms to O2O setting.
-
Regarding the final question about differences between M2M and O2O, it seems in the M2M case, the spurious correlation is less fatal on target since it is not fully predictive of a single class on source. Then, why does ERM perform worse in M2M vs O2O. Can the authors please explain? It is possible I missed something.
Thanks for getting back!
The LfF algorithm is weaker than the JTT algorithm on other O2O spurious correlation benchmarks, so we suspect that it would also perform worse on the O2O challenges than JTT. LISA should perform similarly to random shuffle on our benchmark because both approaches will mix across and within classes and groups. We suspect Br-DRO will perform similarly to JTT since it aims to reweight the dataset based on feature occurrences, however since Spawrious is both group and class balanced in M2M, this is unlikely to be majorly helpful.
Algorithms that aim to maximise test performance on M2M challenges can aim to regularise more layers that just the final layer, such as using IRM on the last 5 layers instead of just the last layer, because the spurious feature dependencies may be represented at earlier layers, and the dog features may be downstream of this. For example, imagine a decision tree, as depicted in Figure 2(b) of the submission, where the model first represents the background, and decides which group of dogs the image could be representing. After using the background for this triaging, the classifier decides between the two dogs in this branch. Within this setting, the spurious feature dependence arises at the beginning of the decision tree. In the test data, this decision tree fails to work because the background group is wholly unpredictive of the class groups. As seen in Figure 2(d), the blue background group (s3, s4) is a feature used by the model to decide between classes (c3, c4), when in fact the model should be deciding between (c1, c2).
Thank you for the clarification. It would be great if you can include the above two discussion points in the paper (at least in the appendix), since it gives weight to the exposition and helps the community understand why the different proposed algorithms may or may not fail in capturing M2M.
Overall I am OK with the presented arguments, though they are sometimes anecdotal and vague (for example the description of the dog classification through the decision tree lens). At the same time, I understand that this was probably only meant as a motivation. I will certainly consider raising my score after discussion with other reviewers.
The author present a new dataset of images of four dog breeds which are generated via a text-to-image model so that they can generate spurious correlations of the dog breed and the imposed kind of background (e.g. jungle, mountain, snow, ...) in the generation prompt. In this way they generate spurious correlations of background and dog breed in the training data which can then be changed in the test set. One reviewer is very positive and the other vote for weak reject but some of them did not acknowledge the rebuttal.
Strengths:
- the dataset is with 152k images quite large and balanced across classes
- the authors test a large number of methods from the area of worst-group-accuracy-optimization
- during the rebuttal the authors have tested the validity of the generated images and report that 10 humans checked in total 480 images and reported that 97.2% of them followed the prompt (however to have an idea of the errors some images together with the prompts would have been useful)
Weaknesses:
- my main concern with the paper is that it provides little insight in the error structure of the trained classifiers. Given that one knows exactly which errors one would expect, I would expect an evaluation of this using the confusion matrices just presenting accuracies is insufficient. The analysis in Figure 5 is superficial and has to made quantitative so that one can see how the different methods manage to compensate for the spurious features
- the authors mention the notion of harmful spurious features but they did not test how images just showing the background are classified
- only one architecture (ResNet50) is tested in the paper. In the rebuttal the authors reported additional results for several other architectures but only in one setting which seem to contradict the message that transformer architectures are worse for this task
- there are other real-world image datasets which can be used to test for spurious features, e.g. FOCUS (Kattakinda, Feizi, ICML 2022) or Spurious ImageNet (Neuhaus et al, ICCV 2023). It is important to discuss pros and cons of real-world versus synthetic datasets like PUG.
I appreciate the effort of the authors creating this dataset which allows for a detailed analysis of spurious features in a controlled environment. This can be a great contribution to the community but at the moment the benchmark is lacking in terms of analysis and proving guidance to the community by not only providing the data but also the right measures which should be evaluated.
为何不给更高分
The benchmark could be useful to the community and allows for a fine-grained control of object vs background but only one architecture is tested and most importantly no explicit evaluation of the errors due to the spurious correlation is introduced.
为何不给更低分
N/A
Reject