Neural Collapse Inspired Feature Alignment for Out-of-Distribution Generalization
摘要
评审与讨论
This paper addresses the problem of spurious correlations caused by environments from where data are collected. The proposed method applies a mask to input data to separate spurious and semantic features. The masked input data are fed into a local model specialized to each environment. Each local model is trained to induce neural collapse for OOD generalization.
优点
- S1: Making use of neural collapse for OOD generalization is interesting.
缺点
- W1: Comparison with not only OOD generalization methods but also spurious correlation (sometimes called bias or shortcut) methods is necessary. Methods that can automatically detect and split spurious and semantic features have been developed [a-e].
- W2: Types of spurious features that the proposed method can handle need to be clarified. Can the proposed method handle spurious features in superposition, e.g., objects and textures?
- W3: The rationale behind the proposed method needs to be clarified. For instance, it is unclear why the method adds the noise to the mask when learning it.
- W4: Deeper analyses in the experiments would make the paper more interesting. For example,
- Whether the neural collapse is achieved by the proposed method should be confirmed in the experiment.
- Visualizing learned masks would produce more valuable insights. - W5: What is described in the introduction and what is done in the proposed method seems to be different. Although L42 states that
we propose to compute the Frobenius norm (F-norm) of the difference between the feature prototypes and the standard simplex ETF, the F-norm does not appear in the proposed method. - W6: Writing and formatting can be improved. There are many inconsistent spellings. For example,
- Is "variable features" in L153 the same as spurious features?
- The meaning of "interaction" in L189, 192, and so on is unclear. Maybe "training?"
- Such inconsistent spellings occur from Section 4.
[a] Tiwari, Rishabh, and Pradeep Shenoy. "Overcoming simplicity bias in deep networks using a feature sieve." ICML2023.
[b] Bahng, Hyojin, et al. "Learning de-biased representations with biased representations." ICML2020.
[c] Yang, Wanqian, et al. "Chroma-vae: Mitigating shortcut learning with generative classifiers." NeurIPS2022.
[d] Liu, Evan Z., et al. "Just train twice: Improving group robustness without training group information." ICML2021.
[e] Nam, Junhyun, et al. "Learning from failure: De-biasing classifier from biased classifier." NeurIPS2020.
问题
- Q1: Why do the accuracies of IRM (w/ env) in Tables 1 and 2 differ?
- Q2: When environment labels are not available, how are spurious and semantic features learned? Is there a possibility that the two types of features are learned conversely?
- Q3: How do we determine the number of local models when the total number of the environments is unknown?
局限性
Discussed in Section 6.
Respond to W1: Thanks for the comments. We would like to explain as follows:
The OOD methods encompass techniques for addressing spurious correlations. Our comparison methods are comprehensive, covering both approaches that handle spurious correlations, such as IRM, VREx, GroupDRO, and other OOD methods like DG, including MLDG, MMD, and CDANN. Additionally, we have compared our work with those proposed by the reviewer. Our setting is distinct in that [a-e] address bias issues, while we focus on spurious correlations and follow [1] to benchmark our approach against state-of-the-art baselines.
Comparison with [a]: [a] involves utilizing an auxiliary network to identify and eliminate spurious features in the feature space early on. While, our method leverages the phenomenon of neural collapse to achieve feature alignment, thereby eliminating spurious features.
Comparison with [b]: [b] utilizes HSIC to achieve independence and learns invariant representations on biased data. In contrast, our method employs a two-stage interactive process, utilizing neural collapse to ultimately achieve learning of invariant features.
Comparison with [c]: [c] achieves invariant representation learning during network training by partitioning the hidden layer space into shortcut and complementary subspaces, guided by loss functions. In contrast, our method automatically partitions across different environments, leveraging the properties of class-invariant features to align class-specific features in various environments.
Comparison with [d]: [d] also follows a two-stage process: the first stage utilizes ERM for training, and the second stage handles instances where ERM identification errors occur. In contrast, our approach first partitions environments. Then, we employ neural collapse to align class-specific features.
Comparison with [e]: [e] involves constructing two networks: one amplifies bias information using GCE, representing variable information, while the other learns debiased information, representing semantic information. In contrast, we employ a two-stage interactive approach to learn invariant networks.
[1] Map: Towards balanced generalization of iid and ood through model-agnostic adapters, 2023.
Respond to W2: Thanks for your questions. We would like to explain this question as follows: We are currently following existing work where neither the baseline models in the literature nor the existing benchmarks incorporate stacked features.
Respond to W3: Thanks for the comments. We would like to clarify it as follows: We follow the approach in [1,2], where we introduce some noise during the initialization of masks to enhance their randomness. This enables the masks to better capture invariant features.
[1] Invariant Representation Learning for Multimedia Recommendation [2] Kernelized Heterogeneous Risk Minimization
Respond to W4: Thanks for the comments. We would like to clarify it as follows: In the introduction, we employ the F-norm to assess the feature conditions resulting from different methods. According to the theory of neural collapse, after network training, features collapse onto a simplex. To visually illustrate this phenomenon, we use the F-norm to measure the discrepancy between the features post-training and a standard simplex, as depicted in our first figure. In the main text, we utilize Eq. 5 to align the post-trained features to the standard simplex using a push-pull mechanism. In different environments, invariant features are inherently more similar, thus they consistently align to the same ETH.
Respond to W5: We will incorporate significant improvements in the subsequent revisions of the paper.
Respond to Q1: Thanks for the very constructive comments. We would like to clarify it as follows: In Table 1, we directly apply the IRM algorithm to restrict parameter variations, thereby achieving invariant learning without utilizing a masking mechanism. In Table 2, to demonstrate that our approach with masking can also yield improved results, we substitute the process of learning masks with the IRM loss as shown bellow. where the is the classification loss, the second term is the constraint across environments, and the third term is a regularization term. is the average loss value inside the environment . Consequently, the accuracies differ between the two tables.
Respond to Q2: Thanks for the comment. We would like to clarify it as follows: In our paper, we have clarified that when the environment is unknown, we first randomly partition the data into environments. Subsequently, through iterative interaction between environment partitioning as described in Algorithm 1 and mask learning, we achieve convergence in both environment partitioning and mask learning processes. We learn an invariant mask to distinguish between spurious and semantic features in the input data. The learning phase of the invariant mask minimizes the loss function defined in Eq. 5. When this process converges, we assert that the features align well with the standard simplex post-training. Importantly, since we minimize the loss on the masked features, which represent semantic features, there is no issue of conflicting learning between two types of features.
Respond to Q3: Thanks for the comments. We would like to clarify it as follows: When the number of environments is unknown, we treat it as a hyperparameter in accordance with our previous method. By adjusting the size of this hyperparameter, we aim to achieve optimal out-of-distribution (OOD) generalization performance. In our paper, we also conducted ablation experiments to investigate scenarios where the number of environments is unknown.
We compared the papers provided by the reviewer. Due to the unavailability of code for papers [a, c, e], we evaluated the methods from [b, d] on ColoredCOCO and COCOPlaces. We also compared our approach with Rubi [f] and LearnedMixin [g]. Our method demonstrates superior performance compared to these approaches.
| Methods | ColoredCOCO | COCOPlaces |
|---|---|---|
| Ours | 63.9±0.5 | 43.7±0.7 |
| Rebias [b] | 56.0±0.2 | 39.2±0.2 |
| Jtt [d] | 55.3±0.3 | 38.0±0.4 |
| RUBi [f] | 53.8±0.5 | 32.7±0.7 |
| LearnedMixin [g] | 52.0±1.4 | 30.2±0.5 |
[a] Overcoming simplicity bias in deep networks using a feature sieve. ICML2023.
[b] Learning de-biased representations with biased representations. ICML2020.
[c] Chroma-vae: Mitigating shortcut learning with generative classifiers. NeurIPS2022.
[d] Just train twice: Improving group robustness without training group information. ICML2021.
[e] Learning from failure: De-biasing classifier from biased classifier. NeurIPS2020.
[f] Rubi: Reducing unimodal biases for visual question answering. Neurips2019.
[g] Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. EMNLP2019.
Thank you for the rebuttal and additional experiment.
For Q2, my question is whether the mask learns to remove the spurious features from input data. For instance, in ColoredCOCO, is there any possibility that the mask learns to remove the semantic features, i.e., the objects, and the predictive model learns classification using background colors? Without prior knowledge such as environment labels, the model and mask cannot distinguish which is the semantic feature.
Thank you for the reviewer's further discussion. Below, we provide a detailed analysis of how our method achieves environment partitioning and semantic feature mask optimization through a two-stage interaction process when environment labels are unknown. We also explain how the mask is utilized to learn invariant semantic information.
We initialize as the semantic or invariant features of . Consequently, represents the spurious or variable features of . Our method iteratively learns and interacts in two stages, gradually achieving environment partitioning and distinguishing between semantic and spurious information.
Step 1: Environment Partitioning Process: Since the distinction between different environments arises from variations in spurious features, we use these spurious features to partition environments. Based on the current sample , we obtain spurious feature , then randomly assign environment labels to the sample. We use the environment partitioning model to make predictions for the samples and reassign environment labels by maximizing the predicted values, as illustrated in Equation 4.
Step 2: Mask Optimization and Learning Stage: For datasets under different environments, we apply the invariant feature mask to obtain the semantic features for each environment. These semantic features are then used for model training across environments. Since the categories are consistent across environments, their semantic features are similar. Therefore, we employ neural collapse methods for feature alignment, aiming to align semantic features from different environments onto a common Simplex Equiangular Tight Frame (ETF), as shown in Equation 5,
n\_{e,k}^\gamma\exp(\beta\cdot\mathbf{v}\_k^T\mathbf{f})}, k\in K, e \in \mathcal{E}.$$ This process results in the updated invariant mask and the newly defined environment partitions after one complete iteration. After performing multiple rounds of **Step1: environment partitioning** and **Step2: mask optimization**, if the data within the newly defined environments does not significantly differ from that in the environments defined in the previous iteration, we consider the environment partitioning to have converged. At this stage, we use the data from these environments to train the invariant network, aiming to achieve out-of-distribution (OOD) generalization. The final results reflect the effectiveness of our method, demonstrating the successful separation of semantic and spurious features using semantic masks. We look forward to your feedback if you have any further questions!Thank you for your further clarification. I would like to increase the score. But I am still curious about how the two types of features are split actually, such as learned masks.
We sincerely appreciate the reviewer for raising the score. Regarding the concerns you mentioned, we would like to offer further clarification. Although we are currently unable to use CAM visualization to demonstrate our mask performance, we plan to include a mask visualization in the final version to provide a more intuitive presentation. Below, we present more granular experimental results and detailed analysis to clarify the mask learning process for the two types of features.
Experiments
We designed an experiment using our proposed method for a two-stage interactive learning process to obtain partitioned environments and invariant masks . To verify that can effectively distinguish between spurious and invariant (or semantic) features, we conducted additional experiments. Specifically, we used the feature information from and to predict during the test phase and compared their performance. The results show that using significantly improves the model's out-of-distribution (OOD) generalization performance compared to using . This demonstrates that our proposed mask accurately captures both invariant (or semantic) and spurious features.
| Methods | ColoredMNIST | ColoredCOCO | COCOPlaces |
|---|---|---|---|
| Ours (m) | 66.9±2.4 | 56.9±1.1 | 36.7±0.9 |
| Ours (1-m) | 21.8±0.9 | 16.3±1.8 | 8.4±0.6 |
Analysis
We initialize the masks, so that may not accurately reflect the input's semantic features. To obtain a semantically representative mask , we align features of different categories from various environments to a standard simplex equiangular tight frame (ETF), leveraging the property of neural collapse (where features from different classes form a canonical simplex ETF upon balanced training). In response to the reviewer's concerns, we can address them using two metrics in our paper.
- The variation in environment partitioning: It is significant at the beginning because the initial mask does not accurately capture the invariant features of the input. Consequently, also fails to effectively represent spurious features, leading to suboptimal environment partitioning. However, as training progresses with the neural collapse algorithm, the semantic mask improves, leading to better separation of semantic features and spurious features. Ultimately, environment partitioning stabilizes, showing minimal changes when it is complete. The table below shows changes in environment partitioning between the new environment and old environment over multiple interactions using our method in ColoredCOCO dataset. It indicates that the variation in environments decreases over interactive time, suggesting that increasingly captures spurious information for better environment partitioning.
| Number of interactions | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3200 | 2011 | 1099 | 696 | 442 | 509 | 405 | 325 | 280 | 235 | 174 | 75 | 23 |
where . represents the number of sample changes between the previous iteration (i.e., the old environment) and this iteration (i.e., the new environment). As the number of iterations increases, the change in the number of samples between environments gradually converges to 0, representing the environment partitioning fit.
- The loss incurred from using the neural collapse method for alignment in the models in different environments: Initially, the loss is high because is initialized. However, as the number of interactions increases, the loss gradually decreases to a low level. At this point, effectively captures the semantic features across different environments. The table below shows the initial loss at different stages of interaction using our method in ColoredCOCO dataset. It reveals a successive decrease in loss, indicating that increasingly captures the invariant (or semantic) features of the input more effectively.
| Number of interactions | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10.69 | 5.48 | 2.25 | 1.69 | 1.55 | 1.47 | 1.21 | 1.02 | 0.82 | 0.65 | 0.50 |
where .
We sincerely hope the reviewer will reevaluate our paper based on the additional experiments and explanations provided above. We look forward to your feedback if you have any further questions!
Many thanks for taking time on reviewing our paper. We would like to provide a summary of the key issues raised by the reviewer. We addressed the reviewer's questions through both rigorous analysis and comprehensive experiments.
- Experiments: To verify whether can effectively distinguish between spurious and invariant features, We trained the invariant network using and , and compared their performance during the final testing phase.
| Methods | ColoredMNIST | ColoredCOCO | COCOPlaces |
|---|---|---|---|
| Ours (m) | 66.9±2.4 | 56.9±1.1 | 36.7±0.9 |
| Ours (1-m) | 21.8±0.9 | 16.3±1.8 | 8.4±0.6 |
- Analysis: we can address them using two metrics in our paper.
- The variation in environment partitioning:
| Num | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ColoredCOCO | 3200 | 2011 | 1099 | 696 | 442 | 509 | 405 | 325 | 280 | 235 | 174 |
- The loss incurred from using the neural collapse method for alignment in the models in different environments:
| Step | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ColoredCOCO | 10.69 | 5.48 | 2.25 | 1.69 | 1.55 | 1.47 | 1.21 | 1.02 | 0.82 | 0.65 | 0.50 |
We sincerely hope the reviewer will re-evaluate our paper based on the additional experiments and explanations provided above. We look forward to your feedback if you have any further questions!
Thanks for the further experiments. Most of my concerns about experiments were addressed, but I think writing should be improved. I would like to retain the current score.
The paper leverages the neural collapse inspired ETF behavior to simulate different environments in datasets, and uses it for OOD classification.
优点
The paper uses a phenomenon that's apparent in the standard setting, for a task that varies from the standard setting. It uses intuitive notions to tackle the task of OOD classification. The paper experiments are generally convincing.
缺点
The paper seems generally consistent and well merited. The experiments are a bit lacking, but are convincing.
问题
The following papers seem missing from the neural collapse literature that may be helpful:
局限性
N/A
Respond to Q1: Thank you for the reviewer's suggestions. We will include citations to these papers in a subsequent version.
We sincerely appreciate the reviewer's positive feedback on our manuscript. Due to the time constraints of the rebuttal period, training on datasets like NICO, which require extensive computational resources, took longer than anticipated. Therefore, we have to present all the supplementary experimental results now. We sincerely hope that the reviewer will re-evaluate our work and decide the final rating based on the following discussions.
The following papers seem missing from the neural collapse literature that may be helpful.
We thank the reviewer for providing three important neural collapse literature, and would like to incorporate the following discussion in our final version.
- [1] employs the neural collapse algorithm in transfer learning to facilitate few-shot learning, enabling the training of a linear classifier on top of the learned penultimate layer and achieving good results. In contrast, our approach utilizes the neural collapse phenomenon in the context of imbalanced categories across different environments to learn invariant masks and perform environment partitioning, ultimately enhancing OOD generalization.
- [2] primarily leverages the phenomenon of neural collapse to guide semantic clustering of categories in self-supervised learning. This is achieved by using regularization to promote clustering within the same category and to separate clusters of different categories. In contrast, our method aligns features of the same category across different environments to learn invariant feature masks. These masks are then used to enable the network to better learn invariant features, thereby enhancing OOD generalization.
- [3] represents a highly significant contribution that can greatly enhance the quality of our work. It primarily focuses on using the neural collapse phenomenon to introduce a novel method for measuring the generalization bounds of neural networks. This method, being empirically non-vacuous and largely independent of depth, offers superior generalization bounds compared to traditional measurement approaches. In future versions, our method can draw on the approach presented in this paper to conduct a theoretical analysis of the generalization bounds.
We also investigate recent advances in neural collapse algorithm for other applications as below.
- [4] investigates the use of the neural collapse phenomenon within a continual learning setting. The scenario involves introducing new classes in a few-shot manner after the model has been trained on a large dataset, leading to an imbalance between old and new classes and resulting in severe catastrophic forgetting. By employing the neural collapse method, the model can align old class features and separate new class features, thereby preserving knowledge of old classes while learning new class knowledge effectively.
- [5] addresses the problem of sample imbalance in large models using the neural collapse method. By leveraging the property of neural collapse where features between classes exhibit a standard ETF distribution after balanced training, the method guides the collapse of features across different categories. This approach enhances the performance of large models in the context of sample imbalance.
The experiments are a bit lacking, but are convincing.
To make the experiment section in our paper more convincing, we have added extensive experiment as below.
- We conduct additional ablation experiments to explore the effect of mask position (i.e., different neural network layers) on the OOD performance, by applying masking to the output features of different layers in the network..
| Methods | ColoredCOCO | COCOPlaces | ColoredMNIST |
|---|---|---|---|
| Output Layer (suggested) | 63.9 ± 0.5 | 43.7 ± 0.7 | 58.7 ± 2.8 |
| Layer 0 | 62.3 ± 0.6 | 41.2 ± 0.5 | 54.2 ± 1.5 |
| Layer 1 | 63.3 ± 0.4 | 44.5 ± 0.3 | 53.6 ± 1.7 |
| Layer 2 | 61.2 ± 0.2 | 40.2 ± 0.1 | 50.6 ± 0.3 |
| Layer 3 | 63.2 ± 0.3 | 43.2 ± 0.6 | 51.2 ± 0.9 |
We found that putting the mask in the output layer (i.e., the previous layer of the linear classifier) gave the optimal results in general, which is also consistent with the findings of neural collapse.
- In comparison with the two-stage HRM method, we found that our approach stably outperforms HRM.
| Methods | ColoredMNIST | ColoredCOCO | COCOPlaces | NICO |
|---|---|---|---|---|
| Ours | 66.9 ± 2.4 | 56.9 ± 1.1 | 36.5 ± 0.8 | 85.9 ± 0.2 |
| HRM [6] | 61.3 ± 0.6 | 52.7 ± 0.7 | 33.4 ± 0.6 | 79.6 ± 0.9 |
- We also compared our method with more competing baselines and found that our method outperforms them.
| Methods | ColoredCOCO | COCOPlaces | NICO |
|---|---|---|---|
| Ours | 63.9 ± 0.5 | 43.7 ± 0.7 | 85.4 ± 0.6 |
| Rebias [7] | 56.0 ± 0.2 | 39.2 ± 0.2 | 78.4 ± 1.7 |
| Jtt [8] | 55.3 ± 0.3 | 38.0 ± 0.4 | 75.5 ± 0.3 |
| RUBi [9] | 53.8 ± 0.5 | 32.7 ± 0.7 | 73.4 ± 0.2 |
| LearnedMixin [10] | 52.0 ± 1.4 | 30.2 ± 0.5 | 73.7 ± 0.1 |
We will definitely put the above additional experimental results into our revised manuscript -- thank you so much!
[1] "On the role of neural collapse in transfer learning." arXiv, 2021.
[2] "Reverse engineering self-supervised learning." NeurIPS, 2023.
[3] "Comparative generalization bounds for deep neural networks." TMLR, 2023.
[4] "Neural collapse inspired feature-classifier alignment for few-shot class incremental learning." ICLR, 2023.
[5] "Bridging the gap: neural collapse inspired prompt tuning for generalization under class imbalance." KDD, 2024.
[6] "Heterogeneous risk minimization." ICML, 2021.
[7] "Learning de-biased representations with biased representations." ICML, 2020.
[8] "Just train twice: Improving group robustness without training group information." ICML, 2021.
[9] "Rubi: Reducing unimodal biases for visual question answering." NeurIPS, 2019.
[10] "Don't take the easy way out: Ensemble based methods for avoiding known dataset biases." EMNLP, 2019.
Dear reviewer F1YK,
Since the discussion period will end in a few hours, we will be online waiting for your feedback on our rebuttal, which we believe has fully addressed your concerns.
We would highly appreciate it if you could take into account our response when updating the rating and having discussions with AC and other reviewers.
Authors of # 10491
The spurious correlation between image background features and their labels is a significant research problem, and the existing research suffers from the issue of difficult decoupling. In this paper, we propose a new approach to solve the spurious association problem by alternately performing environment segmentation and learning semantic masks from the perspective of neural collapse. Extensive experiments are conducted on four datasets and the results show that the proposed method significantly improves the out-of-distribution performance.
优点
This paper explores an important and widespread problem in real-world applications with solid and extensive experiments. The writing is clear and the narrative is easy to follow, facilitating an understanding of the spurious correlations problem. The use of neural collapse is particularly innovative.
缺点
W1: In lines 48-50, it is mentioned that IRM-based methods learn similar representations from different environments, indicating a lack of proper alignment. Could you provide a corresponding experiment to demonstrate this phenomenon?
W2: In Figure 3, the explanation of the middle module that uses logits to judge the environment is unclear. Could you please clarify the structure of the local models, the number of local models used, and the specific meaning of the logit values?
W3: Could you explain the differences between masks based on pixel-level and feature-level approaches? If using feature-level masks, what is the impact of different network-layer features on model performance?
This work addresses an important and interesting question by introducing neural collapse from an invariant perspective, which I believe can provide valuable insights to the community. However, my main concern is that the same mask is used to learn both invariant and variable feature information. What are the advantages of the mask learning mechanism proposed in this paper compared to HRM's [1] mask mechanism?
[1] Heterogeneous Risk Minimization
问题
For more information see Weaknesses.
局限性
Yes, the authors have adequately described the limitations in their submission.
Respond to W1: Thank you for pointing out the issue. We indeed omitted the comparisons in our manuscript. We have actually demonstrated this in Figure 1, where we used the F-norm to measure the degree of alignment. A smaller F-norm indicates that, after training, the feature prototypes are closer to the standard ETF.
Respond to W2: Thank you for the reviewer's reminder. We sincerely apologize for the misunderstanding caused by our negligence. A precise description will be provided in future revisions.
As shown in Figure 3, our method represents the variable features of the input as . By using , we obtain the variable features (equivalent to the environmental information) from the previously partitioned data. The neural networks corresponding to different environments are then trained on these variable features. After training, the neural networks from different environments predict the respective logits. For each input’s true label, the index with the maximum value determines the environment of the input. The structure of the local model is the same as the local model used for alignment with the neural collapse algorithm later.
Respond to W3: Thanks for the valuable comments. We would like to explain as follows: The primary difference between the two methods is the location of the masking. Pixel-level masking is applied to the input image, offering stronger interpretability. Feature-level masking, on the other hand, applies masking to the output features of the neural network, achieving better results and reducing the dimensionality of the masking. To address the reviewers' concerns, we performed experiments by applying masking to the output features of different layers in the network.
| Methods | ColoredCOCO | COCOPlaces |
|---|---|---|
| Ours | 63.9±0.5 | 43.7±0.7 |
| Ours layer0 | 62.3±0.6 | 41.2±0.5 |
| Ours layer1 | 64.3±0.3 | 44.5±0.3 |
| Ours layer2 | 61.2±0.2 | 40.2±0.1 |
| Ours layer3 | 63.2±0.3 | 43.2±0.6 |
Respond to W4: Thanks for the valuable comments. We would like to explain as follows: Our method aligns invariant features across different environments by leveraging the phenomenon of neural collapse. For the same category, invariant features should be similar; thus, we use the same ETF for alignment, facilitating mask learning. Although the HRM algorithm also employs a two-stage approach, it partitions environments through clustering and then learns the mask through regularization. We conducted comparative experiments with the HRM algorithm to evaluate our method.
| Methods | ColoredMNIST | ColoredCOCO | COCOPlaces |
|---|---|---|---|
| Ours | 66.9±2.4 | 56.9±1.1 | 36.5±0.8 |
| HRM | 61.3±0.6 | 52.7±0.7 | 33.4±0.6 |
Thank you for your rebuttal!
I want to thank the authors for their tremendous efforts during the rebuttal process. The additional comparative experiments have significantly improved the quality of the paper. The authors conducted relevant experiments to demonstrate the impact of mask positioning, thoroughly analyzed the differences between the proposed method and the HRM approach, and provided extensive experimental evidence to validate the effectiveness of their method, addressing my concerns. Based on these points, I have decided to raise my score to 7.
This paper received 3 reviews from experts in the field. The paper received the following reviews: 1 Accept, 1 Borderline Accept and 1 Borderline Reject.
The main concerns presented by the reviewers included experimentation and clarity. The author rebuttal with additional experiments persuaded the reviewers to increase their ratings. The AC agrees with the points presented by the reviewers and the decision for this paper is to accept. Please include the additional experiments in your paper and address the reviewers’ concerns about clarity in the camera ready version.