Enhancing Compositional Generalization via Compositional Feature Alignment
Proposes a two-stage finetuning method to enhance the compositional generalization ability of pretrained vision encoders
摘要
评审与讨论
The paper deals with the challenge of compositional generalization, which in detail is the generalization to unseen domain-class combinations. To this end, the paper proposes CG-Bench, a suite of CG benchmarks derived from existing real-world image datasets. Furthermore, Compositional Feature Alignment (CFA), a two-stage finetuning technique is proposed. Evaluation is performed on CG-Bench using CLIP and DINOv2 vision foundation models fine-tuned using the proposed CFA approach.
优点
· The paper is well written and easy to understand, e.g., Figure 2 is very useful for understanding the definition of Compositional Feature Structure.
· The proposed method is novel and interesting, and it seems to work well in the case of datasets such as Color-CIFAR. The results in Figure 4 show that the learned features are disengaged across classes and domains.
· The proposed CG-Bench compositional generalization benchmark is also novel and well-curated.
缺点
· Theorem 1 holds only at the global minimum of the objective function (4). However, in practice when the features Z are from large neural networks such as CLIP the global minimum is unlikely to be obtained. In that case, features would likely not conform to the compositional feature structure defined in Definition 1. Therefore, it is unclear if Theorem 1 has any practical significance.
· While the features seem to be disentangled in the case of the simple Color-CIFAR dataset as shown in Figure 2, it is not clear if the same effect can be observed in the case of more complex datasets such as DomainNet.
· Stability of the model during training is not discussed in detail. This is important because of the complex two-stage training process. The paper should include experiments where the number of training steps in the first and second stage are varied and analyze the effect on the performance of the final model.
· From the results in Table 1, the performance gain over the reweight strategy is minimal (<1%) for all datasets. The biggest performance gain comes from the use of WiSE-FT (Wortsman et al., 2022). Also, compared to the reweight strategy, the proposed approach uses 2-staged training. Therefore, it is not clear whether the increase in training complexity is justified by the small performance gain.
· In Table 1, the reweight baseline with WiSE when using the DINOv2 model seems to be missing. Comparing reweighting and the proposed CFA method the gain in performance without WiSE seems to be <0.3%, especially in the case of OfficeHome and DomainNet.
· Can the proposed approach take advantage of unlabeled data? This is important because prior work such as CLIP or DINOv2 does not need explicit domain/class labels, unlike the proposed CFA approach.
问题
· For models trained using CFA, do we observe the disentanglement similar to Color-CIFAR in Figure 2 in case of larger datasets such as DomainNet?
· Additional details of the reason for the small performance gain of the proposed CFA approach over the reweight baseline in Table 1 would be helpful.
1. Feature Alignment in Large Neural Nets (e.g., CLIP)
To address your concern regarding whether large neural networks can conform to the compositional feature structure (as defined in Definition 1) using our CFA algorithm, we conducted feature visualization for a CLIP model in both its pretrained and CFA-trained versions. The comparative visualization is presented in Figure 7 in Appendix D of our revised manuscript, which you can review at this link. From this figure, it is evident that the features fine-tuned with CFA align well with a compositional feature structure, unlike the original CLIP features. We believe this visualization further justifies the ability of our proposed CFA objective to align features of large neural networks with the desired structure.
2. Feature Disengtanglement on Real-World Dataset
We believe the feature visualization for the DomainNet dataset, as mentioned above, also addresses the reviewer's concern: the CFA-finetuned CLIP model exhibits a feature structure with disentanglement similar to what is seen in the Color MNIST visualization in the paper.
3. Training Complexity
To address your concerns regarding the training stability of CFA's two stages, we conducted experiments with different training lengths for each stage. The results are provided in item 3 of our General Response above (link). The table shows that Stage-1 training is quite stable, as it involves only optimization over linear models. Additionally, the training cost of Stage-1 (linear probing) is almost negligible compared to Stage-2 (backbone finetuning). Stage-1 requires only a single forward pass over training samples for feature gathering (features are stored in CPU memory), and then the linear probing can be optimized fastly on CPU. The training cost for Stage-2 is slightly less than standard full finetuning, as the last layer remains frozen during Stage-2 of CFA. Overall, we conclude that the implementation complexity and additional training costs of CFA are mild, making it a practical method for addressing compositional generalization problems in real-world applications.
4. Performance Gain
Regarding the performance gain of CFA over reweighting, a strong baseline method, we acknowledge that the improvement is not significant. However, our main contribution lies in designing a principled algorithm for compositional generalization, a novel challenge in OOD generalization. As our theoretical analysis and feature visualization in Figures 4 and 7 demonstrate (on Color MNIST and DomainNet, respectively), our method aligns features with a compositional structure suitable for compositional generalization. In OOD generalization research, it's typical for specialized algorithms to only modestly outperform baselines like ERM, as evidenced by extensive benchmarking in studies like DomainBed [1] and WILDS [2]. Thus, we consider CFA's performance gain to be meaningful and convincing. Additionally, the mild implementation complexity and training costs (discussed in the last item) make CFA a viable method for enhancing compositional generalization.
Moreover, we want to emphasize that real-world datasets used in our paper have lots of labeling issues (for both class and domain labels), which may prevent CFA from obtaining greater performance gain. Notably, CFA employs both class and domain labels for training, leading to an increased susceptibility to label noise, particularly from domain labels. Below, we illustrate two types of labeling noise encountered:
-
(i) Mislabeling: We observe that the DomainNet dataset contains many mislabeled images. Many images in the “Clipart” domain are actually real photos of a single object on a white background, which should be categorized into the “Product” domain. On the other hand, numerous class labels are also mislabeled: images in the “bush (plant)” class contain pictures of President George Bush of the USA; the “cooler” class includes electric fan images, which are incorrectly categorized; the “square (shape)” class contains table, which should be placed into the “table” class instead. Note: We only manually examined a very tiny subset of the DomainNet dataset, which comprises 0.6 million images, and have already found many mislabeled images. Therefore, overall, we believe the mislabeling ratio is not negligible.
-
(ii) Ambiguous Labels: We observe that certain domains contain a large number of images that are visually similar to those in another domain. For example, in both Office-Home and DomainNet, the images in the “Product” domain are real photos of objects, making them almost indistinguishable from their counterparts in the “Real” domain. The distinguishing feature of the “Product” domain is that its images all have a white background; however, some images in the “Real” domain also share this characteristic. Additionally, in DomainNet, the “Infograph” domain contains many images stylistically similar to those in “Clipart” or “Real”; some images in the “Painting” domain are sketches, despite the presence of a separate “Sketch” domain. This ambiguity issue extends to class labels as well. In DomainNet, the classes “cup,” “coffee cup,” and “mug” lack clear stylistic distinctions.
In addition to these issues, other aspects of these datasets may affect learning. For instance, the iWildCam dataset includes a class labeled “empty,” signifying the absence of animals, which comprises a significant portion of the dataset.
5. Reweighting Not Work with WiSE-FT for DINOv2 Backbone
We did not include the results of Reweighting+WiSE-FT for DINOv2 in Table 1, as we had observed its ineffectiveness before the submission of this paper. We re-run the experiments for Reweighting for DINOv2 on DomainNet over 3 random seeds and provide the mean accuracy results in the table below.
It is evident that WiSE-FT significantly decreases the performance of Reweighting in terms of both ID and OOD accuracy. The primary reason is that, unlike CLIP that includes a text encoder capable of providing a zero-shot linear classifier on top of its image encoder, DINOv2 is solely an encoder without an initial classifier (the last linear layer). Therefore, when applying WiSE-FT to Fine-tuning and Reweighting methods on DINOv2, the initial parameters of the last linear layer are set to zeros and then interpolated with the final finetuned last linear layer. However, this interpolated image encoder may not align well with the interpolated last layer, leading to a decrease in both ID and OOD performance.
| Method | ID Acc | ID Acc (WiSE) | OOD Acc | OOD Acc (WiSE) |
|---|---|---|---|---|
| Reweight-E | 81.8 | 40.7 | 5.1 | 3.6 |
| Reweight-YxE | 81.5 | 41.2 | 5.3 | 3.1 |
We will run Reweighting+WiSE-FT experiments for DINOv2 on more datasets, and provide results in the appendix of the next revision.
6. Use of Unlabeled Data
First, we wish to clarify the nature of label supervision for CLIP and DINOv2. Specifically, CLIP training involves using text captions for each image, which typically include the object class and often domain or contextual information (see examples at this article). Therefore, CLIP's training approach is essentially weak supervision rather than unsupervised learning. This textual supervision enables CLIP's direct application in zero-shot classification. In contrast, DINOv2 undergoes unsupervised training and thus requires supervised finetuning before it can be used for classification tasks.
Unlike CLIP and DINOv2, which are pretraining methods for large-scale image data (e.g., datasets with 400 million images or more), our CFA method is designed for finetuning pretrained image backbones. Our objective is not to replace CLIP or DINOv2. Instead, we aim to demonstrate how the typical applications of CLIP (for zero-shot classification) and DINOv2 (for supervised finetuning) encounter challenges within the compositional generalization problem setup. CFA is introduced to finetune these pretrained image models, enhancing their capabilities in compositional generalization contexts.
In our compositional generalization benchmark (CG-Bench), we observed that the zero-shot prediction accuracy of CLIP was unsatisfactory, indicating the need for further finetuning with class labels. Similarly, the DINOv2 model, a bare image encoder, has to be supervised finetuned to obtain classification capabilities. Overall, class labels are necessary for achieving effective results in CG-Bench. We recognize that the additional requirement of domain labels may not always be feasible in real-world applications. To address this, we propose two solutions to lessen the need for a fully domain-labeled dataset:
- Use partially available domain labels, e.g., only 10% training samples have domain labels
- Use zero-shot predicted domain labels by CLIP when no domain labels are provided.
The experimental results, detailed in the General Response above, show that our CFA method maintains most of its OOD performance advantages even with limited domain label availability, such as with only 10% of labels. In situations where no domain labels are available, employing CFA with zero-shot predicted domain labels by CLIP still shows improvement compared to the standard finetuning method. Overall, these results indicate that CFA's dependency on domain labels is quite mild.
References
[1] Gulrajani et al. In Search of Lost Domain Generalization. ICLR 2021
[2] Koh et al. Wilds: A benchmark of in-the-wild distribution shifts. ICML 2021
Dear Reviewer ZjmZ:
Thanks for the review. The authors have uploaded their responses to your comments, please check if the rebuttal address your concerns and if you have further questions/comments to discuss with the authors. If the authors have addressed your concerns, please adjust your rating accordingly.
AC
The rebuttal has addressed my concerns. I urge the authors to include the discussion, for example, limited performance gains (point 4) in the paper.
The discussions regarding the training stability and overhead are also not included in the revision of the paper nor do the authors promise to include the same. For the training, the rebuttal does not discuss quantitatively the overhead, parameters, iterations, speed, memory.
(5) For reweighting with DINO V2, why are the layers set to zero would not random initialization be better in this case, if we finetune the layers for better gradients?
Given the above, I will keep my score.
Dear Reviewer ZjmZ,
We appreciate the comments and suggestions in your latest feedback today, and we responded to it 5 hours ago [link]. However, we have just noticed that the "Readers" list for that response does not include reviewers, as the OpenReview system does not give us the option to add reviewers as "Readers" to our response. We are not sure if our previous response is visible to you. Therefore, we have copied our response below for your reference.
We are pleased to know that our response has addressed your concerns. We also appreciate your suggestion to integrate our discussions into the manuscript. Accordingly, we have updated the manuscript with the following new content:
-
Appendix B.1: Discussion on Training Stability and Overhead
-
Appendix B.2: Discussion of Performance Gain
-
Appendix B.3: Partial Availability of Domain Labels
As for the reweighting experiments with DINOv2, we thank you for pointing out that weight interpolation with the random initialized head might be more appropriate than our choice of zero head weights. Following your suggestion, we take weight interpolation (WiSE-FT) between the pretrained DINOv2 with the randomly initialized head and the finally trained model (from this random initialization). We repeat the evaluation over three random seeds (the head initialization is dependent on the random seed) and report the mean accuracy below. From the results shown in the table below, we do observe an improvement of WiSE-FT with this random initialized head over the zero head weights. However, the WiSE version’s ID performance still significantly underperform the original finetuned version; the OOD performance approaches that of the original version but does not surpass it. Hence, our conclusion that WiSE-FT does not work for reweighting with DINOv2 still applies.
| Method | ID [Finetuned] | ID [WiSE] (Zero-Init) | ID [WiSE] (Random-Init) | OOD [Finetuned] | OOD [WiSE] (Zero-Init) | OOD [WiSE] (Random-Init) |
|---|---|---|---|---|---|---|
| Reweight-E | 81.8 | 40.7 | 50.9 | 5.1 | 3.6 | 5.1 |
| Reweight-YxE | 81.5 | 41.2 | 51.4 | 5.3 | 3.1 | 5.2 |
We hope this new response and the updated manuscript can address your concerns. Please let us know if you have further questions.
Dear Reviewer ZjmZ,
Please check Paper 6129 authors's revised paper and the latest response to your questions to see if they have addressed all your concerns. If so, you might consider adjust your rating accordingly.
AC
This paper tackles the challenge of data distribution shifts in machine learning applications. In particular, it emphasizes the multi-domain, multi-class setup where obtaining training data for every domain-class combination becomes impractical, and they try to examine the compositional generalization (CG) for learning models. The authors propose a simple solution in the form of CG and introduce "CG-Bench," a suite of CG benchmarks derived from real-world image datasets. They tested CLIP and DINOv2, and show the effectiveness of both the proposed benchmark and method.
Post rebuttal
I appreciate the provided two solutions (lessen the need for a fully domain-labeled dataset). I encourage the author to incorporate all the discussion into their revision.
优点
I personally believe the studied topic of compositional generalization is crucial in real-world machine learning applications, especially in scenarios with multi-domain, multi-class setups. The introduction of "CG-Bench" is commendable, providing the research community with a dedicated suite of benchmarks for evaluating CG performance.
The proposed two-stage process is clear and logically structured, with a rationale that suggests a theoretical underpinning for the method. I personally pretty like Figure 4 which provided a great comparisons between vanilla CLIP features and the features obtained by this paper.
缺点
It is hard to obtain the class and domain labels. Therefore, the studies in this paper are hard to scale-up to real world system.
Based on Table 1, it seems the proposed method would usually cause negative effects to ID Acc.
问题
Besides, I feel the caption of Figure 2 could be improved to help reader understand the goal of the proposed method.
Please also address the concerns raised above.
1. Label Availability
CLIP is trained in a weakly supervised manner, and its classification performance can often be enhanced through supervised finetuning, such as on ImageNet. In our compositional generalization benchmark (CG-Bench), we observed that the zero-shot prediction accuracy of CLIP was unsatisfactory, indicating the need for further finetuning with class labels. Similarly, the DINOv2 model, a bare image encoder, has to be supervised finetuned to obtain classification capabilities. Overall, class labels are necessary for achieving effective results in CG-Bench. We recognize that the additional requirement of domain labels may not always be feasible in real-world applications. To address this, we propose two solutions to lessen the need for a fully domain-labeled dataset:
-
Use partially available domain labels, e.g., only 10% training samples have domain labels
-
Use zero-shot predicted domain labels by CLIP when no domain labels are provided.
The experimental results are detailed in the item-2 of our General Response above (please review at this link). The results show that our CFA method maintains most of its OOD performance advantages even with limited domain label availability, such as with only 10% of labels. In situations where no domain labels are available, employing CFA with zero-shot predicted domain labels by CLIP still shows improvement compared to the standard finetuning method. Overall, these results indicate that CFA's dependency on domain labels is quite mild.
2. ID-OOD Performance Trade-off
In the literature of OOD generalization and domain generalization, it is widely observed that methods aiming to improve OOD performance usually suffer from some mild in-distribution (ID) performance degradation [1,2]. Additionally, from the theoretical perspective of invariant feature learning [3,4], enhancing OOD performance often requires the omission of domain-specific features, which naturally leads to a decrease in ID accuracy.
In the empirical results of Table 1 of our manuscript, when comparing the two rows of CFA vs. Fine-tuning in a pairwise manner, one can observe that for the CLIP model, the ID (in-distribution) performance of Fine-tuning is always slightly better than that of CFA. However, for DINOv2, the ID performance of CFA consistently outperforms Fine-tuning across all four datasets. Overall, the difference in ID performance between CFA and other methods is quite small for both models across all datasets. We believe these findings justify that CFA maintains competitive ID performance while improving OOD performance in compositional generalization benchmarks.
References
[1] Gulrajani et al. In Search of Lost Domain Generalization. ICLR 2021
[2] Koh et al. Wilds: A benchmark of in-the-wild distribution shifts. ICML 2021
[3] Arjovsky et al. Invariant Risk Minimization. 2019
[4] Rosenfeld et al. The Risks of Invariant Risk Minimization. ICLR 2021
Dear Reviewer d1KC:
Thanks for the review. The authors have uploaded their responses to your comments, please check if the rebuttal address your concerns and if you have further questions/comments to discuss with the authors. If the authors have addressed your concerns, please adjust your rating accordingly.
AC
This paper studies the challenge of compositional generalization in machine learning and focuses on generalization to unseen domain-class combinations. The author present a real-world benchmark named CG-Bench and propose the compositional feature alignment method to improve the CG performance of pretrained models. Extensive experiments demonstate the effectiveness of the method.
优点
S1: This paper solve a new setting or problem: can the model generalize to unseen domain-class combinations?
S2: The overall writing is clear, including problem introduction and theoretical and experimental verification of experimental methods.
S3: Some visualization experiments are shown to help the understanding of the review or reader.
缺点
W1: The setting of compositional generalization (CG) means that the pre-trained model is tested on the unseen domain-class combinations and the setting of domain generalization means that the pre-trained model is tested on the unseen domain samples? if yes, what is the difference between CG and some papers in open-set tasks to solve domain generalization problems?
W2: To solve the CG problem, the authors propose a method to align the class information in different domain. I'm more curious about why the two-stage training method can achieve this goal? If possible, I would tend to see experiments on each stage of visualization.
W3: I am more concerned about different training stages and ablation experiments related to orthogonal loss.
问题
see weaknesses
If the authors solve my concers, I tend to imporve my socre.
2. Explanation of Two Stages of CFA and Feature Visualization on Real-World Dataset
Stage-1 of CFA involves linear probing, which trains two orthogonal linear classifiers ( and ) for class and domain predictions on features of the pre-trained backbone. The backbone remains frozen in Stage-1, and the primary objective of this stage is to establish a compositional feature structure via the two trained orthogonal classifiers, and .
In Stage 2, CFA freezes the trained classifiers & , and only fine-tunes the backbone (image encoder) – this process modifies the features . Drawing from research on neural collapse [3,4,5], it is understood that supervised learning with cross-entropy loss causes the features of samples from class- to collapse towards the -th column vector of , denoted as (for a visual illustration of neural collapse, refer to Fig. 1 in this ICLR 2022 paper). In our context of two orthogonal linear classifiers, our theoretical analysis indicates that for each training sample associated with class- and domain-, its feature will collapse towards the composition . Given the orthogonality constraint of classifiers and , the resulting collapsed features will adhere to the compositional feature structure defined in Definition 1.
As per your request, we visualized features using both the pretrained CLIP ViT model and the one fine-tuned with CFA on the DomainNet dataset. Due to limitations in 3D visualization, we selected 2 out of 6 domains and 3 out of 345 classes for representation. As figures cannot be inserted in OpenReview posts, the visualization is provided in Fig. 7 of Appendix D in the revised manuscript. Please refer to this PDF for review. This visualization clearly demonstrates that CFA can align features with the intended compositional feature structure. We are confident that this visualization effectively addresses your concern regarding the feature alignment of CFA with real data.
3. Ablation Study on Orthogonal Loss
We fixed other hyper-parameters as presented in the manuscript, only changing the orthogonal loss coefficient in Stage-1 of CFA. For each set of hyper-parameters, we repeated the experiment with three seeds. It is clear that with the coefficient set to 100, the out-of-distribution (OOD) performance peaks, leading to our choice of this value in our manuscript. From the table, we observe that a large orthogonal loss coefficient (e.g., >=100) is necessary to encourage orthogonality between heads. (The orthogonal loss coefficient of 100 was used in our paper.)
| Orthogonal Loss Coef. | ID | ID (WiSE) | OOD | OOD (WiSE) |
|---|---|---|---|---|
| 1 | 93.2 | 92.8 | 46.4 | 46.6 |
| 10 | 94.0 | 93.6 | 50.8 | 53.3 |
| 100 (Chosen) | 94.0 | 93.1 | 54.3 | 56.9 |
| 1000 | 93.8 | 92.8 | 53.9 | 55.2 |
In addition to studying the orthogonal loss coefficient, we also conducted experiments on the training steps of the two stages of CFA. These results, detailed in item-3 ("Two-Stage Training Stability") of our General Response above (link), demonstrate that the final performance is relatively stable to the training steps numbers in each stage of CFA.
References
[1] Shu et al. Open Domain Generalization with Domain-Augmented Meta-Learning. CVPR 2021
[2] Zhu et al. CrossMatch: Cross-Classifier Consistency Regularization for Open-Set Single Domain Generalization. ICLR 2022
[3] Papyan et al. Prevalence of neural collapse during the terminal phase of deep learning training. PNAS. 2020
[4] Zhu et al. A Geometric Analysis of Neural Collapse with Unconstrained Features. NeurIPS 2021
[5] Kothapalli et al. Neural Collapse: A Review on Modelling Principles and Generalization. TMLR. 2023
1. Problem Setting
Compositional generalization (CG) explicitly assumes that all domains and classes are seen during training, with only certain combinations of them being unseen, which are used for OOD evaluation. As a comparison, in Open-Set Domain Generalization (OS-GD) [1,2], test samples are from a new domain, and unseen classes in the training phase can appear during the test phase. Notably, OS-DG typically requires models to assign the “unknown” class label to test samples from classes unseen in the training set, which essentially constitutes an auxiliary task of out-of-distribution (OOD) detection.
Note: while OS-DG does not explicitly state requirements on test domains, it generally poses certain implicit assumptions about the test domains. For example, the OS-DG problem setup as described in [1] implicitly places distributional assumptions on domains, implying that the test domain is related to the training domains in a specific manner (so that data augmentation techniques, such as feature-level MixUp applied to training domains, can partially encompass test domains).
Below, we use two tables to illustrate the differences between CG and OS-DG, where "✓" marks a domain-class combination seen during training.
-
Compositional Generalization (CG)
In the CG table, combinations marked with “✓” are present in the training set, while those labeled “Test” are not. The CG challenge is to determine if models trained on “✓” combinations can accurately generalize to “Test” combinations.
Class 1 Class 2 Class 3 Domain 1 ✓ Test Test Domain 2 Test ✓ Test Domain 3 ✓ Test ✓ -
Open-Set Domain Generalization (OS-DG)
In the OS-DG table, the symbols “✓” and “Test” have the same implications as in the CG table. Additionally, “X” denotes a domain-class combination that is unavailable during training and is also not evaluated. The class “Unknown” encompasses all classes not present during training, with models expected to predict the “Unknown” class for these.
Train Class 1 Train Class 2 Train Class 3 Class “Unknown” Train Domain 1 ✓ X X X Train Domain 2 X ✓ X X Train Domain 3 ✓ X ✓ X Test Domain Test Test Test Test
In summary, while CG and OS-DG problems are similar, they aim for different outcomes in OOD generalization. CG seeks to generalize to unseen domain-class combinations composed of domains and classes seen during training. OS-DG can be viewed as domain generalization (DG) with an additional task of OOD detection.
Dear Reviewer H15m:
Thanks for the review. The authors have uploaded their responses to your comments, please check if the rebuttal address your concerns and if you have further questions/comments to discuss with the authors. If the authors have addressed your concerns, please adjust your rating accordingly.
AC
I have understood the difference between compositional generalization (CG) and open-set domain generalization (OS-DG) in the problem setting from the two tables. However, from the first table, the problem setting of CG is somewhat similar to the compositional zero-shot learning (CZSL) [1]. It is recommended that the author can provide comparative explanations in related work.
After carefully reading the author's responses, I feel that my concerns have been resolved and I raised the score to 6.
[1] Hao et al. Learning Attention as Disentangler for Compositional Zero-shot Learning. CVPR2023.
Dear Reviewer H15m,
We are pleased to know that our response has addressed your concerns. Regarding Compositional Zero-Shot Learning (CZSL), in the first version of this manuscript, we provided a discussion of it and compared it with Compositional Generalization (CG) in the related works section.
When comparing these two approaches, one can see that CG does not limit the type of image classifiers, while CZSL is specifically exclusive to vision-language models (e.g., CLIP) that are trained with image-text paired data. In certain real-world domains, such as remote sensing or medical imaging, there is a lack of paired image-text data to train strong vision-language models. Therefore, using self-supervised encoders (e.g., DINO, MAE), which do not need labels, presents a more practical strategy for these domains [2,3]. In this work, as evidenced by our experiments on CLIP and DINOv2, our proposed CFA can work with both vision-language models and self-supervised models. In contrast, CZSL cannot be directly applied to self-supervised models.
We thank you for bringing the paper [1] to our attention, which we did not cite in our initial manuscript. We have added this reference to the related works section and updated our manuscript accordingly.
Let us know if you have more questions or suggestions regarding our manuscript. We greatly appreciate your feedback.
References
[1] Hao et al. Learning Attention as Disentangler for Compositional Zero-shot Learning. CVPR 2023.
[2] Cong et al. SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery. NeurIPS 2022
[3] Wanyan et al. DINO-MC: Self-supervised Contrastive Learning for Remote Sensing Imagery with Multi-sized Local Crops. 2023
We would like to thank all the reviewers for their insightful and constructive feedback on our paper. Below, we provide responses to comments shared across reviewers.
1. Feature Visualization
Following the suggestions of Reviewers H15m and ZjmZ, we conducted a feature visualization study for the CLIP ViT-B/16 image encoder on the DomainNet dataset. Specifically, we take the encoder both before and after finetuning with CFA, visualizing features for 2 domains and 3 classes. This resulted in 6 unique domain-class combinations (5 out of which are present in the training set). The visualization, provided in Figure 7 in Appendix D of the revised manuscript, clearly shows that the features finetuned with CFA conform to a compositional feature structure, similar to the ColorMNIST visualization shown in Figure 4(b). In contrast, the pretrained features do not exhibit this structure. This visualization not only demonstrates the feature alignment ability of CFA but also provides further evidence of its effectiveness for large neural networks trained on real-world data.
2. Partial Availability of Domain Labels
To address the concerns of Reviewers d1KC and ZjmZ on the availability of domain labels, we divide this problem into two scenarios:
i) domain labels are partially available
ii) domain labels are completely unavailable.
Corresponding experiments on the Office-Home dataset are designed and conducted to demonstrate that the CFA remains effective even in the absence of domain labels. Under the first condition, we conduct experiments by only using 10%, 20%, and 50% of the domain labels for Stage-1 to learn the linear head for finetuning in Stage-2. We repeat each experiment 3 times using different subsets of the data and present the average result to mitigate the randomness caused by subsampling. For the second scenario, we propose that we can leverage the zeroshot ability of CLIP to predict the domain labels. To be specific, we first manually design a set of labels that can describe the domains of the data (for the experiment we just use the original domain names). Then, we perform the zeroshot classification on the training data and the domain labels using CLIP model. Finally, we take the predicted domain labels as ground-truth domain labels and use them in Stage-1. We adopt the hyperparameters shown in the manuscript and present the mean accuracy of CFA and WiSE-FT over 3 seeds on the Office-Home dataset. From the following table, we can see our CFA works well as domain labels are partially available.
| Method | Domain label ratio | ID | ID (WiSE) | OOD | OOD (WiSE) |
|---|---|---|---|---|---|
| CFA | 100% (Original) | 94.0 | 93.1 | 54.3 | 56.9 |
| 50% | 94.0 | 93.3 | 53.8 | 56.9 | |
| 20% | 93.9 | 93.1 | 53.8 | 56.7 | |
| 10% | 94.0 | 93.2 | 53.4 | 55.9 | |
| 0% (CLIP Predict) | 94.1 | 93.0 | 52.0 | 53.6 | |
| Finetune | 0% | 94.3 | 93.7 | 51.0 | 52.5 |
| LP-FT | 0% | 93.5 | 93.0 | 43.9 | 42.8 |
The mild reliance of CFA's performance on domain label availability is beneficial for the practical application of this method but may raise questions about why the reliance is so mild. We believe there are two main reasons: i) the number of domains is small (4-6 domains for each 4 dataset of CG-Bench), so the available data per domain is relatively abundant, and ii) domain labels are easier to predict than class labels since image style or background information is highly indicative of the domain label, and these visual features are easy to capture by neural networks.
We clarify this by conducting a simple ablation study: in the Office-Home dataset with 4 domains, we fit linear classifiers to CLIP features for different amounts of domain labels (1%, 2%,..., 100% of the training data), and show the test prediction accuracy for domains below. Each experiment is repeated three times with different sampling seeds, and the mean accuracy is reported below.
| Domain Label Ratio | 0% (Zero-Shot) | 10% | 20% | 50% | 100% |
|---|---|---|---|---|---|
| Avg. #Data per Domain | 0 | 308 | 616 | 1541 | 3082 |
| Accuracy (%) | 61.5 | 82.0 | 84.6 | 85.5 | 86.3 |
From table above, we can observe that as the domain labeling ratio increases from 10% to 100%, the domain prediction accuracy modestly improves from 82.0% to 86.3%, indicating a relatively small enhancement. This suggests that the domain label is indeed quite easy to predict, which could explain why our CFA can work well with a partial availability of domain labels. Furthermore, the zero-shot prediction accuracy of the CLIP ViT-B/16 model for domain labels is 61.5%, significantly higher than a random guess (25%). This outcome explains why CFA with CLIP-predicted domain labels can also improve over vanilla fine-tuning.
Class Label Availability. As for class labels, we do require their full availability for the training set, since our work focuses on supervised finetuning of pretrained encoders. Meanwhile, although CLIP is not explicitly supervised by domain and class labels, the texts in the image-text pairs (used for CLIP pretraining) provide abundant class and environment information. Furthermore, CLIP's unsatisfactory zero-shot performance on our CG-bench indicates the need for supervised finetuning to improve its effectiveness. On the other hand, DINOv2 is a self-supervised pretraining method, thus it needs to be supervised finetuned before applying to downstream classification tasks. Overall, the above statement wants to justify that the class label availability cannot be waived for the compositional generalization task considered in our paper.
3. Two-Stage Training Stability
In this ablation study, we address the concerns of Reviewers H15m and ZjmZ and justify the training stability of our two-stage method. All of the following experiments are conducted for CLIP ViT-B/16 on the Office-Home dataset and are an average over 3 seeds.
We first show that our method is stable to the number of iterations in Stage-1. From the following table, we can see that as the linear heads are trained longer in Stage-1, the ID (WiSE) performance slightly increases and the OOD accuracy reaches the peak at 4000 to 6000 iterations. Except for the trade-off between ID and OOD performance, our method is stable with no drastic drop in accuracy. The 6000 iterations of Stage-1 was the hyperparameter used in our work.
| Stage-1 Steps | ID Acc | ID Acc (WiSE) | OOD Acc | OOD Acc (WiSE) |
|---|---|---|---|---|
| 2000 | 94.0 | 92.8 | 54.3 | 56.9 |
| 4000 | 94.0 | 93.1 | 54.3 | 57.3 |
| 6000 (Chosen) | 94.0 | 93.1 | 54.3 | 56.9 |
| 8000 | 94.0 | 93.6 | 53.3 | 56.5 |
| 10000 | 94.0 | 93.7 | 52.1 | 55.0 |
We then demonstrate how training epochs in Stage-2 will affect the final performance of our model. The table below shows that for longer training epochs, the ID accuracy increases and the OOD accuracy decreases. Hence, in our paper, we used 3 epochs for the best OOD performance.
| Epochs | ID Acc | ID Acc (WiSE) | OOD Acc | OOD Acc (WiSE) |
|---|---|---|---|---|
| 3 (Chosen) | 94.0 | 93.1 | 54.3 | 56.9 |
| 5 | 94.1 | 93.5 | 52.3 | 56.3 |
| 10 | 94.1 | 93.7 | 52.2 | 55.7 |
Dear Reviewers,
Thanks for the reviews. The authors have uploaded their responses to your comments, please check if the rebuttal address your concerns and if you have further questions/comments to discuss with the authors. If the authors have addressed your concerns, please adjust your rating accordingly or vice versa.
AC
Dear Reviewers,
We want to thank you again for providing your valuable feedback on our paper, "Enhancing Compositional Generalization via Compositional Feature Alignment."
We have addressed all the reviewer comments in author responses, which were submitted 2 days ago. We are writing to kindly request you to take a look at our detailed responses when you get a chance. We would greatly appreciate any additional feedback you may have or if you could help engage in a discussion with us.
As authors, we are very eager to address any remaining concerns you may have and clarify any parts of our work to strengthen the paper. Please let us know if you need any additional information from our side.
Sincerely,
Authors of Submission 6129
This paper investigates the compositional generalization(CG) ability of machine learning models to unseen domain-class combinations. It presents a suit of CG benchmarks for evaluating the CG capability of ML models. And further proposed a two-stage fine-tuning algorithm -- compositional feature alignment(CFA) to learn compositional features from pre-trained models. Extensive experiments demonstrate the effectiveness of the method.
Strengths:
- The paper investigates a new CG setting/problem.
- It provides both theoretical and experimental verification.
- It introduces a new CG benchmark.
- It is well written and easy to understand.
Weaknesses:
- There is missing ablation study on each training stage.
- It is hard to obtain domain/class labels, and to scale-up to real world problems.
- Stability of the model during training is not discussed in detail.
- Theorem 1 holds only at the global minimum of the objective function.
为何不给更高分
This is a borderline paper. Reviewers acknowledge that this investigation is worthwhile. However, there are shared concerns on partial availability of domain labels, training stability and scablability to real-world settings. The authors address some of these, providing additional experiments and revised paper, but these were not enough to sway reviewers. I think this paper is not ready for publication at the current stage, and gives the authors more time to further improve the paper.
为何不给更低分
N/A
Reject