5.3

/10

Rejected4 位审稿人

最低3最高6标准差1.3

3.5

置信度

正确性2.8

贡献度2.5

表达3.3

ICLR 2025

Supervised Contrastive Block Disentanglement

Taro Makino,Ji Won Park,Natasa Tagasovska,TAKAMASA KUDO,Paula Coelho,Heming Yao,Jan-Christian Huetter,Ana Carolina Leote,Burkhard Hoeckendorf,Stephen Ra,David Richmond,Kyunghyun Cho,Aviv Regev,Romain Lopez

OpenReview PDF

提交: 2024-09-25更新: 2025-02-05

TL;DR

Practical algorithm that block disentangles environment-invariant and environment-dependent features for domain generalization and batch correction

摘要

Real-world datasets often combine data collected under different experimental conditions. Although this yields larger datasets, it also introduces spurious correlations that make it difficult to accurately model the phenomena of interest. We address this by learning two blocks of latent variables to independently represent the phenomena of interest and the spurious correlations. The former are correlated with the target variable $y$ and invariant to the environment variable $e$, while the latter depend on $e$. The invariance of the phenomena of interest to $e$ is highly sought-after but difficult to achieve on real-world datasets. Our primary contribution is an algorithm called Supervised Contrastive Block Disentanglement (SCBD) that is highly effective at enforcing this invariance. It is based purely on supervised contrastive learning, and scales to real-world data better than existing approaches. We empirically validate SCBD on two challenging problems. The first is domain generalization, where we achieve strong performance on a synthetic dataset, as well as on Camelyon17-WILDS. SCBD introduces a single hyperparameter $\alpha$ that controls the degree of invariance to $e$. When we increase $\alpha$ to strengthen the degree of invariance, there is a monotonic improvement in out-of-distribution performance at the expense of in-distribution performance. The second is a scientific problem of batch correction. Here, we demonstrate the utility of SCBD by learning representations of single-cell perturbations from 26 million Optical Pooled Screening images that are nearly free of technical artifacts induced by the variation across wells.

关键词

disentanglementblock disentanglementout-of-distribution generalizationdomain generalizationdistribution shiftspurious correlationsrobustness

评审与讨论

审稿意见

评分: 3置信度: 42024-10-31

The work proposes a new disentanglement method that is able to learn predictors (e.g., inferring a disease from histology images) that are invariant to spurious correlations arising from different environments in which the training was collected (e.g., histology images coming from different hospitals). The ansatz is most closely related to adversarial approaches in which representations are learnt that are invariant to the environment. Instead of an adversarial objective, however, this paper introduces an easier-to-optimise contrastive objective.

优点

Provides an interesting new ansatz to domain generalisation which might be easier to optimise than adversarial approaches to domain generalisation.
It's encouraging to see that the regularisation parameter induces a clear trade-off between validation and test performance.
The problem is still highly relevant, in particular in data-constrained settings.

缺点

The central weakness of the work is the experimental validation: the proposed method sits squarely in a long line of work on domain generalisation. However, the related benchmarks (e.g, DomainBed) are only mentioned, but there is almost no comparison to existing methods.
The paper introduces it's own baseline based on variational approaches, but it's not clear to me why we would expect this baseline to learn invariances against the environment. The argument in lines 210 - 213 does not hold up because it's unclear why q_\phi(z_c | x) wouldn't learn to use environmental features in order to infer y (which it should to match p_\theta(z_c | y)).
The relation to identifiability (lines 490 - 496) is not correct: identifiability doesn't make a causal argument as to how a feature y is extracted from the data - it's generally only applicable IID and to probe causality or feature reliance, one would need to probe OOD which identifiability (usually) says nothing about.
I found the theoretical outline in chapter 2 rather convoluted.

问题

Why are you not following the typical protocol for evaluating domain generalization methods?
Why do you create your own baseline VAE instead of comparing against the many existing works in domain generalisation?
Why does the proposed VAE approach is expected to learn invariances against the environment?

评论- Please test on standard domain shift benchmarks

2024-11-25

I appreciate the efforts of the authors and the corrections you've done. However, while it is great that you now compare against SOTA methods, you still don't test on standard benchmarks like DomainBed as I've asked for in my initial review. Hence, the comparison to the literature is still flawed as I cannot assess whether the method truly outperforms other methods on benchmarks that the current SOTA has been rigorously tested on.

2024-11-25

Thank you for taking the time to review our paper, and for providing cogent criticisms regarding our domain generalization baselines. We agree with your points, and conducted additional experiments to address them.

The central weakness of the work is the experimental validation: the proposed method sits squarely in a long line of work on domain generalisation. However, the related benchmarks (e.g, DomainBed) are only mentioned, but there is almost no comparison to existing methods.

We agree, and conducted additional experiments so that our domain generalization experiments now compare against six standard baseline algorithms: ERM, CORAL, DANN, IRM, Fish, and GroupDRO. Therefore, our evaluation is now consistent with the convention in the literature. CORAL, and Fish are state-of-the-art methods without block disentanglement, and ERM does not handle distribution shifts. The results are in Table 1 in the revision, and show that SCBD outperforms these baselines.

The paper introduces it's own baseline based on variational approaches, but it's not clear to me why we would expect this baseline to learn invariances against the environment. The argument in lines 210 - 213 does not hold up because it's unclear why q_\phi(z_c | x) wouldn't learn to use environmental features in order to infer y (which it should to match p_\theta(z_c | y)).

We agree, our argument is flawed because environment-dependent spurious correlations could be used to match $p_\theta(z_c \mid y)$ . We removed this paragraph from the revision.

We have reduced our expectations for the iVAE. We argue it should be less sensitive to $e$ compared to the other VAE baselines that do not condition on $e$ . This is because the iVAE decoder uses both $z_c$ and $z_s$ to reconstruct $x$ , and there is no incentive to put all of the environment-dependent correlations in $z_c$ , since $z_s$ is designed to encode these correlations. Therefore, at least some of the spurious correlations should move away from $z_c$ into $z_s$ . In contrast, the VAEs that do not condition on $e$ have no mechanism to enforce this.

However, we do not believe the iVAE will be invariant to the environment to the level of SCBD, since the latter explicitly regularizes for this. Also, as discussed above, we no longer consider iVAE as being the baseline for domain generalization.

The relation to identifiability (lines 490 - 496) is not correct: identifiability doesn't make a causal argument as to how a feature y is extracted from the data - it's generally only applicable IID and to probe causality or feature reliance, one would need to probe OOD which identifiability (usually) says nothing about.

We removed this paragraph in the revision, and focused the discussion on component-wise and block disentanglement.

I found the theoretical outline in chapter 2 rather convoluted.

Would it be possible for you to elaborate on which parts were unclear? We'd like to improve the clarity of this section.

Why are you not following the typical protocol for evaluating domain generalization methods?

We are now following the typical protocol, as discussed above.

Why do you create your own baseline VAE instead of comparing against the many existing works in domain generalisation?

We include the iVAE results as a baseline, solely because it performs well on the batch correction problem - it outperforms CellProfiler on the CORUM task, while having a similar sensitivity to the environment. Our results show that a simple iVAE is effective for batch correction, as long as the posterior over $z_s$ conditions on $e$ . We believe this result may be of independent interest, since VAE-based approaches are popular for this problem.

However, we agree that the iVAE should not be the baseline for domain generalization. Therefore, as mentioned above, we now compare SCBD to the conventional set of domain generalization baselines.

Why does the proposed VAE approach is expected to learn invariances against the environment?

Please see our answer above.

2024-11-28

I appreciate the efforts of the authors and the corrections you've done. However, while it is great that you now compare against SOTA methods, you still don't test on standard benchmarks like DomainBed as I've asked for in my initial review. Hence, the comparison to the literature is still flawed as I cannot assess whether the method truly outperforms other methods on benchmarks that the current SOTA has been rigorously tested on.

We provide additional results on PACS and VLCS, which are particularly popular datasets within DomainBed. The results are in Appendix Sections A.2.3 and A.2.4 in our revision. We now explain the context of these results.

The recent consensus in the domain generalization literature is that the standard benchmark datasets represent a variable set of distribution shifts [1,2]. This makes it important to characterize the nature of distribution shift that an algorithm is expected to work on, and evaluate on datasets that exhibit this particular shift.

An example of this is Eastwood et al. (2023) [3], who like us, experimented with one synthetic and one realistic dataset (CMNIST and PACS from DomainBed). The difference between our papers is that theirs includes theory, while we additionally solve batch correction using a large-scale dataset with 26 million images. The algorithm in Eastwood et al. (2023) assumes a different kind of distribution shift compared to SCBD, and is therefore effective on a different set of datasets. This is clear from the title of their paper - they harnesses spurious features, while we remove them. Their algorithm works on PACS but not on Camelyon17-WILDS, and they provide negative results for the latter in their appendix. In contrast, SCBD works on Camelyon17-WILDS, but not on PACS and VLCS.

So how can we characterize the different kinds of distribution shifts that occur in these datasets? One way is to run a hyperparameter sweep with ERM, and consider the sign and magnitude of correlation between in- and out-of-distribution performance. Wenzel et al. (2022) [1] carried out a large-scale empirical study across 172 datasets, including those from DomainBed and WILDS, and found that in general the correlation is positive. In this case, maximizing in-distribution performance is a good proxy for maximizing out-of-distribution performance. This may explain why the authors of DomainBed found ERM to be state-of-the-art across their datasets.

More recently, Teney et al. (2024) [2] showed that in some cases, in- and out-of-distribution performance are negatively correlated. As we explain in the newly added Section 4.1.2 in our revision, SCBD assumes this negative correlation holds. Such a negative correlation can occur when there exist spurious features in the training environments where the more you learn them, the better you do in-distribution, and the worse you do out-of-distribution. SCBD prevents learning such spurious features, since they are predictive of the training environments.

In our results on PACS, we consider each environment {art painting, cartoon, photo, sketch} as being the test environment, and run a hyperparameter sweep with ERM to show that in- and out-of-distribution performance are positively correlated. These results are in Appendix Figures 17–20 (a), and are consistent with Wenzel et al. (2022) [1]. This is evidence that PACS violates the assumptions of SCBD. We then show that as we increase $\alpha$ , we remove environment-dependent features that generalize on both the training and test environments, and observe a decrease in both validation and test accuracy (Appendix Figures 17–20 (b). The same conclusions hold for VLCS as well (Appendix Figures 21–24). Finally, in Appendix Tables 6 and 7, we compare test accuracy with the standard baseline algorithms.

We believe these additional results on PACS and VLCS have strengthened our paper. We now make a specific assumption regarding the nature of distribution shift, which is that in- and out-of-distribution performance are negatively correlated. We empirically test this assumption, and show that it holds on CMNIST and Camelyon17-WILDS, and is violated on PACS and VLCS. Consequently, SCBD improves robustness on the first two datasets, and does not on the latter two.

We, however, agree with the reviewer that it is much more desirable to have an algorithm for domain generalization that works universally (e.g., across all tasks in DomainBed and WILDS). Unfortunately, our work, as well as many of the existing studies, have not yet made meaningful progress in this direction. We leave it as an important future direction for the community to pursue.

[1] Wenzel et al., Assaying out-of-distribution generalization in transfer learning. NeurIPS, 2022.
[2] Teney et al., ID and OOD performance are sometimes inversely correlated on real-world datasets. NeurIPS, 2024.
[3] Eastwood et al., Spuriosity didn't kill the classifier: using invariant predictions to harness spurious features. NeurIPS, 2023.

2024-12-02

Do you have any questions regarding the additional DomainBed experiments that we performed upon your request? We think these experiments are helpful in showing that SCBD requires datasets to exhibit a negative correlation between in- and out-of-distribution performance. Also, our work is now better-connected to the recent literature suggesting this correlation is an important differentiator between domain generalization datasets.

We're happy to make any clarifications in the last remaining days of the discussion period. If you have no further questions, would you consider increasing your score? Thank you again for your time.

审稿意见

评分: 6置信度: 32024-11-04

The paper proposes Supervised Contrastive Block Disentanglement (SCBD), using supervised contrastive learning to separate target phenomena from spurious correlations in data collected under different experimental conditions. The method introduces a single hyperparameter α to control invariance, and is evaluated on domain generalization and biological batch correction tasks.

The paper presents a novel approach to an important problem with promising results, particularly in biological applications. While the theoretical foundations could be stronger and there are some practical limitations, the method makes a clear contribution with its clean formulation and demonstrated utility on real-world data. The limitations around environment labels and decoder training are acknowledged by the authors and provide clear directions for future work. Thus, I recommend marginal acceptance.

优点

Novel application of supervised contrastive learning for disentanglement
Clean formulation with interpretable hyperparameter
Thorough experimental evaluation of the proposed method including relevant competing methods
Convincing empirical results on biological batch correction applications

缺点

Limited theoretical analysis
- No formal guarantees for disentanglement
- Lacks justification for why contrastive learning should work better than alternatives
Practical limitations (as also acknowledged by the authors)
- Method requires known environment labels e, limiting broader applicability.
- Poor reconstruction quality due to separate decoder training.
- Worse CORUM results compared to iVAE with conditioning.

问题

How does computational cost compare to competing approaches?
Are there any failure modes of the method?

2024-11-25

Thank you for taking the time to review our paper, and for asking important practical questions.

Limited theoretical analysis

No formal guarantees for disentanglement

Lacks justification for why contrastive learning should work better than alternatives

We agree on the value of theoretical grounding. However, as Reviewer Nnxb mentioned, there is a lot of theory but limited practical success in the area of invariant representation learning. We therefore took the purely empirical route, and demonstrated practical success on two realistic datasets. In particular, our batch correction dataset represents a real scientific problem, and consists of over 26 million images.

How does computational cost compare to competing approaches?

Competing approaches typically have a single encoder that learns a representation of $x$ . SCBD uses separate encoders to learn $z_c$ and $z_s$ from $x$ , which adds computational overhead. However, many of the competing approaches require separately computing a penalty for each value of $e$ in the batch, which can be computationally expensive. In contrast, the supervised contrastive and invariance losses in SCBD are very computationally efficient, since the computational cost is dominated by a single matrix multiplication to compute the dot products between pairs of $z_c$ (or $z_s$ ).

Are there any failure modes of the method?

This is a very important question, and led us to write a new section (Section 4.1.2) in the revision to precisely characterize the type of dataset that SCBD should be effective on. Any dataset that doesn’t satisfy this corresponds to a failure mode.

In order for SCBD to work, there must exist spurious features in the training environments where the more you learn them, in-distribution performance improves, and out-of-distribution performance worsens. In other words, datasets need to exhibit a trade-off between in- and out-of-distribution performance. SCBD prevents the learning of such spurious features, since they are predictive of the training environments. This promotes the learning of features that are invariant to the environment, and thus generalize on the test environments.

It turns out that most domain generalization datasets do not exhibit such a trade-off between in- and out-of-distribution performance. Wenzel et al. (2022) [1] reached this conclusion via a large-scale empirical study involving 172 datasets, including those in the DomainBed and WILDS suites. Teney et al. (2024) [2] found that a particularly strong negative correlation persists on the Camelyon17-WILDS dataset, which justifies our use of it.

We include additional results on PACS and VLCS from DomainBed to empirically demonstrate this failure mode. Our results in Appendix Figures 17--24 (a) show that in- and out-of-distribution performance are positively correlated on these datasets. Since this violates the assumptions of SCBD, we do not expect it to be effective. Our results in Appendix Figures 17--24 (b) reflect this. Here, when we increase $\alpha$ , we remove features that generalize on both the training and test environments. Subsequently, instead of trading off validation and test accuracy, they both get worse.

2024-12-02

Are there any failure modes of the method?

Your question regarding the failure modes of SCBD, as well as our discussion with the other reviewers, led us to write a new section (Section 4.1.2) on the specific type of distribution shift that is assumed by SCBD. Any dataset that violates this assumption corresponds to a failure mode.

SCBD requires datasets to exhibit a negative correlation between in- and out-of-distribution performance. The recent consensus in the literature is that domain generalization datasets exhibit significant variation in the sign and magnitude of this correlation [1,2], and that for many datasets, the correlation is positive [1].

We conducted additional experiments to show that the correlation is negative on CMNIST (Appendix Figure 5) and Camelyon17-WILDS (Appendix Figure 11). This means that these datasets satisfy the assumptions made by SCBD, and supports the results from our original submission showing that SCBD is effective on these datasets.

We also conducted additional experiments on the PACS and VLCS datasets from DomainBed to show that these datasets violate our assumptions (Appendix Figures 17–24 (a)), and therefore SCBD does not improve robustness on them, as expected (Appendix Figures 17–24 (b)).

Please let us know if you have any questions regarding this characterization of the failure modes of SCBD. We’re happy to answer any questions in the remaining days of the discussion period. Also, if there are no further questions, would you consider increasing your score? Thank you again for your time.

2024-12-02

Thanks for these clarifications and the added experiments. I appreciate the added analysis, which clearly articulates the failure cases of the proposed method. I also agree with Reviewer sAdN that these comparison where indeed necessary to correctly position this paper within related work. After reading through all the reviews, I choose to keep my score for acceptance and raise my confidence accordingly.

2024-12-02

Thank you again for your time, and for taking our additional experiments into consideration.

审稿意见

评分: 6置信度: 42024-11-05

This paper adapts & modifies the Supervised Contrast Learning algorithm (SCL; Khosla et al 2020) to solve domain generalization tasks. They use a loss that leverages similar ideas to SCL to block disentangle the representation into "content" and "style" blocks capturing the respective parts of the signal that are invariant and vary across domains. They have an explicit regularization term that encourages invariance across domains, and they show experimentally that increasing this hyperparameter leads to improved test set performance on the downstream tasks.

优点

I thought it was a very clearly written paper - the various terms in the loss function are well motivated from a probabilistic perspective, and clearly explained.
I liked that it was a pragmatic take on an area that has a lot of nice theory but relatively little practical success, suggesting a focus on algorithms is important.
The empirical results clearly demonstrate the role that the invariance loss plays.

缺点

Given that this is primarily a methods paper that is supported by empirical evidence, it would have been nice to see the empirical results replicated across all of WILDS. Aside from the compute requirements, I don't see what's stopping that?
It seems likely that the paper could have been supported with theory that shows that the optimizer of the loss separates the representations (analogous to [Von Kügelgen et al., 2021]). It not essential, but it would have strengthened the paper.
While I totally agree on the importance of having a well-defined validation set metric to optimize for, I wasn't convinced by the argument that you could simply maximize invariance subject to some constraint on accuracy loss (see questions below for why).

问题

How do you choose the accuracy loss threshold? Surely both the rate of accuracy loss and the necessary invariance is domain specific? I.e. on some domains, you need a larger validation accuracy in order to get good test performance?
Could the same accuracy trade-off procedure not be applied to any domain generalization method with an invariance penalty? What is specific about this paper?
What is preventing you running this on all of WILDS?

2024-11-25

We sincerely appreciate you taking the time to review our paper.

Given that this is primarily a methods paper that is supported by empirical evidence, it would have been nice to see the empirical results replicated across all of WILDS. Aside from the compute requirements, I don't see what's stopping that?

We do not evaluate on the remaining datasets in WILDS because as we will explain below, they do not satisfy the conditions required for SCBD to work.

We wrote a new section in the revision (Section 4.2.1) in order to:

Precisely characterize the conditions where SCBD should work;
Explain how these conditions are empirically testable;
Provide evidence from the literature that Camelyon17-WILDS satisfies these conditions.

Eastwood et al. (2023) [3] do a similar evaluation as us, evaluating only on CMNIST and PACS (from DomainBed). Their algorithm works under different conditions compared to SCBD, and does not work on Camelyon17-WILDS.

While I totally agree on the importance of having a well-defined validation set metric to optimize for, I wasn't convinced by the argument that you could simply maximize invariance subject to some constraint on accuracy loss (see questions below for why).

We agree with this point. We found that the test accuracy tends to plateau for similar values of $\alpha$ on CMNIST and Camelyon17-WILDS, but we do not believe this will persist on arbitrary datasets. Therefore, in the revision we highlight the difficulty of tuning $\alpha$ as being a limitation (third paragraph in Section 4.1.6). This corresponds to model selection with respect to an unknown test distribution, which is a difficult open problem, and is a limitation also shared by other works [4, 5].

[4] Makino et al., Generative multitask mitigates target-causing confounding, NeurIPS, 2022.
[5] Wortsman et al, Robust fine-tuning of zero-shot models, CVPR, 2022.

How do you choose the accuracy loss threshold? Surely both the rate of accuracy loss and the necessary invariance is domain specific? I.e. on some domains, you need a larger validation accuracy in order to get good test performance?

Please see our answer above.

Could the same accuracy trade-off procedure not be applied to any domain generalization method with an invariance penalty? What is specific about this paper?

This is true, the same argument applies to algorithms such as IRM. We edited the revision to remove mentions of this being a unique property of SCBD.

What is preventing you running this on all of WILDS?

Please see our answer above.

2024-12-02

Given that this is primarily a methods paper that is supported by empirical evidence, it would have been nice to see the empirical results replicated across all of WILDS. Aside from the compute requirements, I don't see what's stopping that?

We conducted additional experiments on the PACS and VLCS datasets from DomainBed, which may align with your suggestion to evaluate on a broader range of datasets.

To briefly summarize, we’ve now linked our work to recent studies demonstrating that domain generalization datasets exhibit significant variation in both the sign and magnitude of correlations between in- and out-of-distribution performance [1,2].

We wrote a new section (Section 4.1.2) explaining that SCBD requires this correlation to be negative, and provide evidence that this holds on CMNIST (Appendix Figure 5) as well as Camelyon17-WILDS (Appendix Figure 11). This is also consistent with the observations in [2]. Since these datasets satisfy our assumptions, the results from our original submission show that SCBD is effective on them.

We now additionally provide some negative results, and show that in- and out-of-distribution performance are positively correlated for PACS and VLCS (Appendix Figures 17–24 (a)). This is consistent with the observations in [1]. Since these datasets violate the assumptions of SCBD, our algorithm cannot improve robustness on them (Appendix Figures 17–24 (b)).

We’re now following the recent convention in domain generalization, which is to make specific assumptions about the nature of distribution shift, and to only expect an improvement on datasets that satisfy these assumptions.

An example of this is Eastwood et al. (2023) [3], who like us, experimented with one synthetic and one realistic dataset (CMNIST and PACS from DomainBed). The difference between our papers is that theirs includes theory, while we additionally solve batch correction using a large-scale dataset with 26 million images. The algorithm in Eastwood et al. (2023) [3] assumes a different kind of distribution shift compared to SCBD, and is therefore effective on a different set of datasets. This is clear from the title of their paper - they harness spurious features, while we remove them. Their algorithm works on CMNIST and PACS but not on Camelyon17-WILDS, and they provide negative results for the latter in their appendix. In contrast, SCBD works on CMNIST and Camelyon17-WILDS, but not on PACS and VLCS.

Please let us know if you have any questions regarding these additional results; we’re happy to make any clarifications in the remaining days of the discussion period. Also, if there are no further questions, would you consider increasing your score? We thank you again for your time.

审稿意见

评分: 6置信度: 32024-11-07

This paper introduces a method, named SCBD, for improving domain generalization and reducing spurious correlations and so-called batch effect among data collected in different environments, an common issue in datasets from experimental biology and clinical data. The proposed method mainly involves modeling the spurious correlation and true signal with two latent vectors and optimizing an objective involving four different parts: one for signals induced from the target labels, one for signals induced from the environment, one for the invariance amongst the environment, one for making the learning invariant to the environment, and a regularization loss on the generation. Empirically, the experiments were done on both small Colored MNIST and Camelyon17-WILDS. The results show their method is able to tune a parameter that exhibits a tradeoff between in-domain generalization and out-domain generalization.

优点

The research problem of domain generalization is well-motivated. The introduction stating the issues with current methods for domain generalization/adaptation is clear, such as the batch effect issues in experimental biology. The writing of how the method works is straightforward to understand. In terms of novelty, the proposed method demonstrate that their method shows a monotonic trade-off between validation and test accuracy. Their experiments also demonstrate their method can achieve the desired U-shape curved. Their method is also applied to biology-related datasets, which the authors have fairly introduced the datasets, making the problem well-contained. Overall, the problem is significant and very relevant to today’s research landscape.

缺点

The weaknesses are the following:

The novelty of the work seems limited. There already exist works that model signals from environment and target variables with two latent factors [1]. The paper also proposed a modification to iVAE, but as the authors mentioned, it was challenging to learn and the experiments do not yield significant improvements from other baselines.
While the experiment settings are well-design, each with a clear point that it is trying to demonstrate, having only one synthetic and one real-world dataset is not convincing enough that this method will work on most cases. And this is especially important, if the authors are trying to claim that their method can show a trade-off between validation and test accuracy in any general case. The authors also mentioned related works on domain generalization, but do not have any comparisons against state-of-the-art methods without block disentanglement or baseline against methods that do not handle out-of-distribution methods.
The visualizations of the experiment results are quite hard to compare. For example, in Figure 3 and 4, instead of having three separate plots, they could be one single plot, where different methods correspond to different colors, with SCBD being a range of colors (such as a rainbow spectrum). The way it’s currently presented, with the tick-labels being on different scales, makes it difficult if not impossible to compare.
There is also a lack of ablation study. In particular, it seems the choice of dimension is important when learning latent factors but a study on how that affects learning and performance is not in the paper. The effects of batch size, learning rate, and regularization is also not shown.
(Minor) Typos on L781 “lossesl”

问题

Is there a difference between domain generalization and batch correction? It seems batch effect is just a case of domain generalization, but the paper is written right now that they are two issues.
The trade-off between the validation and test accuracy is clear to me if there are only two environments. What is the intuition if there are more than two environments? Would there be multiple validation and test accuracy trade-off?

2024-11-25

Thank you very much for taking the time to review our paper, we really appreciate your feedback.

The novelty of the work seems limited. There already exist works that model signals from environment and target variables with two latent factors [1]. The paper also proposed a modification to iVAE, but as the authors mentioned, it was challenging to learn and the experiments do not yield significant improvements from other baselines.

There are two novel aspects to this work. The first is our use of Supervised Contrastive Learning (SCL) to separately model correlations that are invariant to $e$ , and the spurious correlations that depend on $e$ . This is a novel approach to a well-studied problem. The second is our invariance loss, which makes $z_c$ less predictive of $e$ . Our formulation of this loss relies on the fact that we use SCL - the dot products between $z_c$ can be repurposed to predict $e$ , without the need to adversarially train a discriminator as done by existing approaches. We modified the revision to make these points clear.

While the experiment settings are well-design, each with a clear point that it is trying to demonstrate, having only one synthetic and one real-world dataset is not convincing enough that this method will work on most cases. And this is especially important, if the authors are trying to claim that their method can show a trade-off between validation and test accuracy in any general case.

We respectfully believe this is a misunderstanding. We have one synthetic and one realistic dataset (>300k images) for domain generalization, and another realistic dataset (>26 million images) for batch correction. In total, that is one synthetic and two realistic datasets. Due to the large scale of the batch correction dataset, we believe the scope of our experiments is large relative to the literature.

The authors also mentioned related works on domain generalization, but do not have any comparisons against state-of-the-art methods without block disentanglement or baseline against methods that do not handle out-of-distribution methods.

We agree with this point, and conducted additional experiments so that our domain generalization experiments now compare against six standard baseline algorithms: ERM, CORAL, DANN, IRM, Fish, and GroupDRO. Therefore, our evaluation is now consistent with the convention in the literature. The results are in Table 1 in the revision, and show that SCBD outperforms these baselines.

The visualizations of the experiment results are quite hard to compare. For example, in Figure 3 and 4, instead of having three separate plots, they could be one single plot, where different methods correspond to different colors, with SCBD being a range of colors (such as a rainbow spectrum). The way it’s currently presented, with the tick-labels being on different scales, makes it difficult if not impossible to compare.

Instead of having three separate plots, we reduced it to a single plot by presenting the baseline results in the form of a table.

There is also a lack of ablation study. In particular, it seems the choice of dimension is important when learning latent factors but a study on how that affects learning and performance is not in the paper. The effects of batch size, learning rate, and regularization is also not shown.

We agree on the importance of these ablations, and include the results in Appendix Figures 7–9, and 12–14. These additional results demonstrate that SCBD is robust to the choice of these hyperparameters.

Is there a difference between domain generalization and batch correction? It seems batch effect is just a case of domain generalization, but the paper is written right now that they are two issues.

Domain generalization and batch correction are similar because in both cases, we want $z_c$ to represent the correlation between $x$ and $y$ that is invariant to $e$ , and $z_s$ to encode the remaining spurious correlations that depend on $e$ . The key difference between the two problems relates to the evaluation, In domain generalization, we evaluate the ability to predict $y$ given $z_c$ on an out-of-distribution test set. In contrast, in batch correction the evaluation is in-distribution, and measures the degree to which $z_c$ discards the information in $e$ , while preserving the information in $y$ . We clarified this important point in our revision in the first paragraph of Section 4.

The trade-off between the validation and test accuracy is clear to me if there are only two environments. What is the intuition if there are more than two environments? Would there be multiple validation and test accuracy trade-off?

In domain generalization, the trade-off is always between performance on the training environments (in-distribution) and performance on the test environment (out-of-distribution). Notice that training environments is plural - there are multiple training environments in all of our experiments.

2024-11-26

Thank you for the resposne. The authors have addressed my questions and misunderstanidngs appropriately. With the additional ablation studies, along with clairifcation on batch correction and failure modes (suggested by other reviewers), I have adjusted my score accordingly.

2024-11-26

Thank you again for your time, and for taking our response into consideration.

2024-12-04

We sincerely thank the reviewers for taking the time to read our paper and provide constructive feedback. Based on our initial submission, the reviewers unanimously praised the novelty and significance of our approach, and agreed that our experimental results support our claims.

Here are some examples from each reviewer:

Reviewer jiAL: “In terms of novelty, the proposed method demonstrate that their method shows a monotonic trade-off between validation and test accuracy. Their experiments also demonstrate their method can achieve the desired U-shape curved.”
Reviewer Nnxb: “I liked that it was a pragmatic take on an area that has a lot of nice theory but relatively little practical success, suggesting a focus on algorithms is important. The empirical results clearly demonstrate the role that the invariance loss plays.”
Reviewer WV1Y: “Novel application of supervised contrastive learning for disentanglement. Thorough experimental evaluation of the proposed method including relevant competing methods. Convincing empirical results on biological batch correction applications.”
Reviewer sAdN: “Provides an interesting new ansatz to domain generalisation which might be easier to optimise than adversarial approaches to domain generalisation. It's encouraging to see that the regularisation parameter induces a clear trade-off between validation and test performance.”

To summarize the discussion period, the reviewers raised three major points:

We should compare our domain generalization results to additional baseline algorithms (Reviewer sAdN).
We should evaluate on additional domain generalization datasets (Reviewers jiAL, NNxb, sAdN).
We should discuss the failure modes of our algorithm (Reviewer WV1Y).

To address these points, we:

Compare our algorithm to four additional baselines that are standard in the literature.
Perform additional experiments on two datasets from DomainBed (PACS and VLCS).
Wrote a new section describing the exact conditions required for our algorithm to work - violations of these assumptions correspond to the failure modes of our algorithm. This also strengthened the connection of our work to the current literature.

We also addressed all other points mentioned by the reviewers, which were more minor compared to the above three points. Some examples include:

We ran additional ablation studies to show that our algorithm is robust to the choice of hyperparameters (Reviewer WV1Y).
We cleared up a misunderstanding regarding the total number of datasets that we evaluate on (Reviewer WV1Y).
We clarified the novel aspects of our work (Reviewer WV1Y).
We removed faulty claims regarding our VAE-based baseline (Reviewer sAdN).

Based on our response, Reviewer jiAL increased their score from a 3 to 6, and Reviewer WV1Y maintained their score of 6 while increasing their confidence from 2 to 3.

Reviewer sAdN acknowledged that we addressed most of their points, and requested additional experiments on DomainBed. We then performed these experiments, but they were not able to respond afterwards.

Overall, we had a fruitful discussion that we believe strengthened our paper. We thank the reviewers and area chairs for their time and consideration.

AC 元评审

2024-12-22

This paper proposes a supervised contrastive block disentanglement algorithm to learn invariant features that are stable across environments. The proposed algorithm is evaluated on domain generalization, achieving good performance on a synthetic and a real-world dataset. The proposed method is also applied to batch correction on single-cell data. The reviewers raised several concerns, including 1) limited novelty 2) lack of theoretical justification, 3) lack of comparison with state-of-the-art DG methods and limited datasets. The authors removed some incorrect statements and added more experiments. However, the results are not consistently better than existing methods. The author gave some justifications on the possible violation of the assumptions of the method, which is not fully convincing. If the author's new claim is that the proposed method works in some scenarios while not in others, it would be better to provide a more formal analysis of the assumptions and make theoretical justifications, as the conclusion from comparison on a finite number of datasets does not necessarily reflect the truth. I do not recommend acceptance of this paper at its current form.

审稿人讨论附加意见

The reviewers who initially gave a negative score or low confidence were involved in the discussion with the authors. One of the reviewers raised the rating and the other reviewer raised confidence. While three out of four reviewers give rating 6, the concerns from reviewer sAdN remains unresolved. Especially, sAdN pointed out several flaws of this paper, while the authors corrected it, it showed that the authors did not fully understand the key concepts on causality and disentanglement. In addition, the new observations in the added experiments are not fully justified, as explained in my meta review.

最终决定Reject

2025-01-22

Reject