Compositional Risk Minimization
Provable method for extrapolating classifiers to novel combinations of attributes (a.k.a. compositional generalization)
摘要
评审与讨论
This paper addresses compositional generalization by tackling compositional shift, where test data contains unseen combinations of attributes. The authors propose Compositional Risk Minimization (CRM), using additive energy distributions to model attributes and providing an alternative to empirical risk minimization. Their approach involves training an additive energy classifier and adjusting it for compositional shifts, with theoretical analysis showing extrapolation capabilities to affine hulls of seen attribute combinations. Experimental results on benchmark datasets demonstrate improved robustness compared to existing methods for handling subpopulation shifts.
给作者的问题
No
论据与证据
The claims are supported by both theoretical and empirical evidence.
方法与评估标准
Yes, the evalucation criteria (Average Accuracy and Worst Group Accuracy) makes sense.
理论论述
The mathematical derivations and theoretical arguments seem rigorous upon initial examination.
实验设计与分析
Based on my review of the experimental section, the empirical results support the paper's claims.
补充材料
I conducted a cursory review of the supplementary materials without performing an in-depth analysis.
与现有文献的关系
The authors provide a thorough and systematically structured overview of previous work in this research area.
遗漏的重要参考文献
None
其他优缺点
Strengths:
- The research tackles an important problem in compositional generalization with novel insights.
- Clear implementation details and reproducibility through provided pseudocode.
Weakness:
Limited empirical comparisons against existing baseline methods
其他意见或建议
The authors use 'Compositional Risk Minimization' as the title. Could you explain the core concept of Compositional Risk Minimization using simple mathematical formulations? Please refer to the classical mathematical formulation of Empirical Risk Minimization (ERM).
We thank the reviewer for their positive and insightful feedback! We now address the concerns raised by the reviewer ahead.
1. Limited empirical comparisons against existing baseline methods
We have done extensive benchmarking of CRM with 6 widely used baselines in the literature of subpopulation shifts apart from ERM. Please check Table 5 in Appendix G.1, where we compare CRM with GroupDRO, LC, sLA, IRM, VREx, and Mixup. We find that CRM outperforms all the baselines across diverse benchmarks w.r.t. the worst group accuracy.
Note that in the main body of the paper (Table 1), due to space constraints where we only compared with the best performing baselines (GroupDRO, LC, sLA). Also, note that our primary comparison baseline has been with GroupDRO, as it is the most effective method for addressing subpopulation shifts, hence it was the main focus of our work. We also included baselines like LA (Logit Adjustment) that share conceptual similarities with our approach.
2. The authors use 'Compositional Risk Minimization' as the title. Could you explain the core concept of Compositional Risk Minimization using simple mathematical formulations? Please refer to the classical mathematical formulation of Empirical Risk Minimization (ERM).
The classical ERM objective can be stated as follows.
where and is the training prior probability and is the set of all groups. If is the cross-entropy loss, then output of ERM (with no capacity constraints) matches the true . Also, note that in the above summation is zero on all groups that are not in the support of the training distribution.
To tackle compositional distribution shifts, we want to learn predictors that minimize
and not , which ERM minimizes. In the above objective can be non-zero on groups that have zero probability under . Our results show that our approach outputs a predictor that provably minimizes risk under any compositional shift and hence the name compositional risk minimization.
Specifically, in Theorem 3, we showed that CRM outputs the Bayes optimal predictor and hence it provably minimizes with a high probability, where is cross-entropy loss or loss, as long as the number of training groups grow as .
To clarify, our approach does not require us to explicitly compute the risk on the test distribution . For additive energy distributions, in the second step of CRM we adapt the to with the extrapolated bias , which equals the true test predictor (Theorem 2). Hence, CRM avoids the computation of .
Thanks again for this interesting question! We will add this discussion to the paper as well. We are very open to further discussion and would be happy to address any remaining concerns.
This paper proposes compositional risk minimization (CRM), an approach to compositional generalization that is based on additive energy distributions. The intuition is to train an energy-based classifer on the training set, then modify it to account for known bias between the observed training and test distributions. The authors show a number of theoretical results as well as empirical results on benchmarks for subpopulation shifts.
给作者的问题
What practical settings is the additive energy distribution assumption applicable to? For example, does it apply to the blue elephant on the Moon example laid out in the introduction? It's reasonable if it does not hold, but discussing the boundaries of when it holds would be good to include.
论据与证据
Yes, all claims are supported with adequate theoretical and empirical evidence.
方法与评估标准
The method is well constructed and the benchmarks are well chosen.
理论论述
The proofs appear correct; no issues found.
实验设计与分析
The experimental setup described in section 5.1 appears sound.
补充材料
I looked at the proofs and additional results; both appear to support the claims of the main text.
与现有文献的关系
This work is related to prior work on compositionality with energy-based models. However, these works typically consider a generative setting; by contrast, the authors consider a discriminative setting and produce a novel set of theoretical results.
遗漏的重要参考文献
No missing references as far as I am aware.
其他优缺点
The paper is quite well written and presented. Experiments are conducted over a number of datasets. Overall, the authors present an interesting, fresh approach to compositional generalization.
My main concern is whether the additive energy distribution assumption is realistic (beyond the particular subpopulation-shift setting considered in the experiments). It would be great to have additional discussion on this point.
其他意见或建议
Typos
- "boradcasting" in Figure 1 caption
We thank the reviewer for their positive and insightful feedback! We will fix the typo in the caption of Figure 1, thanks for pointing this. We now address the concerns raised by the reviewer ahead.
My main concern is whether the additive energy distribution assumption is realistic (beyond the particular subpopulation-shift setting considered in the experiments). It would be great to have additional discussion on this point. What practical settings is the additive energy distribution assumption applicable to? For example, does it apply to the blue elephant on the Moon example laid out in the introduction? It's reasonable if it does not hold, but discussing the boundaries of when it holds would be good to include.
We believe that additive energy assumption is practical for settings where the image is aptly described by an AND operation among the attributes (and this applies to the blue elephant example). The summation of additive energies leads to a product of exponentials which act as a soft AND operation. Hence, each energy term contributes to checking one of the conditions in the AND operation.
Regarding the specific image "blue elephant on the moon", let us start with a simplification of this image, "elephant on the moon". Then we have one energy term that detects elephant and the other energy term detects moon. However, let us now consider the original image "blue elephant on the moon". If we do a simple AND between detecting blue color, elephant, and the moon, then even the image "elephant on blue moon" will have the same energy (density) as the image "blue elephant on the moon", which is not desirable.
But we can model this scenario using additive energy distribution by having energy components for each object-specific attribute. Hence, the final energy function becomes , where refers to the location of the object and refers to the attribute for the object. This essentially allows us to "bind" the attribute information to an object and still model the overall distribution with additive energies. In the example above, we have one energy term for the object elephant with attribute blue, which gets added with the energy term for the object moon (with some default attribue value).
Thanks again for this interesting question! A thorough investigation of this is a fruitful future direction. We are very open to further discussion and would be happy to address any remaining concerns.
This paper introduces a method for addressing compositional shifts in discriminative tasks. The authors propose a theoretical framework built on additive energy distributions, where each energy term represents an attribute. They introduce the discrete affine hull concept to characterize extrapolation capabilities. Their two-step algorithm first trains an additive energy classifier to predict attributes jointly, then adjusts this classifier for compositional shifts. Theoretical guarantees show that the proposed method can extrapolate to test distributions within the discrete affine hull of training distributions. Experiments on several benchmarks demonstrate the effectiveness of the proposed method.
Pros:
- The proposed method is well-motivated and reasonable, and builds on additive energy distributions that are studied in generative compositionality.
- The proposed algorithm is practical and easy to implement. And the authors provide detailed implementation in the appendix.
- The extensive empirical evaluation demonstrates consistent improvements across diverse benchmarks.
Cons:
- The additive energy assumption may be too limited for many real-world situations where different factors interact in complex ways rather than simply adding together. This could reduce how useful the approach is in practice.
- The additive energy distributions were previously studied in generative compositionality, while the authors extend this framework to discrimination tasks. This is an incremental contribution and the novelty appears limited in scope.
- The paper assumes access to attribute labels during training, which might not always be available in practice.
给作者的问题
-
The paper assumes attribute labels are available during training. Can the proposed method be adapted to settings where attribute labels are only partially available or must be inferred?
-
How does the computational complexity of the proposed method scale with the number of attributes and classes?
论据与证据
-
The authors provide rigorous theoretical analysis with detailed proofs showing that the proposed method can generalize to novel attribute combinations within the discrete affine hull.
-
Empirical validation on several common benchmarks show the proposed method outperforms other baselines.
方法与评估标准
-
The proposed two-step algorithm is consistent with the theoretical framework. The proposed method is well-motivated and reasonable, and builds on additive energy distributions that are studied in generative compositionality.
-
The authors use average accuracy, group-balanced accuracy, and worst-group accuracy to evaluate performance. The evaluation is comprehensive.
理论论述
The proofs appear mathematically sound. The theoretical analysis provides a sharp characterization of extrapolation, demonstrating that generalization beyond the discrete affine hull is fundamentally impossible. This establishes clear boundaries on what can be achieved in this domain.
实验设计与分析
The experiments systematically demonstrate the advantages of the proposed method, especially on worst-group accuracy. The ablation studies clearly show the importance of the extrapolated bias term, which aligns with the theoretical framework.
补充材料
The appendix is comprehensive and provides strong support for the claims in the main paper. It provides detailed theoretical proofs for all theorems, and additional experimental results for each benchmark and compositional shift scenario.
与现有文献的关系
-
This work is built on additive energy distributions for generative tasks. The authors extend it to discriminative tasks.
-
The problem connects to out-of-distribution generalization and subpopulation shifts. The authors clearly articulate how compositional generalization relates to these established research areas.
遗漏的重要参考文献
The idea that compositional generalization can only be achieved within the discrete affine hull is analogous to the assumption in the following papers, which posit that the test distribution should lie within the convex hull of the training distributions: [1] Qiao, F., & Peng, X. Topology-aware Robust Optimization for Out-of-Distribution Generalization. ICLR 2023. [2] Yao, H., Yang, X., Pan, X., Liu, S., Koh, P. W., & Finn, C. Improving Domain Generalization with Domain Relations. ICLR 2024.
其他优缺点
Please see summary.
其他意见或建议
None.
We thank the reviewer for their positive and insightful feedback! We are glad they appreciate the technical soundness of our work, on both the theoretical and empirical front. We now address the concerns raised by them.
Additive Energy Distribution (AED) Limitations
We emphasize that the benchmarks used to evaluate CRM are both realistic and widely adopted in the subpopulation shift literature. Since CRM consistently outperforms baselines, this suggests that AED assumption is not overly restrictive and can model realistic datasets effectively.
We now clarify how AED models complex interactions between attributes. Note that AED does not imply additive interactions in data space () as in additive decoders (Lachapelle et al. 2024). Instead, it models the AND operation between attributes, as illustrated below via examples (check Appendix B for details).
i) Consider images that contain a distinct object varying in shape, size, and color. At any pixel, it is unlikely that shape, color, and size attributes interact additively, rather their interactions are complex which can't be captured via additive decoders. However, under AED, interactions are modeled via an energy component per attribute: one detecting shape, AND another detecting color, AND a third detecting size. Together, these energy terms define the distribution of images conditioned on attributes.
ii) An example from a different data modality is the the CivilComments benchmark, where the attributes toxic language (class label) and demographic identity (spurious attribute) interact non-trivially in text space. However, under AED, we can model their interactions via an energy component that checks whether the language is toxic, AND another energy component checks the demographic identity.
Novelty of the work
We explain the key features that set this work apart.
a) Discrete Affine Hull, a novel mathematical object: Existing AED works in generative compositionality lack theoretical guarantees for generalization beyond training data. To address this, we introduce a novel mathematical object, the discrete affine hull, which precisely characterizes extrapolation to new distributions for both discriminative and generative tasks. For instance, our theoretical guarantee states that it is possible to generalize from groups to groups in both discriminative and generative tasks.
b) Discriminative training without estimating partition function: Compositionality in discriminative tasks is a major problem, and our work makes key advances. One way to learn a classifier is via generative classification (lines 171-190, right column), where we first train densities on observed groups, estimate new densities via affine combinations, and then use Bayes rule to derive . While this guarantees generalization, it is impractical due to the intractable gradient estimation of log partition functions (line 193-207, right column). While CRM circumvents these issues and retains the same guarantees.
Prior work regarding convex hulls
We thank the reviewer for pointing us to these references, and we are happy to cite and contrast with them. However, note that the densities from a new group in the affine hull cannot be expressed as a convex or even an affine combination of the densities. For details, see e.q. (22) in Appendix D.2, summarized below.
Thus, in our setting, only energy terms are expressed as an affine combination, and our guarantees apply to distributions that are outside the convex hull of the training distributions.
Missing attribute labels scenario
We believe this is an exciting future work. Existing works such as XRM (Pezeshki et al. 2024) show how one can discover the environments and then use existing domain generalization methods that require environment labels. We believe it would be exciting to extend these works to infer spurious attributes directly in combination with our approach.
Computational complexity of CRM
For CRM training stage 1 (e.q. 7), the cost for each step is similar to that of training an ERM-based classifier for predicting the group , which is proportional to .
In the training stage 2, we compute the extrapolated bias (e.q. 11), and the cost is proportional to .
Observe that in the worst case scenario the number of classes for at test time is , making inference cost . Any method that predicts would have to compute a probability vector of size and thus spend at least per inference.
Thanks again for your constructive comments, and please let us know if there are any remaining concerns.
This paper addresses the compositional shifts, a hard type of sub-population shifts, and proposes compositional risk minimization. The method is well-motivated and some theoretical analyses are provided. Results on the sub-population shift benchmark are shown to support the proposed method.
给作者的问题
N/A
论据与证据
- The compositional risk minimization method is reasonable and well-motivated.
- The formulation of the compositional shift setting provides a foundation for further research.
- Experimental results are good to support the method.
方法与评估标准
The analysis and proposed algorithm are designed to handle multiple attributes, with the theoretical advantages being most relevant for this multi-attribute context. However, the experiments are limited to only 2~3 attributes. I suggest that the authors include empirical results with multiple attributes to better align with the theoretical analysis.
理论论述
Correct
实验设计与分析
I understand that the paper’s chosen disjoint setting adds a level of complexity. However, in real-world scenarios, it is often feasible to obtain a small number of samples for different attribute combinations (particularly with only two attributes, as in these experiments). The proposed method should also be evaluated in traditional settings where all attribute combinations have some representation. This would confirm that the method performs well without requiring group-dropping.
补充材料
Yes, I checked the additional experimental results.
与现有文献的关系
This paper proposes a more efficient method for domain generalization.
遗漏的重要参考文献
None
其他优缺点
N/A
其他意见或建议
I would suggest the authors moving additional results into the main body.
We thank the reviewer for their positive and insightful feedback! We now address the concerns raised by the reviewer ahead.
1. The analysis and proposed algorithm are designed to handle multiple attributes, with the theoretical advantages being most relevant for this multi-attribute context. However, the experiments are limited to only 2~3 attributes. I suggest that the authors include empirical results with multiple attributes to better align with the theoretical analysis.
Thanks for raising this issue, we would like to provide some clarifications regarding this. In our theoretical results (Theorem 2 & 3), the key finding is that if we observe groups at training time, then CRM would generalize to test distributions over all the groups. Note that the result considers the setting with multiple groups, which can arise from multiple attributes or from multiple values per attribute . Hence, the theoretical advantages are not restricted to the multi-attribute scenario.
In our experiments, we already consider several scenarios that go beyond few attributes and few groups. NICO++ dataset has 360 groups, CelebA (multiple spurious attribute case, Table 12, Appendix G.3) consists of 5 attributes and a total of 32 groups. In addition to this, we also provided experiments on synthetic data with varying ( with up to 50 leading to 2500 groups) and varying ( and up to leading a total of 512 groups) in Figure 9, Appendix G.6. In all these experiments CRM performs well thus aligning the behavior of the method with the theoretical claims.
Finally, we also want to point the guarantees in Theorem 2 are applicable to settings with small number of groups as well, i.e., in the setting of datasets like Waterbirds CRM offers a non-trivial Bayes optimality guarantee when only three out of the four groups are observed at training time.
2. I understand that the paper’s chosen disjoint setting adds a level of complexity. However, in real-world scenarios, it is often feasible to obtain a small number of samples for different attribute combinations (particularly with only two attributes, as in these experiments). The proposed method should also be evaluated in traditional settings where all attribute combinations have some representation. This would confirm that the method performs well without requiring group-dropping.
In the paper, we had already carried out a comparison in the traditional setting where all attribute combinations have some representation. These were presented in the rightmost column in Table 1, "WGA (no groups dropped)", as well as in Table 14 in Appendix G.5 which contains additional metrics. CRM remains competitive with the baselines in this scenario. This confirms that the method performs well also without requiring group-dropping. We also want to point that as the number of groups grow (due to increase in or ), it is natural to expect that we will be disjoint setting where no samples are available from any group.
Also, given the page limit, several results currently are in the supplementary material and we will move some results to the main body in the future revision. If you want us to move some specific result to the main body, please let us know.
Thanks again for your constructive comments! We are open to further discussion and would be happy to address any remaining concerns.
The paper proposes a novel attribute classification method that achieves Bayes-optimal classification for the test distribution, even when it includes attribute combinations not observed in the training data. Assuming that features follow an additive energy distribution, the authors derive a provable method that classifies, in a Bayes-optimal manner, any attribute composition lying within the discrete affine hull of the attribute compositions present in the training set.
Reviewers agree that the paper is well written, with all the claims supported by rigorous theoretical analysis and extensive experiments on benchmark datasets.