/10

Poster3 位审稿人

最低1最高5标准差1.7

ICML 2025

Latent Score-Based Reweighting for Robust Classification on Imbalanced Tabular Data

Yunze Tong,Fengda Zhang,Zihao Tang,Kaifeng Gao,Kai Huang,Pengfei Lyu,Jun Xiao,Kun Kuang

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

TL;DR

We propose using score-based reweighting to achieve the overall robustness under distribution shifts.

摘要

Machine learning models often perform well on tabular data by optimizing average prediction accuracy. However, they may underperform on specific subsets due to inherent biases and spurious correlations in the training data, such as associations with non-causal features like demographic information. These biases lead to critical robustness issues as models may inherit or amplify them, resulting in poor performance where such misleading correlations do not hold. Existing mitigation methods have significant limitations: some require prior group labels, which are often unavailable, while others focus solely on the conditional distribution $P(Y|X)$, upweighting misclassified samples without effectively balancing the overall data distribution $P(X)$. To address these shortcomings, we propose a latent score-based reweighting framework. It leverages score-based models to capture the joint data distribution $P(X, Y)$ without relying on additional prior information. By estimating sample density through the similarity of score vectors with neighboring data points, our method identifies underrepresented regions and upweights samples accordingly. This approach directly tackles inherent data imbalances, enhancing robustness by ensuring a more uniform dataset representation. Experiments on various tabular datasets under distribution shifts demonstrate that our method effectively improves performance on imbalanced data.

关键词

robustnessscore modelreweighting

评审与讨论

审稿意见

评分: 42025-03-13

This paper introduces a latent score-based reweighting framework for improving classification robustness on imbalanced tabular datasets. The approach leverages score-based generative models (diffusion models) to estimate the joint distribution P(X,Y), identifying underrepresented data regions and upweighting samples accordingly. Experiments on six real-world tabular datasets show that the method achieves improved worst-case accuracy under distribution shifts while maintaining competitive mean accuracy compared to baseline robustness-enhancing methods.

update after rebuttal

I read the author's rebuttal carefully and have updated my evaluation accordingly.

给作者的问题

It’s not intuitive to me how the method can benefit the learning of the worst group. In my opinion, since the method is upweighting samples in low-density area, then the authors are assuming that the worst group is under-represented group with fewer samples distributed in a low-density feature space? I would appreciate clarification on this.
It’s unclear how the selected datasets contain distribution shifts, please add more detailed description in the dataset section or appendix.
Also, why do shifts matter for robust classification? The proposed method is simply trying to balance the distribution in feature space via reweighting and helps learning in low-density area. It should be beneficial even without train-test shift.
For tabular data, it is known that tree-based ensembles (like XGBoost or LightGBM) are typically better than neural networks and is more commonly used in practical industry scenarios. I also wonder how this technique works well with those classifiers.

论据与证据

Yes.

方法与评估标准

The idea of balancing the data distribution density in feature space makes sense for more robust classification. Conceptually it should help models to focus more on the underrepresented group with lower density in specific feature subspace.

理论论述

实验设计与分析

The experimental designs generally make sense, but I wonder how this method works with tree-based models such as XGBoost and LightGBM, which are known to generally better than neural networks on tabular data.

补充材料

I checked the code base and it seems sound.

与现有文献的关系

It may serve as a general technique for more robust learning on tabular datasets.

遗漏的重要参考文献

The references seem appropriate to my best knowledge.

其他优缺点

Strengths:

The paper is well-written and easy to follow.
The proposed framework does not require pre-defined group labels, this can be an advantage in real-world scenarios where the group labels may not be available due to privacy concerns.
I like the visualizations that help illustrate the core ideas.

Weaknesses: Please see my questions.

其他意见或建议

作者回复

2025-04-01

Thanks for your detailed review and valuable feedback. Below is our concise response:

[A1] The explanation is as follows:

Your interpretation of our method is correct. We first learn score to model the complex data distribution. A subsequent score-based reweighting ensures balanced representation, improving learning for under-represented regions.
Regarding "worst group accuracy", this metric assesses performance based on groups defined by a non-causal sensitive attribute. The group with the lowest accuracy is termed the "worst group", reflecting model robustness when the chosen attribute does not causally determine the target variable. For example, when predicting income levels, city of residence could serve as the grouping attribute. It does not directly determine an individual's income, but different cities may correlate with varying income distributions. Therefore, such correlations may induce bias and imbalance, adversely affecting accuracy for certain groups. Specifically, data imbalance significantly impacts model robustness, as samples from low-density regions are naturally under-represented and form the "worst group". Achieving robust model performance requires maintaining high accuracy across all groups. Our method addresses this issue by reweighting samples based on our score proxy, ensuring balanced representation independent of inherent data distribution biases.

In summary, worst group accuracy serves as a measure of model robustness but can be negatively affected by data imbalance. Our proposed method mitigates this issue by ensuring balanced learning across all samples, resulting in enhanced worst group accuracy and robust model performance.

[A2] In our paper, "distribution shift" refers to the differences between imbalanced training data and test data—the latter implicitly treated as balanced under the robustness measure. Please refer to A3 for detailed explanations. Regarding how our dataset presents train-test shifts (i.e., imbalance in training data), please see Table 10, which shows the distribution of samples across different groups.

[A3] This is a valuable insight. Indeed, our method remains beneficial even in scenarios without explicit train-test shifts. The core intuition behind our approach involves modeling data distribution and balancing the original data via our score-based proxy. Train-test shifts do not directly affect either our method or evaluation criteria.

In our paper, we emphasize "distribution shifts" primarily to highlight the distinction between the imbalanced training distribution and the test distribution—the latter being implicitly approximated as balanced under our robustness measure (which computes worst-group accuracies across all sensitive attributes). Specifically, the "shift" we discuss stresses the imbalance present in the training data with respect to either covariates $x$ or target class $y$ . To measure model robustness, the ideal evaluation criterion is the sum of worst-group accuracies across all potential non-causal attributes. Under this criterion, an ideal model should perform well in every region of the data space, implicitly corresponding to a balanced test distribution. Therefore, the primary cause of degraded model performance under this testing condition is the imbalance present in training data, creating an implicit train-test shift highlighted in our main text. To mitigate this imbalance, we model the training distribution with our score-based method, thereby effectively enhancing robustness. We will further clarify this point in our final manuscript.

[A4] We deeply appreciate your insightful question. Indeed, our score-based method is compatible with tree-based ensembles. To demonstrate this, we conducted an additional experiment employing three tree-based models to assess whether our sample weights improve models' robustness beyond neural networks. The results are presented at here. For a comprehensive comparison, we also evaluated JTT, a boundary-based method discussed in Section 1, using these tree-based models. Results indicate that our method consistently enhances worst-group accuracy across base models. In contrast, JTT performs well only on certain models. This finding supports our earlier observation of boundary-based methods—they rely solely on decision boundaries, which are disconnected from training distributions, leading to unstable robustness improvements. Conversely, our strategy effectively captures distributional imbalance, generating reliable weights to enhance worst-group accuracy. These results further confirm our method could work well with tree-based models.

We would like to thank you once again for raising these concerns. We believe that these discussions significantly enhance the rigor of our paper. Please feel free to reach out with any questions—we are more than happy to engage further.

审稿人评论

2025-04-03

Thank you for the detailed reply. Most of my concerns are resolved, and I have adjusted my score accordingly.

作者评论

2025-04-03

We sincerely appreciate your valuable suggestions and supportive feedback. Your insightful comments have greatly contributed to improving the quality of our manuscript. We will carefully incorporate the new experiments and related discussions into the final version.

We are also pleased to confirm that we have already addressed your previous concerns. Should you have any additional suggestions, we would be more than happy to engage in further discussions and make any necessary refinements to the manuscript.

Best Regards,

Authors

审稿意见

评分: 12025-03-14

This paper introduces a score-based approach to address data distribution imbalance. First, a variational autoencoder (VAE) is used to transform raw data into a latent representation space, where a diffusion model is applied to learn the joint data distribution. The proposed method estimates relative density using score similarities between noisy data points, mitigating the extreme weight imbalances caused by traditional log-likelihood calculations. A reweighting scheme is then introduced, where each sample is assigned a weight based on its relative density difference, ensuring a more balanced training distribution. Finally, an unbiased classification model is trained using these reweighted samples, and during inference, only the trained encoder and classifier are used for prediction. The proposed method show its effectiveness on several tabular datasets.

给作者的问题

Does using an autoencoder impose restrictions on the choice of network architecture?
Could you provide experimental results on more datasets?
In tabular data classification, there are many datasets where tree-based models outperform neural networks. Do the authors think it is acceptable not to compare the proposed method with tree-based models?
Do the used datasets not contain categorical features?
In tabular datasets, where each dataset has vastly different characteristics, I believe that, as shown in [2], comparing performance across various benchmark datasets [4, 5] is necessary to validate the effectiveness of the proposed method. What do you think about this opinion?

[2] Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data
[4] Asuncion, Arthur, and David Newman. "UCI machine learning repository." Nov. 2007,
[5] Gijsbers, Pieter, et al. "Amlb: an automl benchmark." Journal of Machine Learning Research 25.101 (2024): 1-65.

论据与证据

The mathematical formulations and logical progression appear well-structured with no major issues.

方法与评估标准

The proposed method appears to be appropriately designed for achieving unbiased learning. However, benchmark tabular datasets are highly diverse, with varying characteristics, and evaluations typically involve a large number of datasets. Since the proposed method has not been evaluated on a large number of tabular datasets, additional validation is needed.

理论论述

I did not identify any major issues.

实验设计与分析

When measuring performance, only three runs were conducted, and it is unclear why the standard deviation was reported separately in a different table. Additionally, no information is provided on dataset statistics, such as the number of categorical features or the number of samples. Furthermore, there is no discussion of time complexity or scalability.

补充材料

Yes, it contains the source code.

与现有文献的关系

If the method is shown to be effective for classification on general tabular datasets, it would have significant potential for broader applications and extensions.

遗漏的重要参考文献

The paper has omiited literature on tabular classification [1,2,3].

[1] Gorishniy, Yury, et al. "TabR: Tabular Deep Learning Meets Nearest Neighbors." The Twelfth International Conference on Learning Representations.
[2] Holzmüller, David, Léo Grinsztajn, and Ingo Steinwart. "Better by default: Strong pre-tuned mlps and boosted trees on tabular data." Advances in Neural Information Processing Systems 37 (2024): 26577-26658.
[3] Prokhorenkova, Liudmila, et al. "CatBoost: unbiased boosting with categorical features." Advances in neural information processing systems 31 (2018).

其他优缺点

Strengths

S1. This paper is well-written.
S2. The motivation is clear.

Weaknesses

See above.

其他意见或建议

No.

作者回复

2025-04-01

We sincerely appreciate your detailed reviews. Below are our responses to your concerns (R) and questions (A):

[R1] Due to readability considerations, we report standard deviations in Appendix A.3. For detailed dataset statistics, please refer to Appendix A.5, which lists the attributes selected for evaluation, and Table 10, which provides sample statistics. In addition, Table 7 clarifies how our hyperparameters affect runtime. For all these suggestions, the revisions are currently underway and we believe our next version will meet your requirements.

[R2] We appreciate your suggestions on the undiscussed studies and will incorporate them in our manuscript.

[A1] The primary role of the autoencoder in our method is to provide latent representations for subsequent score-based modeling. It serves solely as a module for semantic compression and the architecture design does not directly affect subsequent score-based modeling. In our implementation, we follow the practice from TabSyn [1]. Specifically, a tokenizer first converts both numerical and categorical features into an embedding space of dimension $d \times (N _{\text{num}} + \sum _{i=1}^{N _{\text{cat}}}C _i)$ , where $N _{\text{num}}, N _{\text{cat}}, C _i$ denote the counts of numerical features, categorical features, and categories within the $i$ -th categorical feature, respectively. Then a transformer serves as encoder and decoder, and its output is detokenized to reconstruct the original features. Only the tokenizer and encoder are used for further score-based modeling.

[A2] Sure. We have conducted additional experiments on three datasets from [3] and [4], as you suggested. The results are listed in here. Our method consistently delivers the best worst-group accuracy, clearly enhancing robustness. While mean accuracy sometimes reflects a trade-off, our method often matches or exceeds baseline performance, confirming that our robustness is improved through effective score-based balancing rather than sacrificing overall accuracy. For more details about datasets, please check A5.

[A3] We would like to first clarify that our main contribution is a novel score-based method designed to enhance robustness, rather than providing a new model architecture. Therefore, our approach (and those boundary-based baselines) is complementary to model architectures (e.g., neural networks, tree-based models), making direct comparisons inappropriate.

However, our reweighting strategy can indeed integrate with tree-based models. To address your concern, we conducted extra experiments using 3 tree-based models on 4 datasets. Details are provided in our A4 to Reviewer eedu. These results confirm that our score-based weights could work well with tree models to enhance robustness.

[A4] Nearly all datasets used in our experiments contain categorical features. How we preprocess these features into latent embeddings is detailed in A1.

[A5] Validating performance across diverse datasets is crucial, and our experiments have already been designed with this principle in mind. Given the primary objective of our paper—to enhance robustness under various data imbalances—we selected datasets reflecting different imbalance scenarios. Specifically, datasets in columns 2–9 of Table 1 primarily exhibit class-label $y$ imbalance. To evaluate robustness under covariate $x$ imbalance, we adopted the ACS dataset with state-level validation (columns 2–7 in Table 2). Thus, we believe the selected datasets comprehensively cover key data imbalance challenges.

Regarding your references, we kindly clarify that [3] is the source repository for several datasets we used (e.g., Adult, Bank). :) Given time and resource constraints, evaluating all datasets from [3, 4] is infeasible. However, our original experiment design has tested our method on datasets used in prior studies [1, 2]. Moreover, we also appended extra experiments on 3 new datasets from [3, 4], as reported in A2. Thus, we believe our current validation sufficiently reveals the effectiveness of our method across diverse imbalance scenarios.

We deeply appreciate your suggestions and will incorporate these clarifications into our revised paper. Please feel free to reach out with any new concerns about our method—we are more than happy to engage further.

[1] Zhang, Hengrui, et al. "Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space." The Twelfth International Conference on Learning Representations.

[2] Liu, Jiashuo, et al. "On the need for a language describing distribution shifts: Illustrations on tabular datasets." Advances in Neural Information Processing Systems 36 (2023).

[3] Asuncion, Arthur, and David Newman. "UCI machine learning repository." Nov. 2007.

[4] Gijsbers, Pieter, et al. "Amlb: an automl benchmark." Journal of Machine Learning Research 25.101 (2024).

审稿人评论

2025-04-09

Thank you for your response. However, my concern regarding the choice of evaluated datasets has grown. Tabular data is highly diverse, and no single method consistently outperforms others across all datasets. Accordingly, recent research on tabular data [1, 2] tends to conduct broad evaluations across multiple datasets within benchmarks (e.g., the AutoML benchmark). For example, RealMLP [2] is evaluated on 90 tabular datasets. In contrast, the proposed method reports performance improvements on all evaluated datasets, which raises a strong suspicion that the authors may have cherry-picked only the favorable results.

Furthermore, despite the fact that tree-based methods have demonstrated strong performance on tabular datasets, the authors only include experiments involving tree-based methods on four datasets. While I agree that the proposed method shows some promise, I believe that its effectiveness should be supported by experiments conducted on a large number of datasets.

Therefore, I lowered my score to 1.

[1] Holzmüller, David, Léo Grinsztajn, and Ingo Steinwart. "Better by default: Strong pre-tuned mlps and boosted trees on tabular data." Advances in Neural Information Processing Systems 37 (2024): 26577-26658.
[2] Gorishniy, Yury, et al. "TabR: Tabular Deep Learning Meets Nearest Neighbors." The Twelfth International Conference on Learning Representations.

审稿意见

评分: 52025-03-15

The paper proposes a latent score-based reweighting framework to improve robustness in machine learning models on tabular data, addressing biases from imbalanced distributions. Unlike existing methods that require prior group labels or focus only on P(Y|X), the approach leverages score-based (diffusion) models to estimate the joint distribution P(X, Y). By using directional similarity of score vectors as a proxy for density, it identifies and upweights underrepresented data regions without relying on unstable raw density estimates. Experiments on tabular datasets under distribution shifts demonstrate improved performance and fairness, making the method broadly applicable in scenarios with unknown biases.

update after rebuttal

I believe the authors have adequately addressed all of my concerns. I have also reviewed the points raised by the other reviewers and continue to find this work solid, novel, and impactful. There are no remaining concerns—minor or major—from my side, so I have adjusted my score accordingly.

给作者的问题

See previous parts.

论据与证据

Yes, the claims made in the submission are supported by clear and convincing evidence. The paper provides a well-structured empirical evaluation across six diverse datasets, ensuring a comprehensive validation of its method under different types of distribution shifts. The evaluation metrics focus on worst-group accuracy, aligning with the stated goal of improving robustness. The experimental design follows best practices, including multiple runs with different random seeds and comparisons against strong baseline methods such as ERM, DRO variants, JTT, EIIL, and FAM. The results consistently demonstrate the superiority of the proposed method, with at least a 3% improvement in worst-case accuracy across datasets. The inclusion of standard deviations, detailed dataset preprocessing, and hyperparameter choices further strengthens the credibility of the findings. Additionally, the generalization experiments on the ACS Income dataset validate the method’s robustness across different environments, reinforcing its ability to handle distribution shifts effectively. Overall, the empirical evidence strongly supports the paper’s key claims.

方法与评估标准

The proposed method is well-suited for unbiased learning, effectively handling covariate and label shifts through score-based diffusion models on latent representations. The probability density proxy via similarity difference offers a practical alternative to exact log-likelihood computation, mitigating extreme weight imbalances. Sample reweighting ensures balanced training without requiring prior knowledge.

理论论述

Yes, the problem formulation and overall theoretical claims are correct.

实验设计与分析

The experimental design appears methodologically sound in leveraging score-based diffusion models for unbiased learning. The use of latent representations via VAEs ensures computational efficiency while capturing meaningful semantics, reducing reliance on raw feature biases. The probability density proxy (SimDiff) is a novel way to approximate sample importance without explicit likelihood estimation, which helps prevent extreme weighting issues. However, it is unclear whether the choice of noise schedule (σ(t)) and temperature parameter (τ) was systematically validated. Sensitivity analysis on these hyperparameters would strengthen the robustness of the method. Additionally, standard benchmarks with known biases (e.g., Waterbirds, CelebA for fairness, UCI datasets for covariate shift) were not included.

补充材料

Yes all of it.

与现有文献的关系

The key contributions of this paper build upon and extend several important areas in the broader scientific literature, particularly in score-based generative modeling, bias mitigation, and distributional robustness. The foundation of the work relies on score-based diffusion models (Song & Ermon, 2019), which have demonstrated state-of-the-art performance in generative modeling by estimating probability densities through iterative noise perturbation and reconstruction. By integrating latent diffusion techniques (Rombach et al., 2022), the proposed method efficiently models data distributions in a lower-dimensional space, aligning with prior research that has shown latent representations to be more effective for capturing meaningful semantic structures.

In terms of bias mitigation and robustness, this work shares objectives with reweighting-based debiasing approaches (e.g., importance sampling in fairness-aware learning) but introduces a novel probability density proxy (SimDiff) to achieve unbiased sample weighting. Unlike traditional methods that rely on explicit likelihood estimation—which can be sensitive to extreme values—the similarity-based density estimation method provides a more stable alternative. The paper also builds on findings in fair representation learning (Higgins et al., 2017; Karras et al., 2022), where latent space transformations have been used to decorrelate sensitive attributes from predictions. However, its class-conditional score modeling introduces a new perspective by explicitly accounting for label shifts alongside covariate shifts. This distinction enhances its applicability to fairness-sensitive domains, aligning with recent work on distributionally robust optimization (DRO) and adaptive sample weighting for fairness (Xu et al., 2023).

遗漏的重要参考文献

Maybe some more recent works on debiasing like SELF (LaBonte et al.) or EVaLS (Ghaznavi et al.) would be nice to mention.

其他优缺点

The paper demonstrates strong originality by combining score-based generative modeling with unbiased learning techniques, presenting a novel way to address distributional shifts without requiring prior assumptions. This integration of latent diffusion models with class-conditional probability estimation is a meaningful contribution, as it removes restrictive assumptions found in traditional bias mitigation methods. The approach is particularly significant for fairness-sensitive applications, as it provides a solution that adapts to both covariate and label shifts.

Additionally, the visualizations and figures are a notable strength, as they effectively illustrate key concepts such as probability density estimation and reweighting. However, Figure 1’s captions are unclear and difficult to follow, making it harder for readers to interpret the figure’s intended message. Improving clarity in figure descriptions would enhance the overall readability.

其他意见或建议

The captions in Figure 1 are mixed and hard to follow. Revising them for clarity would improve the reader’s understanding of the figure’s significance.

作者回复

2025-04-01

We would like to thank you for providing helpful comments and positive feedbacks. Below are our responses to your concerns.

Lack of details about noise preconditioning factor $\sigma$ and hyper-parameter $\tau$

[A1] We appreciate this observation. Regarding the network preconditioning factor $\sigma$ , we adopt the methodology proposed in EDM [1] and TabSyn [2] (see Appendix A.2 and Table 4 for details). $\tau$ controls the strength of reweighting based on $\text{SimDiff(·)}$ . Its sensitivity analysis is presented in Figure 4 and discussed in Section 4.7.

Lack of discussion for essential references

[A2] We thank the reviewer for highlighting these previous studies. We will incorporate a detailed discussion of these works in the final manuscript.

Revise the captions in Figure 1

[A3] We appreciate this suggestion. Figure 1 displays the results from our synthetic experiment, illustrating how boundary-based methods erroneously upweight samples based on the model's prediction boundary rather than the implicit distribution. In contrast, our score-based proxy effectively models the distribution without requiring prior knowledge, thereby yielding more reliable sample weights for robust classification. We will update the caption in the subsequent version of the manuscript.

We would like to express our sincere thanks once again for your efforts in reviewing our paper. Should you have any further comments or queries, we would greatly appreciate the opportunity to address them promptly.

[1] Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." Advances in neural information processing systems 35 (2022): 26565-26577.

[2] Zhang, Hengrui, et al. "Mixed-type tabular data synthesis with score-based diffusion in latent space." arXiv preprint arXiv:2310.09656 (2023).

审稿人评论

2025-04-02

I would like to thank authors for their response. I believe this is a novel and valuable work in the field. I encourage authors to include the missing previous studies in their final version. I have adjusted my score accordingly. All the best.

作者评论

2025-04-02

We sincerely thank you for your valuable suggestions and supportive feedback. Your previous comments have significantly contributed to improving the quality of our manuscript, and we truly appreciate it. In addition to the missing studies you highlighted, we also plan to include other experiments requested by the other reviewers. For further details, please refer to A4 in the response to Reviewer eedu. We believe the next version of our manuscript will comprehensively address all of your concerns. :)

Should you have any additional suggestions, we are more than happy to engage in further discussions and make any necessary improvements to the paper.

Best regards,

Authors

最终决定Accept (poster)

2025-05-01

This paper introduces a novel framework for improving robustness in machine learning models trained on imbalanced tabular data. The authors propose using score-based models (diffusion models) to capture the joint data distribution P(X,Y) without requiring prior group labels. Their approach estimates sample density using score-based models. By identifying low-density regions where data is underrepresented, they upweight samples from these regions, aiming to ensure less biased representation during training. The proposed method is tested on several tabular datasets.

Several reviewers highlighted significant strengths of this work. Reviewer xfS3 noted the paper "provides a well-structured empirical evaluation across six diverse datasets" with results that "consistently demonstrate the superiority of the proposed method, with at least a 3% improvement in worst-case accuracy across datasets." The reviewer also highlighted the "novel way to approximate sample importance without explicit likelihood estimation, which helps prevent extreme weighting issues." Reviewer eedu appreciated that the paper is well-written and that the proposed framework does not require pre-defined group labels -- an advantage, since in many applications group labels are not available

Weaknesses were aslo identified by the reviewers. Reviewer kxsQ pointed out that "no information is provided on dataset statistics, such as the number of categorical features or the number of samples" and "there is no discussion of time complexity or scalability." I find these concerns relatively minor and easily addressed in a camera-ready version, particularly since the datasets used are fairly standard.

The main concern raised by kxsQ, however, was about the evaluation, stating "the proposed method has not been evaluated on a large number of tabular datasets." Reviewer eedu went so far as to lower their score, under the suspicion that the result might be cherry picked.

Despite the criticisms by kxsQ, I believe the paper merits acceptance based on its new approach and effective benchmarks -- a sentiment echoed by the other two reviewers. As Reviewer xfS3 noted, the paper does seem to include novel results, and the experimental results consistently show improvements across various datasets. However, I strongly encourage the authors to include their code with the final submission -- particularly to assuage the concerns raised by reviewer kxsQ. Failure to do so would significantly weaken the contribution.

Overall, the paper addresses a significant challenge in machine learning, namely improving robustness over specific subsets of the input space, and provides a practical solution that shows favourable performance on benchmarks.