PaperHub
7.0
/10
Poster3 位审稿人
最低4最高5标准差0.5
4
4
5
3.0
置信度
创新性3.3
质量2.3
清晰度3.3
重要性3.0
NeurIPS 2025

Distributionally Robust Feature Selection

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

A novel approach to perform distributionally robust feature selection in a model-agnostic manner

摘要

关键词
DROfeature-selection

评审与讨论

审稿意见
4

The paper under review proposes a new method for distributional robust feature selection, using an insightful continuous relaxation of an intractable, non-convex problem. The authors provide helpful intuition for their method and evaluate it via simulations and real data analysis.

优缺点分析

The primary strength of the paper is its intuitive derivation of the method. Section 3 (Methods) is well written and clear. Furthermore, the authors present a very early (if not explicitly the first) attempt at a solution for distributionally robust feature selection. Since distributionally robust methods are growing in popularity, this constitutes a valuable contribution to the machine learning field.

However, the evaluation of the proposed method leaves much to be desired. The authors only evaluate the methods under one data generating mechanism, with the test set being drawn using the same functional relationship between covariates and outcome as the test data. The advantage of DRO in principle is that good performance is guaranteed for nearly arbitrary distribution shifts - this is not explored at all in the simulations. I would like to see a more expansive suite of simulations exploring different functions forms between the covariates and outcome, as well as varying degrees of distribution shift between the training and test datasets. Since the authors do not provide any theoretical guarantees for their proposed method, I would like to see more empirical evidence for the effectiveness of their method in practice.

问题

  1. The form of the relaxation of problem (1) as detailed in Section 3 is tailored for a square loss. However, the authors use their method in the experimentation with the cross entropy loss to satisfying effect. What is the authors' understanding of the validity of their proposed procedure under different loss functions or distributions on the outcome YY? How does the performance of the proposed method change under different, perhaps non-smooth loss functions?

  2. How does the proposed method fare under larger dimensions? The simulations are run with only 14 features, and the real data examples are not much larger. Is there a severe computational burden in higher dimensions? It would be very helpful to see a table of runtime statistics, since this is nominally an optimization paper.

  3. How does the method perform under different data generating mechanisms between the train and test data? More diverse simulations would be extremely valuable to this paper.

局限性

yes

最终评判理由

In the rebuttal period, the authors provided strong simulation results in response to my questions, as well as the points raised by the other reviewers. I think that these simulation results, which cover a wider range of scenarios than those explored in the original version of the paper, improve the evidence of the efficacy of the proposed method. I still hold minor reservations over the fact that the proposed algorithm is derived using the bias-variance decomposition of the MSE, which is why I keep my score at 4 rather than raise it higher. However, I understand that this would constitute a non-trivial extension to the submitted paper and is not feasible to address within the rebuttal period.

格式问题

NA

作者回复

We thank reviewer 789K for their feedback and suggestions, which we address below.

Questions:

Q1: The form of the relaxation of problem (1) as detailed in Section 3 is tailored for a square loss. However, the authors use their method in the experimentation with the cross entropy loss to satisfying effect. What is the authors' understanding of the validity of their proposed procedure under different loss functions or distributions on the outcome? How does the performance of the proposed method change under different, perhaps non-smooth loss functions?

Our derivation uses the bias-variance decomposition specific to MSE. However, there exist generalized variance decompositions for other loss functions. Using such generalized decompositions could allow our method to directly optimize cross-entropy or other losses, but this is a nontrivial extension we leave for future work. Our empirical results with cross-entropy (UCI experiment) indeed suggest that optimizing MSE in our method may transfer well to other losses, particularly losses that are proper scoring rules akin to MSE.

Q2: How does the proposed method fare under larger dimensions? The simulations are run with only 14 features, and the real data examples are not much larger. Is there a severe computational burden in higher dimensions? It would be very helpful to see a table of runtime statistics, since this is nominally an optimization paper.

Our method requires O(PbnKd)O(P\cdot b\cdot n\cdot K\cdot d) operations per iteration, where PP is the number of populations, bb is the number of Monte Carlo samples, n=maxpnpn =\max_p n_p is the maximum population size, KK is the number of nearest neighbors used for kernel weight computation, and feature dimensionality dd.

We now include a synthetic experiment where we vary the dimension of the data under the same generative process (increasing the number or irrelevant/noise variables). Pl. see the last section of our rebuttal for the experiment results. The runtime statistics per epoch (averaged across 3 runs) for this experiment are given in the table below:

Dimension (d)Time per epoch (s) (Mean±Std)
151.8988±0.0026
501.9217±0.0044

For d=15d=15, we run 50-64 epochs, while for d=50d=50, we run 126-150 epochs.

Q3: How does the method perform under different data generating mechanisms between the train and test data? More diverse simulations would be extremely valuable to this paper.

We wish to clarify that our method is designed for Group DRO, which we discuss in our related work section (Section 1.1, citing Sagawa et al. [2019]). Group-DRO optimizes for worst-case performance across a priori known subpopulations (e.g., demographic groups, different hospitals). This setting is distinct from DRO over arbitrary distribution shifts. Thus our experimental setup uses predefined groups rather than exploring arbitrary functional form changes.

We have run new synthetic experiments with different generative processes such as a purely linear model, a mixed linear-nonlinear model with heterogeneous noise, and a sparse linear model with correlated noise variables. Across these, our method outperforms the baselines or closely follows the best performing ones. We include only the first experiment in the last section of this rebuttal due to character limits. Our final paper will have more experiments.

Additional comments

C1: The test set being drawn using the same functional relationship between covariates and outcome as the test data. The advantage of DRO in principle is that good performance is guaranteed for nearly arbitrary distribution shifts - this is not explored at all in the simulations. I would like to see a more expansive suite of simulations exploring different functions forms between the covariates and outcome, as well as varying degrees of distribution shift between the training and test datasets. Since the authors do not provide any theoretical guarantees for their proposed method, I would like to see more empirical evidence for the effectiveness of their method in practice.

We request the reviewer to refer to our response to Q3. Additionally, our algorithmic framework can be extended to more general f-divergence DRO. This would involve fitting a series of conditional mean models for the worst-case distribution at each optimization step, rather than one model per predefined group. The core procedure remains the same, highlighting the flexibility of our approach. We agree that exploring these broader distribution shift scenarios is a valuable direction for future work.

Wrt more simulations, we have run new experiments on synthetic datasets. We include only one of them here due to character limits. However, you can find the other experiments in our response to reviewer ggn5. They also include:

  • Additional downstream model for evaluating performance: we now include an MLP, along with the existing random forest (RF), as a downstream task model.
  • New baseline methods: we've added Forward Selection and Backward Elimination, with their DRO variants, (along with the existing baseline methods in the original paper).
  • Additional data generation settings to explore different covariate-outcome functional relationships.

Experiment

Population Structure:

  • A (40%): Y=8X0+6X14X2+3X3+2X4+ϵY = 8X_0 + 6X_1 - 4X_2 + 3X_3 + 2X_4 + \epsilon
  • B (35%): Y=8X06X1+4X23X32X4+8X5+6X6+ϵY = -8X_0 - 6X_1 + 4X_2 - 3X_3 - 2X_4 + 8X_5 + 6X_6 + \epsilon
  • C (25%): Y=10X7+8X8+6X95X10+ϵY = 10X_7 + 8X_8 + 6X_9 - 5X_{10} + \epsilon Noise:ϵN(0,0.12)\text{Noise:} \quad \epsilon \sim \mathcal{N}(0, 0.1^2)

We use a dataset of size 36000, with a 40-60 train-(test/val) split for each population. We report the results for two feature budgets (5 and 10).

DimensionMethodPopulationMSE±Std (MLP) — Budget 5MSE±Std (RF) — Budget 5MSE±Std (MLP) — Budget 10MSE±Std (RF) — Budget 10
15Our MethodA0.2251±0.00290.2486±0.00180.0342±0.00040.0535±0.0010
B0.2888±0.01130.3194±0.00980.0207±0.00050.0804±0.0036
C0.2787±0.01500.3039±0.01300.0039±0.00040.0339±0.0079
50A0.2227±0.00940.2420±0.01610.0241±0.01780.0450±0.0129
B0.5805±0.03750.6158±0.04480.0150±0.00960.0809±0.0035
C0.1208±0.00340.1388±0.00110.0480±0.07610.0697±0.0605
15(BL) DRO LassoA0.5071±0.01290.5434±0.01910.0338±0.00090.0534±0.0009
B0.2886±0.00860.3194±0.00260.0211±0.00010.0807±0.0041
C0.2755±0.01500.3031±0.01130.0039±0.00040.0340±0.0078
50A0.4155±0.17690.4418±0.18660.0345±0.00040.0511±0.0029
B0.2377±0.09090.2677±0.09720.0206±0.00060.0835±0.0067
C0.3642±0.13500.3849±0.13000.0040±0.00070.0354±0.0012
15(BL) DRO XGBA0.5107±0.01130.5441±0.01880.0738±0.03700.0887±0.0327
B0.2888±0.00810.3187±0.00300.0408±0.01980.0928±0.0145
C0.2762±0.01510.3032±0.01120.0036±0.00030.0337±0.0076
50A0.6827±0.31360.7120±0.31380.1041±0.06450.1139±0.0597
B0.3857±0.14610.4209±0.14950.0613±0.04160.1098±0.0326
C0.2259±0.09280.2479±0.09260.0040±0.00050.0350±0.0010
15(BL) LassoA1.0087±0.01531.0571±0.01070.2056±0.17140.2106±0.1625
B0.5744±0.00510.6150±0.00520.1160±0.09210.1460±0.0763
C0.1080±0.01070.1283±0.01020.0036±0.00030.0340±0.0076
50A1.0181±0.05261.0429±0.03870.4126±0.53030.3859±0.4760
B0.5880±0.03230.6264±0.02840.2327±0.29120.2504±0.2401
C0.1351±0.02520.1550±0.02670.0037±0.00010.0354±0.0009
15(BL) XGBA1.0136±0.00761.0623±0.00390.4051±0.28020.3905±0.2579
B0.5702±0.00080.6170±0.00760.2279±0.15860.2373±0.1307
C0.1314±0.02910.1538±0.03520.0037±0.00020.0338±0.0077
50A1.0211±0.04741.0421±0.04001.1132±0.04391.0216±0.0507
B0.5867±0.03190.6258±0.02820.6607±0.03140.6135±0.0252
C0.1356±0.02610.1551±0.02670.0036±0.00040.0347±0.0012
15(BL) DRO Forward SelectionA0.2257±0.00300.2498±0.00050.0344±0.00090.0534±0.0012
B0.3274±0.07470.3596±0.08080.0208±0.00100.0806±0.0041
C0.3752±0.15450.3991±0.15810.0037±0.00020.0341±0.0079
50A0.3735±0.15060.3996±0.16360.0341±0.00060.0514±0.0029
B0.3729±0.05530.3995±0.05530.0209±0.00120.0833±0.0066
C0.4735±0.08050.5124±0.08830.0038±0.00010.0354±0.0012
15(BL) DRO Backward EliminationA0.0314±0.00080.0492±0.00080.0341±0.00070.0535±0.0010
B0.1833±0.00470.2129±0.00660.0209±0.00060.0806±0.0037
C1.0281±0.03441.0455±0.05480.0038±0.00020.0342±0.0079
50A0.0320±0.00120.0474±0.00240.0346±0.00050.0512±0.0030
B0.1757±0.00930.2079±0.01490.0208±0.00080.0834±0.0068
C1.0008±0.06691.0355±0.06350.0038±0.00030.0354±0.0012
15(BL) Forward SelectionA1.0183±0.01471.0430±0.01181.0887±0.01761.0132±0.0219
B0.5745±0.00430.6221±0.00610.6430±0.02260.5956±0.0086
C1.0298±0.05871.0571±0.05000.3572±0.30660.3407±0.2694
50A1.0132±0.03721.0408±0.04351.1209±0.04831.0224±0.0403
B0.5879±0.03310.6276±0.03840.6509±0.03600.6109±0.0362
C0.9808±0.11611.0040±0.09891.1410±0.14130.9882±0.1095
15(BL) Backward EliminationA0.9972±0.01991.0427±0.01261.1153±0.04051.0136±0.0219
B0.5672±0.00810.6077±0.00680.6486±0.02880.5954±0.0068
C0.4222±0.26900.4523±0.28030.3514±0.30130.3399±0.2682
50A1.0095±0.03931.0463±0.04471.1267±0.04151.0244±0.0405
B0.5897±0.03410.6267±0.03000.6640±0.02080.6111±0.0364
C0.9883±0.10421.0071±0.09121.1047±0.11560.9895±0.1071

Discussion

Although the generative process here is linear and variables X0X_0 to X4X_4 have strong effects in both A and B, the signs of their coefficients are reversed between populations. This reduces the effectiveness of LASSO, which tends to select features based on average effects across all data. As a result, vanilla LASSO does not achieve the best performance even in this linear setting.

For budget=55, our method outperforms most baselines, and has a balanced performance across populations. For budget=1010, our method is comparable with the best performing baselines.

评论

I thank the authors for providing answers to my questions and additional simulation details. Also, thank you for clarifying that your framework corresponds to Group-DRO, rather than DRO over arbitrary distribution shifts. I think that the paper will greatly benefit from the additional simulations that the authors have run in response to my questions as well as the questions of the other reviewers. I will increase my score to a 4 accordingly.

I think that it would be valuable to include a brief remark in your paper regarding the bias-variance decomposition with other loss functions, even if you do not perform the full derivation in this work.

审稿意见
4

The paper introduces Distributionally Robust Feature Selection (DRO-FS), a feature selection method aiming to perform well simultaneously across multiple sub-populations of the data. Instead of using masks, the authors present an interesting mechanism for feature selection: injecting white Gaussian noise into the features. An informative feature would be injected with a small variance noise, whereas a non-informative feature would be injected with high variance noise. Instead of differentiating through an inner training loop, the authors derive an objective based on the conditional variance of the Bayes-optimal predictor, which depends on the variance of the noise through a closed-form kernel weight expression. They use SGD to optimize the worst-case loss across groups. The proposed method is demonstrated using synthetic data and two real-world datasets for predicting household income. The proposed approach yields lower worst-group error than Lasso, XGBoost, and the Group-DRO variants, sometimes by an order of magnitude.

优缺点分析

Strengths

  • Previous methods are optimized for a specific model, not for a subpopulation, meaning that they ignore distribution shifts.
  • The noise injection, along with the Bayes-risk objective, is rather cool. The method is model-agnostic, and there's no need for backpropagating through training.
  • The derivation seems correct.
  • The authors provide anonymized code
  • Training details, architecture choices, and optimization schedules are neatly provided.

Weaknesses

  • While the code idea of the paper is interesting, it is poorly presented. The formulation is cumbersome, lacks clarity, and is hard to follow. There are no clear building blocks. Just a steady stream of claims. I think the authors should reformulate the way they derive the equations, use definitions, lemmas, and theorems, and provide proofs. Even if that would end up taking more space, the proofs could go to the appendix. Still, I think it would make the reading more fluent and add clarity to the paper.

  • A concise algorithm box and/or a block diagram of the proposed method would improve clarity.

  • How does runtime scale with d (features) and n (samples)?

  • After learning continuous α, how is the final subset obtained? A hard top-k cut or magnitude-based sorting? I did not find an explicit description. The authors should provide a full description, as this is a fundamental issue, and discuss stability across random seeds.

  • So many other methods for feature selection since lasso were not mentioned at all!

  • The authors claim the proposed method is model agnostic, yet they demonstrate the performance of their method using only one choice of downstream model (Random Forest). To convince their choice of features is indeed robust, they should use additional downstream models that are not tree-based.

  • The synthetic task is highly nonlinear. Including a linear-target setting would indicate when Lasso-type methods might still be sufficient.

  • Report the learned sparsity pattern overlap across different random seeds to quantify selection stability.

  • Line 133 >=0 at the end is a sub-index of R?

  • Line 157 – incurr should be incur

  • Line 205 "with with"

  • Line 2015 "involves requires"

  • Line 222 "By applying .." has no subject

  • Line 443 – "initialized"

  • Line 452 a missing reference - ??

  • "Moreoever" should be Moreover appears twice

问题

  1. Writing - please revise the writing and the derivation of the mathematical formulation, add a concise algorithm box and/or a block diagram of the proposed method to improve clarity.

  2. Discuss other methods for feature selection and compare their performance to the proposed approach's.

  3. Add other data sets AND and other downstream predictors.

  4. Interpretability - report the selected features and compare to other methods of feature selection to demonstrate the advantage of the proposed method.

局限性

yes

最终评判理由

The authors have fully addressed all of my comments. Together with their responses to the other reviewers’ remarks, the authors resolved every concern raised during the review process. In light of these revisions, I am updating my score to 4.

格式问题

none

作者回复

We thank reviewer Guhq for their feedback and suggestions, which we address below.

Questions:

Q1: Writing-please revise the writing and the derivation of the mathematical formulation, add a concise algorithm box and/or a block diagram of the proposed method to improve clarity.

We thank the reviewer for their suggestions on improving Section 3. We have added an algorithm box summarizing our method in the appendix for greater clarity. We also incorporate the suggestion to present our main results as follows:

Theorem 1 (Population-Level Objective) Under the noise-based relaxation S(α)=X+ϵ(α)S(\alpha) = X + \epsilon(\alpha) where ϵ(α)N(0,diag(α))\epsilon(\alpha) \sim \mathcal{N}(0, \text{diag}(\alpha)), the distributionally robust feature selection problem:

minαmaxPiPES(α),YPi[L(Y,fi(S(α)))]+λReg(α)\min_{\alpha} \max_{P_i \in \mathcal{P}} \mathbb{E}_{S(\alpha),Y \sim P_i}[\mathcal{L}(Y, f_i^*(S(\alpha)))]+\lambda \text{Reg}(\alpha)

where fi(s)=E[YS(α)=s,Pi]f_i^*(s)=\mathbb{E}[Y|S(\alpha)=s, P_i] is the Bayes-optimal predictor, is equivalent to:

minαmaxPiP(ES(α)Pi[E[μi(X)S(α)]2])+λReg(α)\min_{\alpha} \max_{P_i \in \mathcal{P}} ( -\mathbb{E}_{S(\alpha) \sim P_i}[\mathbb{E}[\mu_i(X)|S(\alpha)]^2] )+\lambda \mathrm{Reg}(\alpha)

where μi(X)=E[YX,Pi]\mu_{i}(X)=\mathbb{E}[Y|X, P_i].


Given samples {(XijYij)}j=1ni\{(X_i^j\, Y_i^j)\}_{j=1}^{n_i} , from each population PiP_i, let μ^i(X)\hat{\mu}_i(X) be an estimator of μi(X)=E[YX]\mu_i(X) = \mathbb{E}[Y|X] trained on these samples. The empirical form of the population objective from Theorem 1 is:

minαmaxPiP(E^S(α)[E^[μ^i(X)S(α)]2])+λReg(α)\min_{\alpha} \max_{P_i \in \mathcal{P}} \left( -\hat{\mathbb{E}}_{S(\alpha)}\left[\hat{\mathbb{E}}[\hat{\mu}_i(X)|S(\alpha)]^2\right] \right)+\lambda \text{Reg}(\alpha)

where the expectation E^[μi^(X)S(α)]\hat{\mathbb{E}}[\hat{\mu_i}(X)|S(\alpha)] is taken with respect to the empirical distribution Pi^(X)=1nij=1niδXij(X)\hat{P_i}(X)=\frac{1}{n_i}\sum_{j=1}^{n_i} \delta_{X_i^j}(X).

Theorem 2 (Kernel Form Equivalence)

The empirical expectation

E^S(α)[E^[μi^(X)S(α)]2]\hat{\mathbb{E}}_{S(\alpha)}\left[\hat{\mathbb{E}}[\hat{\mu_i}(X)|S(\alpha)]^2\right]

is equivalent to:

E^_S(α)[(j=1niwij(S(α),α)μ^i(Xij))2]\hat{\mathbb{E}}\_{S(\alpha)}[(\sum_{j=1}^{n_i} w_i^j (S(\alpha),\alpha) \hat{\mu}_i (X_i^j))^2]

where the weights are:

wij(S(α),α)=exp({12(XijS(α))Tdiag(α)1(XijS(α))})k=1niexp({12(XikS(α))Tdiag(α)1(XikS(α))})w_i^j(S(\alpha), \alpha)=\frac{\exp(\{-\frac{1}{2}(X_i^j-S(\alpha))^T \text{diag}(\alpha)^{-1}(X_i^j-S(\alpha))\})}{\sum_{k=1}^{n_i} \exp(\{-\frac{1}{2}(X_i^k-S(\alpha))^T \text{diag}(\alpha)^{-1}(X_i^k-S(\alpha))\})}


The proofs of these results are exactly the step-by-step derivations currently presented in Section 3. We will create a section in the appendix where we expand the proofs further to improve clarity.

Q2: Discuss other methods for feature selection and compare their performance to the proposed approach's.

We now include Forward Selection (FS) and Backward Elimination (BE) (+ their DRO variants) as baselines in our experiments. The implementations use k-fold cross-validation to score feature subsets. FS iteratively adds the feature that most improves the cross-validation score. BE starts with all features and iteratively removes the one whose absence hurts performance the least. The DRO variants also follow this greedy logic but optimize for robustness- evaluating feature subsets by finding the worst-case cross-validation score across all individual data populations. We have added experimental results on a new synthetic dataset that includes these baselines to this rebuttal. Our method generally outperforms these baselines across the evaluated metric.

Q3: Add other data sets AND and other downstream predictors.

We now include an MLP (single hidden layer of size 100) as a downstream model in our evaluations, and will do so for all the experiments in the final paper. We have conducted new synthetic experiments-we include only one of them here due to character limits, however our final paper will include more experiments. Pl. see the last section of our rebuttal for the new experiment.

Q4: Interpretability-report the selected features and compare to other methods of feature selection to demonstrate the advantage of the proposed method.

We will add the selected features for each run in the appendix. Our method is stable across seeds in the subsets of variables that it selects and the order in which it selects them, differing only in a few values. Pl. see the last section of our rebuttal for an example.

Additional comments

C1: How does runtime scale with d (features) and n (samples)?

Our method requires O(PbnKd)O(P \cdot b \cdot n \cdot K \cdot d) operations per iteration, where PP is the number of populations, bb is the number of Monte Carlo samples, n=maxpnpn=\max_p n_p is the maximum population size, KK is the number of nearest neighbors used for kernel weight computation, and dd is the feature dimensionality.

C2: After learning continuous α, how is the final subset obtained?

The final subset is obtained by selecting those variables ii with the smallest αi\alpha_{i} values (corresponding to minimum noise). We will explicitly state this in our paper.

C3: Experiments: Use additional downstream models that are not tree-based, Including a linear-target setting would indicate when Lasso-type methods might still be sufficient. Report the learned sparsity pattern overlap across different random seeds to quantify selection stability.

Our new experiments include:

  • Additional downstream model for evaluating performance: we now include an MLP (single hidden layer of size 100), along with the existing random forest (RF), as a downstream task model.
  • New baseline methods: we've added Forward Selection and Backward Elimination, with their DRO variants
  • Additional data generation settings to explore different covariate-outcome functional relationships. Three of these are included in our response to reviewer ggn5: a purely linear model, a mixed linear-nonlinear model with heterogeneous noise, and a sparse linear model with correlated noise variables. Our method outperforms the baselines or closely follows the best performing ones. We include only the first here due to character limits.

C4: Typos

We thank the reviewer for bringing these typos to our notice, we have corrected them. Wrt Line 133 >=0 at the end is a sub-index of R?, we use this notation for mm dimensional vectors whose entries are non-negative real numbers.

New Synthetic Experiment: Linear generative process

We set dimension d=15d=15.

Population Structure:

  • A (40%): Y=8X0+6X14X2+3X3+2X4+ϵY = 8X_0 + 6X_1 - 4X_2 + 3X_3 + 2X_4 + \epsilon
  • B (35%): Y=8X06X1+4X23X32X4+8X5+6X6+ϵY = -8X_0 - 6X_1 + 4X_2 - 3X_3 - 2X_4 + 8X_5 + 6X_6 + \epsilon
  • C (25%): Y=10X7+8X8+6X95X10+ϵY = 10X_7 + 8X_8 + 6X_9 - 5X_{10} + \epsilon Noise:ϵN(0,0.12)\text{Noise:} \quad \epsilon \sim \mathcal{N}(0, 0.1^2)

The train-test-val sets for Pops A, B, andChad 14400, 12600, and 9000 samples respectively, and we used a 40-60 train-(test+val) split for each population.

MethodPopulationMSE±Std (MLP) — Budget 5MSE±Std (RF) — Budget 5MSE±Std (MLP) — Budget 10MSE±Std (RF) — Budget 10
Our MethodA0.2251±0.00290.2486±0.00180.0342±0.00040.0535±0.0010
B0.2888±0.01130.3194±0.00980.0207±0.00050.0804±0.0036
C0.2787±0.01500.3039±0.01300.0039±0.00040.0339±0.0079
(BL) DRO LassoA0.5071±0.01290.5434±0.01910.0338±0.00090.0534±0.0009
B0.2886±0.00860.3194±0.00260.0211±0.00010.0807±0.0041
C0.2755±0.01500.3031±0.01130.0039±0.00040.0340±0.0078
(BL) DRO XGBA0.5107±0.01130.5441±0.01880.0738±0.03700.0887±0.0327
B0.2888±0.00810.3187±0.00300.0408±0.01980.0928±0.0145
C0.2762±0.01510.3032±0.01120.0036±0.00030.0337±0.0076
(BL) LassoA1.0087±0.01531.0571±0.01070.2056±0.17140.2106±0.1625
B0.5744±0.00510.6150±0.00520.1160±0.09210.1460±0.0763
C0.1080±0.01070.1283±0.01020.0036±0.00030.0340±0.0076
(BL) XGBA1.0136±0.00761.0623±0.00390.4051±0.28020.3905±0.2579
B0.5702±0.00080.6170±0.00760.2279±0.15860.2373±0.1307
C0.1314±0.02910.1538±0.03520.0037±0.00020.0338±0.0077
(BL) DRO Forward SelectionA0.2257±0.00300.2498±0.00050.0344±0.00090.0534±0.0012
B0.3274±0.07470.3596±0.08080.0208±0.00100.0806±0.0041
C0.3752±0.15450.3991±0.15810.0037±0.00020.0341±0.0079
(BL) DRO Backward EliminationA0.0314±0.00080.0492±0.00080.0341±0.00070.0535±0.0010
B0.1833±0.00470.2129±0.00660.0209±0.00060.0806±0.0037
C1.0281±0.03441.0455±0.05480.0038±0.00020.0342±0.0079
(BL) Forward SelectionA1.0183±0.01471.0430±0.01181.0887±0.01761.0132±0.0219
B0.5745±0.00430.6221±0.00610.6430±0.02260.5956±0.0086
C1.0298±0.05871.0571±0.05000.3572±0.30660.3407±0.2694
(BL) Backward EliminationA0.9972±0.01991.0427±0.01261.1153±0.04051.0136±0.0219
B0.5672±0.00810.6077±0.00680.6486±0.02880.5954±0.0068
C0.4222±0.26900.4523±0.28030.3514±0.30130.3399±0.2682

Discussion

Here, although the generative process is linear and variables X0X_0 to X4X_4 have strong effects in both A and B, the signs of their coefficients are reversed between populations. This reduces the effectiveness of LASSO, which tends to select features based on average effects across all data. As a result, vanilla LASSO does not achieve the best performance even in this linear setting.

For budget=55, our method outperforms most baselines, and has a balanced performance across population. For budget=1010, our method is comparable with the best performing baselines.

We select those features ii whose corresponding αi\alpha_{i} values are lowest. Our method consistently selects a similar core ordered set of features across (3) different seeds (in decreasing order of importance) [0, 7, 1, 5, 8, 6, 9, 2, 10, 3], [0, 7, 1, 8, 5, 9, 6, 2, 10, 3],[0, 7, 5, 1, 8, 6, 9, 2, 10, 3]. In contrast, baseline methods, especially non-DRO versions, show significant variability, often selecting irrelevant noisy features (e.g., 13, 14) and failing to consistently identify the true signal variables.

评论

I would like to thank the authors for addressing my comments. However, one crucial issue remains unresolved, namely, the selection of appropriate baseline methods (as raised in Q2).

Classical feature selection methods are generally divided into three categories:

  1. Filter methods – which are not relevant in this context.
  2. Wrapper methods – which involve searching over feature subsets by repeatedly training a model.
  3. Embedded methods – which incorporate feature selection directly into the model training process by enforcing sparsity.

Your proposed method clearly falls into the third category. The original baseline you used, LASSO, is an embedded method as well. However, the newly added baselines are wrapper methods, which are not a fair comparison.

To convincingly demonstrate the superiority of your approach, it is essential to compare it against other embedded methods, particularly those designed for feature selection in heterogeneous or diverse populations. For example:

Sample-wise feature selection: Yang, Junchen, Ofir Lindenbaum, and Yuval Kluger. "Locally sparse neural networks for tabular biomedical data." International Conference on Machine Learning, PMLR, 2022.

Cluster-level feature selection: Svirsky, Jonathan, and Ofir Lindenbaum. "Interpretable Deep Clustering for Tabular Data." International Conference on Machine Learning, PMLR, 2024.

These (and potentially other) relevant works should be discussed in the Related Work section and included as baselines in the Results section.

评论

We thank the reviewer for their response. We recognize the importance of comparing our method with other embedded methods. To this end, we add a new 'dro-embedded-mlp' baseline that uses an MLP with a learnable feature mask trained via DRO. We include the tabulated results at the end of this comment.

With respect to the papers shared by the reviewer, we acknowledge these as new works in the space of feature selection, and will include them in our discussion on related works. However, these methods address different problem settings from our Group-DRO setting, and thus cannot be directly translated to our task: Yang et al. (LSPIN) and Svirsky & Lindenbaum (IDC) select different features for individual samples and clusters respectively—while our method requires selecting a single global feature subset that works universally across all known population groups (e.g., health systems often adopt screeners universally for all patients instead of asking an entirely custom set of questions for each patient or subgroup).

Both methods allow different features for different samples/clusters, which does not align with our constraint of selecting a fixed feature subset for universal collection across populations. These represent complementary approaches to feature selection rather than direct alternatives to our distributionally robust framework. The new baselines implemented by us adapt the approach of the first paper suggested by the reviewer. However, instead of a sample-specific mask, we learn a global mask over features for all samples to better suit our Group DRO setting.

Experiments with embedded baselines

To the results shared in our rebuttal we now add two new baselines (Embedded MLP 1, Embedded MLP 2). Both use an MLP with a learnable feature mask trained via DRO. MLP 1 has a single hidden layer of size 100, while MLP 2 has two hidden layers of sizes 64 and 32. We train both models with the joint objective of MSE minimization (for the regression task), and L1 regularization (weighted by hyperparameter λ=0.01\lambda=0.01 ; we found this λ\lambda value to work best out of [0.1, 0.01, 0.001]). Our method outperforms these baselines as well, with lower variance.

MethodPopulationMSE±Std (MLP) — Budget 5MSE±Std (RF) — Budget 5MSE±Std (MLP) — Budget 10MSE±Std (RF) — Budget 10
Our MethodA0.2251±0.00290.2486±0.00180.0342±0.00040.0535±0.0010
B0.2888±0.01130.3194±0.00980.0207±0.00050.0804±0.0036
C0.2787±0.01500.3039±0.01300.0039±0.00040.0339±0.0079
(BL) DRO LassoA0.5071±0.01290.5434±0.01910.0338±0.00090.0534±0.0009
B0.2886±0.00860.3194±0.00260.0211±0.00010.0807±0.0041
C0.2755±0.01500.3031±0.01130.0039±0.00040.0340±0.0078
(BL) DRO XGBA0.5107±0.01130.5441±0.01880.0738±0.03700.0887±0.0327
B0.2888±0.00810.3187±0.00300.0408±0.01980.0928±0.0145
C0.2762±0.01510.3032±0.01120.0036±0.00030.0337±0.0076
(BL) LassoA1.0087±0.01531.0571±0.01070.2056±0.17140.2106±0.1625
B0.5744±0.00510.6150±0.00520.1160±0.09210.1460±0.0763
C0.1080±0.01070.1283±0.01020.0036±0.00030.0340±0.0076
(BL) XGBA1.0136±0.00761.0623±0.00390.4051±0.28020.3905±0.2579
B0.5702±0.00080.6170±0.00760.2279±0.15860.2373±0.1307
C0.1314±0.02910.1538±0.03520.0037±0.00020.0338±0.0077
(BL) DRO Embedded MLP 1A0.7355±0.27020.7709±0.26360.5336±0.35030.5011±0.3107
B0.5599±0.13630.6043±0.13100.4170±0.29510.4009±0.2521
C0.2805±0.16300.3077±0.17280.2801±0.21170.2649±0.1923
(BL) DRO Embedded MLP 2A0.7126±0.15380.7499±0.14670.0458±0.07350.0709±0.0640
B0.8569±0.07290.8877±0.07480.5404±0.04940.5259±0.0436
C0.4743±0.14960.5106±0.16010.5031±0.16820.4574±0.1339
评论

I thank the authors for the thorough clarifications and the additional simulations. You have fully addressed all of my comments. Together with your responses to the other reviewers’ remarks, you resolved every concern raised during the review process. In light of these revisions, I am updating my score to 4. I wish you the best of luck with your submission!

审稿意见
5

This paper addresses the problem of selecting features such that models trained on these features are robust to distributional shifts. The authors formulate this problem as a continuous relaxation of the traditional variable selection using a noising mechanism. By optimziing over the variance of a Bayes-optimal predictor, they develop a model agnostic feature selectiom framework that achieves robust performance across populations. Experiments on synthetic and real-worlds datasets are conducted.

优缺点分析

Strength

1. This paper introduces an important problem of distributional robust feature selection, which lies at the intersaction of feature selection and domain generalization.

2. The continuous relaxation of feature selection is novel. It efficiently create a differetiable measure of feature utility and premit tracable optimization.

3. The writting of this paper is very clear and easy to follow.

4. The authos conduct extension experiments for validation

Weakness

1. The proposed method lacks theoretical optimality guarantee.

问题

See the weakness.

局限性

Yes

最终评判理由

The author's response has addressed my concern. So I keep my positive score.

格式问题

No

作者回复

We thank reviewer ggn5 for their feedback.

We have conducted additional synthetic experiments and include three of them in this rebuttal. Our new experiments include:

  • Additional downstream model for evaluating performance: we now include an MLP (single hidden layer of size 100), along with the existing random forest (RF), as a downstream task model.
  • New baseline methods: we've added Forward Selection and Backward Elimination, and their DRO variants
  • Additional data generation settings to explore different covariate-outcome functional relationships.

Experiments

For all experiments, we set dimension d=50d=50, and use a 40-60 train-(test+val) split for each population

Expt 1: Linear generative process

We use a dataset of size 36000 with the following composition:

  • A (40%): Y=8X0+6X14X2+3X3+2X4+ϵY = 8X_0 + 6X_1 - 4X_2 + 3X_3 + 2X_4 + \epsilon
  • B (35%): Y=8X06X1+4X23X32X4+8X5+6X6+ϵY = -8X_0 - 6X_1 + 4X_2 - 3X_3 - 2X_4 + 8X_5 + 6X_6 + \epsilon
  • C (25%): Y=10X7+8X8+6X95X10+ϵY = 10X_7 + 8X_8 + 6X_9 - 5X_{10} + \epsilon Noise:ϵN(0,0.12)\text{Noise:} \quad \epsilon \sim \mathcal{N}(0,0.1^2)
MethodPopulationMSE±Std (MLP) — Budget 5MSE±Std (RF) — Budget 5MSE±Std (MLP) — Budget 10MSE±Std (RF) — Budget 10
Our MethodA0.2227±0.00940.2420±0.01610.0241±0.01780.0450±0.0129
B0.5805±0.03750.6158±0.04480.0150±0.00960.0809±0.0035
C0.1208±0.00340.1388±0.00110.0480±0.07610.0697±0.0605
(BL) DRO LassoA0.4155±0.17690.4418±0.18660.0345±0.00040.0511±0.0029
B0.2377±0.09090.2677±0.09720.0206±0.00060.0835±0.0067
C0.3642±0.13500.3849±0.13000.0040±0.00070.0354±0.0012
(BL) DRO XGBA0.6827±0.31360.7120±0.31380.1041±0.06450.1139±0.0597
B0.3857±0.14610.4209±0.14950.0613±0.04160.1098±0.0326
C0.2259±0.09280.2479±0.09260.0040±0.00050.0350±0.0010
(BL) LassoA1.0181±0.05261.0429±0.03870.4126±0.53030.3859±0.4760
B0.5880±0.03230.6264±0.02840.2327±0.29120.2504±0.2401
C0.1351±0.02520.1550±0.02670.0037±0.00010.0354±0.0009
(BL) XGBA1.0211±0.04741.0421±0.04001.1132±0.04391.0216±0.0507
B0.5867±0.03190.6258±0.02820.6607±0.03140.6135±0.0252
C0.1356±0.02610.1551±0.02670.0036±0.00040.0347±0.0012
(BL) DRO Forward SelectionA0.3735±0.15060.3996±0.16360.0341±0.00060.0514±0.0029
B0.3729±0.05530.3995±0.05530.0209±0.00120.0833±0.0066
C0.4735±0.08050.5124±0.08830.0038±0.00010.0354±0.0012
(BL) DRO Backward EliminationA0.0320±0.00120.0474±0.00240.0346±0.00050.0512±0.0030
B0.1757±0.00930.2079±0.01490.0208±0.00080.0834±0.0068
C1.0008±0.06691.0355±0.06350.0038±0.00030.0354±0.0012
(BL) Forward SelectionA1.0132±0.03721.0408±0.04351.1209±0.04831.0224±0.0403
B0.5879±0.03310.6276±0.03840.6509±0.03600.6109±0.0362
C0.9808±0.11611.0040±0.09891.1410±0.14130.9882±0.1095
(BL) Backward EliminationA1.0095±0.03931.0463±0.04471.1267±0.04151.0244±0.0405
B0.5897±0.03410.6267±0.03000.6640±0.02080.6111±0.0364
C0.9883±0.10421.0071±0.09121.1047±0.11560.9895±0.1071

Discussion

Here, although the generative process is linear and variables X0X_0 to X4X_4 have strong effects in both A and B, the signs of their coefficients are reversed between populations. This reduces the effectiveness of LASSO, which tends to select features based on average effects across all data. As a result, vanilla LASSO does not achieve the best performance even in this linear setting. For budget=55, our method outperforms most baselines, and has a balanced performance across population. For budget=1010, our method is comparable with the best performing baselines.

When we set dimension=1515, our method is at par with the best performing baselines for all budgets- pl see the experiment section of our response to reviewer Guhq for the results.

Expt 2: Mixed Linear-Nonlinear with Heterogeneous Noise

We use a dataset of size 44000 with the following population structure:

  • A (30%):Y=4X0+3X1+X22+ϵAY=4X_0+3X_1+X_2^2+\epsilon_A
  • B (30%):Y=4X0+3X1+X22+ϵBY=4X_0+3X_1+X_2^2+\epsilon_B
  • C (25%):Y=2X0+3X5X6+4sin(2X7)+ϵCY=2X_0+3X_5 X_6+4\sin(2X_7)+\epsilon_C
  • D (15%):Y=3X0+2X1+ϵDY=3X_0+2X_1+\epsilon_D

Heterogeneous Noise:

  • ϵAN(0,(0.05)2)\epsilon_A \sim \mathcal{N}(0, (0.05)^2) (reduced noise)
  • ϵB=exp(0.5X3+0.3X4)η0.1\epsilon_B=\exp(0.5X_3+0.3X_4) \cdot \eta \cdot 0.1 where, ηN(0,1)\eta \sim \mathcal{N}(0,1) (heteroscedastic)
  • ϵCN(0,0.12)\epsilon_C \sim \mathcal{N}(0, 0.1^2) (standard)
  • ϵDt30.2\epsilon_D \sim t_3 \cdot 0.2 (heavy-tailed, t-distribution with df=3)

Results

MethodPopulationMSE±Std (MLP) — Budget 5MSE±Std (RF) — Budget 5MSE±Std (MLP) — Budget 8MSE±Std (RF) — Budget 8
Our MethodA0.0024±0.00050.0132±0.00320.0054±0.00070.0150±0.0033
B0.0038±0.00100.0190±0.00160.0062±0.00100.0205±0.0016
C0.8184±0.07500.8548±0.09180.0123±0.00800.2460±0.0483
D0.0101±0.00030.0129±0.00070.0114±0.00150.0130±0.0008
(BL) DRO LassoA0.4107±0.02180.4333±0.02080.4115±0.05540.4020±0.0560
B0.4199±0.01380.4499±0.01900.4172±0.05330.4230±0.0511
C0.2778±0.23970.3782±0.12590.3114±0.26440.3816±0.1018
D0.3225±0.02520.3408±0.01450.3569±0.02590.3375±0.0089
(BL) DRO XGBA0.0026±0.00020.0133±0.00300.0054±0.00140.0150±0.0034
B0.0037±0.00090.0189±0.00150.0064±0.00090.0206±0.0017
C0.4312±0.06350.4690±0.06520.0226±0.00050.2460±0.0479
D0.0102±0.00040.0127±0.00100.0114±0.00120.0132±0.0010
(BL) LassoA0.0510±0.04170.0617±0.04470.0517±0.04170.0616±0.0433
B0.0547±0.04340.0653±0.04020.0562±0.04430.0657±0.0390
C0.4310±0.06360.4611±0.07450.4654±0.05730.4581±0.0676
D0.0100±0.00030.0126±0.00110.0114±0.00140.0130±0.0012
(BL) XGBA0.0027±0.00070.0132±0.00310.0051±0.00130.0151±0.0033
B0.0041±0.00120.0188±0.00150.0067±0.00120.0206±0.0015
C0.4285±0.06220.4637±0.06130.0200±0.00390.2471±0.0475
D0.0100±0.00020.0129±0.00100.0117±0.00170.0131±0.0007
(BL) DRO Forward SelectionA0.3881±0.05980.4151±0.06070.4088±0.07070.4066±0.0570
B0.3989±0.05100.4239±0.05350.4208±0.05550.4186±0.0489
C0.2765±0.23910.3717±0.12360.2980±0.25300.3746±0.1020
D0.3253±0.01920.3467±0.01180.3522±0.02650.3391±0.0083
(BL) DRO Backward EliminationA0.3889±0.05160.4152±0.05480.4076±0.07130.4060±0.0579
B0.3996±0.04960.4309±0.04820.4188±0.05630.4191±0.0494
C0.2771±0.23950.3685±0.12100.2930±0.25070.3756±0.1046
D0.3242±0.01510.3460±0.01130.3519±0.03090.3392±0.0062
(BL) Forward SelectionA0.0506±0.04210.0607±0.04410.0531±0.04220.0613±0.0431
B0.0545±0.04400.0654±0.04030.0569±0.04360.0654±0.0389
C0.8114±0.07050.8523±0.08100.8811±0.07750.8445±0.0760
D0.0099±0.00070.0124±0.00100.0115±0.00130.0130±0.0012
(BL) Backward EliminationA0.0504±0.04170.0609±0.04420.0537±0.04340.0614±0.0431
B0.0551±0.04380.0654±0.04080.0577±0.04490.0655±0.0388
C0.8173±0.07020.8558±0.07530.8585±0.07410.8437±0.0742
D0.0100±0.00030.0127±0.00130.0113±0.00200.0129±0.0012

Discussion

For budget=1010, our method is comparable with the best performing baseline (XGB). For the smaller budget, lasso and DRO have comparable performance, followed by our method. Forward selection and Backward Elimination consistently underperform other methods.

Expt 3: Sparse Linear

We use a dataset of size 36000 with the following composition:

  • A (35%):Y=5X0+4X15+3X30+ϵY=5X_0+4X_{15}+3X_{30}+\epsilon
  • B (35%):Y=6X5+5X20+4X35+ϵY=6X_5+5X_{20}+4X_{35}+\epsilon
  • C (30%):Y=7X10+6X25+5X40+4X45+ϵY=7X_{10}+6X_{25}+5X_{40}+4X_{45}+\epsilon

Noise:

  • ϵN(0,0.12)\epsilon\sim\mathcal{N}(0, 0.1^2) (base noise)
  • Correlated noise features: Xi+1=0.3Xi+0.7ηX_{i+1}=0.3X_i+0.7\eta where ηN(0,1)\eta \sim \mathcal{N}(0,1)

Results

MethodPopulationMSE±Std (MLP) — Budget 5MSE±Std (RF) — Budget 5MSE±Std (MLP) — Budget 10MSE±Std (RF) — Budget 10
Our MethodA0.1186±0.00550.1299±0.00640.1274±0.00540.1277±0.0047
B0.2099±0.00610.2290±0.00860.0034±0.00010.0093±0.0014
C0.6009±0.03560.6453±0.03180.0036±0.00020.0282±0.0021
(BL) DRO LassoA0.1179±0.00530.1296±0.00660.0038±0.00030.0095±0.0003
B0.2115±0.00700.2295±0.00880.0033±0.00040.0092±0.0013
C0.6054±0.04380.6456±0.03100.0034±0.00010.0281±0.0021
(BL) DRO XGBA0.1189±0.00570.1298±0.00660.0038±0.00000.0095±0.0004
B0.2118±0.00710.2286±0.00920.0035±0.00030.0092±0.0014
C0.6073±0.04320.6447±0.03110.0037±0.00040.0280±0.0018
(BL) LassoA0.4729±0.01530.5066±0.01850.0039±0.00040.0095±0.0004
B0.2115±0.00570.2309±0.00900.0036±0.00040.0092±0.0013
C0.3202±0.00290.3485±0.01060.0037±0.00010.0281±0.0018
(BL) XGBA1.0188±0.02351.0651±0.03030.0036±0.00030.0095±0.0004
B0.1414±0.12140.1564±0.12850.0034±0.00020.0091±0.0013
C0.1915±0.11250.2121±0.11960.0037±0.00040.0280±0.0019
(BL) DRO Forward SelectionA0.5335±0.11470.5619±0.12520.0470±0.07460.0488±0.0687
B0.5032±0.01730.5417±0.01900.0826±0.13710.0830±0.1267
C0.6510±0.11430.6876±0.10480.2283±0.08480.2277±0.0756
(BL) DRO Backward EliminationA0.1182±0.00490.1297±0.00650.0039±0.00050.0095±0.0003
B0.2118±0.00740.2293±0.00870.0035±0.00040.0092±0.0014
C0.6059±0.04300.6457±0.03090.0034±0.00030.0279±0.0019
(BL) Forward SelectionA0.4928±0.13950.5228±0.15080.3298±0.30160.3049±0.2713
B0.0013±0.00010.0075±0.00120.0036±0.00040.0091±0.0013
C0.7542±0.24170.7885±0.23490.5878±0.16160.5532±0.1395
(BL) Backward EliminationA0.4056±0.25880.4297±0.27120.3312±0.30150.3055±0.2717
B0.3015±0.16030.3233±0.17070.0035±0.00030.0091±0.0012
C0.7559±0.25190.7844±0.22320.5951±0.17200.5529±0.1390

Discussion

For budget=55, our method is comparable with the best performing baseline (XGB and DRO XGB). Despite the linear setting, lasso is not the best performing method. For budget=1010, lasso and XGB variants, along with DRO Backward Elimination perform best. Our method succeeds in selecting all relevant variables except X30X_{30}, resulting in relatively worse performance on A with budget=1010.

评论

I thank the author for the detailed response. I will keep my positive score.

评论

Thank you for your rebuttal and participation in the discussion.

Upon reviewing the submission, I noted the following claim (lines 131 ff):

“[…] we introduce an alternative continuous relaxation which incorporates a random component controlled by α; effectively, αᵢ will control the amount of noise added to the observation of Xᵢ. To our knowledge, this relaxation is new to the literature.”

However, the idea of using noise as a regularization mechanism has a long history, dating back at least to 1995 [1]. It has also been formalized in the Bayesian framework [2,3], and optimization over α (in the notation of your paper) has been explored for variable selection in the standard setting (by this, I mean without the distributionally robust treatment) [4], including experiments in discrimination with non-linear models such as MLPs.

Can you provide some feedback regarding this point?

[1] Bishop CM. Training with noise is equivalent to Tikhonov regularization. Neural computation. 1995 Jan;7(1):108-16.

[2] R.M. Neal. Bayesian Learning for Neural Networks, Springer-Verlag, New York, 1996.

[3] M.E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, vol. 1, pp. 211–244, 2001.

[4] Grandvalet Y. Anisotropic noise injection for input variables relevance determination. IEEE Transactions on Neural Networks. 2000 Nov 30;11(6):1201-12.

评论

We thank the Area Chair for bringing to our attention the reference [4] (Grandvalet, 2000), and for the broader comment on literature on noise injection for regularization. We originally meant to claim that the use of noise was novel as a relaxation of 0-1 variable selection (as opposed to the broader use of noise as a form of regularization to improve predictive performance). However, we were not aware of [4] (Grandvalet, 2000) and acknowledge this oversight in our literature review. We do wish to emphasize that all of the main technical developments (Sections 3.2-3.4) are still novel.

We will revise the related work section accordingly. Specifically, we will replace the sentence on lines 131ff with the following, more detailed explanation of how our contributions relate to prior work and where they differ:

Previous work has proposed to use the injection of random noise to identify the most important variables for a predictive task in a single population [Grandvalet, 2000], as well as through Bayesian relevance estimation methods [Neal, 1996; Tipping, 2001]. However, tunable-noise-based variable selection has seen limited adoption since it was originally proposed, in contrast to the large body of work on noise as a form of regularization [Bishop, 1995]. In this work, we revisit noise-based relaxations in the distributionally robust setting and show how the optimization problem can be reformulated in ways that create significant, previously-unrecognized advantages. In particular, our approach separates variable selection from fitting the predictive model, so that variable selection is agnostic with respect to the downstream model used for each different subpopulation. This also avoids the need to differentiate through the predictive models, as required in previous work (which may be impossible in the case of frequently-employed models like decision trees or random forests).

Additionally, we wish to highlight that our contributions remain distinct from prior works in the following ways:

  1. Distributionally robust setting: previous work does not discuss the DRO setting, the main motivation for our paper
  2. Model agnostic framework: Our approach works with arbitrary downstream models (including non-differentiable ones like random forests) by targeting the performance of the Bayes-optimal predictor rather than specific trained models.
  3. Training decoupled optimization: We provide a tractable, gradient based optimization framework directly over the task of feature selection, which can make use of a single set of pretrained models as input.
最终决定

This paper proposes a novel approach for feature selection in a distributionally robust context. The problem is framed as a continuous relaxation of hard variable selection, and relies on noise injection. The method decouples feature selection from downstream model training, and is thus applicable to non-differentiable models such as decision trees and random forests.

The reviewers concur that this is a relevant problem, at the intersection of feature selection and out-of-distribution generalization. The proposed method is technically sound. Empirical results on synthetic and real-world datasets demonstrate improvements over several baselines, in some cases by large margins. The availability of anonymized code and clear reporting of training details further strengthen the contribution. The experimental evaluation could consider more distribution shifts and more varied downstream models. Additional empirical evidence such as the one provided for the rebuttal will help support the claims. Also, an assessment of the robustness of the approach with respect to the Bayesian optimal predictor surrogate would be welcomed.

The consensus is that the paper offers novel and impactful contributions. Its formulation of distributionally robust feature selection is original, the proposed formalization is simple and the approach is generic and efficient. While further empirical validation would further strengthen the work, the novelty and potential impact of the paper justify acceptance at NeurIPS.