PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
3
4
3
4
ICML 2025

Active feature acquisition via explainability-driven ranking

OpenReviewPDF
提交: 2025-01-24更新: 2025-08-16
TL;DR

An explainability-driven AFA method uses local explanations and a decision-transformer policy network to dynamically select the most relevant features, achieving better accuracy and efficiency than existing methods.

摘要

关键词
Active feature acquisitionPolicy networkFeature importance

评审与讨论

审稿意见
3

Edit: Thank you to the authors for their response. Since repeated experiments have been run, an RL baseline, and reproducibility details have been provided, I have raised my score from 2 (weak reject) to 3 (weak accept). There are remaining points which I have detailed in the Rebuttal Comment, if they are addressed adequately I will raise the score to 4 (accept).

Original Review:

This paper is about Active Feature Acquisition (AFA) - the test time task of sequentially measuring features on an instance-wise basis to improve predictive performance. This is relevant to all applications where at test-time features are not all jointly available, for example, medicine where for each individual patient a doctor will make different measurements to make a diagnosis. Due to cost or time constraints the aim is to measure as few features as possible.

The paper proposes a novel method where a policy is trained to predict (for each instance) which feature will be best to acquire next, and it is trained as a classification task, where an oracle gives the best feature and this is used as a label using the cross entropy of the policy's prediction as the loss. The oracle is defined using feature importance ranking methods such as FastSHAP.

The training is split into two main phases, where the policy uses the oracle features, and the second phase it uses it's own acquisition as inputs (still using the oracle features as labels), so that it is able to train under the conditions which would be experienced during inference.

The experiments are carried out on multiple real world medical datasets and image datasets. Two SOTA AFA methods are used as baselines as well as additional baselines.

Ablations are carried out investigating the importance of the two stage training, as well as sensitivity to the feature importance ranking method used to generate labels.

给作者的问题

I like the idea this paper proposes and it's interesting to see it working. I would like to recommend accept, however, currently I do not find the empirical evaluation strong enough for acceptance. I understand I may be wrong, so following the responses and reading other reviewers' thoughts I will re-evaluate. In particular these are the following concerns:

  • No repeats of experiments
  • Limited information on reproducing the experiments (no learning rates for example)
  • Lack of RL and generative model baseline
  • Limited hyperparameter tuning of baselines
  • Understanding of the oracle, and how it would work with jointly predictive features. How this might affect feature importances and therefore the training of the proposed model. This can either be with another small synthetic experiment, or with theoretical justification. How does the proposed oracle from this work compare to the oracle described in Acquisition Conditioned Oracle for Nongreedy Active Feature Acquisition, Valancius et al. 2024 (https://arxiv.org/abs/2302.13960), which does consider jointly informative features?
  • Addressing the lack of impact statement, it is enough to use the proposed statement from the call for papers: "This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here."

论据与证据

  • The variety of datasets is very good to see.
  • There are no uncertainties provided in the results. Have there not been repeats of the experiments over different seeds? If not these are required to improve the reliability of the results provided.
  • DIME and GDFS are good baselines to include. However, these are greedy methods, showing this method also beats an RL baseline will make the claims stronger. Similarly, beating a generative model for CMI maximization will also make the claim stronger.
  • At the start of Section 5 it is stated that the first three features are fixed to obtain the results in Figure 2. So why do the baselines perform poorly sometimes with 3 or fewer features and not the same as the oracle or the proposed model?

方法与评估标准

The method makes sense for the problem at hand. Its very interesting to see a model trained to predict the feature to select as a classification problem. Using the feature importance rankings is also an innovative way to generate the labels.

理论论述

There is no theory provided in this paper - which is not a strength or a weakness, just a statement that there are no theoretical claims therefore no errors.

However, there is an issue with how the oracle is defined on page 3. Lines 142 - 164 Right. The claim is that at each step the oracle policy selects the most important feature that maximally improves the predictive performance. This is a greedy oracle, and therefore is not guaranteed to be optimal. This will not select features that are jointly informative but individually non-informative (see https://arxiv.org/abs/2302.13960 for example).

For example we can have features that are only informative if jointly known, and many other very noisy features. The proposed oracle will never select the jointly informative features before the others (unless it already has measured one of them), because a noisy feature will always provide some more immediate information, but less information in the long-term. Therefore this greedy oracle is not guaranteed to be optimal, especially with jointly predictive features.

How does this affect the definition of the oracle? How does it affect how the feature importances are ranked? If a dataset with this property is used, how would the model perform? How does it affect the conclusions drawn from the result in Table 4 if we can't guarantee the oracle is optimal? All greedy methods should "fail" on a dataset like this. The empirical results presented are positive, which would imply this property is not present in those datasets. Another experiment or at least a discussion would greatly improve the paper.

As a starting point: Strong performance is also seen in greedy methods like DIME and GDFS in the literature, additionally theory does exist that supports greedy methods being near-optimal (https://proceedings.mlr.press/v40/Chen15b.html).

实验设计与分析

The main issue with this paper is that without theory, the empirical results must be very convincing, and currently they can be improved (the model motivation is clear so this is not a problem).

  • Without repeats we can't really know how sensitive this model is to initialization
  • There is no code provided which is not a problem in itself, but the reproducibility details are limited. There are no optimizer parameters for example.
  • There has been limited hyperparameter tuning. This is framed as an advantage - the proposed model performs well with minimal tuning. However, there has also been minimal tuning on the baselines, so we don't know if they have just got poor architecture hyperparameters or optimizer hyperparameters.
  • The result in Table 2 is only meaningful if the first stage of training was continued to acquire its result. As it is currently, there are results from the first stage (200 epochs) then for the second stage (200 epochs of first stage + 16 epochs of second stage). So it is not clear if the result from the second stage is due to training for longer or because it is a useful training change.
  • The dataset selection is good, however the baseline selection is limited. Two AFA methods are used as baselines - GDFS and DIME, and these work in very similar ways, in terms of being greedy methods that are trained by simulating the acquisition process. The results would be improved by including an RL baseline and a Generative Model baseline.

补充材料

I read all of the supplementary material. The algorithms provided are useful for understanding the method.

与现有文献的关系

The paper is appropriately framed within the AFA literature. Missing references are given next.

遗漏的重要参考文献

其他优缺点

Stengths:

  • The idea the paper introduces is a very good one.
  • The paper generally is written nicely.
  • The method is clear, with a good use of diagrams and pseudo-code

Weaknesses:

  • There is no impact statement. The paper does not have any ethical issues, so it is not majorly needed. However, it was written as a requirement in the Call for Papers (https://icml.cc/Conferences/2025/CallForPapers): "Authors are required to include a statement of the potential broader impact of their work, including its ethical aspects and future societal consequences."

其他意见或建议

  • Line 176 Left: "these methods aim identifying" should be "these methods aim to identify"
  • How does center cropping keep working after selecting the center? Does it spiral out? It is not clear how it continues to select patches of images? Is it a global fixed ordering as well?
作者回复

Theoretical concern and Q5: We acknowledge that our original description of the oracle definition may have been misleading and we clarify the distinction here. Our oracle is not a purely greedy policy. Rather, it first identifies the optimal subset of features that minimizes prediction loss under a given acquisition budget. This is done by exhaustively evaluating all feasible subsets within the budget. Once the optimal subset is identified, features are ranked in a greedy order to define an acquisition trajectory. Importantly, the acquisition proceeds until all features in the subset are obtained. Thus, the internal ordering does not affect the final outcome. This approach is distinct from a sequential greedy policy that chooses features one step at a time based on marginal gains. Our formulation closely aligns with the Acquisition Conditioned Oracle proposed by Valancius et al. (2024), which also aims to account for joint informativeness. The main distinction is that our formulation imposes a hard budget constraint, whereas Valancius et al. incorporate feature costs into a weighted objective. We will reference this work in the paper and discuss the similarities and differences.

Response to experimental design concerns (Q1–Q4): To improve the robustness of our claims, we conducted additional experiments and updated our results accordingly:

-- (Repeatability and variability): We now report results over three runs (over nine runs on tabular datasets, please see our response to W1 and Q1 of the first reviewer) with different random seeds. Table 2 has been updated to include mean and standard deviation across runs.

-- (First versus second stage comparison): To clarify that performance gains in Table 2 are not due to additional training epochs alone, we extended the first-stage training from 200 to 250. We observed that performance did not significantly improve with longer training, while our second-stage approach yielded clear gains. This confirms that the second stage provides meaningful training benefits beyond simply more epochs. The results of the first-stage training with 250 epochs are nearly identical to those with 200 epochs, demonstrating the effectiveness of the second-stage training. However, on BloodMNIST*, the additional epochs provide a meaningful performance increase. To further assess this, we conducted first-stage training with 300 epochs, yielding results of 79.73±0.1979.73 \pm 0.19, further reinforcing the effectiveness of the second-stage training. This also highlights the potential for performance improvements through dataset-specific hyperparameter tuning.

-- (Reproducibility details): We used Adam optimizer with a learning rate of 1e-3 for all datasets, except for CIFAR100 and Imagenette, where we used 5e-4 due to larger model sizes. If our paper gets accepted, then we will release the complete code, including all training scripts and configuration details, to ensure reproducibility.

-- (Baseline diversity – Q3): We agree that incorporating more diverse baselines would strengthen our comparisons. Generative AFA methods like ACF (Li and Oliva, 2020) were not included due to their high computational demands and limited scalability to tabular and high-dimensional image datasets. Moreover, several recent works (e.g., Gadgil et al., 2024) show that discriminative methods often outperform generative counterparts in practice. We appreciate the suggestion and plan to include an additional baseline in future work. Meanwhile, we have already evaluated an RL-based method. Please refer to the table in our response to Reviewer 1.

-- (Hyperparameter tuning of baselines): For baselines with published results on datasets like CIFAR10 and Imagenette, we confirmed alignment with reported performances. For datasets without public results, we made reasonable efforts to tune hyperparameters. We will include detailed tuning procedures in the final version.

SpamCIFAR10CIFAR100BloodMNISTImageNetteMetabricCPSCTGSCKD
First-stage (250 epochs)0.952 ± 0.00175.96 ± 0.16%45.91 ± 0.36%79.83 ± 0.19%*73.95 ± 0.25%62.52 ± 1.27%67.23 ± 0.48%0.9157 ± 0.00020.822 ± 0.01
First-stage0.951 ± 0.000275.76 ± 0.19%46.05 ± 0.25%79.25 ± 0.15%73.76 ± 0.42%62.48 ± 1.39%67.21 ± 0.15%0.9155 ± 0.00040.824 ± 0.008
Second-stage0.955 ± 0.000178.44 ± 0.15%46.99 ± 0.15%83.87 ± 1.05%78.96 ± 0.12%69.83 ± 0.41%67.45 ± 0.13%0.9164 ± 0.00010.836 ± 0.07

Response to missing impact statement - Q6: We have added the suggested statement.

Response to clarification on center cropping (Comment 2): For the details of center cropping, we kindly refer the reviewer to the papers of DIME and greedy methods.

Response to comment on Figure 2 (Q7): Please see our response to Q1 of Reviewer 3.

We also included new references suggested by the reviewer.

审稿人评论

Thank you to the authors for taking the time to respond, I have read the rebuttal and other reviews. I will raise my score from 2 to 3. I'm still unsure regarding some issues, if those are convincingly addressed I will raise the score to 4. My detailed feedback is below:

Positives

Repeats: Thank you for running multiple repeats. Please make sure these are also added to Figure 2, the metric during acquisition, by using shaded areas to represent the uncertainty.

Reproducibility Details: Thank you for including these. Please add all details to the appendix, including more than those given here (batchsize, dataset details, model sizes etc.) this should be done for the proposed model and all baselines. In particular I notice you've promised to publish code after the review period, this is definitely very helpful for reproducibility.

First vs Second Stage Training: Thank you for extending the training. These results are promising.

New Baselines: Thank you for including a new RL baseline, please can this be added to Figure 2. Is it also possible to add AACO (https://arxiv.org/abs/2302.13960, https://github.com/lupalab/aaco), since this does not need any new models to be trained. It would be insightful, since it claims to closely follow an oracle, but this is not essential, I just think it would be a good baseline to add if time permits.

Points for Improvement

Small points

Center Cropping: I looked at DIME and GDFS, searching for "center" and "crop" but neither describes how the acquisition is decided. Please can this be explained.

Baseline Hyperparameters: Thank you for the explanation. Please can you clarify what datasets required tuning and what the tuning process was. E.g. what hyperparameters and values were tested, how the validation set was created, what validation metric was used etc.?

Main Points

First 3 features: Unfortuntately the answer has made this more confusing to me. Was fixing the first three features only done in training? And then the model can select freely during testing? In this case it is very unlikely the model would ever select those first three, since it never trained to. If it has been done for both training and testing, then the baseline methods should also be given those three features at testing time, to ensure fair comparison. Or to be even more fair they should receive them at training as well. This flexibility is not unique to the proposed method, DIME, GDFS and OPL can be told three features to always start with during training and testing. Please can this be explained, and if this is what has happened, why would the models be different for the first three acquisitions in Figure 2?

Oracle Definition: Thank you for taking the time to adjust the oracle definition. This is quite a different definition from the original which said "The oracle policy sequentially constructs MM^*", and is now "distinct from a sequential greedy policy". That being said, this is still not guaranteed to be optimal. The internal ordering of features does matter for AFA. Even though it is possible to select an optimal subset (is this done by knowing the feature values?), a greedy ordering within that can be suboptimal. Consider a case of three features that are independent and can be 1-1 or 11 with equal probability. Let p(y=1x)=sigmoid(2x_1x_2+0.5x_3)p(y=1 | \mathbf{x}) = \text{sigmoid}(2x\_1x\_2 + 0.5x\_3). If all other features are irrelevant and the aim is to find the smallest subset (size of subset has not been mentioned), then only these three features are selected as the subset, which is good. However, because the sign of x_1x_2x\_1x\_2 requires knowing both, then they individually give no information, only jointly. The greedy ordering in this optimal subset is 3, then 1 or 2 with equal probability. However, the optimal ordering is to select either 1 then 2 or 2 then 1, because after the second acquisition the prediction is far more accurate, despite the first acquisition giving no information. If 3 is selected first, there is only a small amount of information in the first acquisition, and the second acquisition does not improve this.

I'd recommend looking at these two points, especially the oracle definition. Unfortunately we can't discuss it in detail, but I'm interested what you think, I would argue that even if a subset is optimal, a greedy ordering within that can be suboptimal (see the example). If you agree with this, maybe the oracle can be renamed to "Approximate Oracle/Pseudo Oracle" or something like this, with the clarification about the greedy inner ordering. Note that the AACO oracle will also suffer from the same greedy internal ordering, I initially cited it when I thought this paper's oracle was sequentially greedy. Also, your model should actually perform better on the above example than the oracle, since feature importances won't suffer from the greedy behavior, so its not a criticism of the method, more of the proposed theoretical oracle.

作者评论

We sincerely appreciate the reviewer’s follow-up and the decision to raise the evaluation score. Below, we provide our responses to the remaining concerns:

Main Points

'First 3 Features': During testing, the first three features were also fixed. These features were selected based on their average importance rankings, derived from the instance-wise feature rankings φi\varphi^i. This ranking information is specific to our method and enables the selection of highly informative features early in the acquisition process. In contrast, other AFA methods do not have access to such rankings. Therefore, to fix the first three features for other AFA methods, one could either randomly select three features (which is unlikely to yield strong performance) or precompute global feature importance rankings using a global feature selection algorithm and fix the top three accordingly.

'Oracle Definition': In our oracle construction, we assume that it has perfect knowledge, including access to the feature values and the label. Under this assumption, the oracle selects the optimal subset of features, minimizing the prediction loss, within a given acquisition budget among all feasible subsets. This selection process is guided by the true label, meaning the internal greedy ordering of features is also informed by this label. Therefore, in the provided example, the third feature is not necessarily always selected first. In the table below, we present the internal greedy ordering for all possible input scenarios using the subset {x1x_1, x2x_2, x3x_3}. For ordering, we assume missing features are filled with a value of 0, and when x1x_1 and x2x_2 yield equal loss reductions, we break the tie by selecting x1x_1 before x2x_2.

yx₁x₂x₃Ranking OrderValue of 2x₁x₂ + 0.5x₃ at each step
1111{x₃, x₁, x₂}{0.5, 0.5, 2.5}
111-1{x₁, x₂, x₃}{0, 2, 1.5}*
-11-11{x₁, x₂, x₃}{0, -2, -1.5}*
-11-1-1{x₃, x₁, x₂}{-0.5, -0.5, -2.5}
-1-111{x₁, x₂, x₃}{0, -2, -1.5}*
-1-11-1{x₃, x₁, x₂}{-0.5, -0.5, -2.5}
1-1-11{x₃, x₁, x₂}{0.5, 0.5, 2.5}
1-1-1-1{x₁, x₂, x₃}{0, 2, 1.5}*

Note: According to our oracle definition, in the scenarios marked with *, even the third feature should be excluded from the optimal subset, as its inclusion leads to an increase in the loss value.

New Baseline

We thank the reviewer for suggesting the AACO method. While we did not implement it directly, we incorporated its core idea to design a new baseline inspired by the approach. Specifically, for a given masked test instance, we first identified its nearest neighbor from the training set. Subsequently, the next feature to be acquired was determined based on the feature importance ranking of this nearest neighbor. That is, we selected the highest-ranked feature (according to the neighbor's ranking) that has not yet been acquired for the test instance. The results are in the following table. Please note that no special training was conducted for this method; instead, we used the same predictor networks from the second stage of our original method. Additionally, in AACO, the nearest neighbor is identified based on raw feature distances, which may not be effective for image datasets. We leave the exploration of alternative strategies such as computing distances in the embedding space as well as the development of dedicated training procedures, to future work.

SpamMetabricCPSCTGSCKD
Second-stage0.955 ± 0.000169.83 ± 0.41 %67.45 ± 0.13 %0.9164 ± 0.00010.836 ± 0.07
AACO-like baseline0.954 ± 0.000568.12 ± 0.75 %67.20 ± 0.22 %0.9092 ± 0.00090.827 ± 0.003

Small Points

'Center Cropping': This approach does not involve iterative selection. Instead, it involves training multiple models with varying (increasing) number of center patches unmasked in the input. For each height/width pair, that many patches are unmasked at the image center while others remain masked, and a model was trained using this configuration across the dataset. Patch dimensions were manually predefined, matching other global methods. Implementation was taken directly from the greedy method codebase.

'Baseline Hyperparameters': ImageNette, CIFAR10/100, BloodMNIST, and Metabric datasets required hyperparameter tuning. Specifically, for both GDFS and DIME, we tuned learning rate (1e-6-1e-2), learning rate decay (0.2-0.5), learning rate patience (2-10), minimum learning rate (1e-9-1e-7), and early stopping (2-10 epochs). DIME also needed exploration probability (0.05-0.6), exploration probability decay (0.1-3), and epsilon probability patience (2-14). We used a 60/20/20 split for training/validation/test, with validation based on the cross-entropy loss.

审稿意见
4

This paper tackles the problem of active feature acquisition, a setting where all features might not be available at inference and the model needs to make accurate predictions while minimizing the number of features acquired. The authors propose a framework that dynamically selects instance-specific features based on their importance ranking obtained from local explanation methods. They reformulate the problem as a feature prediction task and introduce a policy network based on the decision-transformer architecture which sequentially predicts the next most-informative feature. This network is trained in a two-stage approach along with a predictor network which makes the predictions given a subset of features that have been acquired. Experiments across both image and tabular modalities demonstrate that this approach outperforms other recent methods.

给作者的问题

  1. Initializing the input with three features during inference seems counterintuitive, is there an intuition of why it helps stabilize the training?
  2. Can this method be adapted so that a different number of features are acquired for different instances? One way could be to use a different stopping criteria for each instance (maybe using the prediction entropy).

论据与证据

One of the main claims is that local explanation methods can provide a reliable signal for instance-specific feature importance and the experiment results do support that claim well, especially considering the oracle implementation which outperforms the others substantially across most datasets.

方法与评估标准

I am a bit skeptical about the two-stage training strategy. Even though the results indicate that the second stage improves performance, I would like more analysis on why the first-stage is not enough for the policy network to learn the top features.

I would have liked to see feature costs getting incorporated into the training as well. Since the authors consider feature costs as a potential constraint, there could be a scenario where a feature is highly informative but has a high cost (in terms or time, money, etc.), so it might be better to acquire a less costly feature instead (e.g. lab test features are costlier than demographic features).

理论论述

There are no theoretical claims provided other than maybe the intuition of the oracle policy qq^*

实验设计与分析

I think the experimental design makes sense. In terms of the analyses, I would have like a plot of the training times for the different methods since local explanation methods can be expensive to run (see weaknesses).

补充材料

I reviewed the pseudo-code of the training algorithm provided in the supplementary

与现有文献的关系

There has been a lot of research work done in the field of feature selection, starting all the way from 1996 [1]. Static feature selection has also been an important subject for a long time [2, 3, 4]. For dynamic feature selection, the literature has focused on formulating it as an RL problem [5, 6] or using mutual information estimations [7, 8]. There are many more works in each of these sub-fields, which indicates that this problem is of great interest to the scientific community.

[1] Geman, Donald, and Bruno Jedynak. "An active testing model for tracking roads in satellite images." IEEE Transactions on Pattern Analysis and Machine Intelligence 18.1 (1996): 1-14
[2] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar):1157–1182, 2003.
[3] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu. Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):1–45, 2017.
[4] Balın, Muhammed Fatih, Abubakar Abid, and James Zou. "Concrete autoencoders: Differentiable feature selection and reconstruction." International conference on machine learning. PMLR, 2019.
[5] Jaromír Janisch, Tomáš Pevny, and Viliam Lis y. Classification with costly features as a sequential decision-making problem. Machine Learning, 109:1587–1615, 2020.
[6] Mohammad Kachuee, Orpaz Goldstein, Kimmo Kärkkäinen, Sajad Darabi, and Majid Sarrafzadeh. Opportunistic learning: Budgeted cost-sensitive learning from data streams. In International Conference on Learning Representations, 2018.
[7] Aditya Chattopadhyay, Kwan Ho Ryan Chan, Benjamin D Haeffele, Donald Geman, and René Vidal. Variational information pursuit for interpretable predictions. arXiv preprint arXiv:2302.02876, 2023.
[8] Yang Li and Junier Oliva. Active feature acquisition with generative surrogate models. In International Conference on Machine Learning, pages 6450–6459. PMLR, 2021.

遗漏的重要参考文献

I think the authors cover most of the important works and references related to the topic of feature selection. I would have liked more discussion on the potential limitations of the local explanation methods (reliability, computational feasibility) , since they form the basis of the proposed framework.

其他优缺点

Strengths:

  1. The paper tackles an important and well-studied task for active feature acquisition. As machine learning is getting used in real-world settings, this will be an important practical consideration to reduce costs and improve efficiency.
  2. The authors integrate different existing works like decision transformers and local explanation methods for this task, which is interesting.
  3. Experiments are performed across a variety of datasets and it looks like they achieve state-of-the-art performance on most of them.
  4. I like the way concept figure 1 is presented as it gives a good high-level idea of how the two-stage approach works.

Weaknesses:

  1. The dependence on local explanation methods also means that the proposed framework also has the same limitations as these methods. Feature attributions methods like SHAP grow exponentially in computation with the number of features. This might reduce scalability of the proposed framework. The authors should show a comparison the total training time with the other methods as well. Since the gains are marginal for some of the datasets, it might be practically more feasible to use another method which is more efficient.
  2. From my understanding, feature costs are not considered during training which is also an essential factor to consider. I would like to see an extension of this method that incorporate feature costs in the training.

其他意见或建议

Suggestions:

  1. Provide visualizations of the features obtained on the image datasets. Also, for the medical tabular datasets, showing that the features selected correspond with prior knowledge would also be useful.
  2. The results of the ablation experiments on page 8 (using ResNet-10 and replacing the decision transformers) should be shown in a different table
  3. In the training pseudo-code, show how at:tiia^i_{t:t_i'} and rt:tiir^i_{t:t'_i} are computed. Please provide the pseudo-code for the inference stage as well
作者回复

Methods and evaluation criteria: The reviewer raises a thoughtful point about the role of the second-stage training. The first stage provides clean supervision through ground-truth feature importance rankings generated by explanation methods. However, during inference, the policy must operate on partially observed inputs and select features sequentially. This setting diverges from the idealized full-information assumption of the first stage. It also introduces two key challenges: (a) the transferability of supervision degrades with incomplete or noisy inputs, and (b) the policy must make decisions under uncertainty introduced by prior imperfect acquisitions. Our second-stage training directly addresses these issues by jointly training the policy and predictor under realistic, policy-driven acquisition settings. To further clarify this benefit, we added a comparison in Table 2 showing that extending first-stage training to 250 epochs yields minimal or no improvement, which underscores the utility of the second stage. Please see the updated Table 2 in our rebuttal to the fourth reviewer. The reviewer also notes that incorporating feature costs could further improve the method. We agree that incorporating feature costs is important for real-world applications, especially in domains like healthcare where test costs vary. While our current method assumes uniform costs, we view cost-aware acquisition as an important direction for future work.

Experimental design or analyses: The reviewer suggests comparing training times due to the high cost of generating explanation-based rankings. Our current implementation trains 2-4x slower than baseline methods. However, we expect significant speed-ups with optimization (e.g., using PyTorch Lightning, early stopping). Additionally, as noted in our response to Reviewer 2, W2, our method can leverage precomputed rankings, which aligns with practical scenarios (e.g., clinical models) where such data already exist. We acknowledge the current limitation and will include a discussion in the final version.

Relation to broader scientific literature: The reviewer would like a deeper discussion of the reliability and computational feasibility of explanation methods. We agree, and have discussed this in response to Reviewer 2 (W1 and W2). In short, while explanation methods have known limitations, our experiments show that the AFA policy is robust to variation across seeds and models. Practical approximations (e.g., FastSHAP) are also available for scaling.

Response to W1: Dependency on explanation methods may limit scalability due to computational cost. We addressed this in detail in our response to Reviewer 2, W2. To summarize: our approach can reuse precomputed rankings and has room for significant implementation optimizations. This trade-off will be discussed explicitly in the paper.

Response to W2: Feature costs are not currently incorporated. We acknowledge this limitation and see cost-aware acquisition as a valuable extension. We plan to explore this direction as part of future work.

Other comments: If accepted, we will include visualizations of selected image patches. For tabular data, further work is needed to assess alignment with prior clinical knowledge. We will move the ResNet-10 and decision transformer ablation results to a separate table. Also, we will update the pseudocode and include inference pseudocode.

Response to Q1: Fixing the initial features stabilizes training by reducing uncertainty in early steps and simplifying the learning of conditional distributions. This is especially helpful when many features are available. Notably, this flexibility is unique to our method compared to other AFA methods.

Response to Q2: Our framework can be extended to support adaptive stopping criteria (e.g., based on prediction entropy). While not implemented here, we consider this a promising direction for future work.

审稿人评论

I thank the reviewers for the responses. They have addressed the concern I had with the two-stage training along with most of my questions. As for the computational complexity, while I agree that having precomputed rankings will speed up computation, I am still skeptical of the practicality of this method, especially in medical settings like the authors mention. Time plays a major role in such sensitive settings and even if there are precomputed explanations, they would likely need to be continually updated due to the dynamic nature of the clinical setting. Nevertheless, this method seems to be an interesting and novel approach to feature selection and I will raise my score accordingly. I urge the authors to include the limitations and additional analyses discussed here in the updated manuscript.

作者评论

We sincerely thank the reviewer for the thoughtful follow-up and for raising the evaluation score. In the revised manuscript, we will include a discussion of these limitations and trade-offs, highlighting the current constraints and potential mitigation strategies, such as using faster approximation methods (e.g., FastSHAP).

审稿意见
3

The authors propose an active feature acquisition (AFA) framework that selects features based on their importance to each individual case. The method leverages local explanation techniques to generate instance-specific feature importance rankings. The authors reframe the AFA problem as a feature prediction task, introducing a policy network grounded in a decision transformer architecture. The authors conducted experiments on multiple datasets and demonstrate that their approach outperforms current state-of-the-art AFA methods in both predictive accuracy and feature acquisition efficiency.

给作者的问题

The authors acknowledge that RL-based methods are theoretically capable of finding the optimal policy, but their method outperforms them empirically. It would be good to have a deeper discussion of why this might be the case.

论据与证据

Looks good to me.

方法与评估标准

Looks good to me.

理论论述

No theoretical claims.

实验设计与分析

Looks good to me.

补充材料

N/A

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  1. The paper addresses an important problem in machine learning: efficient feature acquisition in scenarios where data collection is costly or time-consuming.
  2. The approach is intuitive and leverages explainability methods (SHAP, LIME) in a novel way to determine instance-specific feature importance rankings.
  3. The authors reframe the active feature acquisition (AFA) problem as a feature prediction task, which allows them to use a decision transformer architecture for the policy network.
  4. The proposed method outperforms state-of-the-art AFA techniques in both predictive accuracy and feature acquisition efficiency across multiple datasets.

Weaknesses:

  1. The method relies on the accuracy and reliability of local explanation methods. If the explanations are not accurate, the performance of the AFA framework could be affected.
  2. The computational cost of generating feature importance rankings using methods like SHAP can be high, especially for large datasets or complex models.
  3. The complexity of the decision transformer architecture may require significant computational resources and expertise to implement and train.

其他意见或建议

A typo in Ln 291: ImageNette -> ImageNet

作者回复

Response to W1: We agree that the performance of our AFA framework is influenced by the accuracy of local explanation methods. However, our experiments show that rankings generated by widely used explanation techniques (e.g., SHAP, LIME) consistently match or exceed the performance of rankings derived from existing AFA baselines. This suggests that current explanation methods already provide sufficiently reliable signals for guiding feature acquisition. Moreover, explainability research is advancing rapidly, and we expect that continued improvements in explanation quality will further strengthen the effectiveness and generalizability of our framework.

Response to W2: We acknowledge the reviewer’s concern regarding the computational cost of explanation methods like SHAP. However, recent advances, such as FastSHAP, have significantly improved efficiency through approximation strategies. Moreover, in practical settings such as medicine, AFA would typically be used with pre-existing, well-trained models tailored to specific conditions. In such scenarios, explanation methods are often already applied to ensure interpretability and clinical trust, which is an important requirement in medical AI [1, 2]. Our framework is designed to take advantage of these precomputed feature importance rankings, avoiding the need to rerun explanation algorithms during training. This makes our approach both practical and computationally efficient in real-world deployments.

Response to W3: As shown in our ablation experiments, our framework is flexible and can accommodate simpler policy models, albeit with some trade-off in performance. While a decision transformer requires more resources than other architectures that leverage for example, a ResNet block, the decision transformer consistently achieved superior results across tasks, motivating its inclusion as the default choice. To support ease of adoption and reproducibility, we will release all implementation details and code upon acceptance, enabling researchers and practitioners to readily integrate or adapt our method.

Response to Q1: While RL-based methods are theoretically capable of discovering optimal acquisition policies, in practice they often suffer from instability, high variance, and sensitivity to hyperparameters [3,4,5]. As discussed in [6], greedy-based methods, despite their lack of formal optimality guarantees, frequently outperform RL approaches due to their simplicity and stability during training. Moreover, under certain assumptions about the data distribution, greedy strategies can be provably near-optimal [7], making them a compelling practical alternative. We showed that our framework, which avoids high-variance RL training by using a supervised learning objective with a decision transformer, achieves strong empirical performance while benefiting from interpretability and ease of optimization. Additionally, we evaluated the performance of an RL-based method (OPL [8]) in our experiments (please see Table 3 in our response to Reviewer 1). Our experimental results further suggest that RL-based approaches struggle to match the stability and effectiveness of our method,

References:

[1] Hill, E.D., Kashyap, P., Raffanello, E. et al. Prediction of mental health risk in adolescents. Nat Med. 2025 Mar 5. doi: 10.1038/s41591-025-03560-7.

[2] Dai, L., Sheng, B., Chen, T. et al. A deep learning system for predicting time to progression of diabetic retinopathy. Nat Med 30, 584–594 (2024).

[3] Erion, G., Janizek, J. D., Hudelson, C., Utarnachitt, R. B., McCoy, A. M., Sayre, M. R., ... Lee, S. I. (2022). Coai: Cost-aware artificial intelligence for health care. Nature biomedical engineering, 6(12), 1384.

[4] Chattopadhyay, A., Chan, K. H. R., Haeffele, B. D., Geman, D., Vidal, R. (2023). Variational information pursuit for interpretable predictions. arXiv preprint arXiv:2302.02876.

[5] Covert, I. C., Qiu, W., Lu, M., Kim, N. Y., White, N. J., Lee, S. I. (2023, July). Learning to maximize mutual information for dynamic feature selection. In International Conference on Machine Learning (pp. 6424-6447). PMLR.

[6] Gadgil, S., Covert, I., Lee, S. I. (2024). Estimating conditional mutual information for dynamic feature selection. International conference on learning representations..

[7] Chen, Y., Hassani, S. H., Karbasi, A., Krause, A. (2015, June). Sequential information maximization: When is greedy near-optimal?. In Conference on Learning Theory (pp. 338-363). PMLR.

[8] Kachuee, M., Goldstein, O., Kärkkäinen, K., and Sarrafzadeh, M. Opportunistic learning: Budgeted cost-sensitive learning from data streams. In International Conference on Learning Representations, 2019.

审稿意见
4

This paper introduces a novel approach to active feature acquisition (AFA) by reframing the problem as a feature prediction task guided by explainability-driven rankings. Specifically, the authors leverage local explanation methods (e.g., SHAP, LIME, TreeSHAP) to generate instance-wise feature importance rankings. These rankings are used to supervise a decision transformer that learns a policy for acquiring the next most informative feature for each instance. Empirical results across a diverse set of tabular and image datasets (including several medical datasets) show consistent improvements over state-of-the-art AFS methods. Notably, to my knowledge, this is the first known work to directly incorporate local explanation methods as a guiding signal for sequential feature acquisition, offering a more stable and interpretable alternative to traditional reinforcement learning or mutual information-based approaches.

给作者的问题

  1. How robust is your method to noise or instability in the feature importance rankings generated by SHAP or LIME?

    Would using a weaker model (or different initialization) to compute rankings degrade downstream acquisition policy quality significantly?

  2. How does your method compare to training-time feature selection methods like INVASE that don’t rely on post-hoc explainability?

    Could this offer better stability or generalization?

论据与证据

Yes. The claims made in the submission are supported by clear and convincing evidence. The experiments are thorough and diverse, covering nine datasets. Performance comparisons with state-of-the-art methods, ablations (e.g., different explanation techniques, architectures), and oracle benchmarks support the central claims.

方法与评估标准

Yes. The use of decision transformers, per-instance explainability-based supervision, and standard evaluation on public benchmarks are all appropriate for the AFS setting. No additional explanation necessary here.

理论论述

N/A — the paper does not include novel theoretical derivations or formal proofs that require scrutiny.

实验设计与分析

Yes, the experimental design is sound. The authors use well-established backbones like ResNet variants, a GPT-3 mini version for the transformer, and standard training protocols.

补充材料

Yes. I reviewed the supplementary material — specifically the algorithmic components. They are clearly described and support the main claims.

与现有文献的关系

The paper builds on two key streams of literature: (1) active feature acquisition, especially via reinforcement learning or mutual information-based methods; and (2) explainability methods like SHAP and LIME.

This work is distinct in its fusion of these two domains, using post-hoc explanation tools during training to stabilize and guide sequential feature acquisition. I am not aware of any prior attempts that combine local explanation techniques with sequential policy learning in this manner, making the contribution both novel and timely.

遗漏的重要参考文献

One possibly relevant line of work the authors may consider discussing is INVASE (Yoon et al., ICLR 2018), which also performs instance-wise variable selection using a policy-based network but learns feature importance end-to-end rather than using post-hoc explanations. It would be helpful for the authors to clarify how their use of SHAP/LIME compares in terms of ranking stability and generalization performance.

其他优缺点

Strengths:

  1. Strong combination of explainability and AFA.
  2. Effective use of decision transformers in a new setting.
  3. Generalizable two-stage training framework.
  4. Broad empirical evaluation on both image and medical datasets.
  5. Well written paper.

Weaknesses

  1. The paper provides a helpful analysis (e.g., Table 4) showing that the learned acquisition policy aligns well with the explainability-based feature rankings, which supports the overall training framework. That said, an important open question is the stability of the explanation methods themselves. Post-hoc techniques like SHAP and LIME can produce variable feature rankings across different runs or model initializations, especially in high-dimensional settings. While the two-stage training strategy may help mitigate this, the paper does not directly evaluate how such variability might affect the robustness of the learned policy. It would strengthen the work to include a ranking stability analysis (e.g., across seeds or models), and assess how sensitive the acquisition policy is to such fluctuations.
  2. Additionally, it would be helpful to contrast this approach with training-time feature selectors like INVASE (Yoon et al., 2018), which do not rely on post-hoc explainability and may produce more stable, model-aligned importance scores. A comparison or discussion here would strengthen the positioning and generalizability of the approach.

其他意见或建议

Line 312: "due to to" → Typo, please correct to "due to".

作者回复

Response to W1 & Q1: We appreciate the reviewer’s comments on the potential variability of post-hoc explanation methods and their impact on the robustness of our learned acquisition policy. To address these concerns, we conducted several robustness evaluations. As shown in Table 3, our method performs consistently across different explanation-based ranking approaches (e.g., SHAP, LIME), demonstrating resilience to variations in feature importance estimations. Additionally, we tested the impact of using a weaker model to compute rankings: when replacing ResNet-18 with ResNet-10 on CIFAR-10, performance declined only marginally (78.61% to 78.22%), suggesting our method is robust to model capacity differences in the ranking stage. As further suggested, we evaluated the stability of feature importance rankings across random seeds. Specifically, we trained three initial models using different random seeds, resulting in three distinct ranking orders for each explanation method in Table 3. For each ranking order, we trained our policy network three times to account for variability in our model’s initialization. The updated Table 3 now reports the mean and standard deviation across these nine runs, demonstrating stable performance and indicating that our method is robust to fluctuations in explanation-based rankings:

Updated Table 3

SpamMetabricCPSCTGSCKD
T-SHAP0.955 ± 0.00169.83 ± 0.41%67.45 ± 0.13%0.916 ± 0.0010.836 ± 0.07
LIME0.953 ± 0.00269.15 ± 0.18%67.06 ± 0.36%0.913 ± 0.0010.822 ± 0.09
K-SHAP0.956 ± 0.00269.57 ± 0.33%67.32 ± 0.56%0.915 ± 0.0010.831 ± 0.005
IME0.954 ± 0.00169.78 ± 0.10%67.12 ± 0.61%0.916 ± 0.0010.8258 ± 0.1
INVASE0.927 ± 0.002-68.37 ± 0.23%0.912± 0.0030.8305 ± 0.09
OPL*0.889 ± 0.00261.54 ± 0.23%63.45 ± 0.80%0.864± 0.0050.7003 ± 0.04

*We also evaluated an RL-based method (OPL [1]). Since we are unable to update our figures during the rebuttal period, we present this baseline result in table format.

Response to W2: Our AFA framework fundamentally differs from end-to-end feature selection methods like INVASE. While INVASE integrates feature selection directly into model training via policy gradients, AFA decouples the feature importance estimation from policy learning. This has two advantages: (a) it enables interpretability aligned with widely used post-hoc explanation tools, and (b) it simplifies training by avoiding the high variance and complexity of reinforcement learning.

Response to Q2: Thank you also for suggesting the INVASE [2] method . Our approach is compatible with any ranking order, including those produced by INVASE. Based on your suggestion, we evaluated our method using INVASE-derived rankings and included these results in Table 3 (excluding the Metabric dataset due to computational constraints). Running INVASE on Metabric proved infeasible within the rebuttal window due to its high computational cost.

References:

[1] Kachuee, M., Goldstein, O., Kärkkäinen, K., and Sarrafzadeh, M. Opportunistic learning: Budgeted cost-sensitive learning from data streams. In International Conference on Learning Representations, 2019.

[2] Yoon, Jinsung, James Jordon, and Mihaela Van der Schaar. "INVASE: Instance-wise variable selection using neural networks." International conference on learning representations. 2018.

审稿人评论

Thank you for your response. My questions have been addressed, so I will be maintaining the same score.

作者评论

We appreciate the reviewer's follow-up and are glad that the concerns have been resolved.

最终决定

This paper presents a novel approach to active feature acquisition by leveraging explainability methods to guide a decision transformer. The reviewers agreed on the strong empirical results and the intuitive nature of the method, and its potential to outperform existing techniques.