TabFSBench: Tabular Benchmark for Feature Shifts in Open Environments
摘要
评审与讨论
The paper studies the problem of feature shifts in tabular learning. It introduces TabFSBench, a benchmark for evaluating four types of feature-shift scenarios, assessing the robustness of four categories of tabular models. The study shows among others that most tabular models have limitations in handling feature-shifts and the feature importance has linear relationship with performance degradation.
给作者的问题
Q. Observation 3 states that models with superior performance in closed environments also perform better in feature-shift scenarios. This seems somewhat counterintuitive, as one might expect that models achieving high performance in closed environments rely more heavily on specific features, making them more vulnerable to feature shifts. Do you have any intuition about this? Or, does this suggest potential improvements to the benchmark to strengthen the evaluation?
论据与证据
The paper evaluates four types of feature shifts and their impact on model performance: (a) single shift, where a single feature with different correlations is shifted; (b) most-relevant shift, where features with the highest correlations are shifted; (c) least-relevant shift, where features with the lowest correlations are shifted; and (d) random shift, where a random subset of features are shifted.
The main claims, including that current state-of-the-art methods fall short in handling the four types of shifts in the 'feature-decrement' scenario and that the degree of performance degradation is closely related to feature correlation, are well supported by the experimental results. However, the scope of the evaluation is somewhat limited. For example, the paper only considers the 'feature-decrement' case, where some of the features are missing values, for all test example cases. The evaluation would be more comprehensive if the shifted features varied across different test examples and if the 'feature-increment' scenario were also considered, allowing an assessment of how much the model's performance degrades compared to when additional features are available.
方法与评估标准
TabFSBench is specifically designed to evaluate the types of feature-shift scenarios and tabular models considered in this paper, and so the benchmark is appropriate for the evaluation. However, it could have included more tabular tasks, e.g., those with substantially larger feature sets, to strengthen the evaluation. Moreover, the different types of tabular models considered are evaluated 'in isolation'. Instead, some of the approaches could be combined to potentially better handle the feature-shift scenarios. For example, traditional tree-based models that tend to do better when large training data is available could benefit from features imputed using LLMs as feature generators. Such new proposals for handling feature shifts and demonstrating those on the benchmark would have strengthened the evaluation substantially.
理论论述
No substantial theoretical claims are made in the paper.
实验设计与分析
The overall design of TabFSBench to include 12 tabular tasks, with the four types of feature shifts is sound. The mean-value imputation strategy used is a common approach and therefore is a valid choice. Again, the experiments could have been extended in several ways, such as considering cases where different test examples experience different types of feature shifts or exploring other imputation strategies, such as using LLMs as feature generators.
补充材料
No supplementary material has been reviewed.
与现有文献的关系
While prior works tend to focus more on distribution shifts, this paper examines feature shifts in tabular learning, introducing a benchmark for assessing four types of feature shifts. In addition, the paper evaluates different types of tabular models on this benchmark, including both traditional tree-based models as well as more recent LLM-based approaches.
遗漏的重要参考文献
While the paper discusses general-purpose and tabular LLMs and evaluates them on the benchmark, LLMs have also been used as feature generators in tabular learning, and discussion on those use cases are missing. Some relevant works in this regard include OpenFE [1], CAAFE [2], and OCTree [3].
[1] Zhang et al., OpenFE: Automated Feature Generation with Expert-level Performance.
[2] Hollmann et al., Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering.
[3] Nam et al., Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning.
其他优缺点
None.
其他意见或建议
Typos:
- near line 192.
Dear Reviewer B7Cf:
Thank you very much for your valuable feedback on our paper. We will take your reviews seriously and make the necessary revisions or additions to the final version.
Q1: Shifted features vary across different examples.
We appreciate your feedback on our experimental design. In our random shift experiments, to evaluate model performance under n/N% feature shift (where n is the number of shifted features and N is the total feature count), we follow this protocol:
- Randomly sample min(10,000, C(N,n)) distinct feature combinations for each shift magnitude.
- Remove each selected combination during testing to assess performance.
- The performance for a given n/N% shift is the mean across all combinations.
Q2: Considering cases with different types of feature shifts
Regarding this question, we would like to clarify two potential interpretations:
- Different types of (feature shifts): Please refer to our response to Reviewer Q7p8 Q1. The first leaderboard provides a comprehensive evaluation of model robustness across various types of feature shifts.
- Different types of (feature) shifts: We conducted evaluations of CatBoost performance on a single shift, categorized by: Categorical, Numerical and Boolean features. Results demonstrate that model performance exhibits the highest sensitivity to Categorical features, followed by Boolean features while showing the lowest susceptibility to Numerical features. The complete experimental results are available in the link. |Metric|Raw|Categorical|Numerical|Boolean| |-|-|-|-|-| |Acc|0.792|0.769|0.787|0.778|
Q3: Feature-increment scenario.
Please refer to our response to Reviewer Q7p8 Q2.
Q4: More datasets.
Due to time and computational constraints in the rebuttal phase, we couldn't complete evaluations on larger benchmarks. However, following OcTree, we conducted random shift experiments on two high-dimensional datasets (madelon and nomao) and present model performance under 10%, 20%, ..., and 100% feature shift degrees, limited by space. This link has detailed results.
| Dataset | Metric | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Nomao | Acc | 0.931 | 0.926 | 0.918 | 0.907 | 0.880 | 0.871 | 0.845 | 0.822 | 0.831 | 0.765 | 0.747 |
| Madelon | Acc | 0.590 | 0.581 | 0.572 | 0.568 | 0.573 | 0.564 | 0.555 | 0.543 | 0.526 | 0.513 | 0.524 |
The table demonstrates that as the degree of feature shift increases, the model performance decreases significantly. Please refer to our response to Reviewer Q7p8 Q1 for further explanation.
Q5: Imputation methods.
We appreciate your suggestions. As Reviewer Q7p8 and WumN noted, our paper is a benchmark study and needn't propose new methods. Thus, we did not explore AutoFE in feature shifts. Additionally, our focus on evaluating model robustness meant we did not extensively cover advanced imputation methods, as explained in Section 3.3.
In light of your proposal, we posit that AutoFE can address feature shift scenarios through two approaches. Detailed results can be found in the link.
-
Imputation: CAAFE and OcTree generate rules for imputing specific features, while OpenFE's generated features do not match the originals. We tested single shifts on the heart dataset using CAAFE and OcTree for imputation. Results show that LLM-based AutoFE can effectively generate matching features, enhancing model robustness compared to mean imputation. |Model|Raw|CAAFE|OcTree| |-|-|-|-| |CatB|0.846|0.852|0.850| |PFN|0.859|0.869|0.870|
-
Generation: We used AutoFE to generate new features to offset the impact of missing original features on model performance. Tests on the heart dataset with random shifts via OpenFE, CAAFE, and OcTree (with the number of generated features matching the original dataset) showed that LLM-based AutoFE has significant potential in feature shift scenarios. |Model|Raw|OpenFE|CAAFE|OcTree| |-|-|-|-|-| |CatB|0.710|0.849|0.854|0.849| |PFN|0.763|0.859|0.867|0.865|
As the importance of features increases, both model performance declines, further corroborating Observation 2 of TabFSBench. Please refer to our response to reviewer wA1o Q7 on other imputation methods.
Q6: Typos.
We will make revisions in the final version.
Q7: Questions For Authors.
Feature importance, calculated via PCC, correlates strongly with the impact of feature absence on performance. Higher importance results in greater performance decline, indicating that model dependency on features is consistent across closed environments and feature shift scenarios. This stability is due to the models' robust feature selection and weight allocation mechanisms. For more details, refer to our response to reviewer wA1o Q6 and the link.
The paper introduces TabFSBench, a benchmark designed to evaluate feature shifts in tabular data. The authors argue that while distribution shifts have been extensively studied, feature shifts remain an underexplored challenge in open environments. TabFSBench includes 12 datasets, four feature-shift scenarios, and evaluates various model categories, including tree-based models, deep learning approaches, and large language models (LLMs). The key findings from the study include:
- Most tabular models perform poorly under feature shifts.
- The importance of shifted features correlates linearly with performance degradation.
- Model performance in closed environments is indicative of its robustness to feature shifts.
给作者的问题
See above.
论据与证据
The paper claims that feature shifts significantly degrade model performance, but the supporting evidence is limited by a small selection of datasets (12 datasets), unclear dataset selection criteria, and insufficient hyperparameter optimization (HPO) for baseline models. The claims regarding linear correlation between feature importance and performance degradation are presented with empirical results, but the justification for why Pearson correlation is the appropriate metric for feature importance is weak. Additionally, the claim that LLMs show potential for handling feature shifts is not rigorously validated.
方法与评估标准
The experimental design includes:
- A benchmark with four feature-shift scenarios: single shift, most/least relevant shift, and random shift.
- Comparison of various tabular models, including tree-based models, deep learning approaches, and LLMs.
- Evaluation metrics: accuracy for classification, RMSE for regression, and a robustness metric based on performance drops.
However, the dataset selection process is unclear, and the justification for using 12 datasets is missing. Many benchmarks for tabular learning typically use dozens or even hundreds of datasets to ensure robustness. Additionally, the HPO process for baseline models is underdeveloped, which raises concerns about whether the reported performance reflects true model capabilities.
理论论述
The paper does not introduce novel theoretical claims but suggests a linear relationship between shifted feature importance and model degradation. While this is weakly supported by empirical evidence, no theoretical justification is provided. The Pearson correlation analysis is simplistic, and additional validation is needed.
实验设计与分析
The experiments are not sufficiently rigorous due to:
- Limited dataset variety: Only 12 datasets, whereas other tabular learning studies often use significantly more.
- Weak HPO process: Many tabular models, particularly gradient-boosted decision trees (GBDTs), are highly sensitive to hyperparameter tuning, but details on the HPO space are lacking.
- Overgeneralized conclusions: The claim that "most models struggle with feature shifts" is based on a narrow selection of datasets and does not consider variations across different domains.
The benchmark fails to justify why these 12 datasets are representative of real-world tabular learning tasks.
补充材料
The supplementary material includes dataset details and additional experimental results. However, it lacks statistical analysis to strengthen the empirical observations. No justification is provided for dataset selection or alternative methods for defining feature importance.
与现有文献的关系
The paper correctly identifies that distribution shifts have been extensively studied, while feature shifts remain relatively unexplored. However, it does not cite all relevant prior work on feature shift mitigation. The paper could benefit from a more detailed discussion of:
- Domain adaptation and domain generalization techniques relevant to feature shifts.
- Methods for assessing robustness of tabular models beyond standard performance metrics.
遗漏的重要参考文献
Most of recent essential references are discussed.
其他优缺点
Strengths:
- Addresses an underexplored problem (feature shifts in tabular data).
- Includes a diverse set of tabular models, including LLMs. (even though they are only loaded from the previous python library.)
Weaknesses:
- Lack of clarity in key definitions: The paper does not provide a formal, quantifiable definition of "feature shift."; The explanation of distribution shift vs. feature shift is vague.
- Limited dataset selection: Only 12 datasets are used without clear justification.
- Insufficient hyperparameter tuning: HPO significantly affects tabular models, but the paper lacks a proper discussion of its impact.
- Overgeneralized conclusions: Claims about limited applicability of tabular models are made based on a small dataset selection.
- Weak novelty: The benchmark consists of existing datasets and models, with no novel methodological contribution.
其他意见或建议
- Clarify definitions: Provide formal mathematical definitions of feature shift and distribution shift.
- Improve dataset selection justification: Explain why these 12 datasets were chosen and why they represent real-world tabular learning challenges.
- Enhance HPO methodology: Describe the hyperparameter search space and ensure fair comparisons among models.
- Avoid overgeneralized claims: Reword broad claims such as “most models struggle with feature shifts” to reflect limitations of the current benchmark.
Dear Reviewer wA1o:
Thank you very much for your valuable feedback on our paper. We will take your reviews seriously and make the necessary revisions or additions to the final version.
Q1: Weakness&Suggestion 1
We adopt notations used in our paper and will provide a more detailed explanation in the final version.
-
Distribution Shift: During the training phase and the testing phase, the feature sets remain unchanged, i.e., . However, there are three types of distribution shifts of the samples themselves: , , and , where and .
-
Feature Shift: During the training phase and the testing phase, the feature sets change while the data distribution of samples has no shift.
- Increment: .
- Decrement: .
Q2: W&S 2.
Please refer to our response to Reviewer Q7p8 Q1.
Q3: W&S 3.
We implement adaptive hyperparameter optimization based on the Optuna framework and following previous studies[1], fixing the batch size at 1024 and conducting 100 independent trials through train-validation splits to prevent test set leakage, with the best-performed hyperparameters fixed during the final 15 seeds.
Q4: W&S 4
Please refer to our response to Reviewer B7Cf Q4.
Q5: W 5
Since our paper focuses on evaluating the robustness of tabular machine learning models in feature-shift scenarios, as Reviewer Q7p8 and WumN noted, our paper is a benchmark and doesn't propose new methods. Therefore, we did not propose a new method. However, through our response to reviewer B7Cf Q5, we explored the potential of AutoFE in addressing the challenges of feature shifts.
Q6: The justification for the Pearson correlation is weak.
We calculated Kendall's coefficient among Pearson, Spearman, SHAP, and mutual information to assess the consistency of feature importance rankings. The table showed a high degree of consistency across these metrics:
| Metric | |
|---|---|
| Pearson | 0.60 |
| Spearman | 0.61 |
| SHAP | 0.49 |
| Mutual Information | 0.53 |
Despite Spearman's high consistency ( = 0.61), We chose Pearson for its widespread use and interpretability. The minor difference in consistency between Pearson and Spearman does not affect the conclusions of the analysis. Detailed data are provided in the link. We will further refine related experiments in the final version.
Q7: The claim that LLMs show potential in feature shifts is not rigorously validated.
We evaluated LLM performance in scenarios without missing value imputation. Experimental results demonstrate that LLM can not only handle feature shifts but is also more robust than tree-based models. Please refer to our response to reviewer 8NVr Q5. Additionally, we tested the performance of LLM-based AutoFE in handling feature shifts. Please refer to our response to reviewer B7Cf Q5.
Q8: Observation 2 lacks theoretical justification.
This observation is drawn from experimental results, thereby limiting the scope of the analytical section. To further investigate this observation, we conduct a theoretical justification.
A high PCC signifies a strong feature-target relation, enabling the feature to provide substantial information increment and enhance model performance. Thus, removing important features leads to a significantly greater performance degradation than removing unimportant ones. Additionally, the model depends strongly on highly correlated features during training, and their absence markedly deteriorates performance (see single-shift experiment results).
We have already pointed out the limitations regarding the impact of interactions among input features on model performance in lines 433-435 in the Conclusion of our paper. Subsequently, we will conduct an in-depth exploration of this. For additional explanations regarding this observation, please refer to our response to Reviewer 8NVr Q2.
Q9: Not cite all relevant work.
Thank you for your suggestion. Although existing Heterogeneous Domain Adaptation (HeDA) methods have achieved significant progress on feature shift of images, tabular data presents a fundamentally different pattern. The inherent structure of it presents challenges when attempting to directly implement HeDA on such datasets. Moreover, our review of the literature reveals that what is often referred to as "feature shift" in many papers is essentially a form of distribution shift. For example, [2] regards covariate shift as feature shift. Other related works on feature shift discussed in our paper, please refer to our response to Reviewer 8NVr Q10.
[1] Liu, Si-Yang, et al. TALENT: A Tabular Analytics and Learning Toolbox. arXiv preprint arXiv:2407.04057, 2024.
[2] He, Huan, et al. Domain adaptation for time series under feature and label shifts. International Conference on Machine Learning, 2023.
Summary
This paper proposes a benchmark to study "feature shift" in the tabular data setting -- where the set of features (i.e. columns) available to the model change at test time vs. training time. The authors identify a set of 12 datasets (four binary classification/multiclass classification/regression datasets), along with different shift settings (random, most/least relevant, single shift), and conduct a set of benchmarking experiments across these datasets, highlighting some empirical findings.
Overall, this is a much-needed direction of high relevance to the tabular data research + practitioner community. However, I have some concerns about design decisions in the study, which seem to make it difficult to rely on the comclusions. At the very least, it is critical to see empirical validation that the authors' decisions do not overly bias the results. Additionally, the benchmark seems fairly small, and the empirical results are noisy, which can make the results feel somewhat unreliable. Some clarification is also needed on various details.
Major comments
-
In some sense, each "table" is a data point in this study. As such, the authors have 12 data points (four binary classification/multiclass classification/regression datasets), which makes it difficult to rely on the noisy results without clear trends (both Figure 3 and Figure 4 show data that clearly do not conform to the line of best fit drawn on the plot). The benchmark would feel much more reliable if it was comprised of more tables, or if the authors were at least able to better explain the results (for example, why are there multiple, apparent "trajectories" or lines in Figure 3?). I would suggest that the authors incorporate more high-quality tables from existing benchmarks, including some of those referenced in the paper (such as the Grinsztajn benchmark, TabZilla https://arxiv.org/abs/2305.02997, or the UniPredict suite https://huggingface.co/datasets/mlfoundations/tabula-8b-eval-suite/tree/main/unipredict).
-
API design: most users won't want to re-run your experiments (shown in "Example Command" -- this should be a latex code listing). Instead, they will want to get the data, and run their own algorithms on it. They may also want to contribute new data to the bechmark, and run your implementations against this. Please comment on how this is supported, and consider adding this to your documentation.
-
The authors make a significant design decision -- perfoming mean-value imputation with missing features -- which seems likely to have a major impact on the study's results. In particular, replacing values in the existing columns, after training on the true values, is not the same as "removing" the column. It revalues that column, and changes the model's predictions to p(x_j{j\neq 1}|x1=x1) for some imputed feature x1. In this case, it seems an almost trivial observation that "removing" (imputing) important features leads to a linear drop, as a model is likely to learn to rely on these important features, and then mean-inputation actually distorts the distribution of X along that dimension, leading to an unsurpising performance drop. If the authors want to demonstrate that this is a reasonable design decision, I would suggest (1) providing evidence from other studies that this is a realistic and reliable method for feature "removal", and (2) show that the results of the study are robust to this decisino (for example, by performing other forms of removal or imputation and demonstrating that the key claims still hold). I, however, am extremely skeptical of the results without such validation.
-
Many methods used in this study are capable of natively modeling missing data. However, the authors do not investigate this decision, despite the fact that any practitioner would likely use this in practice if the schema was known to change. Please also include variants of models where they are trained with missing data (at train time, for example, by randomly masking x% of cells to missing) and then the target column is set to missing at test time.
-
It would be helpful to provide "oracle" measures of performance on the shifted data, by training a model directly on the shifted data and evaluating on a held-out performance. This would help understand how much of the perofrmance degradation is due to the information loss (an irredicible source of error), vs. how much is due to lack of robustness (it is this latter form of drop that the authors seem interested in minimizing/understanding).
Minor comments
-
I think the name "feature shift" is not an accurate descriptor of the setting being studied, and I am not aware of this terminology being used in prior tabular studies. Perhaps "column shift" or "schema shift" would be better names.
-
Details of the Pearson correlation method (L168) belong in the main text, not in the supplementary.
-
Why do the authors use a "sampling with replacement" approach to single shift, instead of simply removing one feature randomly? Please explain or revise.
-
While the hyperparameter grids are given, the HPO method is not - please clarify how hyperparameters are tuned.
-
The design of iteratively removing important/unimportant features is also in the Gardner et al. 2024 paper.
Typos ec
- Abstract: "Python codes" --> Python code
- L079: "feature-shift challenge" --> missing period.
给作者的问题
See above.
论据与证据
See above.
方法与评估标准
See above.
理论论述
See above.
实验设计与分析
See above.
补充材料
No.
与现有文献的关系
See above.
遗漏的重要参考文献
See above.
其他优缺点
See above.
其他意见或建议
See above.
Dear Reviewer 8NVr:
Thank you very much for your valuable feedback on our paper. We will take your reviews seriously and make the necessary revisions or additions to the final version.
Q1: It is difficult to rely on noisy results without clear trends.
We would like to clarify that || greater than 0.7 can be considered as a moderate linear correlation[1]. Tableshift[2] figure 5 also draws a conclusion of linear correlation based on value of 0.7. In our paper, value in Figure 3 is 0.7405, and value in Figure 4 is 0.6. Therefore, despite the presence of noisy data, relevant conclusions can still be drawn based on values. We will add the evaluation results of more datasets to mitigate the noise.
Q2: why are there multiple trajectories in Figure 3?
The reason for multiple trajectories in Figure 3 is that each trajectory consists of points derived from the same dataset. During the fitting process, the results of all datasets were combined, resulting in each dataset forming a distinct trajectory. It is worth noting that these trajectories reflect the relationship between feature importance and model performance across different datasets. Specifically, each dataset's trajectory demonstrates how its specific feature importance influences the model's performance.
To better support our conclusions, this link provides correlation and accuracy plots for 12 different datasets. Below are values for the relationship between correlation and accuracy for each of the 12 datasets.
| Dataset | Credit | Electricity | Heart | Miniboone | Eyemovements | Iris | Penguins | Jannis | Abalone | Bike | Concrete | Laptop |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.8597 | 0.8939 | 0.9244 | 0.9482 | 0.9796 | 0.9646 | 0.9377 | 0.9891 | 0.6866 | 0.9656 | 0.8360 | 0.9896 |
It indicates that although levels of feature importance vary across different datasets, they all support our research conclusion, namely, that there is a significant linear correlation between feature importance and model performance. These trajectories further confirm the generality and reliability of our conclusion.
Q3: Incorporate more high-quality tables.
Please refer to our response to Reviewer B7Cf Q4.
Q4: API design
We have already provided the --export_dataset option in our paper. Users can set this option to True to export specific versions of the datasets they wish to evaluate (e.g., single-column missing, missing to a certain extent, and all possible missing scenarios).
The README.md in Supplementary Material already explains how users can add new datasets and new models. We will further elaborate on the code functions in the final version.
Q5: Major comment 3&4.
-
Reviewer B7Cf has mentioned that "The mean-value imputation strategy used is a common approach and therefore is a valid choice." We also found that benchmark datasets such as Talent[3] also use mean imputation as a method for handling missing values.
-
We compared the performance of various models using their own imputation methods, random imputation, and mean imputation. We present the average performance of several representative models under different imputation methods. |CatBoost|NAN|Mean|Random| |-|-|-|-| |0%|0.845|0.851|0.818| |9%|0.820|0.826|0.807| |18%|0.792|0.801|0.767| |27%|0.763|0.775|0.739| |36%|0.733|0.747|0.702| |45%|0.702|0.716|0.665| |54%|0.670|0.682|0.630| |63%|0.637|0.643|0.608| |72%|0.600|0.596|0.574| |81%|0.538|0.541|0.502|
Our experiments show that the model performs best under mean imputation, and its performance declines under all three types of imputation methods, indicating that the model still faces challenges from feature shift. Detailed data are presented in this link and we strongly recommend that you view this link.
Q6: Major comment 5.
Please refer to our response to Reviewer WumN Q2.
Q7: Minor comment 1&2 & typos ec.
We will make revisions in the final version.
Q8: Minor comment 3.
The purpose of the single-shift experiment is to test the impact of the absence of features with different importance levels on model performance. "Removing one feature randomly" is the experimental setting for a feature shift degree of (n = number of columns) in our random shift experiment.
Q9: Minor comment 4.
Please refer to our response to Reviewer wA1o Q3.
Q10: Minor comment 5.
We have already cited the paper you mentioned on lines 683-684 of Section A.3 on page 13 of our submitted paper.
[1] Iversen, Gudmund R, et al. Statistics: The conceptual approach. Springer Science & Business Media, 2012.
[2] Gardner, Josh, et al. Benchmarking distribution shift in tabular data with tableshift. Advances in Neural Information Processing Systems, 2023: 53385-53432.
[3] Liu, Si-Yang, et al. TALENT: A Tabular Analytics and Learning Toolbox. arXiv preprint arXiv:2407.04057, 2024.
This paper establishes a new benchmark for tabular data, focusing on feature shift issues, where feature increments and decrements occur between the training and testing phases. This benchmark includes three types of tasks across twelve datasets and evaluates twenty-four tabular methods. The experimental results reveal three key observations that can guide future research in this area.
给作者的问题
- What is the performance upper bound for each dataset when the feature shift issue occurs? When feature shift happens, the optimal performance of each dataset should change, and understanding this optimal performance is helpful to gauge the gap between existing methods and a perfect method.
- Could the authors provide some examples of feature shift issues occurring in the real world? This would be helpful for understanding the importance of this problem.
论据与证据
The authors' major claims are based on three key observations from the experiments: (1) tabular models have limited performance in handling feature-shift scenarios; (2) feature importance shows a linear trend with performance degradation; (3) performance under feature shift is related to performance in a closed environment. The authors have demonstrated these key observations through their experiments.
方法与评估标准
This is a benchmark paper, and as such, it does not propose any new method to address the feature shift problem. The evaluation in this paper is comprehensive and accurate. The authors use accuracy and ROC-AUC for classification tasks and RMSE for regression tasks. Additionally, the performance gap is illustrated to demonstrate model robustness. Extensive analyses are provided to highlight the shortcomings of existing methods and suggest future directions.
理论论述
The paper does not present theoretical claims.
实验设计与分析
The experimental design and corresponding analyses are comprehensive because this is a benchmark paper. The three key observations are fully supported by the experimental results.
补充材料
The supplementary material includes a URL that links to an anonymous Git repository. This repository contains the full code and instructions for using the code in this paper.
与现有文献的关系
This paper can advance tabular data research towards deployment in real-world applications. Specifically, this research has revealed a weakness in existing tabular methods, that is, feature shift leads to performance degradation issues, and established a benchmark to comprehensively evaluate the performance of each method, guiding future method development.
遗漏的重要参考文献
No essential reference is omitted.
其他优缺点
Strengths:
- This paper studies a novel and practical problem for the tabular classification and regression problems. In tabular tasks, feature failure is a common problem. With the advance of this problem, the tabular methods can better deploy in the real-world applications.
- The experimental results and corresponding analysis in this paper is comprehensive. The analyses have revealed three observations, which can guide the following researches.
- The authors provide a large-scale benchmark for the studied problem, containing three types of tasks and twelve datasets evaluated by four metrics. This can benefit the whole tabular community.
Weakness:
Although the studied problem and conducted experiments offer a certain contribution, the authors should provide datasets that naturally contain feature shift issues rather than simply analyzing performance on synthetic datasets.
其他意见或建议
Please see the strengths and weaknesses above.
Dear Reviewer WumN:
Thank you very much for your valuable feedback on our paper. We will take your reviews seriously and make the necessary revisions or additions to the final version.
Q1: The authors should provide datasets that naturally contain feature shift issues rather than simply analyzing performance on synthetic datasets.
Regrettably, there currently exists no dataset specifically designed for feature shift, unlike Tableshift [1] which was developed for distribution shifts. However, we have preliminarily constructed a feature-shifted dataset based on the Heart dataset. Given that different features in the original dataset require distinct measurement instruments, we categorized the features into three groups: basic features, electrocardiogram (ECG) features, and exercise stress test features.
In the constructed feature-shifted Heart dataset, both the training set and the test set step 0 contain all features. However, patients in the test set step 1 lack ECG measurements, resulting in the absence of RestingECG and ST_Slope features. Similarly, patients in the test set step 2 did not undergo an exercise stress test, leading to the absence of ExerciseAngina and Oldpeak. A subset of examples is provided in this link.
Additionally, we evaluated CatBoost on this dataset:
| Metric | Test Set Step 0 | Test Set Step 1 | Test Set Step 2 |
|---|---|---|---|
| Accuracy | 0.879 | 0.658 | 0.823 |
Note that for meaningful evaluation of feature-shifted datasets, models must be assessed under specific partitioning schemes. Applying the four experimental settings proposed in our paper would undermine the unique characteristics and practical relevance of such datasets. In the final version, we will include a comprehensive evaluation of the feature-shifted Heart dataset and will open-source the dataset for broader research use.
Q2:What is the performance upper bound for each dataset when the feature shift issue occurs?
We evaluated the performance upper bound of the models by training directly on the shifted data and assessing their performance on a held-out test set. Due to time constraints, we did not test the performance upper bounds of LLMs and tabular LLMs. The detailed performance for all evaluated models on 9 datasets is provided in this link, with the average performance upper bound and feature-shift performance on the Heart dataset presented here.
| Model | Average Vanilla Performance | Average upper-bound performance |
|---|---|---|
| LightGBM | 0.700 | 0.707 |
| XGBoost | 0.698 | 0.761 |
| CatBoost | 0.718 | 0.767 |
| TabPFN | 0.778 | 0.804 |
| DANets | 0.617 | 0.628 |
| MLP | 0.775 | 0.783 |
| NODE | 0.599 | 0.613 |
| ResNet | 0.796 | 0.808 |
| SwitchTab | 0.810 | 0.821 |
| TabCaps | 0.743 | 0.777 |
| TabNet | 0.604 | 0.616 |
| TANGOS | 0.790 | 0.794 |
| AutoInt | 0.629 | 0.642 |
| DCNv2 | 0.776 | 0.786 |
| FT-Transformer | 0.661 | 0.671 |
| GrowNet | 0.578 | 0.589 |
| Saint | 0.770 | 0.778 |
| SNN | 0.718 | 0.729 |
| TabTransformer | 0.502 | 0.515 |
| TabR | 0.806 | 0.818 |
| ModernNCA | 0.781 | 0.819 |
The results demonstrate that, under the same degree of feature shift, models trained on the original dataset exhibit inferior performance compared to those trained directly on the shifted dataset. This indicates that although training on the original dataset provides more comprehensive information, feature shift significantly compromises model robustness, resulting in performance degradation relative to models trained on the shifted data. Furthermore, as the degree of feature shift continues to increase, the performance gap between original and shifted datasets also becomes progressively larger, indicating a progressively diminishing robustness of the model.
Q3:Could the authors provide some examples of feature shift issues occurring in the real world?
Section 2.2 of our paper employs forest disease monitoring as a case study to demonstrate how sensor degradation leads to a reduction in available features. As further evidenced by the designed heart dataset in Q1, incomplete medical examinations may result in missing diagnostic indicators (features) due to the absence of specific equipment.
The feature shift phenomenon also manifests prominently in financial and transportation domains:
-
Finance: Stock prediction models trained on comprehensive features (e.g., financial ratios, macroeconomic indicators) may encounter missing features (e.g., market sentiment indices) during real-world deployment due to unforeseen events.
-
Transportation: Accident prediction models relying on features like road conditions and weather data may experience partial feature absence caused by sensor failures or insufficient data collection.
[1] Gardner, Josh, Zoran Popovic, and Ludwig Schmidt. Benchmarking distribution shift in tabular data with tableshift. Advances in Neural Information Processing Systems, 2023: 53385-53432.
This paper proposes a new benchmark “TabFSBench” for tabular data learning, especially for the feature shift in open environments. Feature shift means the feature could be decrement or increment. The authors select open-source datasets from OpenML and Kaggle’s dataset library, including three curated tasks of binary classification, multi-class classification, and regression, covering various domains such as finance, healthcare and geology. Four feature-shift scenarios including single shift, most/least relevant shift, and random shift. Some insightful empirical analysis are provided. Moreover, the paper provides callable APIs.
update after rebuttal
My problem has been solved, and I decided to keep my score.
给作者的问题
In the paper, the benchmark doesn’t consider the feature increment. I think it is also a promising direction to study how to better exploit the new feature information.
论据与证据
Yes, the claims are well supported.
方法与评估标准
Yes, the paper evaluates the performance of different models with ROC-AUC and RMSE, which is reasonable for the evaluation.
理论论述
The paper doesn’t include theoretical results, but the empirical analysis from the results are reasonable and insightful.
实验设计与分析
Yes, I have checked the experimental design and analyses. The authors conducted extensive to evaluate the performance of different methods on the feature shift scenarios. The compared methods include both tree-based methods, deep learning methods, LLM methods, tabular LLM methods. The experimental design is sound and the empirical analyses are insightful, pointing the challenges that current methods remain in the feature shift scenarios. I believe these results could help promote related research.
补充材料
Yes, I reviewed more detail and results about the experiments in the supplementary.
与现有文献的关系
This paper is related to the field of tabular data learning, especially in the benchmark for tabular learning. Different from previous methods or benchmarks, this paper is the first to study the tabular data learning under feature shift. The authors present a definition about feature shift in tabular data learning. This problem is important and unique to the tabular data (For images or text, there is no feature decrement or increment). Therefore, it is important to construct a benchmark to help evaluate the performance of tabular data learning methods under the feature shift problem. The benchmark could promote related research.
遗漏的重要参考文献
No.
其他优缺点
It is good to see that the authors provided easy-tool-use API, I believe this could help related researchers. I hope the authors could release a leaderboard for the tabular data learning methods on this benchmark.
其他意见或建议
No.
Dear Reviewer Q7p8:
Thank you very much for your valuable feedback on our paper. We will take your reviews seriously and make the necessary revisions or additions to the final version.
Q1: I hope the authors can release a leaderboard for the tabular data learning methods on this benchmark.
We have currently implemented a leaderboard for TabFSBench and incorporated additional evaluation results for TabPFN v2. This link provides access to the TabFSBench homepage we designed. Upon completion of the paper review process, public access will be granted. Furthermore, we will continuously update the benchmark with the evaluation results of new models. Researchers are also cordially invited to contribute their own evaluation results on additional datasets or models. Additionally, we'll be expanding TabFSBench in the future from below two parts:
1. For the dataset selected in TabFSbBench.
We selected 12 datasets from Grinsztajn[1] and TabZilla[2]. They exhibit substantial heterogeneity in scale, domain, and task characteristics, deliberately encompassing diverse potential scenarios of feature shift. To comprehensively evaluate the challenges posed by feature shifts, we have implemented four distinct experimental configurations that collectively enhance both the breadth and depth of our benchmark assessment. Multiple experimental repetitions were conducted to mitigate stochastic variability.
Due to rebuttal time constraints, we were unable to review the large table benchmark in a short period of time. We will incorporate evaluation results based on the Grinsztajn benchmark into the final version.
2. For the future work for TabFSBench.
-
We have established and maintained a project homepage and a comprehensive rank leaderboard. We continuously update the performance evaluation results of various newly released models or datasets (for example, the benchmark test data of the recently included TabPFN v2 model have been incorporated), ensuring that the research community can access the latest evaluation information in a timely manner. This homepage will be made open source in the subsequent phase.
-
To enhance the scalability of cross-model comparison research, TabFSBench has been specifically designed with a feature-shift dataset export module. Through command-line parameter settings, this module can export datasets under various feature-shift scenarios, thereby effectively supporting researchers in conducting evaluations on their own models (for specific implementation details, please refer to our response to Reviewer 8NVr Q4).
-
In the anonymized code repository provided in Supplementary Materials, TabFSBench has implemented a plug-and-play interface for importing new datasets and new models. This enables researchers to conveniently test their custom datasets or models within the TabFSBench framework (for specific implementation details, please refer to our response to Reviewer 8NVr Q4).
-
We encourage and actively accept researchers to provide us with their own evaluation results. Subsequently, we will also be committed to constructing feature-shift datasets with practical significance and opening up both public and private evaluation rank leaderboards. This will further promote the research and evaluation of the feature-shift scenario within the community (for other relevant content, please refer to our response to Reviewer WumN Q1/3).
Q2: In the paper, the benchmark doesn’t consider the feature increment.
Section 2.2 of the paper indicates that this study primarily focuses on the impact of feature shifts on model performance. However, tabular machine learning models evaluated in this work are inherently incapable of handling newly added features in feature-increment scenarios, as they require consistent input and output dimensions. Consequently, in such scenarios, these models automatically disregard the new features while maintaining their original performance.
In the future work, we plan to extend the evaluation scope of TabFSBench by incorporating additional datasets and model evaluations. Specifically, we will include specialized models designed for feature increment scenarios (e.g., [3][4]) to assess their performance improvements under such conditions.
[1] Léo, Grinsztajn, et al. Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems, 2022: 507-520.
[2] McElfresh, Duncan, et al. When Do Neural Nets Outperform Boosted Trees on Tabular Data? Advances in Neural Information Processing Systems, 2023: 76336-76369.
[3]Zhang, Zhen-Yu, et al. Learning with feature and distribution evolvable streams. International Conference on Machine Learning, 2020.
[4]You, Dianlong, et al. Online learning for data streams with incomplete features and labels. IEEE Transactions on Knowledge and Data Engineering, 2024.
This paper contains a new benchmark containing tabular datasets with feature shifts, whereby features are removed or added at test time. The paper observes that performance degradation occurs as feature shifts happen and notes that some models may be more robust than others. Reviewers criticized the statistical analysis, feature imputation methods as a choice for handling missing features, the fact that all feature shifts are synthetic, and dataset shortage. I will also point out that another paper has looked at feature shifts for tabular data but in the transfer learning setting (“Transfer Learning with Deep Tabular Models”), and this is probably worth discussing. While the paper received mixed reviews, I am inclined to accept it since these criticisms seem non-terminal, and the authors addressed the criticisms during rebuttal.