Language Models Are Good Tabular Learners
摘要
评审与讨论
The authors present a method called TDTransformer to handle tabular data. They use different embedding methods for categorical, numeric and binary columns: categorical is embedded as text, numeric uses an extension of the PLE method to encode numbers in the [-1, 1] range using quantiles, and binary columns are just 0 or 1. They explore removing positional embeddings. TDTransformer outperforms XDGBoost on the benchmark they consider.
优点
- Strong performance on OpenML against XGBoost/CatBoost
缺点
- The presentation is quite lacking, also because of janky notation. For example, in Eq. (5) it seems like the linear transformation is applied to the concatenation of PLE features for all cells in the columns which seems to be absurd, because there should a variable number of cells for each table. All equations that present the architecture are similarly unclear.
- It's also unclear what is the difference between SCL and SSCL in Eq (11) and Eq (12). In what sense one uses labels and the other one doesn't?
- I am not an expert in the field, but a cursory look at arXiv pointed out a missing comparison system: https://arxiv.org/pdf/2403.01841, published at ICLR last year.
问题
- It's not clear whether any tuning has been performed on XGBoost or default hyperparameters have been used which can have a dramatic effect on performances: see Table 1 in https://arxiv.org/pdf/2403.01841. Can you clarify this issue?
- How does CTA work? Tables are (arguably) permutation invariant wrt rows, but removing positional embeddings altogether would make the two following tables have the same representations:
(1)
| a | b |
|---|---|
| 0 | 0 |
| 1 | 1 |
(2)
| a | b |
|---|---|
| 0 | 1 |
| 1 | 0 |
- Given that you use 0 to represent
falsewould your approach be able to distinguish the two following tables?
(3)
| a | b |
|---|---|
| 1 | 0 |
| 1 | 0 |
(4)
| a | c |
|---|---|
| 1 | 0 |
| 1 | 0 |
- Is PLE invariant to scaling?
- Have the authors tried a simple LLM baseline? This would be more to contextualise the practical implications of the paper, rather than its scientific validity.
The authors thank the reviewer for comments. Here is the one-to-one response:
-
Weakness 1: We focus on relational tables where the number of cells per column or per row does not change. In other words, given a table, the number of cells per column does not change and the number of cells per row does not change either. We do not consider cross-table tasks, so it is permissible if the number of cells changes when moving to other tables. Essentially different datasets correspond to different models. Here different models have two meanings: (1) the model weight is different. (2) the sequence length for numerical columns (or binary columns) is different. There will not be a problem when changing the table since we do not pre-train the model on one table and do fine-tuning on another table.
-
Weakness 2: Both SCL and SSCL can be used in the pre-training process. As indicated in Eq. 11 and Eq.12, SSCl considers positive pairs for only the table row and its corresponding corrupted counterpart. SCL considers positive pairs for the table row and the corrupted rows that have the same label as that row.
-
Weakness 3: We try to get the performance on the reference that the reviewer points out. However, there are more than 3 bugs in the official repo (https://github.com/jyansir/tp-berta) preventing us from obtaining the performance. Bugs we find include:
-
bin/tpberta_modeling.pyline 79 indicates thatembeddingshas the same sequence length asinput_scales. Inlib/data_utils.py, theembeddingsis related tonum_fix_partandinput_scalesis related tonum_input_scales. On line 379 ofdata_utils.py, the length ofnum_fix_partis equal tolen(efn) + 2because original code is[cls_token_id] + efn + [mask_token_id]. One line 384 ofdata_utils.py, the length ofnum_input_scalesis1 + len(efn). When we changed1 + len(efn)to2 + len(efn), the error message was gone. -
In
scripts/pretrain/pretrain_tpberta.py, the method of theRegulatorclass is called on line 154regulator.trainer_end(args, model). However, in the definition of theRegulatorclass, the definition of this method isdef trainer_end(self, model: nn.Module)on line 461 oflib/optim.py. When we changed it todef trainer_end(self, args, model), the error message was gone. -
After pre-training, when loading the pre-trained model weights, one of the related code lines is
state_dict['tpberta.embeddings.position_ids'](bin/tpberta_modeling.py). When we examined the checkpoint, there is noposition_idsin the checkpoint. What the checkpoint has is:
odict_keys(['tpberta.embeddings.word_embeddings.weight', 'tpberta.embeddings.position_embeddings.weight', 'tpberta.embeddings.token_type_embeddings.weight', 'tpberta.embeddings.LayerNorm.weight', 'tpberta.embeddings.LayerNorm.bias', 'tpberta.intra_attention.W_q.weight', 'tpberta.intra_attention.W_q.bias', 'tpberta.intra_attention.W_k.weight', 'tpberta.intra_attention.W_k.bias', 'tpberta.intra_attention.W_v.weight', 'tpberta.intra_attention.W_v.bias', 'tpberta.intra_attention.W_out.weight', 'tpberta.intra_attention.W_out.bias', 'tpberta.encoder.layer.0.attention.self.query.weight', 'tpberta.encoder.layer.0.attention.self.query.bias', 'tpberta.encoder.layer.0.attention.self.key.weight', 'tpberta.encoder.layer.0.attention.self.key.bias', 'tpberta.encoder.layer.0.attention.self.value.weight', 'tpberta.encoder.layer.0.attention.self.value.bias', 'tpberta.encoder.layer.0.attention.output.dense.weight', 'tpberta.encoder.layer.0.attention.output.dense.bias', 'tpberta.encoder.layer.0.attention.output.LayerNorm.weight', 'tpberta.encoder.layer.0.attention.output.LayerNorm.bias', 'tpberta.encoder.layer.0.intermediate.dense.weight', 'tpberta.encoder.layer.0.intermediate.dense.bias', 'tpberta.encoder.layer.0.output.dense.weight', 'tpberta.encoder.layer.0.output.dense.bias', 'tpberta.encoder.layer.0.output.LayerNorm.weight', 'tpberta.encoder.layer.0.output.LayerNorm.bias', 'tpberta.encoder.layer.1.attention.self.query.weight', 'tpberta.encoder.layer.1.attention.self.query.bias', 'tpberta.encoder.layer.1.attention.self.key.weight', 'tpberta.encoder.layer.1.attention.self.key.bias', 'tpberta.encoder.layer.1.attention.self.value.weight', 'tpberta.encoder.layer.1.attention.self.value.bias', 'tpberta.encoder.layer.1.attention.output.dense.weight', 'tpberta.encoder.layer.1.attention.output.dense.bias', 'tpberta.encoder.layer.1.attention.output.LayerNorm.weight', 'tpberta.encoder.layer.1.attention.output.LayerNorm.bias', 'tpberta.encoder.layer.1.intermediate.dense.weight', 'tpberta.encoder.layer.1.intermediate.dense.bias', 'tpberta.encoder.layer.1.output.dense.weight', 'tpberta.encoder.layer.1.output.dense.bias', 'tpberta.encoder.layer.1.output.LayerNorm.weight', 'tpberta.encoder.layer.1.output.LayerNorm.bias',
... <deeper layers>
We have contacted the authors from the mentioned reference. If we have the code work before the deadline, we will update the manuscript.
Due to the time limit, we are sorry that we might not be able to provide the performance on the reference the reviewer mentioned although we have spent long time trying to make it work. When we tried to debug, we needed to make sure not only there was no bug but also that the modification would not cause the wrong calculation.
-
Question 1: We use the default hyperparameters for XGBoost. When we did experiments across 76 tables, we used the same hyperparameters. Ideally, 76 tables might correspond to 76 sets of optimal hyperparameters. We want to make sure the model complexity is the same across different datasets, so we do not do hyperparameter tuning.
-
Question 2: CTA adds positional encoding to the embedding of categorical columns as indicated by Eq. 10. The categorical column embeddings serialize columns into natural language sentences. Hence, we add positional encoding. The numerical columns and binary columns are similar to sentence embedding where the embedding aggregates information from all sequence embeddings. Hence, we do not add positional encoding to enforce the permutation invariance of tables. In the ablation study, we do find removing positional encoding can have a remarkable effect on the embedding space as shown in Figure 4. We would like to emphasize that CTA does not entirely remove positional encoding. It only removes the positional encoding for numerical column embeddings and binary embeddings.
Our method does not work on cross-table tasks. Besides, the embedding is generated per row. Four rows in two example tables are the same. We do not think the embedding will be the same as the prediction done by feeding row embedding z_i to the classification head.
-
Question 3: Yes, our method does take column names into consideration. That is the strength of our method which utilizes the power of language models for semantic understanding. Most recent tabular deep learning works use the embedding layer to encode cells which does not consider column name and semantics of cell values. Our method does consider them. Please refer to the section 3.1 for column embeddings.
-
Question 4: PLE depends on the distribution of cell values within the same column. Before PLE, we normalize cell values in the data preprocessing. We tested z-norm and min-max normalization. For the min-max normalization, the scaling does not change normalized values. Hence, PLE does not change. For the z-norm, the scaling has a limited effect on the normalized values.
-
Question 5: Using LLM will inevitably introduce a lot more number of trainable parameters, which makes it less competitive compared to tree-based methods such as XGBoost. We tested RoBERTa [2] as the backbone model (the number of layers and the hidden dimension are much smaller than the default parameters). The result is reported in the Appendix A.2.
Could the reviewer give examples of simple LLM baseline?
[1] Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
[2] Liu, Yinhan. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 364 (2019).
Dear Reviewer,
Thank you for your time and efforts in reviewing our submission. We have carefully addressed all the points you raised in the attached rebuttal and made every effort to clarify and improve the manuscript based on your suggestions.
We kindly request you to review our responses and let us know if there are additional concerns or points requiring clarification.
Authors
Dear Reviewer,
We appreciate the time and effort you have invested in reviewing our work. We submitted a detailed response in official response. We kindly request if it would be possible to review our rebuttal and provide your feedback.
Authors
Dear Reviewer,
As the rebuttal period is drawing to a close, I would appreciate your response to the authors' rebuttal at your earliest convenience.
Best Regards,
Area Chair
We thank the reviewer for the feedback. Below is the one-to-one response:
- Weakness 1: our method can handle arbitrary-sized tables.
We concatenated the embeddings of different types of columns as indicated in Equation 8, which allows different numbers of columns.
We tested 76 tables. The number of numerical columns, categorical columns and binary columns are shown in Appendix A.3 (please refer to Supplementary Material). The number of columns do vary in those 76 tables.
-
Weakness 2: The contrastive learning essentially builds positive and negative pairs. In our method, we make a prediction for each table row based on the embedding of that row as indicated in the Task formulation in Section 3. Pairs are generated by the corruption of table rows. For example, if there are table rows and in a table, the corrupted counterparts of these two table rows are and . In SCL, if and have the same label, (, ) and (, ) are considered as positive pairs. Whereas in SSCL, (, ) are considered as negative pair. Hence, in Equation 12 for SCL, we have a summation for while there is no such summation in Equation 11 for SSCL.
-
Weakness 3: When we ran baselines, we use the default fixed hyperparameters. When changing hyperparameters such as the model depth, the model complexity will vary. We want to make sure the model complexity is the same across datasets (the model complexity of different methods is shown in Appendix Figure 7). Hence, we use fixed hyperparameters instead of the optimal hyperameters for each dataset, i.e., the hyperparameters for 76 tables are the same.
-
Question 4: Table (1) and Table (2) will have different embeddings as the rows are different (one table has rows of [0, 0] and [1, 1] while the other one has [0, 1] and [1, 0]). The prediction is based on the table row as indicated in Task formulation in Section 3.
-
Additional Comment: Our method has only 32.31 M trainable parameters as shown in Appendix Figure 7 while the prevalent LLM such as Llama 7B [1] has 7B trainable parameters. In addition, our method trains the model from scratch and does not rely on prior knowledge from the massive pre-training datasets. Hence, we did not compare our method against the prompting method on pre-trained LLM.
We are happy to provide further clarifications or modifications if the reviewer have more concerns.
[1] Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).
Dear Reviewer,
Thank you for your time and efforts in reviewing our submission. In our latest response, we made a clarification regarding your concerns. We noticed there might be some misunderstandings to our work (e.g. support for tables of different sizes, task formulation in our work, and table embedding process). We hope our clarifications resolve those misunderstandings. As the deadline for author-reviewer discussion is approaching, we would like to kindly draw your attention to our latest response. If there is any remaining or additional concern you would like us to address, we would be more than happy to do so.
Authors
I thank the authors for providing a response. I am really sorry for the late response. I am not increasing scores.
I'm not sure I'm understanding parts of the response.
- Weakness 1: do you simply mean that the tasks do not require capability to handle arbitrary sized tables?
- Weakness 2: I don't understand the following sentence "SCL considers positive pairs for the table row and the corrupted rows that have the same label as that row."
- Weakness 3: was there any hparam tuning for the proposed method at all? if the answer is positive, why wouldn't your tuning methodology work for the baseline as well?
- Question 4: Based on your response, it seems to me that tables (1) and (2) would have the same representation. If so, this makes me doubt of the practical usefulness of the model, regardless of whether permutation invariance is useful in the experimental benchmark.
- LLM baseline: few/zero-shot prompt with examples and list of of output classes.
Transformer language models excel at natural language processing but underperform in tabular data compared to traditional algorithms. Authors claim that this is due to 2 factors: a) heterogenous data in tables, and b) handling of numerical values. Authors propose TDTransformer to fix these issues. To address a), TDTransformer uses different embedding processes for different types of columns. To address b), TDTransformer adapts piece-wise linear encoding.
They evaluate the model on 76 datasets, improving on SOTA methods.
优点
The main strength of this paper is empirical (results are strong). Additionally, the architectural modifications introduced are well-motivated.
The paper itself is well-written and well-presented, and the code, open-source, looks good at a first glance. I haven't attempted it myself, but I assign high success ratios for reproducing this work.
缺点
While the motivations for the architectural motivations are well-principled, and they do work empirically, I still think that the paper has some claims that in my view are not backed up by the paper itself or by existing literature. For example, "overcome the transformer-based architectures’ incapability of interpreting heterogeneous data". This is a very strong statement. It's not clear to me whether Transformers have any intrinsic incapability of interpreting heterogeneous data, especially given the success of Transformers handling multimodal data in other domains. Perhaps these statements should be more cautious.
问题
Say that I have a computational budget C and I want to obtain the best possible results on a given tabular dataset with some features (e.g., certain number of rows). Should I choose XGBoost or TDTransformer? I'm not necessarily asking for scaling laws (which would be ideal), but I do think readers would appreciate some pointers on this.
Similarly, "PLE introduces an inductive bias that is beneficial to the training process". Is this an inductive bias that is going to be less relevant at scale?
Contrastive losses are sensitive to batch sizes, how does batch size affect TDTransformer?
The authors thank the reviewer for feedback. Here is the one-to-one response:
-
Weakness 1: The reviewer’s concern makes sense. We made this statement based on the existing works [1-4]. The spectral analysis [2] provides the experimental evidence to support the claim that transformer-based architectures are incapable of interpreting heterogeneous data.
-
Question 1: We added the discussion on computational costs in updated appendix A.5. The updated part is marked in blue color.
-
Question 2: PLE depends on the distribution of the values in numerical columns. When scaling numerical values, the bin indices of numerical columns does not change. When b_{t-1} <= x <= b_{t}, xi_i is invariant of scaling.
-
Question 3: We thank the reviewer for the good suggestion! We added the experiment results on different batch sizes in the updated manuscript. Please refer to Table 5, Table 6, Section 4.3 ablation study on batch size. We marked the updated part in blue color. We found that when decreasing batch size, the performance for the multiclass classification task decreases while that for the binary classification task nearly does not change.
[1] Shwartz-Ziv, Ravid, and Amitai Armon. "Tabular data: Deep learning is not all you need." Information Fusion 81 (2022): 84-90.
[2] Rahaman, Nasim, et al. "On the spectral bias of neural networks." International conference on machine learning. PMLR, 2019.
[3] Beyazit, Ege, et al. "An inductive bias for tabular deep learning." Advances in Neural Information Processing Systems 36 (2024).
[4] Gorishniy, Yury, Ivan Rubachev, and Artem Babenko. "On embeddings for numerical features in tabular deep learning." Advances in Neural Information Processing Systems 35 (2022): 24991-25004.
Thank you for the answers and updates. "We found that when decreasing batch size, the performance for the multiclass classification task decreases while that for the binary classification task nearly does not change." - interesting result. Re weakness 1, for an hypothetical camera-ready version of the paper, I'd like to see more caution in the writing. I'll update my assessment.
We thank the reviewer for the thoughtful feedback that helps improve our work. We are especially grateful for the positive assessment and the raised scores, which reflect the improvements and clarifications made in our submission.
Dear Reviewer,
As the rebuttal period is drawing to a close, I would appreciate your response to the authors' rebuttal at your earliest convenience.
Best Regards,
Area Chair
This paper presents the Tabular Domain Transformer (TDTransformer) framework, designed to enhance transformer-based models' performance on tabular data. This framework uses distinct embedding processes for categorical, numerical, and binary columns. It also includes a positional encoding. After fine-tuned the model on the data from OpenML benchmark. It outperforms some traditional tree-based and transformer based methods.
优点
This paper provides a comprehensive experiments and evaluations on the traditional methods on tabular data.
缺点
This proposed framework is lack of novelty, it largely adapts existing techniques without proposing fundamentally new methodologies, which may limit its impact. For example TAPAS from google is also a transformer-based table parser.
I think authors can follow the paper I posted in the review as well as some other papers related to more up-to-date LLM-based methods to address the concern of out dated baselines, for details please refer to the paper I posted below.
By using transformer-based models with specialized encoding and pre-training techniques, the computational complexity and training cost increase. In many cases, traditional tree-based models like XGBoost or CatBoost achieve comparable results with far less computation, making TDTransformer potentially less attractive for practitioners dealing with tabular data in resource-constrained environments. Most of the baseline methods are out dated. I'd like to see more comparison with up-to-date methods like the LLM-based. Please provide runtime and resource usage comparisons between TDTransformer and tree-based models, or to discuss scenarios where the increased computational cost might be justified by performance gains.
问题
Can you add some experiments on LLM based methods like the ones mentioned in Fang, Xi, et al. "Large language models (LLMs) on tabular data: Prediction, generation, and understanding-a survey." (2024).?
The authors thank the reviewer for comments. Here is the one-to-one response:
-
Weakness 1. TAPAS is an approach for tabular question answering as opposed to tabular prediction. It directly flattens tables while our method proposes different ways to process different types of columns. Compared to existing methods, (1) we propose three drastically different encoding strategies for three different types of columns: categorical, numerical and binary columns; (2) we propose a corruption strategy designed for tabular data; (3) we propose a column-type aware positional encoding designed for our encoding strategies. This type of positional encoding encourages the permutation invariance of tables and ensures the sequence order of natural language sentences.
-
Weakness 2: The deep learning baselines in our study are SubTab (2021) [1], Scarf (2022) [2] and SwitchTab (2024) [3]. We looked into the survey paper that the reviewer recommended. We find many baselines are not suitable for the task in our study. The survey paper covers prediction, data generation and table understanding. The task in our study is prediction. We do use the baselines mentioned in the prediction category: XGBoost, CatBoost and TransTab. Some other baselines mentioned in the survey paper are not up-to-date baselines. For example, TabNet [4] and TabTransformer [5] mentioned in the survey paper are earlier than most deep learning baselines we used. Can the reviewer elaborate on what specific baselines should the authors add for the comparison?
-
Weakness 3: We added the computational cost comparison in the updated Appendix A.5. The updated part is marked in blue color.
-
Question 1: The survey paper covers prediction, data generation and table understanding. Our model is used for the classification tasks, which falls into the prediction category. We use 76 datasets to examine the performance. Can the reviewer elaborate on what kind of experiments are needed to examine the performance of TDTransformer?
[1] Ucar, Talip, Ehsan Hajiramezanali, and Lindsay Edwards. "Subtab: Subsetting features of tabular data for self-supervised representation learning." Advances in Neural Information Processing Systems 34 (2021): 18853-18865.
[2] Bahri, Dara, et al. "Scarf: Self-supervised contrastive learning using random feature corruption." arXiv preprint arXiv:2106.15147 (2021).
[3] Wu, Jing, et al. "Switchtab: Switched autoencoders are effective tabular learners." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 14. 2024.
[4] Arik, Sercan Ö., and Tomas Pfister. "Tabnet: Attentive interpretable tabular learning." Proceedings of the AAAI conference on artificial intelligence. Vol. 35. No. 8. 2021.
[5] Huang, Xin, et al. "Tabtransformer: Tabular data modeling using contextual embeddings." arXiv preprint arXiv:2012.06678 (2020).
Thank you for your response. However, I am not convinced by the responses. I will keep my rating.
Dear Reviewer,
As the rebuttal period is drawing to a close, I would appreciate your response to the authors' rebuttal at your earliest convenience.
Best Regards,
Area Chair
The authors thank the reviewer for the response. Can the reviewer elaborate on which aspects remain unclear or unconvincing?
The reviewer suggested to add baselines and experiments based on [1]. We tried to add baselines and experiments from the reference [1] that the reviewer gave, but we were not sure which baseline or experiment to add. We used SwitchTab (2024) as one of our baselines. Can the reviewer give more details on which baselines or experiments should the authors add to improve the quality of the work?
We tested one baseline SAINT (2021) that is included in [1]. The performance for the binary classification task is:
| Method | Acc | Auc | Acc | Auc | Acc | Auc | Acc | Auc | Avg Acc | Avg Auc |
|---|---|---|---|---|---|---|---|---|---|---|
| SAINT | 86.16 | 0.85 | 84.24 | 0.85 | 86.92 | 0.86 | 86.25 | 0.79 | 85.77 | 0.85 |
| TDTransformer | 87.56 | 0.87 | 91.67 | 0.87 | 83.94 | 0.88 | 95.40 | 0.96 | 87.79 | 0.88 |
| TDTransformer (CTA Pos) | 87.19 | 0.87 | 91.70 | 0.87 | 83.30 | 0.87 | 95.59 | 0.94 | 87.48 | 0.87 |
The performance for the multiclass classification task is:
| Method | Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 | Avg Acc | Avg F1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SAINT | 63.63 | 0.61 | 64.95 | 0.60 | 70.97 | 0.71 | 65.55 | 0.64 | 75.03 | 0.74 | 68.87 | 0.67 |
| TDTransformer | 76.30 | 0.63 | 78.68 | 0.69 | 81.06 | 0.70 | 80.89 | 0.65 | 79.00 | 0.77 | 80.23 | 0.70 |
| TDTransformer (CTA Pos) | 76.70 | 0.63 | 78.94 | 0.69 | 81.36 | 0.70 | 81.07 | 0.65 | 79.47 | 0.77 | 80.51 | 0.70 |
[1] Fang, Xi, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Jane Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, and Christos Faloutsos. "Large language models (LLMs) on tabular data: Prediction, generation, and understanding-a survey." (2024).
Dear Reviewer,
Thank you for the review of our submission. Following your suggestion, we have chosen one baseline SAINT [2] from the reference [1] you mentioned. Results and details are reported in the previous reply. We would like to kindly ask do you have feedback regarding the previous reply?
We would greatly value more specific details about which aspects of our work were unconvincing or unclear. If there are still concerns, could you provide additional insights into your concerns? For instance:
- If you think we need to do more experiments, could you provide more details on what type of experiments are needed to improve the quality of our work?
- If you think more baselines are needed, could you list the recommended baselines?
Your feedback would help us improve our work. We are happy to provide further clarifications or modifications based on your input.
Authors
[1] Fang, Xi, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Jane Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, and Christos Faloutsos. "Large language models (LLMs) on tabular data: Prediction, generation, and understanding-a survey." (2024).
[2] Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C.B. and Goldstein, T., 2021. Saint: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342.
The paper investigates the application of transformer-based language models to tabular data, highlighting their limitations due to the heterogeneous nature of tables and the interpretting of numerical values. To tackle the challenges, the paper proposes Tabular Domain Transformer (TDTransformer), which utilizes distinct embedding processes for different column types and incorporates piece-wise linear encoding (PLE) for numerical columns. This approach aims to enhance the semantic understanding of language models when interpreting tabular data. Extensive experiments on 76 real-world datasets demonstrate that TDTransformer outperforms existing methods, suggesting a promising direction for leveraging language models in the tabular data domain.
优点
- Introduces a novel framework (TDTransformer) specifically designed for tabular data, addressing the limitations of traditional transformer models.
- Employs distinct embedding processes for different column types, improving the model's ability to capture semantic information.
- Employs piece-wise linear encoding (PLE) for numerical columns, improving the model's ability to capture numerical values.
- Demonstrates strong performance across a wide range of real-world datasets, indicating the framework's robustness and applicability.
- The paper is well-organized and easy to follow.
缺点
- The complexity of the TDTransformer architecture may lead to increased computational costs compared to simpler models.
- The reliance on specific embedding techniques may limit the model's generalizability to other types of data or tasks.
问题
Have you ever considered using a different LM as the table encoder (for example, llama or mistral)?
The authors thank the reviewer for feedback. Here is the one-to-one response:
-
Weakness 1: Yes, compared to tree-based methods such as XGBoost [1], the tabular deep learning methods are computationally more expensive. We do consider the computational costs so we intentionally choose LM encoder instead of LLM encoder. The total number of trainable parameters of TDTransformer is merely 32.31 M whereas the number for large language models is much larger (e.g. Llama 7B has 7B parameters). The high computational cost of deep learning methods is an inherent drawback compared to traditional machine learning methods. However, the large language model has its unique strength compared to traditional machine learning methods: the ability of understanding semantics. This is particularly helpful in the tabular learning task when tables contain rich semantics. Traditional machine learning methods have difficulty encoding column names. Using our proposed column encodings and the power of the language model, the performance is pronouncedly higher than traditional machine learning methods as shown in Table 2 and Table 3.
-
Weakness 2: TDTransformer uses three different types of column embedding to process categorical columns, numerical columns and binary columns. These three types of columns constitute all types of columns in relation tables. We focus on relation tables as indicated by Section 3 “Input data are relational tables that have a unique column given a column name”. For each type of column, we use drastically different encoding methods to enhance the capability of transformer models to process heterogeneous data.
-
Question 1: Yes, we considered a more powerful LM encoder, but as the reviewer points out, the computational costs of deep learning methods are higher than traditional machine learning methods such as XGBoost [1]. We intentionally use LM encoders in lieu of LLM encoders. Our current model complexity is much lighter than Llama or Mistral. The total number of trainable parameters is just 32.31 M whereas Llama 7B [2] has 7B parameters and Mistral [3] has 7B parameters.
[1] Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
[2] Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).
[3] Jiang, Albert Q., et al. "Mistral 7B." arXiv preprint arXiv:2310.06825 (2023).
Thank you for the response. I will keep my rating.
We thank the reviewer for the thoughtful feedback.
Dear Reviewer,
As the rebuttal period is drawing to a close, I would appreciate your response to the authors' rebuttal at your earliest convenience.
Best Regards,
Area Chair
The paper is about modifications to the transformer architecture and training objectives for solving tabular classification tasks using transformers. The authors propose techniques for linearizing tabular data and encoding it into a pre-trained transformer model like BERT or Roberta models. Features are encoded differently depending on whether the feature is categorical, numerical, or binary, with column-type aware position encodings. Further, the paper includes a pre-training strategy based on a self-supervised and a supervised contrastive learning strategy. The authors compare their proposed method, named TDTransformer, with strong baselines used for tabular classification tasks, e.g., XGBoost, and demonstrate reasonable accuracy improvements over these baselines.
优点
- The paper explores an important area -- tabular classification -- where traditional ML methods continue to outperform pre-trained transformers.
- The experiments in the paper seem to be very thorough, covering 76 tabular classification tasks in the openml benchmark and ablations of the key design choices used in the proposed method.
缺点
Despite its strengths, I feel that the paper needs significant improvements.
Major concern: Clarity of the Proposed Method.
- Sections 3.1 and 3.2 introduce an embedding E, but Section 3.3 refers to an embedding z without clarifying if or how z is derived from E. Line 261 states, "z_i is the hidden representation for the i-th table row," yet Sections 3.1 and 3.2 only discuss column embeddings, not row embeddings. It appears that "column embedding" might actually refer to "feature embeddings." Therefore, clearer terminology and more consistent naming of components would enhance understanding.
- Section 3.3 (Training Pipeline) begins by mentioning both training and fine-tuning but only elaborates on pre-training objectives (SSCL and SCL). The absence of detail on the fine-tuning objective leaves it unclear how the model was fine-tuned after pre-training.
Minor concerns:
- Line 33: The reference to "tree-based methods" is not clear. Providing specific examples or references would improve clarity.
- Line 110: There appears to be a typo -- should "j_i" be replaced with "y_i"?
问题
- How does the proposed method handle columns with string datatype?
- How does the proposed method handle missing values in a table?
- Does the proposed method encode multiple rows in a table, or is it only one row at a time?
- Line 357: Why is subset selection required?
The authors thank the reviewer for comments. Here is the one-to-one response:
- Weakness 1: is obtained after feeding E in Eq.10 to the LM encoder as shown in Figure 1 (a) and extract i-th row. We explained the meaning of on page 5 of the original manuscript “ is the hidden representation for -th table row”. To clearly show their relation, we added a paragraph to illustrate the relationship between and in our updated manuscript marked on page 5 “After the contextualization in LM encoder…”. The updated part is marked in blue color.
The numerical column embedding depends on the distribution of scalars in the column, so in section 3.1, we mention column embedding. A table row i consists of all columns at i-th row, so we mention different columns in section 3.2.
The task is to make a prediction for each row as shown in Section 3 Task formulation. We train the model to learn semantically meaningful embedding to represent i-th row and make a prediction for that row. The i-th row contains cell values of multiple columns at this row. We treat different columns differently to solve the heterogeneous issue in tables. After getting the embedding, we want to combine all embeddings to obtain a sequence of embedding for each row, which is the combination process.
-
Weakness 2: The fine-tuning process is in a supervised fashion as shown in Section 3 Task Formulation. The model is learned to maximize the probability of predicting the correct label as shown in Eq. 1. We added a paragraph for the fine-tuning process on page 6 "The model weight after the pre-training process is used as the initialized weight for the fine-tuning...".
-
Weakness 3: Examples of tree-based methods include XGboost [1] and CatBoost [2]. The overview for the comparison between tree-based methods and deep learning methods can be found in [3-4]. We added specific examples for tree-baed methods on page 1 of the modified manuscript “However, they continue to lag tree-based methods such as XGBoost”.
-
Weakness 4: Yes, we appreciate the typo pointed out by the reviewer.
-
Question 1: Categorical columns are essentially columns with string datatype. Hence, TDTransformer is able to handle columns with string datatype. The way of handling categorical columns is reported in Section 3.1 column embedding.
-
Question 2: We replace the missing values with the most common values in the column. We will add this detail in Section 4.1 Experimental details.
-
Question 3: TDTransformer does the preprocessing (e.g. handling missing values and normalizing numerical values in the same column) at the same time. For the encoding process, TDTransformer converts multiple rows in a table to embeddings. The number of rows is equal to the batch size.
-
Question 4: Different subsets are selected to examine the performance on different aspects. For example, TDTransformer proposes the embedding method for numerical columns. The embedding process of categorical columns is the same as traditional tabular deep learning methods using serialization and embedding. Hence, we select a subset of tables to examine the effectiveness of TDTransformer.
In addition to the performance on subsets, we reported the average performance over all 76 datasets.
[1] Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
[2] Prokhorenkova, Liudmila, et al. "CatBoost: unbiased boosting with categorical features." Advances in neural information processing systems 31 (2018).
[3] Shwartz-Ziv, Ravid, and Amitai Armon. "Tabular data: Deep learning is not all you need." Information Fusion 81 (2022): 84-90.
[4] Grinsztajn, Léo, Edouard Oyallon, and Gaël Varoquaux. "Why do tree-based models still outperform deep learning on typical tabular data?." Advances in neural information processing systems 35 (2022): 507-520.
[5] Borisov, Vadim, et al. "Deep neural networks and tabular data: A survey." IEEE transactions on neural networks and learning systems (2022).
Dear Authors,
Thank you for updating the paper based on the review feedback and for providing detailed responses to the questions. I have also reviewed the feedback from other reviewers and your rebuttals to their comments. Based on these inputs, I have made a minor adjustment to my scores.
Thanks.
We thank the reviewer for the thoughtful feedback. We are especially grateful for the positive assessment and the raised scores.
Dear Reviewer,
As the rebuttal period is drawing to a close, I would appreciate your response to the authors' rebuttal at your earliest convenience.
Best Regards,
Area Chair
This paper introduces TDTransformer, a new transformer-based architecture designed for tabular classification. The core claim is that standard transformers struggle with tabular data due to its heterogeneous nature and the way numerical values are represented. TDTransformer attempts to overcome the limitations by employing distinct embedding processes for categorical, numerical (using Piece-wise Linear Encoding), and binary columns, along with a column-type aware positional encoding. This paper performs pre-training on their model using self-supervised and supervised contrastive learning objectives and demonstrate its performance against strong baselines like XGBoost on a comprehensive set of OpenML datasets. Key findings presented are the empirical improvements TDTransformer shows over these baselines and the effectiveness of the proposed embedding techniques.
The primary strength of the paper lies in its strong empirical performance across a large number of datasets, indicating the potential of the proposed architecture. Reviewers consistently acknowledge the thoroughness of the experiments and the promising results achieved by TDTransformer compared to traditional methods. The architectural modifications, particularly the different embedding strategies for various column types, are also viewed as well-motivated. The paper is generally well-written and presented, with reviewers noting its organization and the open-source nature of the code.
However, several weaknesses are also pointed out by reviewers. A major concern is the clarity of the proposed method, specifically regarding the embedding process and the distinction between different embedding representations. The description of the fine-tuning process is also lacking. There are also concerns about the computational cost compared to simpler tree-based methods. Furthermore, some reviewers point out potentially overstated claims regarding the "incapability" of standard transformers to handle heterogeneous data. A point of contention arises regarding the choice of baselines, with some reviewers advocating for comparisons against more recent, especially LLM-based, approaches.
While this paper demonstrates certain merits, particularly in its empirical results, its current standing suggests necessary enhancements before it meets the acceptance criteria for the conference.
审稿人讨论附加意见
Given Reviewer X4q7's lack of engagement in the discussion, their reviews was not heavily weighted in the final decision. The general sentiment among the remaining reviewers leans towards a cautious evaluation of this paper. A recurring concern involves the consistent application of hyperparameter tuning across the baseline models, and the contrasting feedback, including instances of misunderstanding, underscores the necessity for significant improvements in clarity and presentation to fully address the reviewers' concerns.
Reject