5.2

/10

Rejected5 位审稿人

最低3最高8标准差1.6

3.8

置信度

ICLR 2024

UniPredict: Large Language Models are Universal Tabular Predictors

Ruiyu Wang,Zifeng Wang,Jimeng Sun

OpenReview PDF

提交: 2023-09-18更新: 2024-02-11

TL;DR

We developed a universal tabular data predictor that accepts arbitrary tabular inputs and predicts for arbitray target, with experimental verification on over 200+ tabular datasets.

摘要

关键词

tabular predictionlarge language modelAI for healthcare

评审与讨论

审稿意见

评分: 5置信度: 32023-10-30

This paper proposes a novel method for tabular data prediction based on generative modeling using large language models (LLMs), UniPredict. It can handle various prediction tasks without re-training by following the input instructions. The paper shows that UniPredict outperforms existing methods that use discriminative modeling and require re-training for each task. The paper also demonstrates that UniPredict can adapt to new tasks in few-shot learning settings with minimal data. The paper aims to develop a universal tabular data prediction system that can leverage the generative power of LLMs and serve diverse applications.

优点

This paper proposes a novel idea of using a large language model (LLM) to perform tabular data prediction for any target variable, based on generative modeling and prompt engineering.
This work introduces the concept of target augmentation, which is a technique to enhance the LLM’s ability to handle diverse and complex targets.
It demonstrates the effectiveness of UniPredict on 169 tabular datasets with diverse targets, and compares its performance with several baselines, including tree-boosting and neural network models.
Well-written and organized, with clear problem formulation and framework description.

缺点

The Implementation section is not clear enough, what model did you use in the experiment during your model learning process, and what are the specific training parameters?
I found that most of the data in the experiment are discrete (the dataset’s targets are not continuous), I wonder how the results of different target types are?
Do you do any special processing for continuous target types?
In the ablation experiment, you did not ablate whether to use target augmentation or not
Using different classifiers for target augmentation
From the experimental results, UniPredict's results are not significantly better than XGBoost, and it gives me a feeling that it is distilling XGBoost.
These datasets are from Kaggle, can you list the gap between the current results and the Kaggle top results?

问题

See above Weaknesses section

评论- Response to Reviewer Mbbo

2023-11-15

Dear reviewer,

Thank you for the valuable comments. We hope the following clarifications could address your concern.

1. Training Details

What model did you use in the experiment during your model learning process, and what are the specific training parameters?

To address concerns about clarity in the Implementation section, we instruct fine-tuned a GPT-2 model. The specific training parameters can be found in the global response and the updated manuscripts.

2. Target Augmentation

I found that most of the data in the experiment are discrete (the dataset’s targets are not continuous), I wonder how the results of different target types are?

We guess the question is asking if only classification tasks are tested in the experiments. We acknowledge that regression tasks are translated into classification tasks in our experiments. We pre-processed different types of targets into classifications, addressing the major limitation of fine-tuning LLMs on one-token-long targets. While this approach converts continuous targets into discrete classifications, we recognize the trade-off and suggest avenues for improvement in the future.

Do you do any special processing for continuous target types?

As detailed in the Target Augmentation, section 2.3, continuous numerical targets are categorized based on their quantiles. An illustrative example is provided in Listing 10 of Appendix A.4, demonstrating how a continuous target is converted into a classification task.

Using different classifiers for target augmentation

While we did not test different methods for target augmentation, opting for XGBoost as it proved effective, we encourage further experimentation within our framework to determine the most suitable classifier.

3. Ablation

In the ablation experiment, you did not ablate whether to use target augmentation or not

In the ablation experiment, we did ablate whether to use target augmentation or not, as indicated in Table 1, Section 3.5. The results include UniPredict-heavy (UniP-h), Unipredict-heavy-without-target-augmentation (Abl-h), UniPredict-light (UniP-l), and Unipredict-light-without-target-augmentation (the second Abl-h, and we apologize for the typo).

4. Relation to Teacher Models

From the experimental results, UniPredict's results are not significantly better than XGBoost, and it gives me a feeling that it is distilling XGBoost.

We want to emphasize that UniPredict is a model designed for universal tabular prediction. The comparison involves one UniPredict model with 100+ specialized XGBoost models, which is inherently unfair as we are evaluating a single model against a diverse set of specialized models.

Additionally, we are not distilling XGBoost; it is used as a distribution fitting method. UniPredict is designed to accommodate alternative methods that can produce better fitting results.

5. Comparison with Kaggle Results

These datasets are from Kaggle, can you list the gap between the current results and the Kaggle top results?

Unfortunately, we cannot provide a direct comparison between UniPredict results and Kaggle top results for each dataset. This is because

the Kaggle datasets we used were randomly selected, and access to state-of-the-art results on each dataset is not available;
some datasets are not among the popular ones, making it challenging to establish a comprehensive benchmark that demonstrates top results;
the test sets of almost all kaggle datasets are not open.

审稿意见

评分: 8置信度: 32023-10-31

This work curated tabular data and trained a single LLM on an aggregation of 169 tabular datasets with diverse targets. It improves the prediction accuracy by 5.4% to 13.4% compared with the SOTA tree-boosting baseline and neural network baseline, respectively. Besides, the trained LLM outperforms others by a large margin in few-shot settings.

优点

It's good to see a Tabular LLM, and it could benefit the community if the code and data will be released.

The experiment is solid, and the performance improvement is solid.

缺点

Only minor things: For the experiment setting, the baseline setup needs to be clarified in the main text. I read the appendix and found that the setting of TabLLM is different from its original paper. Why do we need to change their backbone and prompts, at least for the few-shot setting? The author said, "we streamlined the process by instructing the model to predict the class name directly. This approach simplifies the training procedure and conserves computational resources." But is there any performance drop of doing so?

问题

For the metadata reformatting process that leverages ChatGPT, can we measure the impact of hallucination since feeding column names to LLM may get some weird explanation or irrelevant content? Is there any post-processing to control the quality of reformatted meta information?

According to the ablation study using the augmented teaching signal from XGBoost, I saw the performance gap is significant. Does it mean the distillation from XGBoost is a key to success? If so, what's the performance of other baselines if we also take the XGBoost as the teacher?

Typos in Sec-3.5, the table number is wrong, and there are two Abl-h.

评论- Response to Reviewer p7B3

2023-11-15

Dear reviewer,

We sincerely appreciate your thoughtful comments and the opportunity to provide clarifications on the issues raised.

1. Experiment Setups

The baseline setup needs to be clarified in the main text.

It is also a concern for other reviewers. We have sent a global response to this matter. In summary, we used GPT-2 as the backbone model and hence used it as the backbone of TabLLM as well, to ensure fair comparison. We have updated relevant content in the new manuscript.

2. Data Preprocessing

Can we measure the impact of hallucination since feeding column names to LLM may get some weird explanation or irrelevant content?

Concerns about hallucination during metadata reformatting were carefully considered. We attempt to resolve this problem by:

Giving the LLM a strict template to follow.
Giving clear definitions of the tasks we perform.
Giving alternatives for LLMs to respond if they do not know what to respond. We rule out such datasets from the training set for future fine-tuning.

Listing 3, Appendix A.2 gives an example of our reformatting setup described above. We did not encounter hallucination errors from this setup. For the datasets that LLMs are not able to generate relevant results, they are instructed to add ‘N/A’ in the sections, and the dataset is omitted.

3. Insight from UniPredict

Does it mean the distillation from XGBoost is a key to success?

In our design, UniPredict learns from groundtruth labels, and XGBoost is utilized for fitting the distribution. While we did not experiment with other teacher models, we hypothesize that any teacher models capable of providing label probabilities could yield good performance. However, testing other baselines with XGBoost as the teacher was not conducted, as we believe this aspect contributes to the novelty of our work.

4. Typos

Typos in Sec-3.5, the table number is wrong, and there are two Abl-h.

We appreciate your help in identifying typos. We have updated this in our new manuscript.

审稿意见

评分: 5置信度: 52023-11-01

The paper proposes UniPredict, a framework for training large language models (LLMs) to serve as universal tabular data predictors. The key ideas and contributions are:

• Most prior tabular data prediction methods are discriminative and make predictions only for a fixed, pre-specified target column. In contrast, UniPredict is a generative model that can accept arbitrary tabular data as input and make predictions for any target column specified at query time.

• The authors aggregate 169 diverse tabular datasets into a large training corpus to train the UniPredict LLM. This exposes it to the diversity needed to handle new datasets and prediction tasks.

• Novel prompt engineering strategies are used to transform tabular data into natural language inputs consumable by the LLM. Reformatted metadata and instructions for specifying the target variable are incorporated into the prompts.

• Target augmentation and training procedures are designed to produce probabilistic predictions from the LLM with reliable confidence estimates.

• Experiments show UniPredict outperforms prior specialized models, with especially strong generalization under low-data regimes. It achieves higher accuracy than the best neural baselines and boosting methods across the aggregated test sets.

In summary, UniPredict demonstrates how scaling up training data and prompt engineering enables LLMs to learn universal tabular prediction capabilities not seen in prior specialized models. The proposed system and training framework enable handling diverse datasets and prediction tasks within a single model.

优点

For the first time, this paper introduces a novel approach of outputting confidence scores for predictions made by large language models (LLMs) on tabular data. Specifically, the authors employ XGBoost to first generate confidence scores for the training tabular data. The large language model is then trained to mimic these confidence estimates for its own predictions on tabular data. Applying LLMs to tabular prediction and producing probabilistic outputs is an innovative contribution in the field of using LLMs for tabular prediction.
The paper provides extensive prompt engineering techniques that enable the LLM to accept arbitrary tabular inputs and make predictions. Substantial research has been conducted on designing effective prompts that allow the LLM to comprehend diverse tabular data samples and generate outputs. This represents significant effort and advancement in prompt engineering for applying LLMs to tabular data.

缺点

This paper does not clearly specify which large language model architecture was used for the fine-tuning experiments. Was it a model like LLAMa, Falcon, or GPT-3.5? This omission of key information is a major limitation. Based on the logo in Figure 2, I infer that OpenAI's fine-tuning API was likely used. It is reasonable to hypothesize that the models in the experiments (e.g. UniPredict-light and UniPredict-heavy) were potentially GPT-3.5 or GPT-4.
The previous point raises the concern that the paper seems to conflate the notion that "Only OpenAI's ChatGPT constitutes a large language model." The title reads "LARGE LANGUAGE MODELS ARE UNIVERSAL TABULAR PREDICTORS," but large language models include more than just ChatGPT, such as LLAMa and Falcon. It is well known that ChatGPT has strong generalization abilities, but the authors did not discuss whether their methods would still be effective on other large language models. The experimental results make it difficult to ascertain the validity of the authors' methods, as the results could simply stem from the power of the chosen foundation model rather than the methods themselves. Therefore, the conclusion stated in the title is not convincingly demonstrated.
The authors appear to have limited tabular prediction tasks to only classification problems, which is unreasonable. Tabular prediction also encompasses regression tasks. The authors should clarify which validation datasets involve classification versus regression to properly support the title conclusion of "UNIVERSAL TABULAR PREDICTORS."
Potential data leakage. The tabular data used for testing was sourced from the public Kaggle website. Due to the blackbox nature of the large language models used (not knowing what training data they have seen), we cannot be certain that the experimental results stem from the authors' fine-tuning rather than the LLMs having previously encountered related data from Kaggle.
Poor reproducibility. The authors did not release or provide any materials or code to support reproducibility of their work.

问题

Please refer to the "Weaknesses" section.

评论- Response to Reviewer qPUm

2023-11-15

Dear reviewer,

Thank you for the valuable comments. We hope the following clarifications could address your concern.

1. Implementation Issues

This paper does not clearly specify which large language model architecture was used for the fine-tuning experiments.

The backbone used for UniPredict training is GPT-2, and we have now included this detail in the manuscript.

2. UniPredict's Capability

the paper seems to conflate the notion that "Only OpenAI's ChatGPT constitutes a large language model."

We clarify that our choice of GPT-2 as the backbone is intended to demonstrate the effectiveness of our framework with a minimal-sized LLM. We do not assert that only OpenAI’s ChatGPT constitutes a large language model. We welcome further experimentation with different LLM implementations and believe that if GPT-2 performs well, other larger LLMs will likely perform even better. We apologize, as this misunderstanding might originate from our mistake of omitting the backbone selection.

Tabular prediction also encompasses regression tasks.

We acknowledge the concern regarding the focus on classification tasks and the absence of regression tasks. We do have tasks that are initially regression tasks, however, we translate them into classification tasks by making quantiles for prediction. We acknowledge that there are no true regression tasks, and it can be further improved.

The limitation stems from the challenge of fine-tuning LLMs on one-token-long targets. Our approach, Target Augmentation, aims to address this limitation by converting targets into strings of individual label confidence, enabling more effective fine-tuning. We recognize the trade-off and limitations of this solution.

An alternative approach involves augmenting the number of classes associated with a regression task. Consider a regression task with a target range of 0-99; in this scenario, expanding the target into 100 classes would lead to a more gradual and refined response from the framework.

3. Data Leakage

Potential data leakage.

Data leakage is a broader issue in the field, impacting various works, not specific to ours [1]. We want to clarify that the backbone used in our experiments is GPT-2, released in 2019, and the majority of the datasets used were not publicly available at that time. We note that frequently examined datasets like Credit-g, Blood, and Banks are more likely to be known to LLMs. We acknowledge that it is helpful in future work to build novel prediction datasets to verify the true generalizability of these pre-trained tabular data prediction models.

[1] Chunyuan Deng, et al. “Benchmark Probing: Investigating Data Leakage in Large Language Models” Chunyuan Deng and Yilun Zhao and Xiangru Tang and Mark Gerstein and Arman Cohan, NeurIPS 2023 Workshop on Backdoors in Deep Learning - The Good, the Bad, and the Ugly, 2023.

4. Reproducibility

Poor reproducibility. The authors did not release or provide any materials or code to support reproducibility of their work.

We are not allowed to provide open-source URLs at this point because that is against the double-blind policy. We will release the GitHub repository after the review process is finished.

We believe that your concern about reproducibility can be resolved after we report the backbone choice. We proposed a simple instruction tuning process with GPT-2, and all datasets and prompts used are revealed in the appendices.

We shared the code as a zip file in the supplementary material section to support the reproducibility of our work.

评论- Response to the authors

2023-11-22

Dear authors,

I greatly appreciate your thoughtful response addressing the concerns from my review. Especially, the addition of using GPT-2 architecture as the backbone for UniPredict in the revised draft, alongside providing reproducible code in the supplementary material, is praiseworthy. In light of these amendments, I would be glad to raise my score from [3] to [5]. However, a higher score is precluded due to the following lingering reservations:

I respectfully maintain that the claimed title of "Universal Tabular Predictors" necessitates native handling of both classification and regression tasks, the latter still being an open challenge for UniPredict. As acknowledged in your response, UniPredict currently focuses on classification problems. Perhaps the title could be revised to "Universal Tabular Classifiers" to avoid overclaim or unintended generalization beyond demonstrated capabilities.
There remains room for further ablation studies on the novel and vital contribution of Target Augmentation with label probabilities. Currently, the XGBoost with only one single parameter setting (n_estimators=100) generates the label probabilities for training data to fine-tune UniPredict to output prediction probabilities. Since different models (XGBoost, Random Forests, CatBoost etc.) with different parameters would produce disparate probability estimates on tabular data, whether label probabilities from other models or settings would further aid UniPredict's accuracy merits investigation. Moreover, owing to the opacity of language models, precisely how UniPredict calculates output probabilities - whether a surface correlation or true probabilistic computation - deserves deeper analysis.
Using accuracy as the sole evaluation metric provides an incomplete portrayal of performance for the classification tasks attempted. Alongside accuracy, other metrics like AUC and F1-score would furnish a more wholesome perspective for imbalanced classification datasets experimented upon.

I welcome a constructive discussion on these suggestions at your convenience.

评论- Response to Reviewer qPUm

2023-11-22

Dear reviewer,

Thank you for improving the score. we hope the following points would address your concerns.

We recognize the potential overclaim in branding UniPredict as 'Universal Tabular Predictors.' The manuscript has been revised, now referring to it as 'Universal Tabular Classifiers.'
Your suggestion for further ablation studies on the contribution of Target Augmentation with label probabilities is valid. While we view our contribution as introducing the concept of target augmentation, we acknowledge that the current implementation may not be optimal. Despite yielding robust results, we encourage future experiments within our framework to identify the most effective classifier.
We agree with your perspective on metrics. Due to time constraints, we plan to incorporate this into the final version. We opted for accuracy in our paper because of its widespread use in practical applications, and previous works that have solely employed accuracy, for example [1]. Also, given the comprehensive testing across 160+ datasets using this metric, we believe the results remain representative.

[1] Gorishniy, Yury, et al. "Revisiting deep learning models for tabular data." NIPS 2021.

Best regards,

Submission 990 Authors

审稿意见

评分: 3置信度: 32023-11-08

The paper presents a pretraining methodology for tabular data prediction. Concretely, the authors propose training on a large collection of tabular datasets (169 datasets).
In order to do so, the authors standardize samples across datasets: i.e given a sample, they serialize the sample as a description of the metadata (describing the dataset schema), serializing the column and values as a natural language string, a description of the target labels and finally the target label(s) along with an associated confidence measure (where the confidence measure is derived from an external classifier trained on the dataset). In order for the metadata to be described as a natural language string, the authors propose prompts for rewriting the available metadata information leveraging GPT-3.5.
By leveraging the aforementioned formulation, the authors demonstrate being able to train a single model across numerous datasets as well as a diverse set of target columns
The proposed model is evaluated on both the 169 datasets used for training, as well as on 69 datasets in a low resource setup to demonstrate transferability. It achieves substantial improvement over baseline methods on the aggregated 169 datasets. Moreover, it also achieves strong performance on the 69 datasets used for studying the transfer learning setup, especially in the low data setup; substantially outperforming the XG-Boost baseline.

优点

The paper presents a novel method for serializing data-samples from tables for generative pre-training that allows not only the colum value information from tables, but also the associated metadata.
The proposed methodology of leveraging not just the target labels, but also the confidence associated with each class label in the generative modeling setup is quite novel. The ablation studies demonstrate the benefits of modeling the confidence estimates
The proposed methodology achieves strong performance, especially considering that it is a single model for a diverse number of tasks, when compared to the per dataset baselines.

缺点

The authors do not mention any details about the actual model backbone used for the UniPredict training: for example what (if any) pre-trained model is used for the backbone.
While the authors do evaluate over a suite of Table tasks, it is very hard to position the model's performance compared to other proposed methods. It might be better to demonstrate the performance of the model on datasets that have been used previously, just to get a sense of the model's performance compared to prior literature (eg the Blood, Bank, Calhousing, Car, Credit-g, Diabetes, Heart, Income and Jungle datasets as used in [1])
Regarding baseline methods: some of the stronger baselines like TabPFN ([2]) would also be helpful in trying to understand the utility of the proposed method. In addition to that, the TabLLM baseline implemented is considerably weaker compared to what was proposed in the original paper. Concretely, the original paper uses T0, fine-tuned with T0-Few recipe [3]; which inherently has better instruction following capabilities compared to the GPT-2 backbone used in this paper. This makes the baseline somewhat artificially weak. It would be good to compare against the actual proposed methodology in [1] or (as mentioned above) evaluate on the datasets for which TabLLM reported the results, just to ensure a fair comparison.

[1] Hegselmann, Stefan, et al. "Tabllm: Few-shot classification of tabular data with large language models." International Conference on Artificial Intelligence and Statistics. PMLR, 2023.

[2] Hollmann, Noah, et al. "Tabpfn: A transformer that solves small tabular classification problems in a second." arXiv preprint arXiv:2207.01848 (2022).

[3] Liu, Haokun, et al. "Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning." Advances in Neural Information Processing Systems 35 (2022): 1950-1965.

问题

Given that model confidence is a part of the autoregressive prediction objective, how calibrated are the generations from the model during evaluation ? Concretely, (1) do the confidence estimates produced during generation form a valid probability distribution and (2) would it be possible to compute the calibration score for the produced probabilities (maybe something like the Expected Calibration Error, eg as done in [1])? LLMs have been shown to verbalize well calibrated outputs [2], but in my opinion, the degree of calibration observed would probably be a function of the compute used for training the model.
The authors of TabLLM observed hallucinations to be a source of errors while using LLMs for reformatting purposes. Was this something that was also observed while generating the reformatted metadata ?
Given that Unipredict-light does comparably / better than Unipredict-heavy, I am not entirely sure if including the metadata actually helps improve the model performance. The ablation study presented on page 8 argues that Unipredict-heavy is more robust because the loss in performance is less when not using the label confidence estimates during training. But I am not sure why having metadata (or lack thereof) should impact how the model handles confidence estimates.

Typographic edits:

Abstract: "Here, we show that scaling up an LLM to extensive tabular inputs and predicting of target variables following the input instructions" -> is a bit unclear what this is trying to convey.
Abstract: "our method outperforms XGBoost over 100% on the low-resource setup" -> not very clear what this means, maybe needs some rewording ?
Page 1, para 2: "most previous methods fall short of assuming a fixed target" -> most previous methods fall short by assuming a fixed target
Page 5: Learning: "update the model based on the discrepancies with augmented target sequences" -> this is a bit unclear. It would be good to specify if this is log-likelihood based training, or something else (eg: say RL training based on the BLEU / Rouge score between the model prediction and the ground truth).

[1] Guo, Chuan, et al. "On calibration of modern neural networks." International conference on machine learning. PMLR, 2017. [2] Lin, Stephanie, Jacob Hilton, and Owain Evans. "Teaching models to express their uncertainty in words." arXiv preprint arXiv:2205.14334 (2022).

评论- Response to Reviewer HUKy

2023-11-15

Dear reviewer,

Thank you for the valuable comments. We hope the following clarifications can address your concern.

1. Backbone Choice

The authors do not mention any details about the actual model backbone used for the UniPredict training.

We have included the information about the model backbone used for UniPredict training in the manuscript, specifying that GPT-2 serves as the backbone.

2. Performance Comparison

While the authors do evaluate over a suite of Table tasks, it is very hard to position the model's performance compared to other proposed methods.

We acknowledge the suggestion to compare the model's performance on well-established datasets from prior literature, such as the Blood, Bank, Calhousing, Car, Credit-g, Diabetes, Heart, Income, and Jungle datasets used in [1]. However, we focus on a more extensive set of tabular datasets and implement the most advanced baselines in various architectures for comparison. This approach enables a broader evaluation and comparison of our proposed model.

Also, some of the datasets, despite not explicitly, are presented in our experiment. For example, The results of the Diabetes dataset can be accessed through the name uciml-pima-indians-diabetes-database in Table 4, appendix C.1.

The TabLLM baseline implemented is considerably weaker compared to what was proposed in the original paper.

Regarding the performance of TabLLM, we want to clarify that we aimed for a fair comparison by ensuring both TabLLM and UniPredict with GPT-2 had the same parameter size.

3. Data Preprocessing & Augmentation

Given that model confidence is a part of the autoregressive prediction objective, how calibrated are the generations from the model during evaluation?

The current method does not strictly enforce the return of exact probabilities. While the predictions may not form a valid probability distribution, they are still useful for inspecting the confidence of the model predictions.

While it is possible to use ECE as an objective to train the model, we emphasize that the primary purpose of target augmentation is to provide LLMs with more sufficient supervision, enabling them to learn more about the nature of tabular tasks.

The authors of TabLLM observed hallucinations to be a source of errors while using LLMs for reformatting purposes. Was this something that was also observed while generating the reformatted metadata?

Hallucinations are something that would take place when the LLMs do not know what to respond to or do not find anything to respond to. We attempt to resolve this problem by:

Giving the LLM a strict template to follow.
Giving clear definitions of the tasks we perform.
Giving alternatives for LLMs to respond if they do not know what to respond. We rule out such datasets from the training set for future fine-tuning.

Listing 3, Appendix A.2 gives an example of our reformatting setup described above. We did not encounter hallucination errors from this setup.

4. UniPredict Variants

I am not entirely sure if including the metadata actually helps improve the model performance.

The use of metadata and label probability is considered a pair of dependent factors, and their removal impacts the model's performance. UniPredict-light is a lighter version but may face more failure cases due to instability, for example being more affected by bad feature values. UniPredict-heavy is more robust when the quality of training data is not good enough.

5. Typographic edits

Thank you very much for providing these suggestions. We have updated relevant sections in our manuscript. You can review the changes that are marked in orange. In response to your concern, here is what we meant to say:

“Here, we show that scaling up an LLM to extensive tabular inputs and predicting target variables following the input instructions” – it is a typo. The message we would like to convey is that we demonstrate the scalability of an LLM to extensive tabular datasets, enabling it to comprehend diverse tabular inputs and predict target variables following the provided instructions.

“our method outperforms XGBoost over 100% on the low-resource setup” – In low-resource few-shot setup, we observed a 100%+ performance advantage compared with XGBoost.

“update the model based on the discrepancies with augmented target sequences” – We used instruction fine-tuning.

审稿意见

评分: 5置信度: 52023-11-10

The paper proposes UniPredict, a model for analytical predictions in tabular data. The work incorporates LLM to multiple tabular datasets, in which a single LLM is trained on an aggregation of 169 tabular datasets with diverse targets. The work compares its performance to other neural-network based and tree-based baselines, and show improvements in prediction performances, especially in in few-shot regimes.

优点

The major strength of the paper lies in that the paper presents an elegant method of constructing prompts, feature serialization of tabular data, targets in which LLMs (the generative model presented in the paper) can take in as the input.
Incorporating multiple tabular data to build a type of a pretrained LLM is another strength that that paper exhibits. Moreover, strength in few-shot learning scheme may also provide a hint for extending the work for a more robust pretrained model for prediction in tabular datasets.

缺点

The major weakness of the paper lies in the experiments:

A decent hyperparameter tuning of comparing baselines should be conducted.
The results should also show some statistical measures on performance comparisons (e.g., critical plots)
There may be some form of leakage of labels in the experiments. For instance the example "Listing 7, A.3" shows an example with "Unnamed: 0 is 2346". This comes from the way of saving a csv file, in which it lists the index of datasets. In some cases, if the data is ordered by the magnitude of the target, this serves as a rank of the target, which might indicate a leakage of labels (target variables).
It would be good to observe how the proposed model performs on datasets with more samples in the few-shot settings.
Encoding categorical variables with ordinal encoder (for comparing methods) might not be the best option in handling categorical variables. It would also be good to have some comparison with models that handle categorical variables well (e.g., catboost).

It is unclear to interpret class probability as confidence. There should be a distinct definition of the terms for readers understanding.

问题

How are the hyperparameters for the comparing baseline selected?
What are some statistical testing results on the performance comparisons? Can we really say that the proposed method outperform other comparing baselines?
How does the model do in the few-shot regime where the number of samples is greater than the reported datasets?
Can we interpret the class probability (in the target augmenting step) as the confidence?
Are there better approaches in handling numerical values in the prompts? How does the model perform without using the numerical features?

评论- Response to Reviewer brjf

2023-11-15

Dear reviewer,

Thank you for your valuable comments. We appreciate the opportunity to address your queries with the following clarifications.

1. Hyperparameter Selection

How are the hyperparameters for the comparing baseline selected?

We performed hyperparameter tuning, and the best set of hyperparameters are reported in Appendix B.1.

2. Performance Comparisons

What are some statistical testing results on the performance comparisons? Can we really say that the proposed method outperforms other comparing baselines?

The comprehensive evaluation on 100+ tabular datasets presented in our work is one of the most extensive assessments in the literature. We believe these results sufficiently represent the advantages of UniPredict. While we did not perform specific statistical testing, the scale and diversity of the experiments contribute to the robustness of our findings.

3. Interpretation of Class Probability

Can we interpret the class probability (in the target augmenting step) as the confidence?

It's crucial to clarify that the class probability in the target augmenting step is conceptually different from a statistical 'confidence' measure. Although they may share commonalities, we do not claim them to be identical. The class probabilities are more akin to the groundtruth 'logits' used for training.

Our rationale behind target augmentation is that converting targets into strings of individual label confidence would make fine-tuning more effective. To illustrate, consider two cases where in the first one we want the LLMs to predict a class, say class ‘1’, and in the second we want it to give a full probability such as ‘class 0 is 0.1, class 1 is 0.5, class 2 is 0.2, class 3 is 0.2 .’ The latter, involving full probability information, provides more information for fine-tuning, resulting in better convergence.

4. Handling Numerical Values

Are there better approaches in handling numerical values in the prompts? How does the model perform without using the numerical features?

The primary focus and contribution of our work lie in presenting a comprehensive framework for handling tabular prediction in various tasks. While we did not propose specific techniques for handling numerical features, UniPredict still outperforms traditional methods in numerous scenarios. We anticipate that the performance of UniPredict will further improve with the incorporation of more advanced numerical representation techniques in the future.

评论- Global Response

2023-11-15

Dear reviewers,

We appreciate your thorough review of our work. We recognize the valuable insights provided during the review process and would like to address some key points.

The main contributions of our paper include:

Unified tabular prediction: Support prediction for tabular inputs with diverse columns and target variables with a unified model. Our method involves fine-tuning a GPT-2 model as the backbone.
Large-Scale Experiments: We conducted extensive experiments on 160+datasets, showcasing that our model belongs to the top-performing group.
Claim of Universal Tabular Prediction: Leveraging the exceptional performance of GPT-2, we assert that advanced LLMs have the potential to achieve even better results, positioning them as universal tabular predictors.

We acknowledge concerns raised about certain omitted details in our final manuscript that may have caused confusion. Specifically:

In the UniPredict framework, we utilized GPT-2 as the backbone model. We have updated the hyperparameter choices in the revised manuscript. In detail, we instruction fine-tune it with AdamW, lr=5e-5, 0.9-0.999, epsilon=1e-8, weight_decay=0, and epoch=3.
To ensure a fair comparison, we standardized all backbones in LLM-based tabular prediction models to GPT-2. Our assertion of LLMs being universal tabular predictors is based on the observed solid performance within our framework.
We have also uploaded the runnable code to the supplementary materials to clarify the concern of reproducibility from the reviewers.

We have updated the manuscripts with new lines marked as orange. For the reviewer’s questions, please find the detailed explanations below.

评论- Final Review and Further Discussion

2023-11-20

Dear reviewers,

As the review discussion stage approaches its conclusion, we kindly request that you take a moment to review our revised paper and assess whether our response addresses your concerns. If you have any additional questions or require further clarification, please do not hesitate to inform us. We value the opportunity to continue the discussion if necessary.

Best regards,

Submission 990 Authors

评论- Reminder for Reviewers

2023-11-22

Dear reviewers,

As the review process nears completion, we kindly request your valuable feedback on our revised version and individual responses.

Best regards,

Submission 990 Authors

AC 元评审

2023-12-09

This submission contributes a fine-tuned LLM model for tabular data. The submission generated many solid discussions. The reviewers found the overall topic interesting and appreciated several aspects of the work, such as the incorporation of multiple tabular datasets. They were disappointed that the work did not position itself well with regards to TabPFN and had doubts on the TabLLM, both of wich are very relevent baselines.

为何不给更高分

The ratings are borderline and, in my opinion fair. The authors could have done a stronger empirical study to really demonstrate the value of the proposed solution. I am not convinced that it would not have revealed weaknesses.

为何不给更低分

There are good things about the manuscript: the idea is an interesting one and the empirical evidence is useful. I would not lower the score.

最终决定Reject

2024-01-16

Reject