4.3

/10

Rejected4 位审稿人

最低1最高8标准差2.6

4.3

置信度

正确性2.0

贡献度2.0

表达2.3

ICLR 2025

Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs

Faisal Hamman,Pasan Dissanayake,Saumitra Mishra,Freddy Lecue,Sanghamitra Dutta

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

关键词

trustworthy large language modelsReliable machine learningtabular datamodel multiplicityhigh-stakes application

评审与讨论

审稿意见

评分: 8置信度: 42024-10-19

This paper addresses the challenge of multiplicity in fine-tuning LLMs (i.e., fine-tuning under different conditions), particularly when applied to tabular predictions. While these models may achieve comparable average performance, they can generate conflicting predictions for the same inputs. To address this, the paper introduces a consistency measure that evaluates model behavior in the embedding space, focusing on the local region around the input. Additionally, the paper provides theoretical guarantees for the metric and supports it with empirical validation.

优点

Originality: The paper introduces the concept of fine-tuning multiplicity in LLMs applied to tabular data. The problem is interesting and possibly unexplored. A key originality is the proposed consistency measure which offers a novel way to assess prediction robustness without requiring the retraining of multiple models.

Quality: The paper is well written and motivated with the proposed metric supported by both theoretical analysis and empirical evidence.

Clarity: The paper is clearly written, effectively contextualizing the problem within the scope of tabular data.

Significance: The issue addressed is highly relevant when using LLMs for tabular data, especially in terms of evaluating prediction robustness. The proposed metric tackles a critical challenge in ensuring the reliability of LLM-based classification systems for tabular data.

The significant ablations to understand the different dimensions is a particular strength

缺点

The datasets assessed are more conventional tabular datasets (<50 features). It would be useful to understand if the findings translate to higher dimensional tabular datasets often found in genomics and physics. Examples suggested are like the Higgs (physics) and TCGA (genomics) datasets. This is especially relevant of how this dimensionality interplays with the embedding space.
The authors are encouraged to contrast a bit more explicitly to other related approaches in which multiplicity and consistency can be assessed --- e.g. ensemble based methods

问题

How would results translate on high dimensionality data? -Examples suggested are like the Higgs (physics) and TCGA (genomics) datasets.
The authors mention an easy extension of the theory to multi-class problems. Do the empirical results hold beyond binary on multi-class tabular datasets or are there any nuances? It would be interesting to understand in terms of the methods generality.

2024-11-25

Thank you for your positive review and feedback!

Higher dimensional datasets

Regarding higher dimensional datasets: Yes, our method would extend to higher dimensional datasets. However, we are constrained by the context window of the LLM, which is 4096 tokens for our model (and most current encoder-decoder models). Extremely high dimensional inputs would have a hard time fitting within this context window. This is a fundamental challenge of the LLM, rather than our method. A key constraint is the applicability to higher-dimensional datasets due to the limited context window size of current LLMs, though extending context windows is an active area of research [1,2]. We have highlighted this in our limitation section.

Multi-Class Problems with an Example

Our consistency measure is tied to the confidence in predicting a specific class (i.e., the probability values derived from softmax logits), and not directly to the predicted labels. In a multi-class setting, consider a classification task with three classes: A, B, and C. For a given input $x$ , the model assigns softmax probabilities $f_A(x)$ , $f_B(x)$ , and $f_C(x)$ to the three classes, where the probabilities satisfy $f_A(x) + f_B(x) + f_C(x) = 1$ . Assume that class A has the highest probability ( $f_A(x) > f_B(x)$ and $f_A(x) > f_C(x)$ ), making it the predicted class. In this case, we only need to compute the consistency measure for class A.

To compute the consistency for class A, we sample points $x_i$ from the local neighborhood of $x$ in the embedding space and evaluate the stability of $f_A(x)$ : $S_{k,\sigma}(x, f_A) = \frac{1}{k}\sum_{x_i \in N_{x,k}} f_A(x_i) - \frac{1}{k}\sum_{x_i \in N_{x,k}} |f_A(x) - f_A(x_i)|.$

By applying our measure to the softmax logit of the predicted class (class A), the consistency measure provides insight into the robustness of the prediction to that specific class label (class A).

[1] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023

[2] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023a.

2024-11-25

Related Approaches

We appreciate the suggestion to contrast our work with other methods for assessing multiplicity. Existing approaches provide valuable perspectives. For example, [1] tackles multiplicity by formulating and solving integer programming problems to identify models that maximize prediction disagreement. Similarly, [2] discusses three aggregation techniques to address multiplicity: (i) mode aggregation, which outputs the majority vote among models to minimizes disagreement; randomized predictions, which randomly samples predictions from multiple trained models; and random model selection, where a single model is randomly chosen and applied consistently. Other metrics have been proposed to measure predictive multiplicity by quantifying the spread of output scores within the Rashomon set, e.g., Rashomon Capacity [4]. While these metrics provide nuanced insights, they are computationally expensive, particularly for TabLLMs, as they often require retraining numerous models. An exception is the work by [5], which introduces a dropout-based approach to efficiently explore the Rashomon set in neural networks. This method utilizes dropout to create multiple competing models by randomly dropping weights, and aggregates predictions from these models. However, their approach requires an initial validation step to ensure that all dropout-generated models are indeed in the Rashomon set (i.e., they achieve similar accuracy or loss), adding computational overhead. In contrast, our method addresses robustness of individual prediction to multiplicity without necessitating expensive retraining or extensive model aggregation, making it more efficient for Tabular LLMs.

[1] Marx, Charles, Flavio Calmon, and Berk Ustun. "Predictive multiplicity in classification." International Conference on Machine Learning. PMLR, 2020.

[2] Black, Emily, Manish Raghavan, and Solon Barocas. "Model multiplicity: Opportunities, concerns, and solutions." Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2022.

[3] Watson-Daniels, Jamelle, David C. Parkes, and Berk Ustun. "Predictive multiplicity in probabilistic classification." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. No. 9. 2023.

[4] Hsu, Hsiang, and Flavio Calmon. "Rashomon capacity: A metric for predictive multiplicity in classification." Advances in Neural Information Processing Systems 35 (2022): 28988-29000.

[5] Hsu, Hsiang, Guihong Li, and Shaohan Hu. "Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation." ICLR (2024).

2024-11-26

Thank you for the detailed responses.

Adding these into the revised paper would strengthen the paper and ensure clarity.

One note as the authors mention context window of LLMs of 4096, I want to highlight that recent LLMs have context windows of 32k, 64k and 128k so it would be useful to try larger dimensionality datasets with LLMs that can handle the larger context window. The authors are encouraged to try this.

That said I retain my score and my positive assessment of the paper

审稿意见

评分: 1置信度: 52024-11-04

The paper aims to develop a consistency measure for evaluating prediction robustness under model multiplicity in fine-tuned Tabular Large Language Models (LLMs). The authors claim that this measure provides probabilistic guarantees that predictions will remain stable over a range of equally well-performing fine-tuned models, without needing extensive retraining. However, the main result is incorrect, and the result (except for experimental part) is irrelevant to "tabular LLM".

优点

The paper brings forward the important problem of efficiently measuring prediction robustness in tabular LLM finetuning.

缺点

The introduced “consistency” term lacks meaningful interpretation. Although it is intended to indicate prediction robustness across models, Theorem 1 does not support this interpretation. The result (even if correct) says if a data point has a sufficiently high consistency score with respect to a random model, the prediction of another random model from the same broad class of fine-tuned models will be at least the level of consistency (meaning that it is approaching class 1) with high probability. But what if the underlying class label should be 0?
The paper’s theoretical formulation has significant issues. It lacks a clear definition of “random functions” and assumes that “functions are independent and identically distributed (i.i.d.)” without providing a rigorous explanation. These assumptions are unconventional and do not align with standard definitions typically found in probability theory.
The consistency measure itself is random, as it depends on sampled points, and its properties would only stabilize as the sample size k approaches infinity. This suggests that further study should focus on the properties of the deterministic limit of the measure rather than its finite-sample properties.
Assumption 1 is trivial. If F, F' are bounded functions, certainly the variance of Z is bounded.
The paper claims "while model multiplicity in machine learning has been studied in various contexts, the unique challenges of fine-tuning multiplicity in Tabular LLMs remain relatively unexplored." While the paper claims to address fine-tuning multiplicity specifically for tabular LLMs, the main results are general and do not leverage any unique characteristics of tabular data or LLMs. The main result is irrelevant to the paper scope.

问题

Pls see my comments and suggestions above.

2024-11-25

Clarification on Consistency Measure

We would like to clarify that our measure does apply to label 0, as well as multi-class classifications. We clarify a possible misunderstanding: Our consistency measure is tied to the confidence in predicting a specific class (i.e., the probability values derived from softmax logits), and not directly to the predicted label 1. This was discussed in Remark 1 of the original submission. We have revised Remark 1 to further clarify this.

Remark 1 Our consistency measure is tied to the confidence in predicting a specific class and not the predicted labels. The concept can be seamlessly applied by considering the softmax logits for predicting any given class. This also extends to multi-class classification by using the softmax logits for each class, thereby maintaining the measure's applicability across various classification tasks.

If the underlying class prediction is 0 (negative class), the softmax outputs would assign a probability greater than 0.5 to the negative class. The consistency measure is then computed on the softmax output.

Take an example of a constant 0 predictor mentioned by Reviewer szsk: A constant 0 predictor, when interpreted in terms of softmax outputs, would assign a softmax probability of 1 to the "negative" class for all inputs. If we compute consistency for the "negative" class (for which the softmax output is 1), the consistency becomes $S(x,f) = 1$ . The consistency measure correctly reflects the model’s confidence in the target class. The "negative" class for the constant 0 predictor would have maximal consistency, likewise for a constant 1 predictor. This behavior aligns with the purpose of the measure.

Multi-Class Example

In a multi-class setting, consider a classification task with three classes: A, B, and C. For a given input $x$ , the model assigns softmax probabilities $f_A(x)$ , $f_B(x)$ , and $f_C(x)$ to the three classes, where $f_A(x) + f_B(x) + f_C(x) = 1$ . Assume that class A has the highest probability ( $f_A(x) > f_B(x)$ and $f_A(x) > f_C(x)$ ), making it the predicted class. In this case, we need to compute the consistency measure with respect to class A: $S_{k,\sigma}(x, f_A) = \frac{1}{k}\sum_{x_i \in N_{x,k}} f_A(x_i) - \frac{1}{k}\sum_{x_i \in N_{x,k}} |f_A(x) - f_A(x_i)|.$

By applying our measure to the softmax logit of the predicted class (class A), the consistency measure provides insight into the robustness of the prediction to that specific class label (class A). $Pr(F_A(x) \geq S(x,F’_A(x)) - \epsilon))$ is large.

Mathematically, we can also apply our measure to the other non-predicted classes, however this would only provide a trivial interpretation (e.g., $Pr(F_C(x) \geq S(x,F’_C(x)) - \epsilon))$ is large, where $S(x,F’_C(x))$ is small).

Clarification on Theoretical Formulation

Random functions are well-established and foundational concepts in probability theory, extensively used in numerous fields such as stochastic processes and statistical learning [1].

In our study, we consider a broad class of equally well-performing fine-tuned models, denoted as $\mathcal{F}_{\delta}$ = { $f : \text{err}(f) \leq \text{err}(f_0) + \delta$ }, where $\text{err}(f)$ is the empirical error of model $f$ on a given dataset, $f_0$ is a reference model with satisfactory performance, and $\delta$ is a small error tolerance specific to the application. This class includes all models that achieve similar performance to $f_0$ , ensuring they are "equally well-performing."

We model the process of fine-tuning as a random process, where each fine-tuned model is a realization of a random function $F(\cdot)$ drawn from an underlying distribution over $\mathcal{F}_\delta$ . The randomness arises from factors like different random seeds used during training, which affect weight initialization and data shuffling.

When fine-tuning models with different random seeds while keeping the training data and hyperparameters fixed, each model is generated independently under the same conditions. Each fine-tuning run is independent because it uses a different random seed, leading to independent stochastic updates during training. Since all runs use the same training procedure and data (except for the seed), the models are drawn from the same underlying distribution over $\mathcal{F}_\delta$ . Thus, we assume, given a sample $x$ , the outputs $F(x)$ and $F’(x)$ are independent and identically distributed random variables, which aligns with standard probabilistic definitions.

[1] https://encyclopediaofmath.org/wiki/Random_function

评论- further comments

2024-11-26

Thank you for providing further explanation. However, the multi-class example does not sound right to me. While the main result applies to a given class, it does not apply to the class that has the highest probability -- as this introduces selective biases. In other words, the result does not apply to a class index that is random, which could violate independence assumption.

Regarding the theoretical results, I would like to see more rigorous definitions of concepts such as "fine-tuning as a random process", "random function", "distribution over $F$ " (a function class), "each model is generated independently". While I understand stochastic process, random variables, function class, the above listed are not well defined.

2024-11-25

Stability of Guarantee with Sample Size k

From Theorem 1, as the sample size $k$ increases, the consistency guarantee holds with a probability that approaches 1 exponentially. This ensures that with sufficiently large $k$ , the measure stabilizes without the need for infinite $k$ samples. Empirically, we observed that $k = 30$ achieves a good balance between computational efficiency and performance, as shown in the ablation study (see Table 7 in Appendix C), while further increasing $k$ incurs higher computational cost with diminishing returns.

Bounded Variance of Z

The variance of $Z$ is indeed bounded for bounded functions $F$ and $F'$ , and we could have used the trivial bound of $\beta = 2$ in our guarantee. However, we anticipate that $\beta$ is significantly smaller, particularly on the data manifold, because $F$ and $F'$ are models fine-tuned from the same pretrained on the same dataset. For samples lying on the data manifold—where realistic samples exist—we expect several models (from the same pretrained model) fine tuned on the same dataset with different training seed to exhibit "similar" prediction probabilities. However, fine-tuned models can differ significantly in regions outside the data manifold, as the absence of training samples in these areas means there is no shared information to constrain their behavior. We have now included a discussion of this in Remark 3.

Unique Challenges of Multiplicity in TabLLMs

Our focus on fine-tuning multiplicity in Tabular LLMs is deliberate, motivated by unique challenges that arise when applying LLMs to tabular data—a domain where they are increasingly used but not originally designed for. Our results are specifically inspired by empirical observations unique to Tabular LLMs, as demonstrated in our experiments (see Section 2 and Figure 1). These experiments reveal that fine-tuned Tabular LLMs exhibit noisy and non-smooth decision boundaries. That highlighted that model predictions alone may be unreliable and motivates our perturbation-based approach to quantifying multiplicity.

This instability is likely due to the mismatch between the LLMs' pre-training on language data and their adaptation to tabular tasks, which often involve a mix of numerical and categorical features. These issues are critical in tabular data applications, where arbitrary predictions can have significant impacts in high-stakes fields like finance and healthcare [1].

Our problem is inherently meaningful when considering small model changes (variations in the fine-tuning process), such as changes in random seeds or weight initialization. Large or drastic model changes—such as fine-tuning on entirely different datasets or adapting models pre-trained on vastly dissimilar corpora—are outside the scope of this work, as such changes naturally lead to different predictions. Addressing multiplicity for drastic changes might not only be infeasible but also conceptually irrelevant.

[1] van Breugel, Boris, and Mihaela van der Schaar. "Why tabular foundation models should be a research priority." ICML (2024).

评论- followup comments [2/2]

2024-11-26

Thanks for your explanation regarding "Unique Challenges of Multiplicity in TabLLMs". Still, it does not convince me how the developed main result has anything specific to tabularLLM. It is claimed that "This instability is likely due to the mismatch between the LLMs' pre-training on language data and their adaptation to tabular tasks ..." "Our problem is inherently meaningful when considering small model changes ..." but I would like to see more analysis or insightful discussions into claims like this.

You mentioned "We anticipate that is significantly smaller, particularly on the data manifold, ... " However, the argument is too vague and lack rigorous ground. And it does not directly address my comment that the main result is irrelevant to the paper scope.

2024-12-02

Thank you for your response!

Rigorous Definitions of Concepts

We rewrite our notations here for clarity. Capital letters denote random variables. There are multiple sources of randomness here from the data and fine-tuned model parameters.

Let F denote a fine-tuned model, i.e., $F(u)=g(m,W,u)$ where $m$ is the fixed pre-trained model, $u$ is an input and $W$ denotes the parameters that were finetuned (random variable). Here $W$ (parameters) is a random variable because of variations in seed, hyperparameters, training process, etc. leading to a random well-performing fine-tuned model every time. The fine-tuned models $F$ are assumed to belong to a set of well-performing models $\mathcal{F}_\delta$ .

For simplicity, we can assume each model $f_i$ in the of set well-performing models to be a discrete random variable because it is using parameters $W=w_i$ with an associated probability $p_i$ , where: $\sum_{i=1}^{|\mathcal{F}_\delta|} p_i = 1.$ This description can also be extended to continuous random variables.

Then, the datapoint X is another random variable. Given an input $X = x$ , the output $F(x)$ , represents the behavior of a random model $F$ at a fixed point $X= x$ . The random variable $F(x)$ can follow any distribution consistent with the constraints imposed by $\mathcal{F}_\delta$ , but we do not assume a specific distributional form. We only assume that $F(X)|X=x$ and $F’(X)|X=x$ are iid since the parameters $W$ and $W’$ would be iid (drawn with replacement from the set $\{w1, w2,.....,\}$ ).

Multi-Class Scenario

We are somewhat confused by this comment and wanted to flag it to make sure there is no misunderstanding. In an attempt to clarify, let us consider a binary classification scenario with two classes, A and B, and corresponding softmax logits $f_A(x)$ and $f_B(x)$ , such that $f_A(x) + f_B(x) = 1$ . Our measure applies to all classes but is informative for the highest predictive class.

The consistency measure can be defined for class A as $S(x, f_A)$ and for class B as $S(x, f_B)$ . The former denotes the consistency for the prediction of A. So if $S(x, f_A)$ is high, it means that across many finetuned models, the logit for class A is also likely to be high (as per our theoretical guarantee). Similarly, $S(x, f_B)$ is the consistency for the prediction of B. So if $S(x, f_B)$ is high, it means that across many finetuned models, the logit for class B is also likely to be high (as per our theoretical guarantee).

However, both $S(x, f_A)$ and $S(x, f_B)$ are not going to be very high together because of their complementary nature, so both of them are not going to be informative. For if $S(x, f_A)$ = 0.5+epsilon and $S(x, f_B)$ = 0.2+epsilon, we know the logit for class A will likely be higher than 0.5 and predictions of A will be more consistent. We also know that the logits for class B will be higher than 0.2 though this is not very informative.

This complementary nature of the measure ensures that the consistency score for the chosen class aligns with its predicted confidence. For example, just as the predicted probability of the chosen class is given by $\max(f_A(x), f_B(x))$ , the consistency score of the chosen class can similarly be expressed as $\max(S(x, f_A), S(x, f_B))$ which is what we do in the experiments.

2024-12-03

Why focus on Tabular LLMs

While our main result could be broadly applicable to multiplicity in LLMs for classification tasks, our work is specifically motivated by, contextualized within, and empirically validated on Tabular LLMs. Below, we outline the reasons for this focus:

Specific Motivational Use-Case in High-Stakes Applications: LLMs are increasingly appealing for tabular tasks in high-stakes applications, e.g., hiring, finance, healthcare, etc [1]. While traditional models like neural networks and gradient boosting machines continue to dominate when large amounts of labeled tabular data are available, their performance deteriorates significantly in data-scarce scenarios. In contrast, LLMs can leverage their pre-trained knowledge and reasoning capabilities to maintain strong performance, even with limited labeled data, through few-shot learning (Hegselmann et al., 2023). However, prediction consistency is critical in these domains [1], e.g., the same application is approved by one fine-tuned model and rejected by another equally-well-performing fine-tuned model for employment, loan, etc. (with just minor changes in seed, hyperparameters, etc.) Arbitrary and conflicting predictions (from fine-tuning multiplicity) in these settings can lead to significant consequences, including reputational damage, erosion of trust, and regulatory non-compliance.

Much more multiplicity could also arise if the underlying pre-trained model is changed, e.g., in $F(u)=g(m,W,u)$ , we no longer have a fixed pre-trained model m anymore and instead becomes $F(u)=g(M,W,u)$ where $M$ is also a random variable. Here, since many more parameters are changing and each $M$ is trained on a different data distribution, it might not even be reasonable from an application standpoint to expect that the predictions would remain consistent. In fact, multiplicity would be inevitable if the pre-trained LLM changes too (it’s like using a different evaluation criteria altogether since it’s learnt from a different data distribution) and one cannot expect a single model $F=f$ to anticipate multiplicity across multiple pre-trained LLMs that were trained on fundamentally different data distributions.

Therefore, we restrict our claim to “small” model changes (hence introducing the word fine-tuning multiplicity) where we use the word small only to mean fewer parameters (only $W$ ) changes and the pre-trained model m remains fixed. This is the regime where we can also refer recent theoretical results on LORA [2] to corroborate our theoretical assumptions. Our strength lies in the fact that empirically we still see a strong correlation between regions anticipated to be consistent by our measure using just a single model $f$ and regions that are actually consistent under fine-tuning multiplicity which in itself is a surprising finding.

[1] Boris van Breugel and Mihaela van der Schaar. Why tabular foundation models should be a research priority. arXiv preprint arXiv:2405.01147, 2024.

[2] Yuchen Zeng and Kangwook Lee. The expressive power of low-rank adaptation, ICLR 2023.

Smaller Variance on the Data Manifold

We appreciate the reviewer’s feedback and are happy to revise our presentation by replacing $\beta$ with its upper bound, $\beta \leq 2$ , in our guarantees if that better aligns with the expectations for rigorous exposition. Our reason for including $\beta$ was to provide additional nuance and offer an intuitive argument about when $\beta$ might be small or approach its trivial upper bound. This intuition was not intended as a formal claim but rather as a way to contextualize the conditions under which our guarantees might be tighter.

Because $F$ and $F'$ are models fine-tuned from the same pretrained on the same dataset. For samples lying on the data manifold—where realistic samples exist—we expect several models (from the same pretrained model) fine tuned on the same dataset with different training seed to exhibit "similar" prediction probabilities (i.e., lower variance). However, fine-tuned models can differ significantly (i.e., higher variance) in regions outside the data manifold, as the absence of training samples in these areas means there is no shared information to constrain their behavior.

Since the points for which we are measuring robustness are expected to correspond to real individual data, we expect they lie on the data manifold, leading to $\beta$ been smaller than the trivial upper bound of 2.

审稿意见

评分: 3置信度: 42024-11-04

The work studies the problem of model multiplicity in the application of LLMs to tabular data. Model multiplicity refers to the phenomenon that multiple models of similar accuracy assign confliciting predictions to individual instances. The authors propose a measure of model multiplicity called consistency, provide a theoretical observation about it, and evaluate it in four standard tabular datasets.

优点

Model multiplicity is an interesting problem. It's certainly worth studying it in the context of tabular datasets, and with regard to models resulting from multiple finetuning runs on the same data.

缺点

(1)

The proposed measure of consistency has a few different issues that prevent it from being an informative measure in my opinion.

(a) The biggest issue with it is that it is apparently unable to identify constant predictors adequately. Constant predictors, by definition, cannot exhibit model multiplicity. Unfortunately, the measure is sensitive to the value of the constant: A constant 0 predictor would have consistency 0, whereas a constant 1 predictor would have consistency 1. Note that 0 is the smallest possible value consistency can take on, while 1 is the largest possible value. In other words, two equivalent predictors can have maximal or minimal consistency.

This shows why we cannot read anything into a particular value of consistency. The authors write: "We desire that the measure S(x, f) should be high if the input x is consistent across this broad class models." Unfortunately, even if S(x, f) = 0, it could be that the model output is perfectly "consistent" as the example of the constant 0 predictor shows.

(b) The measure S(x, f) is a random variable that depends on the number of sample points and the radius of the sphere. It's not clear how sensitive the measure is to these design choices. This raises the quesiton how conflicting values of sensitivity should be reconciled.

(c) There is no reason that the model family should respect the Euclidean geometry implicit in the definition of the consistency measure. This gives additional reason to believe that the measure cannot be useful across all model families.

In light of these issues, I prefer the definitions from prior work. For example, the measure of pairwise disagreement by Black et al. (2022) already gets around the above issues.

(2)

As a result of the definitional problems with consistency, I find the suggested main "theorem" to be less than helpful. Let's even take aside the fact that it's a direct application of the assumptions and Bernstein's inequality. It's not clear what we can read into. Again, the constant 0 predictor gives the same output to any input, but the theorem would only say that Pr(F(x) > -\epsilon) is large. This is, of course, a vacuous statement.

(3)

The selection of datasets is rather disappointing. Although standard, these four datasets are very small, outdated, and problematic for different reasons. This is especially unfortunate, given the wide availability of many better tabular datasets. For example, the tableshift project has more than a dozen useful tabular datasets:

https://github.com/mlfoundations/tableshift

Likewise, the folktables package gives easy access to a plugin alternative and extension to the Adult dataset:

https://github.com/socialfoundations/folktables

There is really no good reason for the empirical evaluation to be so limited.

问题

Could you please comment on the issues with the definition?

What are the advantages of consistency compared with existing measures?

Do you have results on other/larger datasets? If not, why not?

2024-11-25

Thank you for your detailed and insightful feedback on our paper.

1(a) Clarification on Consistency Measure

We would like to first clarify that our consistency measure does apply to constant predictors, and multi-class classification. We clarify a possible misunderstanding: Our consistency measure is tied to the confidence in predicting a specific class (i.e., the probability values derived from softmax logits), and not directly to the predicted label 1. This was discussed in Remark 1 of the original submission. We have slightly revised Remark 1 to further clarify this.

Remark 1 Our consistency measure is tied to the confidence in predicting a specific class and not the predicted labels. The concept can be seamlessly applied by considering the softmax logits for predicting any given class. This also extends to multi-class classification by using the softmax logits for each class, thereby maintaining the measure's applicability across various classification tasks.

A constant 0 predictor, when interpreted in terms of softmax outputs, would assign a softmax probability of 1 to the "negative" class for all inputs. If we compute consistency for the "negative" class (for which the softmax output is 1), the consistency becomes $S(x,f) = 1$ . The consistency measure correctly reflects the model’s confidence in the target class. The "negative" class for the constant 0 predictor would have maximal consistency, likewise for a constant 1 predictor. This behavior aligns with the purpose of the measure.

Multi-Class Example

1(b) Sensitivity to Sample Size and Radius

Based on Theorem 1, choosing a larger sample size $k$ is advantageous as it ensures the consistency guarantee holds with higher probability. However, this also increases the computational cost of model inference. We perform an ablation study on the sample size $k$ , observing improved performance with increasing $k$ (see Table 7 in Appendix C).

We explore the effect of varying the perturbation radius $\sigma$ . Results of this ablation study are summarized in Figure 7 and Table 8 in Appendix C. Best performance is observed at $\sigma = 10^{-2}$ . When $\sigma$ is too small (e.g., $10^{-4}$ ), we basically sample (almost) the same points and our consistency measure is not more informative than the prediction probability. When $\sigma$ is too large (e.g., $10^{-1}$ ), one loses all information about the data point. To guide the choice of $\sigma$ , one could consider the spread of training data points in the embedding space (i.e., The perturbation radius $\sigma$ is chosen based on the variance of the training data, inspired by the adversarial robustness literature, where perturbation budgets are often tied to the spread of the data distribution, as discussed in [1].

[1] Madry, Aleksander. "Towards deep learning models resistant to adversarial attacks." arXiv:1706.06083 (2017).

2024-11-28

Thank you for your response. I just reviewed your revision.

I read Remark 1 a few times, but it did not help. It only convinced me further that the definition has problems. If you intend to apply the definition to the softmax probabilities, the problem will not go away. The are now many equivalent classifiers that have the very different consistency values. For example, the constant (0.8, 0.2) predictor will have a very different sensitivity as the (0.6, 0.4) predictor even though both output the same class on all instances.

Due to the fact that equivalent predictors have very different consistency values, by your definition, the theorem you prove necessarily can't say much.

I continue to think that the two terms in the difference of S(x, f) don't interact well. The first step toward fixing the broken definition is to make consistency a fixed quantity rather than a random variable. This is a useful exercise.

Suppose, the expectation E f(x_i) for x_i drawn from the neighborhood N is equal to f(x). Then the first term of your definition is essentially f(x). Suppose the function f is Lipschitz and the radius of the neighborhood is sigma. Then you expect the deviations in the second terms to be about O(sigma). So, in expectation you measure gives you something like

$S(x,f) = f(x) \pm O(\sigma)$

So what you're seeing modulo random fluctuation is just f(x). Hence, you're seeing the scaling issues I pointed out.

Try to come up with a definition that is not a random variable, but a fixed quantity. It will reveal some of the issues to you and point the way towards improvement.

2024-12-03

Thank you for your feedback and suggestions!

The classifiers (0.8, 0.2) and (0.6, 0.4) are not equivalent since they have different probability scores, however, they do predict the same label. Our measure captures more than just the predicted label—it also reflects the confidence of the prediction. For both classifiers, the consistency bounds would be $Pr(F(x) > 0.8 - \epsilon) \geq 1 - \delta$ and $Pr(F(x) > 0.6 - \epsilon) \geq 1 - \delta$ , respectively.

You are correct in the sense that our measure would tell you that a prediction would remain valid/robust (i.e., stay on the positive side of the decision boundary) under the two scenarios. However, our measure can tell us more. The higher consistency score for (0.8, 0.2) indicates stronger robustness, as it reflects a more confident prediction. While both classifiers predict the same label, our measure provides a more nuanced view by leveraging the softmax probability logits. This additional sensitivity is a strength of the measure, as it offers more information about the robustness of predictions. In our experiments, we already include results for both class 0 and 1.

Consider a binary classification scenario with two classes, A and B, and corresponding softmax logits $f_A(x)$ and $f_B(x)$ , such that $f_A(x) + f_B(x) = 1$ . Given $S(x, f_A)$ for one class, there is a one-to-one relationship with $S(x, f_B)$ for the other class. We can think of $f$ as a vector of softmax probabilities, e.g., $f = (f_A, f_B)$ , and the consistency measure reflects the robustness of this model as a whole. When $f$ has low entropy (i.e., one class has high confidence), $S(x, f)$ is highly informative. Conversely, when $f$ has high entropy (i.e., predictions are distributed more evenly across classes), $S(x, f)$ is less informative, as expected. The measure inherently captures a property of the model output, not the individual classes in isolation. For three or more classes, the same principles apply.

2024-12-03

You may have missed my main point.

You define consistency to be a random variable. It would help you a great deal if you tried to make it a population quantity, i.e., a property of a predictor (at a given input) and a data-generating distribution. A standard way of doing this is to take expectations over the random process. When you try to do this, I think you'll likely see some of the issues with the definition. My argument suggests that for small enough \sigma, your current measure is essentially just f(x) plus random noise.

If you take expectations in your definition, you get f(x) (the first term) plus some kind of local $\ell_2$ -sensitivity of the function at x (the second term). It's not clear why these should added to each other and what we can learn from that.

2024-11-25

1(c) Model family and Euclidean geometry

Our measure of consistency is grounded in the observation that non-robust predictions are often sensitive to slight perturbations in the input space (as demonstrated in Figure 2). For samples lying on the data manifold—where realistic samples exist—we expect several models (from the same pretrained model) finetuned on the same dataset with different training seeds to exhibit "similar" prediction probabilities. However, fine-tuned models can differ significantly in regions outside the data manifold, as the absence of training samples in these areas means there is no shared information to constrain their behavior.

Additionally, recent mathematical analyses of LoRA corroborate this perspective, showing that $\mathbb{E}_X \|f(X) - f'(X)\| \leq \Delta$ for a random variable $X$ over a bounded set [1].

Furthermore, our experimental results empirically validate the effectiveness of our measure. They demonstrate that leveraging the local neighborhood around a sample informs multiplicity, outperforming solely using the prediction probabilities $f(x)$ .

Our problem is meaningful when considering small model changes (variations in the fine-tuning process), such as changes in random seeds or weight initialization. Large or drastic model changes—such as fine-tuning on entirely different datasets or adapting models pre-trained on vastly dissimilar corpora—are outside the scope of this work, as such changes naturally lead to different predictions. Addressing multiplicity for drastic changes might not only be infeasible but also conceptually irrelevant.

Regarding Black et al 2022

The pairwise disagreement metric by Black et al. (2022) is indeed a valuable metric, but it requires access to multiple trained models to calculate disagreements. In contrast, our approach is specifically designed to measure multiplicity using a single model, making it more practical in scenarios where retraining several models is computationally infeasible. Our measure differs in from pairwise disagreement, as it focuses on assessing multiplicity using a single model, in fact getting close to pairwise disagreement in our goal as we try to measure the correlations.

2(a) Clarification on Interpretation of guarantee

To clarify, our consistency measure is based on softmax logits (as mentioned in Remark 1 (line 298) of the original submission), reflecting the model's confidence in predicting a specific class. For a constant predictor outputting the negative class (a constant 0 predictor), the softmax probability for the negative class is 1 for all inputs. In this case, the theorem provides that $\Pr(F(x) \geq 1 - \epsilon)$ is large, which is what is expected.

(3) Additional Datasets

We appreciate the reviewer’s suggestion to include a broader range of datasets. In response, we have now included results for two additional datasets (making a total of 6 datasets), Heart and Car (see Tables 1 and 2 in the revised paper). Additionally, we plan to extend our analysis to three more datasets (Jungle, CalHousing, and Blood) in the final version. These updates were limited during the rebuttal phase due to time constraints.

[1] Yuchen Zeng and Kangwook Lee. The expressive power of low-rank adaptation, ICLR 2023.

2024-11-28

I also reviewed the additional datasets you added.

The Heart dataset has 918 patients and 11 features. The Car dataset has 1728 data points and 6 features. I'd consider both, especially the latter, to be a bit of a toy dataset.

I recommend doing experiments on at least one larger dataset.

2024-12-04

Thank you for your response and willingness to have a discussion with us regarding this.

We now understand your comment. If we assume $E[f(X_i)] = f(x)$ , where $X_i$ are sampled from the neighborhood points of $x$ and the model is $\Gamma$ -Lipschitz, then in expectation, our measure would satisfy $E_{X}[S(x,f)] \geq f(x) - O(\gamma\sigma)$ .

However, the assumption $E[f(X_i)] = f(x)$ can be problematic in practice. While this assumption may hold in idealized scenarios (e.g. linear predictors), it does not necessarily hold for all models (violates Jensen’s inequality). Specifically, $f(x)$ itself can be misleading or unstable due to fine-tuning multiplicity (and indeed in our experiments on all the datasets, we show f(x) alone does not correlate well with multiplicity even with no threshold). For example:

Consider a case where $f(x) = 1$ , but the neighboring predictions $f(x_i)$ around $x$ (i.e., $x_i$ sampled from the neighborhood) are closer to 0. In this scenario, relying solely on $f(x)$ would fail to reflect the true predictions in the neighborhood. This behaviour can be observed in Figure 2 in our visualization of an experiment on synthetic data. We observe noisy, locally non smooth patterns decision boundaries. This behaviour where the prediction of a datapoints varies significantly from its neighboring samples highlights why we do not assume $E[f(X_i)] = f(x)$ , why model predictions alone may be unreliable and motivates our perturbation-based approach to quantify multiplicity.

Intuition for variability term- While the mean provides an average measure of the function's outputs, it alone can be misleading without considering variability. For instance, consider a neighborhood where a constant (0.55, 0.45) classifier. In this case, the mean term of our measure is 0.55, and the variability is zero since the outputs are identical. Our guarantee would then state $\Pr(F(x) > 0.55 - \epsilon)$ , indicating that predictions remain consistently on the positive side of the decision boundary.

Consider another neighborhood with a mix of high and low outputs (e.g., 1's and 0's). Here, the mean might still exceed 0.55 due to averaging, even exceeding the first case (does this imply this is more robust? No! The variability term would be high in this case, reflecting the inconsistency of predictions in this region. The guarantee in this case, incorporating the variability penalty, would be $\Pr(F(x) \geq S - \epsilon)$ , where S is low due to the high variability. This implies that we can not certify the robustness of a prediction in this region (the bound still holds but is non-informative). This highlights the importance of variability: Without this term, misleading confidence could be assigned to highly variable regions simply due to a favorable mean. The goal of our consistency measure is to quantify the proneness to fine-tuning multiplicity and not necessarily to approximate f(x).

Origin of variability term – The variability term stems from the reverse triangle inequality step in the proof of our guarantee, which allows us to bound the behavior of $f(x)$ in terms of both the mean and variability within the neighborhood: $f(x) \geq f(x_i) - |f(x) - f(x_i)|$ . This inequality holds with probability 1 and ensures that variability within the neighborhood directly impacts the bounds on $f(x)$ . The reverse triangle inequality shows that $f(x)$ is at least $f(x_i)$ minus the fluctuation $|f(x) - f(x_i)|$ , ensuring $f(x)$ cannot drop significantly below $f(x_i)$ without a large deviation between $f(x)$ and $f(x_i)$ .

Variability term on theoretical guarantee – The guarantee $\Pr(F(x) > S(x,f) - \epsilon)$ , where $S(x,f)$ incorporates both the mean and variability terms, is akin to asking how likely $F(x)$ is to deviate significantly from the central tendency of its neighborhood. This is analogous to quantifying the probability of a random variable exceeding one variability (or standard deviation) away from its mean. Including the variability term strengthens the guarantee by reflecting the stability of predictions in the neighborhood.

Empirical Validation – We have also experimented with using only the mean term of our measure and found it to perform better than just using the predicted probability $f(x)$ . However, including the variability term improved the results further, as shown in Table below. This highlights the importance of accounting for both the mean and variability in robustly quantifying prediction consistency.

2024-12-04

Dataset	Number of Shots	Measure	Arbitrariness	Pairwise Disagreement	Prediction Variance	Prediction Range
Adult	128	Consistency	0.80 ± 0.03	0.96 ± 0.05	0.84 ± 0.01	0.91 ± 0.06
		(mean term only) Local Averaging Robustness [2]	0.75 ± 0.01	0.90 ± 0.04	0.83 ± 0.05	0.90 ± 0.04
		Drop-Out	0.74 ± 0.01	0.83 ± 0.05	0.69 ± 0.08	0.81 ± 0.08
		Pred. Prob.	0.67 ± 0.00	0.62 ± 0.02	0.30 ± 0.02	0.54 ± 0.05

审稿意见

评分: 5置信度: 42024-11-09

The paper studies the fine-tuning multiplicity in tabular LLMs, where different settings, such as varying random seeds, lead to different models and correspondingly different predictions for a given sample or dataset. The authors propose a measure called consistency, which is based on the model’s local behavior around the input. Asymptotic guarantees for consistency are provided, and an experimental comparison with other model multiplicity metrics (e.g., arbitrariness, discrepancy, pairwise disagreement), as well as predictive probabilities and dropout methods, is conducted.

优点

The paper is clearly written and easy to follow.

The authors provide an extensive literature review.

The authors consider the interesting problem of uncertainty quantification for tabular LLMs and propose using a relatively inexpensive measure to assess it.

A comparison is made with various methods of computing model multiplicity.

缺点

Could you please provide more background and intuition behind the consistency measure? Essentially, it is not clear why the choice of the $f(x_i) - |f(x)-f(x_i)|$ metric was made and what it means locally for a given sample, as well as globally when averaged for a given model. It is also unclear why the discrepancy in the metric $f(x)$ vs. $2f(x_i)-f(x)$ occurs, given the sign on the absolute value.
Consistency, as designed, is a measure of robustness or stability, and more comparison or motivation on how consistency differs from existing robustness methods should be provided (there exist different measures, just as an example please see (1) and (2)). Further, while there is a high rank correlation with model multiplicity metrics, more discussion is needed on why a local robustness metric is a good approximation of model multiplicity.
Theorem 1 provides interesting asymptotic guarantees. However, as the authors mentioned, it is challenging to use in practice. Specifically, it is not clear whether the current consistency value is high or low for the theorem to apply. To provide more information on the usefulness of the theorem or $k$ , is it possible to estimate $\beta$ for some simpler hypothesis spaces, such as linear models?
Tables 1 and 2 in the main paper, as well as other tables in the Appendix, do not contain confidence intervals (either computed or estimated). Therefore, it is difficult to assess the robustness of the presented values and compare them across datasets and metrics.
Methods that enumerate all models in the Rashomon set for different hypothesis spaces, such as RashomonGB, TreeFarms, and CorelsEnum, can also be added to the literature review for completeness.

(1) J. Djolonga, F. Hubis, M. Minderer, Z. Nado, J. Nixon, R. Romijnders, D. Tran, and M. Lucic. Robustness Metrics, 2020.

(2) Tessa Han, Suraj Srinivas, Himabindu Lakkaraju. Characterizing Data Point Vulnerability as Average-Case Robustness, 2024.

问题

A lot of the model multiplicity metrics used in the paper are label-based. Does the choice of the threshold of 0.5 influence the rank correlation between consistency and these metrics?
Do you have any intuition as to why the rank correlation is low between consistency and arbitrariness/pairwise disagreement for the German dataset?

2024-11-25

W3. Applicability and Interpretation of Theorem 1

Theorem 1 establishes that higher consistency scores $S_{k,\sigma}(x,f)$ indicate greater robustness, ensuring that predictions are likely to remain stable across different fine-tuned models with high probability. While estimating certain unknown parameters, such as $\beta$ , could help determine the exact sample size $k$ required for optimal results, this is not a significant limitation in practice. Given the nature of LLMs, we do not expect a linear embedding space, as we aim to leverage pretrained knowledge and few-shot examples to capture complex relationships in the tabular data.

In our experiments, we conducted an ablation study on $k$ (see Table 7) and observed that after a threshold of $k = 30$ , further increasing the number of samples did not significantly improve the correlation between the consistency measure and multiplicity metrics. This indicates that a practical choice of $k$ can yield robust results without requiring exact parameter estimation.

Practitioners can use the upper bound of $\beta \leq 2$ to determine the worst-case sample size $k$ needed to ensure the theoretical guarantees hold with high probability. This provides a conservative estimate that remains applicable even in the absence of precise parameter knowledge. The number of sampled points $k$ can be increased to even much larger values to ensure higher probability. It remains computationally more efficient to increase $k$ than retraining multiple models to quantify multiplicity. We have included a discussion of this in Remark 4.

W4. Including Confidence Intervals

We appreciate the reviewer’s suggestion to include standard deviations to better assess the robustness of our results. Calculating standard deviations for all metrics would require repeating experiments across multiple runs, which involves training and evaluating significantly more models (e.g., over 40 models per dataset), making it computationally expensive. However, for one scenario where this was feasible, we observed a low standard deviation (<0.1), indicating that our results are robust.

Dataset	Number of Shots	Measure	Arbitrariness	Pairwise Disagreement	Prediction Variance	Prediction Range
Adult	128	Consistency	0.80	0.96	0.84	0.91
		STD	0.03	0.05	0.01	0.06
		Drop-Out	0.74	0.83	0.69	0.81
		STD	0.01	0.05	0.08	0.08
		Pred. Prob.	0.67	0.62	0.30	0.54
		STD	0.00	0.02	0.02	0.05

(W5) Cite Relevant Work

We appreciate your suggestion to include methods like RashomonGB, TreeFarms, and CorelsEnum in our literature review. We have revised the related work section to incorporate these methods.

2024-11-25

Questions

Q1 A lot of the model multiplicity metrics used in the paper are label-based. Does the choice of the threshold of 0.5 influence the rank correlation between consistency and these metrics?

Yes, the choice of the threshold (e.g., 0.5) for existing threshold-dependent evaluation metrics (e.g., arbitrariness and pairwise disagreement) would influence the correlation between consistency and these multiplicity metrics. This highlights a key limitation of threshold-dependent metrics: they may fail to capture certain aspects of prediction variability.

For instance, if the classification threshold for predictions of the positive and negative classes were changed from 0.5 instead of 0.3, predictions with scores in (0.4, 0.6) would all be assigned to the positive class. In this scenario, metrics like arbitrariness or pairwise disagreement evaluated at a threshold of 0.3 would fail to capture the variability within this range, whereas using a threshold of 0.5 would reflect this variability more effectively.

To address this limitation, we also introduce metrics like the prediction variance and prediction range, which are not dependent on a specific threshold and better evaluate the variability in model outputs. On the contrary, a key strength of our consistency measure is that it is inherently not threshold-dependent. Our measure provides a guarantee on the robustness of prediction confidence across a broad range of fine-tuned models, regardless of the classification threshold.

Q2 Do you have any intuition as to why the rank correlation is low between consistency and arbitrariness/pairwise disagreement for the German dataset?

The low rank correlation between consistency and arbitrariness/pairwise disagreement for the German dataset may stem from the threshold-dependent nature of these metrics. It is possible that the observed multiplicity predominantly occurs on one side of the decision boundary (as discussed in Q1). Metrics like prediction variance and prediction range are not influenced by accept/reject thresholds, enabling them to capture multiplicity even when predictions remain consistently on one side of the decision boundary. This could explain the low correlation between consistency and threshold-dependent metrics like arbitrariness and pairwise disagreement while showing a higher correlation with prediction variance and prediction range.

2024-11-27

Thank you to the authors for their responses and for revising the manuscript.

The fact that the definition of consistency was inspired by the theoretical results is logical, given the form of consistency. However, Theorem 1 provides asymptotic guarantees, so it is unclear whether consistency remains meaningful under arbitrary choices of $k$ and $\beta$ , or whether it is a reasonable measure for other model classes. The definition of consistency itself still does not seem intuitive to me. Its benefits currently rely on the empirical analysis and the observed correlation with the multiplicity measures. However, other robustness measures might also exhibit high correlations with model multiplicity measures. Therefore, a more careful analysis and comparison between robustness measures and consistency is needed.

The model multiplicity measure can be very sensitive to data properties or small changes in the data. Therefore, confidence intervals are very important when analyzing these measures. Without them, it is difficult to compare measures or draw strong conclusions.

2024-11-25

Thank you for your thoughtful and constructive feedback on our paper.

W1. Clarification and Intuition Behind the Consistency Measure

We appreciate your request for more background and intuition behind our proposed consistency measure $S_{k,\sigma}(x,f)$ . The consistency measure quantifies the robustness of a model's prediction at a specific point $x$ by examining the model's behavior in the local neighborhood of $x$ within the embedding space. Our measure is motivated by our observations on synthetic data experiments where the model was exhibiting noisy and non-smooth patterns in the decision space (see Figure 2).

Local Averaging: The term $\frac{1}{k}\sum_{x_i \in N_{x,k}} f(x_i)$ represents the average prediction of the model on points sampled from a neighborhood around $x$ . This captures the general tendency of the model in the vicinity of $x$ .
Variability Penalization: The term $\frac{1}{k}\sum_{x_i \in N_{x,k}} |f(x) - f(x_i)|$ computes the average absolute difference between the model's prediction at $x$ and its predictions at neighboring points. Subtracting this from the local average penalizes the consistency score when there is high variability in a neighborhood in spite of high local mean, reflecting instability in the model's predictions around $x$ .

By combining these two terms, $S_{k,\sigma}(x,f)$ provides a measure that is high when the model's predictions are both strong (i.e., high average prediction) and stable (i.e., low variability) in the neighborhood of $x$ . The metric is designed to capture the local stability of the model's predictions, which is critical in assessing robustness to fine-tuning multiplicity.

Consistency Interpretation of $f(x)$ and $2f(x_i) - f(x)$

This interesting structure of our consistency measure is not a heuristic design but arises directly from the reverse triangle inequality step in the proof of our theoretical consistency guarantee (Theorem 1): $|f(x)| \geq |f(x_i)| - |f(x_i) - f(x)|$ [1].

For the cases you mentioned: When $f(x_i) \geq f(x)$ , the contribution to the consistency score is $f(x)$ , indicating that the neighborhood prediction $f(x_i)$ reinforces and supports the robustness of prediction $f(x)$ .

When $f(x_i) < f(x)$ , the contribution becomes $2f(x_i) - f(x)$ . If $f(x_i)$ is significantly less than $f(x)$ , the term $2f(x_i) - f(x)$ becomes negative, penalizing the consistency score due to large discrepancies between $f(x)$ and its neighbor. However, if $f(x_i)$ is only slightly less than $f(x)$ (i.e., $f(x_i) > \frac{f(x)}{2}$ ), the term $2f(x_i) - f(x)$ remains positive, thereby contributing positively to the consistency measure. The intuition is that we only penalize significant drops in neighboring predictions and allow neighbors that closely support $f(x)$ prediction. We have included this interpretation in the revised paper.

[1] Reverse Triangle Inequality, https://www.stat.ubc.ca/~saif.syed/math100V/Math%20100V%20-%20Triangle%20inequality.pdf

2024-11-25

W2: How consistency differs from existing robustness methods

We thank the reviewer for suggesting these references, which we have now cited and addressed in our revised manuscript to clarify how our approach differs from existing methods.

Our focus on model multiplicity distinguishes this work from traditional robustness measures, which address different aspects of model behavior such as out-of-distribution (OOD) generalization, stability under natural perturbations, and uncertainty estimation [1]. OOD generalization typically evaluates how well a model performs on data that differs from the training distribution (e.g., classifying objects seen from novel viewpoints or in cluttered settings). This is often quantified using test datasets with altered conditions or domain shifts, and methods like domain adaptation are employed to enhance robustness. Stability under natural perturbations assesses the sensitivity of predictions and predicted probabilities to small, random changes in the input, such as Gaussian noise or image transformations. Uncertainty estimation, on the other hand, focuses on calibrating the predicted probabilities to reflect true likelihoods, often using measures like Expected Calibration Error or entropy-based metrics to evaluate how well the model quantifies confidence in its predictions. While these methods provide valuable insights into different facets of robustness, their goals differ significantly from ours.

[2] is more closely related to our approach, as it quantifies robustness by measuring the fraction of consistent predictions within a local neighborhood. While both approaches leverage the neighborhood around a data point, the objectives diverge: [2] focuses on quantifying the probability of consistent predictions against perturbations to evaluate robustness to noise. In contrast, our measure captures the consistency of predictions (multiplicity) among competing models within the Rashomon set.

Additionally, our consistency measure's unique mean-variance nature further distinguishes it. Unlike existing metrics, it not only accounts for the average prediction within a neighborhood but also penalizes the variability in predictions. Moreover, we provide theoretical guarantees on the robustness of predictions with high consistency scores over a broad range of equally-well performing models. We also include an ablation study comparing our full consistency measure, which incorporates both mean and variability, to a version that considers only the mean term.

Dataset	Number of Shots	Measure	Arbitrariness	Pairwise Disagreement	Prediction Variance	Prediction Range
Adult	128	Consistency	0.80	0.96	0.84	0.91
		Consistent (mean term only)	0.75	0.90	0.83	0.90
		Drop-Out	0.74	0.83	0.69	0.81
		Pred. Prob.	0.67	0.62	0.30	0.54

Why local robustness metric is a good approximation of model multiplicity

Our measure of consistency is grounded in the observation that non-robust predictions are often sensitive to slight perturbations in the input space (as demonstrated in Figure 2). For samples lying on the data manifold—where realistic samples exist—we expect several models (from the same pre-trained model) finetuned on the same dataset with different training seed to exhibit "similar" prediction probabilities. However, fine-tuned models can differ significantly in regions outside the data manifold, as the absence of training samples in these areas means there is no shared information to constrain their behavior.

Additionally, recent mathematical analyses of LoRA corroborate this perspective, showing that $\mathbb{E}_X \|f(X) - f'(X)\| \leq \Delta$ for a random variable $X$ over a bounded set [3].

[1] J. Djolonga, F. Hubis, M. Minderer, Z. Nado, J. Nixon, R. Romijnders, D. Tran, and M. Lucic. Robustness Metrics, 2020.

[2] Tessa Han, Suraj Srinivas, Himabindu Lakkaraju. Characterizing Data Point Vulnerability as Average-Case Robustness, 2024.

[3] Yuchen Zeng and Kangwook Lee. The expressive power of low-rank adaptation, ICLR 2023.

2024-12-02

Thank you for your response!

More Intuition on Consistency Measure

The consistency measure is: $S(x,f)= \frac{1}{k}\sum_{x_i \in N_{x,k}} f(x_i) - \frac{1}{k}\sum_{x_i \in N_{x,k}} |f(x)-f(x_i)|$ .

This exact form of (local average confidence) minus (local variability in confidence) has multiple rationales behind it:

Intuitive rationale - One might intuitively think that just the model’s confidence at a point f(x) is a measure of consistency under multiplicity, i.e., a high $f(x)$ of 0.8 is more consistent than 0.51. However, just high $f(x)$ may not be sufficient if its neighboring region has low confidence (particularly if the target function is smooth) (see Figure 2). The next candidate measure would be $\frac{1}{k}\sum_{x_i \in N_{x,k}} f(x_i)$ (just the first term of our measure). However, average confidence can be high but still there may be a lot of local variations in confidence leading to our measure: $S(x,f)$ = (local average confidence) - (local variability in confidence) (see new ablation on Table A).

Theoretical rationale - This form comes from the reverse triangle inequality. It helps us bound the probability of fine-tuned model confidence F’(x) being greater than the consistency measure evaluated at x, i.e., $S(x,F’)$ for an independent $F’$ . $\Pr \left(F(x) \geq S_{k,\sigma}(x,F') - \epsilon \right) \geq 1 - \delta$ . This means that if we find a point $x$ with high $S(x,F’)$ , then for that point, $F(x)$ is going to be high - so it is more likely to belong to the class $c$ for other random models $F$ . So if $S(x,F’)=0.5 + \epsilon$ , then $F(x)$ is going to be higher than $0.5$ meaning x is more likely to belong to the class c for other random models F. Increasing k means we sample more in the local region, so the probability improves. But $\beta$ is a measure of the goodness of the broad class of fine-tuned models.

Experimental rationale - We have conducted experiments on several datasets to find our proposed measure evaluated on just one model being strongly correlated with actual multiplicity across multiple fine-tuned models (without imposing any theoretical assumptions). From our ablation studies (Table 7), we have not observed issues with the practical utility of consistency, even for moderate values of k (e.g., k = 30).

Relationship with Other Robustness Measures

Exploring other robustness measures and seeing their relationship with actual multiplicity is interesting. A bulk of robustness measures [1,2] look into the local neighborhood and perform some sort of averaging of confidence or final decision which shares similarities with the first term in our measure $\frac{1}{k}\sum_{x_i \in N_{x,k}} f(x_i)$ . The closest approach to ours is [2], which quantifies robustness by averaging predictions within a local neighborhood of a datapoint. So, we include a comparison with this approach in our analysis, highlighting the additional insights provided by incorporating the variability term unique to our measure (see Table A and B for ablation studies on Adult and Heart dataset). We have included the standard deviations here as well.

Local Averaging Robustness = $\frac{1}{k}\sum_{x_i \in N_{x,k}} f(x_i)$ (also suggested in [2], consistency measure without the variability term)

Notably there are other robustness measures too. For instance calibration methods (e.g., [1,3]), these typically adjust a model's predicted probabilities to better align with true likelihoods using techniques like temperature scaling or histogram binning. These methods operate on global prediction probabilities and validation datasets but do not consider neighboring sample points, as our approach does (hence not capturing the noisy behavior).

Other robustness measures try to find adversarial examples (closest point on the other side of the decision boundary) using optimization methods. We believe that these may be more computationally expensive for TabLLMs going even beyond actual ensembling across multiple fine-tuned versions of the model and our goal in this paper is to arrive at an easier-to-compute measure to even bypass actual ensembling across multiple fine-tuned models. Our measure draws inspiration from all these previous robustness works but provides a novel and easy-to-compute measure.

[3] Guo, Chuan, et al. "On calibration of modern neural networks." International conference on machine learning. PMLR, 2017.

2024-12-02

Table A

Dataset	Number of Shots	Measure	Arbitrariness	Pairwise Disagreement	Prediction Variance	Prediction Range
Adult	128	Consistency	0.80 ± 0.03	0.96 ± 0.05	0.84 ± 0.01	0.91 ± 0.06
		Local Averaging Robustness [2]	0.75 ± 0.01	0.90 ± 0.04	0.83 ± 0.05	0.90 ± 0.04
		Drop-Out	0.74 ± 0.01	0.83 ± 0.05	0.69 ± 0.08	0.81 ± 0.08
		Pred. Prob.	0.67 ± 0.00	0.62 ± 0.02	0.30 ± 0.02	0.54 ± 0.05

Table B

Dataset	Number of Shots	Measure	Arbitrariness	Pairwise Disagreement	Prediction Variance	Prediction Range
Heart	128	Consistency	0.89 ± 0.09	0.90 ± 0.06	0.97 ± 0.05	0.87 ± 0.03
		Local Averaging Robustness [2]	0.80 ± 0.03	0.88 ± 0.02	0.86 ± 0.06	0.82 ± 0.06
		Drop-Out	0.64 ± 0.02	0.76 ± 0.03	0.74 ± 0.03	0.83 ± 0.03
		Pred. Prob.	0.61 ± 0.06	0.46 ± 0.05	0.50 ± 0.04	0.26 ± 0.04

评论- Application of Measure in Across Various Predicted Classes

2024-12-03

We would like to clarify that our consistency measure can be applied to any prediction class in multi-class classification. Our consistency measure is tied to the confidence in predicting a specific class (i.e., the probability values derived from softmax logits), and not directly to the predicted label 1. In our experiments, we already include results for both class 0 and 1.

For an intuition, let us consider a binary classification scenario with two classes, A and B, and corresponding softmax logits $f_A(x)$ and $f_B(x)$ , such that $f_A(x) + f_B(x) = 1$ .

However, both $S(x, f_A)$ and $S(x, f_B)$ are not going to be very high together because of their complementary nature, so both of them are not going to be informative. Like if $S(x, f_A) = 0.5+\epsilon$ and $S(x, f_B) = 0.2+\epsilon$ , we know the logit for class A will likely be higher than 0.5 and predictions of A will be more consistent. We also know that the logits for class B will be higher than 0.2 though this may not be as informative for us.

Given $S(x, f_A)$ for one class, there is a one-to-one relationship with $S(x, f_B)$ for the other class. We can think of $f$ as a vector of softmax probabilities, e.g., $f = (f_A, f_B)$ , and the consistency measure reflects the robustness of this model as a whole. When $f$ has low entropy (i.e., one class has high confidence), $S(x, f)$ is highly informative. Conversely, when $f$ has high entropy (i.e., predictions are distributed more evenly across classes), $S(x, f)$ is less informative, as expected. The measure inherently captures a property of the model, not the individual classes in isolation. For three or more classes, the same principles apply.

AC 元评审

2024-12-20

Based on the reviews, I conclude that the paper cannot be accepted for publication in its current form. Of the four reviews, three recommend rejection. The reviewers raised significant concerns about key aspects of the work, particularly questioning the validity of the notion of 'consistency' and highlighting limitations in the empirical evaluation.

审稿人讨论附加意见

The key points raised across reviews centered on the validity of the proposed consistency measure and the limited empirical evaluation:

Consistency Measure and Theoretical Concerns: Reviewers zzSS, szsk, and YYAF raised concerns regarding the definition and intuition behind the "consistency" measure. YYAF also questioned the submission's theoretical formulation. The authors attempted to clarify the measure’s intuition and provided additional explanations about its theoretical grounding. Despite these clarifications, the fundamental concerns about the definition and theoretical validity remained unresolved, influencing the rejection recommendations.
Empirical Evaluation and Dataset Selection: Several reviewers, including zzSS, szsk, and DHwF, noted limitations in the empirical evaluation, particularly regarding dataset selection. The authors expanded the evaluation by including results from two additional datasets and clarified the dataset choices. However, the added datasets are not sufficient to fully address the raised concerns, contributing to the overall negative assessment.
Confidence Intervals and Experimental Robustness: zSS raised concerns about the absence of confidence intervals in the original submission. The authors added confidence intervals and conducted additional robustness experiments. This update was acknowledged but did not fully address remaining major concerns regarding the measure’s definition and theoretical formulation.

While the authors made several efforts to clarify their approach and improve the evaluation, these efforts were not fully convincing. Despite Reviewer DHwF’s positive stance, the overall concerns raised by the other reviewers were too significant to overcome, leading to the above recomendation.

最终决定Reject

2025-01-22

Reject