Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering
摘要
评审与讨论
This paper presents a study on calibration methods for in-context learning. The authors provide a unified and systematic analysis of existing calibration, focusing on their decision boundaries. They also investigate the common use of content-free tokens in calibration. The paper highlights the biases in language models’ predictions and proposes Batch Calibration (BC) as a zero-shot and inference-only calibration method. BC aims to accurately model the contextual bias in prompt contexts by marginalizing the language model scores. The authors extend BC to black-box few-shot learning by introducing a learnable parameter to adapt to available labeled data. The performance of BC is evaluated and compared to baseline methods, showing improved performance in various tasks.
优点
This paper is technically sound and well-written (although some writing issues are listed in weaknesses). The authors' revisiting of previous methods is novel to the research community.
缺点
-
For writing:
- "Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning" has been accepted by EMNLP-2023. Please cite it correctly.
- Why is the related work in section 5? Introducing it in section 2 will make readers understand the background better.
- Unify the usage of abbreviated words. Tab. -> Table.
-
When introducing BCL, the author takes some examples to make readers understand the . I suggest authors provide more experiments on how different s affect the results.
-
As shown in Figure 3, the improvement from BC to BCL seems marginal.
-
Any border impact and limitation discussion?
问题
See weakness
Dear Reviewer tiQs,
Thank you for your prompt reply! We are glad that our response has alleviated your concerns.
Regarding the related work section, we thank the reviewer's suggestions. We will merge Sec 2 and Sec 5 in the next revision of the paper, as suggested.
Regarding BCL, there are two points we'd like to make:
-
First, BCL, which features adjustable strength learned from labeled data, allows us to study the ground-truth optimality of the design boundary asserted by BC under zero shot. We can see that the performance of BC is consistently strong and, in many cases, indeed close to optimal -- see Fig. 5 where we validate different choices of on labeled data that would not be otherwise available to a zero-shot method; we can see that , which corresponds to BC, is close to optimal; this provides empirical support to the design of BC described in Section 3.
-
Second, while we acknowledge that the extent of improvement from BC to BCL is not always significant from Table 3, there are cases where BCL improves quite significantly (e.g., MRPC and CB). On the other hand, when BCL underperforms BC, the extent of performance decrease is mostly marginal. Thus, on balance, we felt it is still beneficial to include discussions of BCL as a potential improvement method that makes use of labeled data when they are indeed available (as contrasted to zero-shot).
Again, we will clarify the above points further and incorporate them into the next revision shortly.
Thanks for your clarification. I will closely follow the reactions from other reviewers to determine my final rating. Now, I will keep my rating.
I am looking forward to the further discussion with other reviewers!
Dear Reviewer tiQs,
We have just uploaded a revision where we have adjusted the position of related works in our paper, following your suggestions.
We thank the reviewer again for engaging in the discussions.
We thank the reviewer for their constructive and valuable feedback and for recognizing the soundness, presentation, and novelty of this work to the research community. We appreciate the suggestions for including more experiments and a section discussing border impacts and limitations. We have incorporated all the reviewer’s suggestions in the updated manuscript. We would be grateful if the reviewer could consider increasing their rating if they found our response has sufficiently addressed their concerns.
"Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning" has been accepted by EMNLP-2023. Please cite it correctly. Unify the usage of abbreviated words. Tab. -> Table.
We thank the reviewer for pointing them out. We have corrected them in the updated manuscript.
Why is the related work in section 5? Introducing it in section 2 will make readers understand the background better.
We appreciate the reviewer’s comments on the structure of the presentation. Section 2 serves as a unified overview of the most relevant existing calibration methods while addressing two important research questions behind their design principles. Therefore, we found that having the related work with the current analysis section together may be chunky and could potentially dilute the attention of readers. Hence, we decided to move the related work to section 5 and present the main body of our work, section 3, to readers as early as possible for a better flow of content.
When introducing BCL, the author takes some examples to make readers understand the . I suggest authors provide more experiments on how different s affect the results.
We thank the reviewer for suggesting more experiments. We have now conducted additional experiments to illustrate the impact of the strength term on BCL and have included more plots in Figure 11, Appendix D, in the updated manuscript. To better understand the term, We kindly refer the reviewer to Table 1 and Figure 2, which visualize the decision boundary for BC. The strength term is controlling the strength of the shift (BC) operation for the decision boundary, such that . When , the estimation of by zero-shot BC serves as a strong but not optimal prior for shifting the decision boundary. Therefore, by leveraging labelled data to learn the strength term, we can further refine the decision boundary by learning .
As shown in Figure 3, the improvement from BC to BCL seems marginal.
We acknowledge that having labelled data brings marginal improvement to BC for certain tasks, which has also been reflected in Figure 5, Page 6, where the gain of performance between the BC (purple line) and BCL (green line) is observable but not significant. However, this empirical evidence exactly indicates the strength point of BC, such that BC at already serves as a good prior for estimating the contextual bias from the target (test) statistics only. It further consolidates the effectiveness of BC even in the zero-shot setups. By incorporating an additional strength hyperparameter, we are enabled to estimate the source prior to source statistics in labelled data. This BCL design further refines the robustness of BC under covariate shift and shows a much more enhanced calibration performance in a range of tasks (SST-2, MRPC, QQP, CB) than BC.
Any border impact and limitation discussion?
We appreciate the reviewer’s suggestion for discussing the broader impacts of our approach and limitations. We have included a new limitation section in the Appendix in the updated manuscript. As a brief summary, we reiterate the limitations and the impact of BC below:
In terms of limitations, we recall that BC is a test-time method that relies on the target statistics from batched input. To mitigate any potential estimation errors from unlabelled data, we introduce the adjustable BCL extension to incorporate source statistics from labelled data. Though BC has shown remarkable sample efficiency in terms of batch sizes, it still requires a batch of inputs to estimate the contextual bias. We introduce a running estimation for BC from mini-batches, which subsequently stabilizes the prediction of LLMs when more mini-batches arrive.
Overall, BC is zero-shot, inference-only, and incurs negligible additional costs for driving toward robust LLM applications. BC mitigates unfair prediction of LLMs due to contextual bias while alleviating the sensitivity of LLMs with respect to prompt designs. We expect easy prompt engineering with BC for users building towards their own robust and responsible LLM applications.
Thanks authors' reply. I have checked the rebuttal. My concerns are partially resolved.
My remaining concern is still about the marginal improvement from BC to BCL. However, I do not think it is a major issue. Besides, I still suggest authors move the discussion to Section 2 and discuss the relationship with previous efforts. In this way, readers will have a more clear understanding of your novelty and distinct contribution to the community.
Looking forward to your reply.
Best,
Reviewer
This paper studies model calibration in the context of LLMs (large language models). The authors first analysed the recently proposed methods with empirical results, and they also discussed the two important design principles behind those ICL calibration methods. A novel calibration method (which is termed batch calibration) is then introduced for zero-shot learning; an extended version that has a hyperparameter is proposed for fine-tuning. The authors conducted extensive experiments on multiple NLP tasks and showed superior performance as compared to the existing ICL calibration methods.
优点
-
This paper addresses an important and interesting topic: calibrating LLMs in zero or few shot settings.
-
The paper gives an overview of the most relevant and recent ICL calibration methods and discusses their motivations and design principles with empirical results.
-
The experiments are extensive, and the proposed method achieves better performance on most tasks.
缺点
-
My main concern goes to the parameter in BCL; the current results show seems to give strong performance (though not optimal) across tasks on CB and SST-2. Does this generalize to other tasks as well? It would be good to provide the optimal for each task; maybe include it in Table 2? Besides, would it be more reasonable to sample \gamma in [0, 5]?
-
There are some statements that are not well presented or supported. In Sec. 4.3, the authors claimed BC retains the performance even when using emoji pairs as verbalizers. Is there an example or result related to this specific experiment?
问题
- In e.q.(3), how to obtain the contextual prior ?
- How would BC work on the task other than classification?
- It seems the calibration perform is not stable across different model architectures in 1-shot setting on some tasks. E.g., MNLI, 75.12/ 60.02/81.34 for PaLM-2-S/PaLM-2-M/PaLM-2-L. I understand that PaLM-2-S/L are based on 5 runs and PaLM-2-M result is from a single run. Could you help to understand, what causes the high perform variance, I may miss something here.
In e.q.(3), how to obtain the contextual prior
In Equation 3, the contextual prior is obtained via Equation 2, which is equivalent by obtaining the estimated mean of log probability for each class. In the updated manuscript, we have updated the notation in Equation 2 to better to emphasize the link between Equations 2 & 3.
How would BC work on the task other than classification?
We thank the reviewer for discussing this appealing research direction that is worthy of being explored by the research community. We suspect that the contextual bias may also exist in short-form generation tasks. Motivated by [1], one possible solution for generative tasks is to calibrate over the logits of the first output token since the following tokens are likely deterministic based on the first token. We will endeavor to explore generative calibration. We thank the reviewer for this insight, and we include this idea in the future work section in Appendix in the updated manuscript.
[1] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q. On calibration of modern neural networks. ICML 2017.
It seems the calibration perform is not stable across different model architectures in 1-shot setting on some tasks. E.g., MNLI, 75.12/ 60.02/81.34 for PaLM-2-S/PaLM-2-M/PaLM-2-L. I understand that PaLM-2-S/L are based on 5 runs and PaLM-2-M result is from a single run. Could you help to understand, what causes the high perform variance, I may miss something here.
We acknowledge that different sizes of models preserve distinctive robustness features for different tasks, and it’s occasionally possible that a larger model underperforms a smaller model for specific tasks (this has also been reflected in, for example, the PaLM 2 technical report [2], Table 2) due to multiple possible factors such as pretraining strategies, type and size of input data used in training and instruction finetuning, and/or slight differences in templates, but in general, PaLM 2-M still shows a stronger 1-shot ICL performance than PaLM 2-S across the set of evaluation tasks, where BC also shows consistent performance improvement across all model scales.
[2] Anil, R., Dai, A.M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z. and Chu, E. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
We thank the reviewer for their insightful and constructive feedback. We appreciate the reviewer’s suggestion to present BCL plots in other tasks and detailed example performance in Figure 8. We have incorporated all suggestions made by the reviewer into the updated manuscript. We hope that our response sufficiently addresses the reviewer’s concerns and the reviewer could consider improving their score.
My main concern goes to the strength parameter in BCL; the current results show seems to give strong performance (though not optimal) across tasks on CB and SST-2. Does this generalize to other tasks as well? It would be good to provide the optimal for each task; maybe include it in Table 2? Besides, would it be more reasonable to sample in [0, 5]?
We thank the reviewer for expressing interest in the strength parameter in BCL and the generalization of performance in terms of the strength term. We recall that when strength is at 1, Equation 5 becomes the zero-shot BC, and BC shows state-of-the-art performance across many tasks over ICL and calibration baselines in Table 2. To better understand the term, We kindly refer the reviewer to Figure 2 that visualizes the decision boundary for BC. The strength term is controlling the strength of the shift (BC) operation for the decision boundary, such that . When , the estimation of contextual prior, , by zero-shot BC serves as a strong but not always optimal prior for shifting the decision boundary, which is due to the dependency on test-time statistics. Therefore, by leveraging labelled data to learn the strength term as a source prior, we can further refine the decision boundary while mitigating calibration errors.
We thank the reviewer for suggesting to sample in [0, 5]. By having the in the negative range, we can explore what the decision boundary looks like after shifting it in the opposite direction. We then retain both directions for a sufficient range to sample the optimal source prior. To provide a better intuition of across other tasks, we have included more plots in Figure 11, Appendix E in the updated manuscript, where optimal for the test set can be read from each task plot. BC () demonstrates strong performance across tasks, where further slight gains can be achieved by BCL when we incorporate the source prior via labelled data. We welcome the reviewer for further discussion.
There are some statements that are not well presented or supported. In Sec. 4.3, the authors claimed BC retains the performance even when using emoji pairs as verbalizers. Is there an example or result related to this specific experiment?
We thank the reviewers for expressing their interest in individual test performance for each verbalizer choice in Figure 8. We kindly refer the reviewer to Table 10, which lists the detailed verbalizer choices and descriptions for each emoji pair. We agree that it will be interesting to interpret individual experimental results corresponding to each design choice. Therefore, we have updated our manuscript and included each test sample result before and after calibration in a new Table 7, Page 17. Empirically, we observe that BC can recover the performance of LLM for emoji pairs with opposed but context-relevant semantics. We hope this will address your remaining concerns.
Dear Reviewer jGRr,
Thank you for your engagement in the discussion! We are glad that our response has alleviated your concerns.
We would like to provide a further clarification that interprets the strength term from our new Figure 11 following the suggestion of the reviewer:
The gap between the optimal strength term in the labelled data and the best test strength term indicates the distribution shift from the source to the target. This distribution shift exists due to the statistics of the dataset, not from any proposed algorithms, and the best test strength term is always an unknown prior for the test set. The gap can also be seen as, fundamentally, a manifestation of generalization gap that any learning algorithm will be subjected to rather than specific to us. Despite this, we still believe BCL yields empirical benefits, as discussed below.
Incorporating the strength term allows us to leverage an additional source prior when encountering a distribution shift from a uniform prior. In cases where, BCL improves quite significantly (e.g., MRPC and CB), we found the effectiveness of having the strength term to be reflected from the relative distance between the optimal strength term from labelled data (red) and the original ICL (blue), where BCL serves as a robust extension of BC with more significant performance enhancement over ICL across tasks. Empirically, we found BCL to often lead to large gains and even when it does not, the extent of performance change is minimal. Thus, on balance, we still believe BCL is an useful extension.
Finally, as we have responded to Reviewer tiQs, another reason of studying BCL, which features adjustable strength learned from labeled data, is it that allows us to study the ground-truth optimality of the design boundary asserted by BC under zero shot. We can see that the performance of BC is consistently strong and, in many cases, indeed close to optimal -- see Fig. 5 where we validate different choices of on labeled data that would not be otherwise available to a zero-shot method; we can see that , which corresponds to BC, is close to optimal; this provides empirical support to the design of BC described in Section 3.
Therefore, we believe that BCL is a valuable extension of BC both empirically and conceptually for robustness, extending the applications of BC from zero-shot learning only to become more capable of few-shot learning. We appreciate the discussion from reviewer, and we are happy to provide further discussions.
We hope the reviewer could reconsider their rating if they believe their concerns have been clarified better with our latest response.
Thanks for the author response. I read the author response as well as others' reviews. My concerns are partially addressed. I am still not fully convinced by the effectiveness of learning the strength term as a source prior by looking at Figure 11, i.e., there is a gap between the learned optimal strength term and the best one at test time on some tasks. I'd like to keep my score for now.
The paper delves into the current challenges faced in adapting Large Language Models (LLMs) to new tasks through the method of human-designed instructions. Although these models possess a commendable ability for in-context learning (ICL) and can efficiently adapt from few-shot input-label pairs, they are significantly influenced by the choice of templates, verbalizers, and demonstrations. This results in biases that can act as barriers to creating adaptable and robust LLM applications. While several studies have tried to address these biases, a holistic analysis differentiating the merits and demerits of each approach is lacking.
优点
The paper provides a thorough and systematic examination of existing calibration methods for LLMs, filling a gap in the existing research landscape. Methodologically, the introduction of Batch Calibration (BC) offers a zero-shot, inference-only calibration method that is computationally efficient, addressing a primary concern in the domain. BC proves effective in reducing prompt sensitivity, a prominent issue in LLMs, thereby facilitating easier prompt engineering.
The scope is also broadened compared to previous works like CB, which only studies GPT-2's biases.
缺点
While BC introduces minimal computational overhead, in highly resource-constrained environments, even small overheads might be significant.
The term calibration can be confusing, maybe using bias is better. At least should add a footnote to the Introduction. Another widely used notion of calibration is from the perspective of uncertainty, explainability, and reliability.
Please include model sizes and details in the paper, as the performance of ICL and prompting is quite irreproducible and context-dependent.
Please adjust the margins of subfigures in fig. 2, it's overlapped.
The liteature review of Test time tuning is a bit unnecessary as ICL itself is not a tuning method
问题
In fig.5, why does a strength greater than 1 consistently decrease accuracy?
How significant is the bias/miscalibration of the ICL performance? that say is the problem still relevant for models with greater scales? Should consider 65B llama2, GPT-3.5, GPT-4 etc as the method does not require tuning
We thank the reviewer for their insightful and constructive feedback, especially for valuable suggestions that further improves the presentation and clarity of terms in the paper. We have made changes in our updated manuscript accordingly. We hope that in light of the response that addresses all the reviewer’s concerns, the reviewer could consider improving their score.
While BC introduces minimal computational overhead, in highly resource-constrained environments, even small overheads might be significant.
We thank the reviewer for acknowledging the minimal computation overhead of our approach. While we agree in principle that small overheads can be significant in resource-constrained setup, it is true for any method that uses additional computations and/or storage and hence applies to all baselines unless we don’t do any calibration. In comparison to previous methods, batch calibration does not involve any additional forward (API) calls of LLMs that may incur observable inference latency for users, whereas prototypical calibration [1] can result in significant computation overhead with its hundreds of random initializations and iterations of EM. Furthermore, BC may also use an on-the-fly, running mean estimation in mini-batches, which only involves a post-hoc, linear operation on the log-probability of the model.
[1] Han, Z., Hao, Y., Dong, L., Sun, Y. and Wei, F. Prototypical calibration for few-shot learning of language models. ICLR 2023.
The term calibration can be confusing, maybe using bias is better. At least should add a footnote to the Introduction. Another widely used notion of calibration is from the perspective of uncertainty, explainability, and reliability.
We appreciate the reviewer’s suggestion for clarifying the notion of calibration. In our updated manuscript, we have included a clarification on the definition of calibration as a footnote in the introduction. We agree with the reviewer that the term “calibration” is usually aligned with the trustworthiness, reliability, and interpretability of neural networks [2]. In the context of our work, we follow the notion of calibration defined in [3] as a technique for mitigating the label shift and contextual bias in LLMs prediction.
[2] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q. On calibration of modern neural networks. ICML 2017.
[3] Zhao, Z., Wallace, E., Feng, S., Klein, D. and Singh, S. Calibrate before use: Improving few-shot performance of language models. ICML 2021.
Please include model sizes and details in the paper, as the performance of ICL and prompting is quite irreproducible and context-dependent.
We thank the reviewer for being aware of the nature of the context-dependency of prompting. We acknowledge the sensitivity of LLM prediction to the context and the prompt template. Therefore, we included detailed templates in Tables 8, 9, 10, & 11 for reproducibility and conducted all main experiments with 5 random seeds to reveal the variance for each study. Regarding the details of the model size, we included the PaLM 2-S, M, L variants, and we kindly refer the reviewer to [4] that lists the details for each variant, including experiments on PaLM 2 with comparable model sizes as PaLM, and PaLM 2-L shows competitive performance to GPT-4 [5]. We then follow the notion of model variants in [4] and hope this has addressed the reviewer’s concern.
[4] Anil, R., Dai, A.M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z. and Chu, E. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
[5] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Please adjust the margins of subfigures in fig. 2, it's overlapped.
We appreciate the reviewer's suggestions on the presentation. We have adjusted the margins and provided improved visualization in the updated manuscript.
The literature review of Test time tuning is a bit unnecessary as ICL itself is not a tuning method
In Appendix A, we included a discussion of test-time adaptation (TTA), where we explicitly focus on the previous literature using test-time batch statistics for BN without tuning. This line of research has shed light on the theoretical foundation for batch calibration, where contextual bias can also be framed as a form of covariate label shift [6, 7]. Batch calibration takes inspiration from test-time batch adaptation in the novel context of LLMs, and we would be grateful if researchers in the TTA community could be inspired to deliver advanced research ideas on top of batch calibration.
[6] Nado, Z., Padhy, S., Sculley, D., D'Amour, A., Lakshminarayanan, B. and Snoek, J., 2020. Evaluating prediction-time batch normalization for robustness under covariate shift. arXiv preprint arXiv:2006.10963.
[7] Sun, Q., Murphy, K., Ebrahimi, S. and D'Amour, A., 2022. Beyond invariance: Test-time label-shift adaptation for distributions with "spurious" correlations. arXiv preprint arXiv:2211.15646.
In fig.5, why does a strength greater than 1 consistently decrease accuracy?
We would like to recall that when strength is at 1, Equation 5 becomes the zero-shot BC. We then refer the reviewer to Figure 2, which visualizes the decision boundary for BC. BC is equivalent to shifting the original ICL decision boundary by the amount of , where is the contextual prior estimated by marginalizing the output scores. Following that, the strength term that can be learned from labelled data is controlling the strength of this shift operation, such that . As visualized in Figure 2, an optimal linear decision boundary that separates the data points should always exist. When , the estimation of by zero-shot BC serves as a strong but not optimal prior for shifting the decision boundary. Therefore, by leveraging labelled data to learn the strength term, we can further refine the decision boundary by learning . Consequently, having an inappropriate strength term can shift the decision boundary to become an undesirable separation plane, which results in a degradation of performance.
How significant is the bias/miscalibration of the ICL performance? that say is the problem still relevant for models with greater scales? Should consider 65B llama2, GPT-3.5, GPT-4 etc as the method does not require tuning
LLMs have been shown to be very sensitive to the context and biased prediction across all scales. This phenomenon was first revealed by [2] on GPT-2 and three sizes of GPT-3. [1, 8] further observe the same problem of biased prediction in GPT-J and Bloom family of models. Since PaLM 2-L [4] has achieved results that are competitive with GPT-4, we follow this line of research work and observe the same pattern in three variants of PaLM 2 models, where BC brings a significant margin of performance improvement across all scales. We also extended the current study to a new modality, calibrating vision-language models, and we will endeavor to include other models as part of future work. We have incorporated this discussion in the revised version of the paper.
[8] Fei, Y., Hou, Y., Chen, Z. and Bosselut, A. Mitigating Label Biases for In-context Learning. ACL 2023.
Dear Reviewer 99FV,
We thank you once again for your feedback! As the discussion period is ending, we wonder if you could take a look at our response if you haven't? We are eager to know whether our response has addressed your concerns: we'd be grateful if you could reconsider the rating if that's the case. Otherwise, we are always happy to answer further questions that you may have.
Best,
Authors
This paper discusses the issue of bias and unexpected performance degradation in large language models (LLMs) when using prompting and in-context learning (ICL). To address this, the authors first provide a comprehensive analysis of existing calibration methods and their decision boundaries, and then propose a new calibration method called Batch Calibration (BC) with linear decision boundaries, which mitigates the bias from the batched input and is both zero-shot and inference-only. BC can be easily extended to learn the bias from labeled data, and applied to calibrate vision-language models. The authors conduct extensive experiments on over 10 natural language understanding and image classification tasks and show that BC achieves state-of-the-art results.
优点
- The proposed Batch Calibration method is simple and empirically effective.
- Extensive experiments are conducted.
缺点
- The proposed method is not thoroughly justified.
- The writings could be improved in terms of the analysis of existing methods and the proposed method.
问题
- The authors argued that linear decision boundaries produced by calibration methods can be more robust and generalizable across tasks. This argument is not well supported theoretically and empirically.
- The advantage of Batch Calibration over existing methods is not thoroughly justified. Could you provide more intuitive descriptions and theoretical analysis for it?
- The derivations of Table 1 could be provided in more details in the main contents or supplementary. Currently, it is not easy to understand.
We thank the reviewer for their insightful review and the acknowledgment of the empirical effectiveness of our method. Please see our response below, which we believe has addressed all the concerns based on the contents we refer to in the paper. We hope that in light of the response, the reviewer could consider improving their score.
The proposed method is not thoroughly justified.
We’d like to refer the reviewer to Section 2, Pages 2-5, where we have conducted a comprehensive analysis of existing calibration methods through their decision boundaries. We identified the current deficiency of existing methods with individual case studies, and the design of each aspect of BC has been strongly justified with empirical evidence in Section 2.3 and motivated in Section 3. In particular, we addressed two important research questions for calibration methods with an interpretation of decision boundaries and model biases, as illustrated in Figures 2, 3, & 10, delivering novel guidelines for opting for a linear decision boundary and content-based calibration. We’d be grateful if the reviewer could point out what precisely, in their opinion, is lacking justification, and we are happy to provide further clarifications.
The writings could be improved in terms of the analysis of existing methods and the proposed method.
We thank the reviewer for their suggestion. We have revisited the analysis section and cleaned up any language problems. We’d be grateful if the reviewer could let us know of any further concerns, and we are happy to provide further clarifications.
The authors argued that linear decision boundaries produced by calibration methods can be more robust and generalizable across tasks. This argument is not well supported theoretically and empirically.
We stated in Section 2.3, as the alternative design choice to a linear decision boundary, the non-linear decision boundary by GMM tends to be susceptible to overfitting and vulnerable to instability, which, empirically, was further evidenced by the extensive results in Table 2 and discussed in Section 4.2 in the paper, where the non-linear decision boundary by PC exhibits the worst stability with the largest standard deviation even across the number of shots.
Theoretically, we have derived the non-linear boundary equation for GMM in Section 2.3, which was known to be unstable with outlier data points, and the robustness of GMM has been theoretically analyzed in previous work [1, 2]. Therefore, we opt for a linear decision boundary for BC due to its robustness, and we provide consistent and generalizable empirical evidence across tasks and modalities. We also conducted extensive robustness analysis to further validate the empirical robustness for the linear decision boundary design in Section 4.3, Page 8. We are open to any further discussion.
[1] Diakonikolas, I., Kane, D. M., & Stewart, A, 2017. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. FOCS 2017. IEEE.
[2] Diakonikolas, I., Hopkins, S.B., Kane, D. and Karmalkar, S., 2020. Robustly learning any clusterable mixture of gaussians. arXiv preprint arXiv:2005.06417.
The advantage of Batch Calibration over existing methods is not thoroughly justified. Could you provide more intuitive descriptions and theoretical analysis for it?
We thank the reviewer for expressing their interest in a more intuitive description of BC. In contrast with existing methods that rely on content-free tokens and fail on multi-sentence classification tasks, BC serves as a content-based method using test-time statistics, thereby being more robust against distribution shifts and more generalizable across tasks. Intuitively, in Section 2.3, we discussed that the calibration can be framed as an unsupervised decision boundary learning problem. Figures 2 & 10 then provide a visualization of the decision boundary for BC and existing methods. To constitute a better decision boundary, we recall from Section 3 that the most critical component for calibration is to accurately estimate the contextual bias term that shifts the original ICL decision boundary. This serves as a prior that can be approximately estimated by marginalizing the output likelihood over all samples within the batch in Equation 2. This forms the mathematical foundation of BC, and we welcome the reviewer for further discussion.
The derivations of Table 1 could be provided in more details in the main contents or supplementary. Currently, it is not easy to understand.
We appreciate the reviewer for pointing this out, and we have made updates in the caption of Table 1 to provide a supplementary explanation for each cell, including a reference to Section 2.3 that presents detailed derivations of decision boundaries for each method.
Thanks for the detailed responses from the authors, which have clarified most of my concerns. I will raise my rating to 6.
We thank the reviewer for their feedback, and we are glad that we have largely clarified their concerns. We are always happy to discuss further if the reviewer has any additional questions.
Once again, we thank all reviewers and the area chair for their time and attention! We have uploaded the revised manuscript and responded to the reviewers individually. As the end of the discussion period is approaching, we'd really be grateful if the reviewers could look at the feedback/revised manuscript if they have not done so, and we are always happy to answer any remaining questions they may have.
This paper studies calibration methods for language and vision-language models, specifically from the perspective of decision boundaries and model biases. Through a series of derivations and empirical analysis/observations, the authors propose two simple modifications from prior works: Linear decision boundaries and content-based (as opposed to prior content-free) method specifically utilizing a batch of samples to estimating the contextual prior. Results are shown across a number of large models including PaLM 2-S/L.
The reviewers appreciated the well-laid out analysis and empirical observations driving the design of the proposed method, simplicity and zero-shot nature of the method, and extensive experimental results. There were a number of weaknesses raised, including the (admittedly small) computational overhead which might be an issue in some cases, lack of some details including prompt details and model sizes, and the importance of the strength parameter. The authors provided a significant rebuttal justifying the method, computational overhead (which any method will likely have some of), and additional details. After the rebuttal, the reviewers uniformly rated this paper above acceptance and after considering the paper, reviews, rebuttals, and post-rebuttal discussion I agree that the paper should be accepted. It discusses an increasingly important issue of bias in large-scale models, provides a nice set of analysis for thinking about the issues in current methods and an improved simple method, and even has results for vision-language models which will be of interest to the larger community.
I encourage the authors to take significant heed of the discussions and improve the paper for the camera-ready submission, for example clarifying the notion of calibration and perhaps using more appropriate terminology to prevent confusion.
为何不给更高分
While reviewers mentioned that they were satisfied with the rebuttal, they still thought some aspects reduce the paper's impact and that it should not be at the spotlight/oral level.
为何不给更低分
Overall, all of the reviewers agreed that the paper provided a nice analysis and contribution, and they were mostly satisfied with the rebuttal, with all reviewers scoring it above the acceptance threshold.
Accept (poster)