PaperHub
6.3
/10
Poster4 位审稿人
最低6最高7标准差0.4
6
6
7
6
3.5
置信度
COLM 2025

Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy

OpenReviewPDF
提交: 2025-03-18更新: 2025-08-26
TL;DR

We propose a Heaviside step function based ensemble debiasing method, to flexibly rectify biased ICL output class probabilities across both class and sample levels, achieving fairer prompting accuracy for LLMs.

摘要

关键词
ensemble debiasingaccuracy imbalanceHeaviside step functionpost-hoc correction

评审与讨论

审稿意见
6

This paper proposes a novel method called DCS (Dynamic Class-wise Scaling) to address class-wise accuracy imbalance in classification tasks. The approach employs simulated annealing to determine optimal linear mapping functions f(·) for probability calibration across different classes. By applying these learned linear transformations to model outputs, the method effectively mitigates class accuracy imbalance while improving overall classification performance. Experimental validation in few-shot settings using LLaMA-2-13B demonstrated that DCS achieved accuracy improvements in six out of seven benchmark datasets, with significant enhancement in the balance of class-wise accuracies.

接收理由

  • This paper demonstrates that the proposed DCS method achieves state-of-the-art (SOTA) performance in addressing class imbalance challenges. In the N-shot setting (where N equals the total number of classes), the approach delivers consistent improvements across six out of seven benchmark datasets, with an average accuracy enhancement of approximately 1%.
  • The paper demonstrates strong writing skills through its well-structured methodology and clear technical exposition.

拒绝理由

  • This paper identifies DNIP and FuRud - two current state-of-the-art methods in the field - at line 249, but fails to formally cite these seminal works in that section.
  • Limited novelty, the study demonstrates limited originality as its core component (the triangular membership function) is directly adopted from previous work by Lin & You (2024b). The primary technical contribution centers on employing simulated annealing to optimize the parameters of these triangular membership functions through a systematic search.

给作者的问题

  • Could you clarify how the hyperparameters ak, bk, and ck of the triangular membership function are determined in the paper?
评论

Thank you for your valuable review and the constructive comments! We appreciate that you point out the citation issue and will cite properly.

Regarding the question on how ak,bk,cka_k, b_k, c_k of the triangular membership functions are determined: we use a finite, pre-determined set of triangular membership functions (μk\mu_k's), with ak,bk,cka_k, b_k, c_k controlling their active ranges. Our SA algorithm performs a constrained combinatorial optimization to search for the optimal combination of which specific μξi\mu_{\xi_i} and ωξi\omega_{\xi_i} to assign to each class i, where ξi\xi_i is the selection index. The Heaviside functions H(⋅) in Eq. (4) dynamically selects μξi\mu_{\xi_i} or ωξi\omega_{\xi_i} based on a fuzzy decision index DFD_F (typically set between 10 to 20, covering 4 fuzzy slope partitions). Specifically, if ξiDF\xi_i \le D_F, μξi\mu_{\xi_i} is selected. For instance, μ1\mu_1 with a1=0,b1=0,c1=0.5a_1=0, b_1=0, c_1=0.5 applies linear downscaling from probability 0 to 0.5, peaking at 1 when p=0p=0, enabling smooth low-range probability transformations.

Each SA step proposes a new combination of weight and membership correction functions, accepted if it improves the objective or satisfies the Metropolis criterion. This discrete optimization over function combinations allows flexible, post-hoc corrections at both class and sample levels - our key contribution beyond prior work.

评论

I agree with the authors' point in the review that the core innovation of their work lies in achieving flexible posterior calibration across both class and sample dimensions. However, the class-level debiasing design is relatively crude, and the sample-level debiasing method shows little innovation, thus limiting the overall novelty. Therefore, I will keep my score.

审稿意见
6

This paper aims to improve the overall accuracy of class predictions by large language models (LLMs) by elevating weaker classes through an ensemble debiasing method based on a Heaviside step function. By formulating class prediction as an optimization problem that accounts for overall error, class-level bias, and sample-level bias, the proposed DCS (Debiasing at Class and Sample levels) framework calibrates LLM predictions to mitigate the aforementioned biases inherited from the training data or caused by prompts, thereby enhancing downstream class prediction accuracy.

接收理由

  1. General improved performance across different general domains, especially non-triieveal improvement in the two biomedical domains (Table 1)
  2. It demonstrates the combination of sample-level and class-level correction can elevate the weak classes

拒绝理由

  1. Concerns of over-engineering: this DCS framework involves combinatorial optimization techniques that include many hyper-parameters setup and tuned (e.g. the weight corrections and membership corrections for each class; the simulated annealing (SA) parameters). From the complexity perspective, it may not be that different from directly fine-tuning an LLM on these data, making the performance boost unsurprising.
  2. Generalizability and efficiency: due to the setup and tuning issues mentioned above, DCS is not a plug-and-play framework, it is hard to generalize to a new dataset fast.
  3. Lack of baseline comparison: while this work compare to other optimization-based or fuzzy-based methods, it can be improved by incorporating more baselines. For example, fine-tuning by N-shot example.

给作者的问题

  1. The DNIP’s paper (COBias and Debias) and FuRud ’s paper (Let the Fuzzy Rule Speak) are not directly cited when these abbreviation first appear.
  2. The paper can be improved by visualizing the ablation of sample-level rebalance and class-level rebalance by, e.g. bar chart of performance.
评论

Thank you for your insightful review and the constructive comments! We will correct the citation issue and appreciate your suggestion on visualizing the ablation of sample-level and class-level rebalancing. We will include bar charts (temporarily linked as https://anonymous.4open.science/r/colm-0D36/DCS_visualization.pdf).

Regarding hyperparameter tuning, generalizability and efficiency: most hyperparameters are fixed in practice, including the number of triangular membership functions and most SA parameters. Only five scalars are tuned: the weight correction scale, β,τ\beta, \tau in Eq. (5), and λ1,λ2\lambda_1, \lambda_2 in SA. Our post-hoc optimization averages 46 seconds (as computed from Figure 3 in the paper), significantly lighter than LLM fine-tuning — especially at the 70B scale, where bias remains significant and finetuning is costly. Our method adapts outputs directly, making it applicable to large or closed models that expose only logits or probabilities.

Regarding baselines: while class-balanced loss or other related losses can be used in fine-tuning smaller models, it does not guarantee balanced class accuracy, as it remains an indirect training surrogate. We appreciate the suggestion to compare with N-shot fine-tuning and will consider it in future or supplementary work.

审稿意见
7

The study proposed a post-hoc ensemble debiasing framework to correct in-context learning probabilities for both class and sample levels. It evaluated the performance of the method on 7 datasets, and the results show that the method improves the performance of low-accuracy classes while effectively mitigating class performance imbalances.

接收理由

The proposed method tries to solve the performance of low-accuracy classes while keeping the overall performance at the SOTA level. Results suggest that the method is promising, and it also helps with the domain specific tasks that normally face the challenge of class imbalances.

拒绝理由

The current study analyzed the annealing time for optimization. It would be helpful if the authors could elaborate more on the effectiveness (e.g., time, cost) of the proposed method for implementation in real practice.

给作者的问题

Please see the previous section for questions and comments.

评论

Thank you for your valuable review and for highlighting a key practical consideration! We appreciate the opportunity to elaborate on the real-world applicability of our proposed DCS method.

Accessibility

DCS is model-agnostic, requiring only output probabilities, making it particularly suitable for large or closed LLMs where only logits/output probabilities are accessible.

Post-hoc optimization

The post-hoc, offline optimization of DCS requires no model architecture modification and prompt engineering. The annealing time averages 46 seconds (computed from Figure 3 in the paper). Hyperparameter tuning can be done efficiently because most hyperparameters are pre-configured in practice, with only a few scalar values (e.g., β,τ\beta, \tau) requiring tuning. Moreover, since the class-level correction method DNIP demonstrates low-data optimization capability, we'd infer that DCS is amenable to low-data scenarios. These make DCS particularly valuable in specialized applications where fairness and accuracy are paramount.

Negligible overhead

The learned correction functions are reused on-the-fly with mere milliseconds in prediction, introducing virtually no computation overhead or latency.

Scalability

Our method exhibits linear scaling with the number of classes (NN). At each temperature in simulated annealing, the number of searches remains within several multiples of the search space size N(DF+DW)N(D_F + D_W), ensuring practical feasibility across small and large classification tasks.

审稿意见
6

In this paper, the authors propose to use heaviside step function as the mapping function for ensemble debiasing to achieve fairer prompting accuracy. Specifically, the mapping function is utilized for in-context learning probability correction. They also propose to solve the non-differentiable framework with simulated annealing algorithm. Experiments conducted on several multi-class text classification benchmark datasets demonstrate that the proposed method can improve the performance of Llama2 LLM. The authors also conducted analysis on probability correct and ablation to validate the effectiveness of the proposed framework.

接收理由

  • The ensemble based correction on weights and membership functions enhance the performance.
  • The proposed method generally improves the performance of Llama2 across many benchmark datasets.
  • Solid analysis on different perspectives in the experiments.

拒绝理由

  • Lack of concrete examples of the proposed method about how it improves the performance.
  • Only test on Llama-2-13B. It would be interesting to see the results on different model families.
  • The proposed framework underperforms in some datasets. It would be interesting to discuss what the gap is.
评论

Thank you for your constructive review and valuable suggestions! We will add examples and discussions of how our method improves performance and on the underperforming case, and experiments on more model families.

Regarding concrete examples: we add visualizations on how sample-level and class-level corrections benefit the classes, with bar charts temporarily linked as https://anonymous.4open.science/r/colm-0D36/DCS_visualization.pdf.

Regarding more model families: we also appreciate the suggestion on experimenting with more model families. We experimented with a more recent LLM, Gemma-2-2B, and obtained consistent improvements similar to the Llama-2 cases, demonstrating that our method is applicable to LLMs of varied sizes and families. Additional results are shown in the table temporarily linked as https://anonymous.4open.science/r/colm-C8E3/DCS_gemma_results.pdf.

Regarding the underperforming case: DCS underperforms on the binary classification RTE task because the binary setting may not best exploit the ensemble flexibility. Compared to multi-class cases, with three configuration options (both classes use class-level corrections, a class uses class-level correction and the other uses sample-level correction, both classes use sample-level corrections), the benefit of diversity diminishes. FuRud (sample-level) performs best on accuracy for its capacity to push borderline samples to the correct class, while DNIP (class-level) excels on COBias via consistent global reweighting. DCS adds complexity, but its broad flexibility does not outperform in the simpler binary setting.

最终决定

People liked the importance of the problem, the DCS method, and these methods showed very strong results. All reviewers recommended acceptance.