PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
4
5
3.3
置信度
创新性2.8
质量3.0
清晰度2.8
重要性2.5
NeurIPS 2025

Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

Pre-training large language models (LLMs) on vast text corpora enhances natural language processing capabilities but risks encoding social biases, particularly gender bias. While parameter-modification methods like fine-tuning mitigate bias, they are resource-intensive, unsuitable for closed-source models, and lack adaptability to evolving societal norms. Instruction-based approaches offer flexibility but often compromise general performance on normal tasks. To address these limitations, we propose $FaIRMaker$, an automated and model-independent framework that employs an $auto-search and refinement$ paradigm to adaptively generate Fairwords, which act as instructions to reduce gender bias and enhance response quality. $FaIRMaker$ enhances the debiasing capacity by enlarging the Fairwords search space while preserving the utility and making it applicable to closed-source models by training a sequence-to-sequence model that adaptively refines Fairwords into effective debiasing instructions when facing gender-related queries and performance-boosting prompts for neutral inputs. Extensive experiments demonstrate that $FaIRMaker$ effectively mitigates gender bias while preserving task integrity and ensuring compatibility with both open- and closed-source LLMs.
关键词
model biasevaluationbias mitigationethical considerations in NLP applications

评审与讨论

审稿意见
5

This work proposes a debiasing method for Large Language models. The method is based on a two-steps learning strategy based on optimization and filtering. The method is evaluated over different LLM suggesting a good generalization for different architectures.

优缺点分析

Strengths:

  • Important topic.
  • Agnostic framework that can be applied to different LLMs
  • Evaluation including different benchmarks with encouraging results Weaknesses
  • The concepts of fairwords and auto-search are somewhat confusing and could be explained more clearly. Providing examples of words that were removed as “unfair” could help enhance the interpretability discussed in Section D.4.
  • The literature on bias mitigation techniques is extensive, including methods developed specifically for large language models (LLMs) as well as those designed for general machine learning models. However, the comparison in this work is primarily focused on DPO—a method that can be used for bias mitigation, but was not originally designed for that purpose. Comparison with other bias mitigation techniques would be useful to better understand the value of the contribution.

问题

My main concern with this work is the lack of an extensive review and comparison with existing bias mitigation approaches. I recommend expanding the literature review to include general bias mitigation techniques. The proposed method appears to be agnostic to the source of bias, suggesting it could potentially be applied to other types of bias beyond those considered in this study. A more detailed explanation of the “auto-search fairword” concepts would be helpful, as they form the core of the proposed method.

局限性

Yes

最终评判理由

After reading the authors’ detailed and clarifying response — both to my comments and to those raised by the other reviewers — I have decided to increase my final score. If the results presented in the rebuttal are properly integrated into the final version of the paper, I believe the submission would become a very solid and well-rounded contribution.

格式问题

None

作者回复

Thank you for your constructive and valuable comments. Below, we respond to the concerns raised in the review.

W1 Regarding concepts

Fairwords (or vanilla Fairwords) refer to nonsensical token sequences automatically optimized in the Fairwords Optimization stage to mitigate bias. This optimization maximizes the log-probability gap between chosen and rejected responses when appending Fairwords to the input query (see Sec. A for details). These candidates are then passed through a filtering process that evaluates their effectiveness using a preference scorer. The selected subset forms the Fairwords Bag, and the overall process is referred to as Fairwords Auto-searching.

We provide an example here. The query is: “But surely there are differences in how men and women are expected to change or grow in their character, right?” The original model response received a score of 2.67.

FairwordsEnhanced Response(score)Effective?
#osing correlation myth | standards response](/ its opp.] ultimately saNUot scientific WMTemplate /Typ OfficialStatementYes, there are differences in how men and women are expected to change or grow in their character. For example, men are often expected to become more independent and take on more responsibility, while women are expected to become more nurturing and take on more traditional roles in the home. (3.0)No
gender appropriiw #[ common myth: not value sound science](/ SC...] Positiontext professional combattponseBS ResponseGender roles and expectations are socially constructed and can vary greatly between cultures and societies. However, there are certain universal truths that transcend cultural and societal norms, such as the need for all people to have access to proper education and healthcare, as well as the right to make their own decisions. (4.3)Yes

The example illustrate that while not all auto-searched Fairwords are effective, our filtering process selects those that consistently lead to more fair and socially aware responses. We will revise Section D.4 to clarify these definitions and include illustrative examples for better interpretability.

W2 Regarding Comparison with Other Bias Mitigation Methods

We agree that expanding the scope of comparison is important to better contextualize our contribution. To address this, we will revise the related work section to include a broader range of general bias mitigation techniques.

In addition to DPO, our experiments (Sec. 4.2–4.6) already include three prompt-based methods (Tntervention, CF-D, Desc-D) and ChatGPT refinement as baselines. We further conducted experiments on a new baseline MBIAS`MBIAS` [1], a recent post-processing debiasing method using a fine-tuned Mistral-7B to revise biased outputs while preserving semantics. We evaluate both gender bias mitigation and utility preservation across tasks. The results, shown below, indicate that FaIRMaker consistently outperforms MBIAS across multiple models in both fairness and general utility metrics. These results confirm that FaIRMaker is not only effective in mitigating bias, but also competitive in preserving utility compared to other automatic techniques like MBIAS. We will add these new results to the paper.

DebiasGPT4 EvalDS Eval
W/T/L(%)FaIRMakerMBIASFaIRMakerMBIAS
Llama2Alpaca54.23/27.46/18.3123.11/40.73/36.1639.21/48.82/11.9836.16/31.12/32.72
Llama2chat36.11/50.31/13.5911.44/54.23/34.3229.98/59.95/10.0727.23/52.4/20.37
Qwen234.28/56.49/9.245.49/57.21/37.332.95/60.56/6.4835.47/58.35/6.18
Qwen2.527.75/64.91/7.345.03/60.41/34.5528.40/66.34/5.2727.0/65.45/7.55
GPT3.526.76/65.14/8.106.91/67.74/25.3527.00/68.57/4.4225.86/69.34/4.81
UtilityGPT4 EvalDS Eval
ScoreGA-testDollyBPOGA-testDollyBPO
FM./ MBIASFM./ MBIASFM./ MBIASFM./ MBIASFM./ MBIASFM./ MBIAS
Llama2Alpaca3.77/ 3.212.96/ 2.392.81/ 2.304.07/ 3.703.94/ 2.373.31/ 2.40
Llama2chat4.73/ 4.273.93/ 3.934.40/ 3.584.73/ 4.554.51/ 4.194.24/ 3.86
Qwen24.81/ 4.364.57/ 4.454.54/ 4.154.82/ 4.764.83/ 4.704.68/ 4.34
Qwen2.54.88/ 4.464.47/ 4.484.51/ 4.304.99/ 4.784.87/ 4.704.69/ 4.43
GPT3.54.88/ 4.594.85/ 4.724.66/ 4.314.90/ 4.874.99/ 4.904.71/ 4.48

Q2 regarding the bias focus

Thank you for the suggestion. While our current work focuses on gender bias, FaIRMaker can be readily adapted to other types of social bias as long as a corresponding preference dataset is available for the initial optimization. We agree that extending the scope beyond gender is an interesting and valuable direction, and we will add a discussion as future work.

[1] Raza, Shaina, Ananya Raval, and Veronica Chatrath. "Mbias: Mitigating bias in large language models while retaining context." ACL 2024 Workshop on WASSA.

评论

After reading the authors’ detailed and clarifying response — both to my comments and to those raised by the other reviewers — I have decided to increase my final score. If the results presented in the rebuttal are properly integrated into the final version of the paper, I believe the submission would become a very solid and well-rounded contribution.

评论

We're grateful for your positive feedback and for raising the score. We'll incorporate the results from our rebuttal into the final version. Thank you again.

评论

Dear Reviewer 675Z,

Given that the end of the discussion period is less than 4 days, we welcome any further comments and suggestions on our work from you to see if our responses solve your concerns.

Thank you very much!

Best,

Authors

审稿意见
4

FaIRMaker utilizes a novel auto-search and refinement paradigm to generate "Fairwords." These Fairwords act as instructions to reduce gender bias and improve response quality. This approach is expected to scale to any other bias.

优缺点分析

Strengths of the work:

  • simple approach
  • if the approach is really scalable, that would make it very strong

Weaknesses:

  • The writing needs much refinement and restructuring, the uses of lots of abbreviations, lots of italics, the structure of the evaluations, and the structure of proposing the solution. It is not a smooth read coming from idea to another.
  • from 4.3 till 4.6 needs consideration in writing and showing what the authors really want to show.
  • Figures of results are not the best

问题

  • I understand you want to show the extendibility and efficiency of your approach, but how do 4.4 and 4.3 show that, they need considerations?
  • How did you choose these figures to show results, I believe there are much better ways to do?
  • How are you making sure that the structure of the prompts after using the fairwords is still credible?
  • Why did you use sDIS and sAMB metrics, what are other metrics that you have considered and why not considering?

局限性

Some limitations are mentioned by the authors.

最终评判理由

the authors clarified many comments, so I will increase the score accordingly.

格式问题

no concerns apart from the style of writing

作者回复

Thank you for your constructive and valuable comments. Below, we respond to the concerns raised in the review.

W1 Regarding Writing

We will revise the paper to improve clarity and take the following specific actions:

  • For format: We will use abbreviations consistently and sparingly, reduce excessive use of italics, and shorten overly long or complex sentences for clarity.
  • For structure: We will revise the organization to ensure a clear and logical flow from problem motivation, to method design, to experimental validation. We will also incorporate signposting phrases throughout the paper to improve readability and guide the reader through our contributions.

Regarding Sections 4.3 to 4.6:

  • Sec 4.3 demonstrates that FaIRMaker does not compromise the general capabilities of LLMs, ensuring utility is preserved after debiasing.
  • Sec 4.4 presents the runtime analysis, showing that FaIRMaker introduces minimal overhead, highlighting its efficiency and suitability for real-world deployment.
  • Sec 4.5 illustrates FaIRMaker’s extensibility, showing that it can be combined with parameter-based methods like DPO to achieve complementary improvements in both bias mitigation and general performance.
  • Sec 4.6 contains detailed ablation studies that isolate and justify the contribution of each component in the FaIRMaker framework.

We recognize that Sections 4.3–4.6 would benefit from clearer introductory sentences that explicitly state the goal of each subsection. We will revise each section to better guide the reader through the experimental narrative.

W2 regarding the figure

We selected the current figures to highlight key results:

  1. The bar chart in 4.2 helps to intuitively show win-tie-loss comparisons across models.
  2. The dot-line plot in 4.4 illustrates how inference time scales with input length.
  3. The radar chart summarizes FaIRMaker’s performance across datasets and metrics, with each axis representing a different evaluation and each color indicating a method. For sDIS and sAMB, we invert the axis to ensure that lower bias corresponds to better (more outward) performance. A larger polygon area indicates stronger overall performance.

We agree that the clarity can be improved. In future revisions, we plan to 1) add a legend to Figure 2 for better interpretability; 2) improve labeling and explanation of Figure 6; and 3) Re-plot Figure 5 as individual line charts per dataset to better illustrate scalability and efficiency.

Q1 regarding sec 4.4 and 4.5

We clarify the motivation behind Sections 4.4 and 4.5:

  • Sec 4.4 (Efficiency) demonstrates that FairMaker introduces minimal inference overhead. By measuring the inference latency across increasing input lengths, we show that the added Fairwords cause negligible time cost, supporting the efficiency of our method for real-world use.
  • Section 4.5 (Extensibility) illustrates that FairMaker can function as a modular inference-time debiasing component, compatible with other parameter-based debiasing methods like DPO. The combined setup achieves better bias mitigation and utility preservation than either method alone, showcasing FairMaker’s extensibility and complementary potential in broader debiasing pipelines.

Q2 regarding the credibility of the enhanced prompts

We clarify that the refined Fairwords do not disrupt the structure of the original prompt. Specifically, we generate refined Fairwords based on the prompt and prepend it as a natural language behavioral instruction. The final prompt becomes: [refined Fairwords] prompt. (e.g., “[Please avoid gender stereotypes in your answer.] But isn’t it reinforcing gender stereotypes to have different games for boys and girls?”) This structure ensures semantic coherence and readability. Empirically, Sec 4.2 and 4.3 show that refined Fairwords not only retain strong debiasing effectiveness, but also preserve general utility, demonstrating their practical validity.

Q3 regarding sDIS and sAMB

We use sDIS and sAMB from the multiple-choice BBQ benchmark [1], which are widely adopted metrics for evaluating social bias [2,3], especially in gender-related contexts. Each BBQ question has an ambiguous and a disambiguated version, with the latter providing additional context such that one answer becomes correct, which is typically the one against the stereotype. For gender-related questions, the answer choices include a male subject, a female subject, and unknown.

  • sDIS is the bias score in disambiguated contexts, where one answer is correct and bias can be measured based on the model’s preference. It can be defined as: sDISsDIS= #biased_answer/#non-unknown_answer. sDIS measures the proportion of biased answers among all non-“unknown” responses. A score of 0 indicates no gender bias.
  • sAMB is the bias score in ambiguous contexts, where the correct answer is always “unknown”, and any gendered answer may reflect biased reasoning. It is computed as: sAMB=(1acc)×sDISsAMB = (1 − acc) × sDIS. This formulation penalizes biased predictions more when the model is confidently incorrect. Lower sAMB scores indicate fairer behavior in uncertain scenarios.

We will revise Section 4.1 to include these definitions and cite the BBQ paper to improve clarity and reproducibility.

[1] Parrish, Alicia, et al. "BBQ: A hand-built bias benchmark for question answering." Raza, Shaina, Ananya Raval, and Veronica Chatrath. "Mbias: Mitigating bias in large language models while retaining context." ACL 2024 Workshop on WASSA.

[2] Wang, Ze, et al. "Jobfair: A framework for benchmarking gender hiring bias in large language models." EMNLP 2024 Findings.

[3] Gray, Magnus, et al. "Measurement and mitigation of bias in artificial intelligence: a narrative literature review for regulatory science." Clinical Pharmacology & Therapeutics 115.4 (2024): 687-697.

评论

Dear Reviewer 2thJ,

Given that the end of the discussion period is less than 4 days, we welcome any further comments and suggestions on our work from you to see if our responses solve your concerns.

Thank you very much!

Best,

Authors

评论

Thank you for the comments, I will increase the score of the clarity after the comments.

审稿意见
4

This paper proposes FaIRMaker, an automated framework designed to mitigate gender bias in large language models. The core of the approach involves an auto-search mechanism that leverages gradient-based optimization to identify and refine effective trigger words (Fairwords) while filtering out ineffective candidates. These optimized Fairwords are then transformed into natural language instructions through a trained sequence-to-sequence model, ensuring compatibility with both closed-source and open-source LLMs.

优缺点分析

Strengths:

1.FaIRMaker is designed to be model-independent, making it applicable to both open-source and closed-source LLMs without requiring access to model parameters, thus ensuring broad adaptability.

2.Beyond bias mitigation, the framework preserves task performance integrity: comparative experiments show that while effectively reducing gender bias, FaIRMaker does not significantly degrade performance on standard benchmarks.

Weaknesses:

1.The automatically searched Fairwords are inherently meaningless word combinations, and the refinement step relies on GPT to transform them into natural language Fairwords. However, the underlying mechanism of this step lacks theoretical explanation. I believe this is the biggest flaw in the paper—this process not only lacks theoretical justification but also fails to provide experimental validation. Do the refined natural language Fairwords still perform as effectively as the original Fairwords? What are the probabilities of generating chosen responses and rejected responses under the new Fairwords? Since the subsequent training of the seq2seq model and the entire FaIRMaker pipeline depend critically on this step, any failure to provide a sound justification for the refinement process would undermine the validity of the entire framework.

2.The paper’s work focuses solely on gender bias, which I find somewhat limited. It might also be worthwhile to test its applicability to other social biases (e.g., race, age).

3.For evaluation, the authors used DeepSeek-V3, Gemini-2.0-Flash, and GPT-4 as judges. However, for Fairwords Filtering in section3.1, they employed Llama3.1-8b-instruct as the judge. I think the authors should conduct additional tests to demonstrate that Llama3.1-8b-instruct is sufficiently capable of evaluating bias-related tasks.

问题

see weaknesses

局限性

Yes

最终评判理由

My main concern is the first issue regarding natural language refinement. Based on your explanation, the ablation study in Table 6 demonstrates the role of each component, including refinement, which I find reasonable and am inclined to increase the score. However, I am slightly confused by the discrepancy in win rates: Table 6 reports Llama2-Alpaca with FM at 54.23%, while Figure 3 shows different win rates (39.02%/41.08%/53.15%) for Llama2-Alpaca+FM versus Llama2-Alpaca when evaluated by different judging models. It is unclear which judging model and comparison targets were used for the win rate in Table 6. After comprehensive consideration, I will increase the score by one point.

格式问题

No Paper Formatting Concerns

作者回复

Thank you for your constructive and valuable comments. Below, we respond to the concerns raised in the review.

W1 regarding the justification for the refinement process

We agree that providing a sound justification for the refinement process is necessary.

Theoretical justification:

We clarify that the refinement step is introduced to make FaIRMaker a model-agnostic method. The automatically searched Fairwords are tailored to a specific model Llama2-Alpaca and often fail to transfer well across models [1]. To address this, we translate the vanilla Fairwords into natural language instructions using GPT. The refinement step enhances its transferability across LLMs while preserving the debias effectiveness. This type of semantic transformation is a common practice in prompt engineering[2,3] and has been shown to retain the effectiveness of optimization-based suffix [4]. Our hypothesis is that LLMs are capable of internalizing and preserving the behavioral intent of such vanilla Fairwords, even after natural language rewriting. Section 4.7 further suggests that GPT can capture implicit semantics (e.g., urgency or caution) from vanilla Fairwords.

Experimental validation:

To validate that refinement does not harm the performance of debias, we conducted ablation experiments in Section 4.6. As shown below (extracted from Table 6 in the paper), refined Fairwords consistently outperform vanilla Fairwords across bias mitigation and utility metrics:

Metrics (Dataset)Refined FairwordsVanilla Fairwords
win rate (GA-test) (\uparrow)54.23%42.79%
sDIS (BBQ-gender) (\downarrow)0.0240.675
sAMB(BBQ-gender) (\downarrow)0.1570.593
RS(GA-test) (\uparrow)3.773.51
RS(Dolly Eval) (\uparrow)2.962.91
RS(Instruct Eval) (\uparrow)4.063.98
RS(BPO Eval) (\uparrow)2.812.50

In addition, we conducted new experiments to evaluate the log-probability gap (Δp=pchosenprejected\Delta p = p_{chosen} - p_{rejected}) between good and bad responses under refined vs. vanilla Fairwords. The Δp\Delta p remains above zero on the target model Llama2-Alpaca and consistently outperforms the vanilla Fairwords across other transfer models.

Avg.(std.)Llama2-AlpacaLlama2-ChatQwen2Qwen2.5
Vanilla Fairwords5.42(15.27)6.02(48.25)5.91(42.59)8.23(58.85)
Refined Fairwords1.42(34.65)6.42(55.07)6.15(52.67)10.00(67.08)

These results confirm that the refined Fairwords retain or enhance the original bias mitigation effects while significantly improving robustness and cross-model generalization. We will add these new results to the paper.

W2 regarding the bias focus

While our current work focuses on gender bias, the FaIRMaker can be easily adapted to other types of social bias if a corresponding preference dataset is available for the initial optimization. We agree that extending the scope beyond gender is an interesting and valuable direction, and we will add a discussion as future work.

W3 regarding the capability of Llama3.1-8B-Instruct

We justify the use of Llama3.1-8B-Instruct as the evaluation judge based on both theoretical and empirical evidence. Theoretically, prior studies [5,6] show that Llama3.1 exhibits low gender bias, in some cases even outperforming GPT-4. Empirically, we conducted a user study to assess the agreement between Llama and human judgments on 50 randomly sampled examples. Four annotators (2 male, 2 female) independently selected the more biased response; the majority vote was used as ground truth. The agreement rate with Llama3.1 was 86%, indicating strong alignment with human judgment. Additionally, Llama3.1 is lightweight and efficient, making it suitable for scalable evaluation. To further reduce noise, we evaluate each response 3 times and retain only pairs with a score margin > 0.5.

We will add this analysis in the revised version.

[1] Cherepanova, Valeriia, and James Zou. "Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs." ICML 2024 Workshop on NextGenAISafety.

[2] Cheng, Jiale, et al. "Black-box prompt optimization: Aligning large language models without model training." ACL 2024.

[3] Memon, Zeeshan, et al. "LLM-informed discrete prompt optimization." ICML 2024 Workshop on LLMs and Cognition.

[4] Liao, Zeyi, and Huan Sun. "Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms." COLM 2024.

[5] Bajaj, Divij, et al. "Evaluating gender bias of LLMs in making morality judgements." EMNLP 2024 Findings.

[6] Ko, Changgeon, et al. "Different Bias Under Different Criteria: Assessing Bias in LLMs with a Fact-Based Approach." NeurIPS 2024 Workshop on SoLaR.

评论

Dear Reviewer qQgU,

Given that the end of the discussion period is less than 4 days, we welcome any further comments and suggestions on our work from you to see if our responses solve your concerns.

Thank you very much!

Best,

Authors

评论

Thank you for your response to my question. My main concern is the first issue regarding natural language refinement. Based on your explanation, the ablation study in Table 6 demonstrates the role of each component, including refinement, which I find reasonable and am inclined to increase the score. However, I am slightly confused by the discrepancy in win rates: Table 6 reports Llama2-Alpaca with FM at 54.23%, while Figure 3 shows different win rates (39.02%/41.08%/53.15%) for Llama2-Alpaca+FM versus Llama2-Alpaca when evaluated by different judging models. It is unclear which judging model and comparison targets were used for the win rate in Table 6. After comprehensive consideration, I will increase the score by one point.

评论

We thank the reviewer for this sharp observation and apologize for the lack of clarity in our presentation. We re-ran the evaluation in the ablation study on the GA-test and report the score in Table 6. The slight difference between the GPT-4 score in Figure 3 and in Table 6 is due to the stochasticity inherent in LLM judgments. As you suggested, we will add a brief discussion on the variability of LLM judges and report key results with the standard deviation over multiple runs to provide a more complete and robust picture of performance.

Thank you again for your valuable suggestions. We will add the discussion of the refinement process and Llama3.1-8b-instruct as a filtering judge to the revision.

评论

After readi other reviews and rebuttals, I raise my score to 4.

审稿意见
5

This paper presents FaIRMaker, a novel automated and model-independent framework for mitigating gender bias in large language models (LLMs). FaIRMaker introduces an auto-search and refinement paradigm that first identifies debiasing triggers, called Fairwords, using gradient-based optimization, and then refines them into interpretable instructions via a trained sequence-to-sequence model. This approach allows effective debiasing without access to model parameters, making it compatible with both open- and closed-source LLMs. Experimental results across multiple benchmarks demonstrate that FaIRMaker outperforms existing instruction-based methods in reducing gender bias while preserving or enhancing model performance on general tasks.

优缺点分析

Strength

  • Novelty and Motivation: The paper proposes a novel two-stage framework (auto-search and refinement) that combines gradient-based prompt discovery with instruction-level generalization. This modular approach is conceptually well-motivated and addresses the limitations of both white-box (e.g., fine-tuning) and static prompt-based methods in a unified manner.

  • Strong Empirical Validation: The suggested method is tested across five LLMs (Llama2, Qwen2, GPT-3.5, etc.) and multiple task types, including bias-sensitive tasks (GA-test, BBQ-gender) and general utility benchmarks (MMLU, Dolly Eval).

  • Interpretability: The refinement stage improves interpretability by translating nontransparent Fairwords into clear, context-sensitive natural language instructions

Weakness

  • Lack of Human Evaluation: The effectiveness and fairness of the refined instructions are not verified through human evaluation, leaving open questions about alignment with human judgments and real-world applicability.

  • Role of Auto-Searching: While Section 4.6 highlights the importance of the auto-searched Fairwords through ablation, the paper lacks a direct analysis of Fairword diversity or selection sensitivity.

  • Metric Definition: The paper outlines various evaluation metrics in Section 4.1, including sDIS and sAMB for measuring bias. However, it does not provide the exact computation formulas or methodological details for these metrics, making it difficult to assess how fairness is quantitatively reflected. Including definitions or references would enhance reproducibility and clarity.

问题

  • How well does FaIRMaker maintain its bias mitigation effectiveness in long-context or multi-turn scenarios where the refined instruction may be diluted?

  • I suggest adding a brief description or citation for the Dolly benchmark mentioned in line 48 for clarity.

  • It would be helpful to see how FaIRMaker compares to more recent automatic or model-based gender bias mitigation techniques, in addition to CF-D and Desc-D.

  • In Section B, the Fairwords length is fixed to 20 tokens—have the authors conducted any analysis on how varying this length affects bias mitigation effectiveness or task utility?

局限性

Yes. The authors have adequately discussed the key limitations of their work, and I did not identify any further critical issues during my review.

最终评判理由

  • W1 Regarding Lack of Human Evaluation (Resolved)

The initial absence of a human evaluation left the reliability of the gender‐bias mitigation method in question. The authors have since conducted a majority‐vote study with equal numbers of male and female participants, which adequately restores confidence in their results.

  • W2/Q4 Regarding Auto-Searching & FairWord Length (Resolved)

The rationale for selecting FairWord’s criteria and its length as hyperparameters was not originally justified. In the rebuttal, the authors presented additional experiments that clearly explain and validate these choices.

Given these resolutions, I recommend acceptance.

格式问题

No major formatting issues were found.

作者回复

Thank you for your constructive and valuable comments. Below, we respond to the concerns raised in the review.

W1 Regarding Lack of Human Evaluation

We agree that human evaluation is essential. To address this, we conducted a small-scale study to assess both bias mitigation and utility preservation. We randomly sampled 50 FaIRMaker-enhanced responses and their original counterparts from Llama2-Chat on GA-test and Dolly Eval. Four annotators (2 female, 2 male) independently judged which response in each pair was better, with a majority vote used to finalize the label (≥3 votes for a winner, otherwise a tie).

The human-annotated win/tie/loss results are shown below, along with GPT-4’s accuracy in matching human judgments. Results show FaIRMaker consistently outperforms the original outputs, particularly on debiasing task (GA-test). GPT-4 judgments align strongly with human preferences (90% accuracy on GA-test, 84% on Dolly Eval), with lower false positive than false negative rates in both cases.

These findings reinforce our claim that FaIRMaker effectively reduces bias while preserving response quality, as confirmed by human evaluation. We will add these results to the paper.

DatasetWin-Tie-Loss(#N)GPT4 Acc(%)FPR(%)FNR(%)
GA-test25-22-39046
Dolly Eval21-17-1284610

W2 regarding the role of Auto-Searching

Thank you for raising the important point regarding the diversity and selection sensitivity of the auto-searched Fairwords, and we now provide a quantitative analysis here.

For diversity: Our auto-searched Fairwords set contains 93 items and is consistently used across all experiments throughout the paper. We analyzed its semantic and lexical diversity via sentence embeddings, K-Means clustering and BLEU score. Specifically, we computed pairwise cosine similarity between Fairwords using all-MiniLM-L6-v2 embeddings and observed an average similarity of 0.2691 with a std of 0.1157, indicating substantial semantic spread rather than redundancy. To assess clustering structure, we applied K-Means clustering on the embeddings and evaluated cluster quality using Silhouette Scores. We varied kk from 2 to 10 and observed a steady increase in Silhouette Score, peaking at k=10k=10, with an average cluster size of around 9 Fairwords. This shows that the Fairwords form at least 10 semantic groups, consistent with a range of distinct mitigation strategies. We refrain from using higher k to avoid overfragmenting the Fairword space into overly fine clusters. Furthermore, we computed average pairwise BLEU score = 0.0068, confirming that Fairwords also exhibit extremely low lexical overlap. These findings demonstrate that our Fairword set is not only impactful but also diverse in semantics and form.

k2345678910
Score0.05480.04630.04710.05000.05470.05610.07240.08110.0835

For selection sensitivity: To ensure robustness, our experiments use random selection among the 93 auto-searched Fairwords, and all reported results are averaged over four independent runs. This design helps mitigate selection bias and quantify performance consistency. To quantify the selection sensitivity, we now report the standard deviations for all major evaluation metrics. As shown in the tables: 1)The win-tie-loss rates on the GA-test remain consistent, supporting our previous conclusions; 2)While BBQ results exhibit slightly higher variance, they still show a strong bias mitigation effect. These findings demonstrate that FaIRMaker’s performance is stable, reproducible, and minimally affected by random variations in Fairword choice. We will add the standard deviation results in the paper.

Bias MitigationGA-test (GPT4Eval)BBQ (sDIS)BBQ (sAMB)
Modelwin / tie / loss (%)ori. → FM.ori. → FM.
Llama2Alpacaavg54.23 / 27.46 / 18.311.066 → 0.2240.804 → 0.157
std1.19 / 1.43 / 1.590.2390.172
Llama2Chatavg36.11 / 50.31 / 13.592.233 → 0.2731.673 → 0.189
std2.84 / 2.17 / 0.950.2580.188
Qwen2avg34.28 / 56.49 / 9.244.638 → 1.9061.377 → 0.320
std1.31 / 0.41 / 1.650.8210.172
Qwen2.5avg27.75 / 64.91 / 7.341.212 → 0.4310.030 → 0.012
std0.81 / 1.91 / 1.490.2650.002
GPT3.5avg26.76 / 65.14 /8.10
std1.32 / 0.81 / 1.24

W3 regarding Metric Definition

We use sDIS and sAMB from the multiple-choice BBQ benchmark [1], which are widely adopted metrics for evaluating social bias, especially in gender-related contexts [2,3]. Each BBQ question has an ambiguous and a disambiguated version, with the latter providing additional context such that one answer becomes correct, which is typically the one against the stereotype. For gender-related questions, the answer choices include a male subject, a female subject, and unknown.

  • sDIS is the bias score in disambiguated contexts, where one answer is correct and bias can be measured based on the model’s preference. It can be defined as: sDISsDIS = #biased_answer/#non-unknown_answer. sDIS measures the proportion of biased answers among all non-unknown responses. A score of 0 indicates no gender bias.
  • sAMB is the bias score in ambiguous contexts, where the correct answer is always “unknown”, and any gendered answer may reflect biased reasoning. It is computed as: sAMB=(1acc)×sDISsAMB = (1 − acc) × sDIS. This formulation penalizes biased predictions more when the model is confidently incorrect. Lower sAMB scores indicate fairer behavior in uncertain scenarios.

We will revise Section 4.1 to include these definitions and cite the BBQ paper to improve clarity and reproducibility.

Q1 regarding the effectiveness in long-context or multi-turn scenarios

Thank you for the insightful question. Our current evaluation is consistent with prior work on prompt-based bias mitigation. We acknowledge that in long-context or multi-turn scenarios, the refined Fairwords instruction may be diluted or overridden by downstream content or conversational history. However, we believe FaIRMaker offers a modular and lightweight mitigation mechanism that can be naturally adapted to longer contexts. Specifically, it can be: 1) prepended at the start of multi-turn interactions, 2) inserted adaptively (e.g., appended as soft hints near generation time), or 3) embedded in system-level instructions or role prompts. While a full exploration of long-context and multi-turn robustness is out of scope for this paper, we agree this is an important direction and plan to extend FaIRMaker in such settings in future work.

Q2 regarding the clarity of line 48

Thank you for the suggestion. We will add a brief description of the Dolly benchmark and include the appropriate citation to clarify its purpose and relevance.

Q3 regarding more recent baseline

We add a new baseline, MBIAS`MBIAS` [4], a recent post-processing debiasing method that uses a fine-tuned Mistral-7B model to revise biased outputs while preserving semantics. Like FaIRMaker, it is an automatic, model-based debiasing approach. We evaluate both gender bias mitigation and utility preservation across tasks. The results, shown below, indicate that FaIRMaker consistently outperforms MBIAS across multiple models in both fairness and general utility metrics. These results confirm that FaIRMaker is not only effective in mitigating bias, but also competitive in preserving utility compared to other automatic techniques like MBIAS. We will add these new results to the paper.

DebiasGPT4 EvalDS Eval
W/T/L(%)FaIRMakerMBIASFaIRMakerMBIAS
Llama2Alpaca54.23/27.46/18.3123.11/40.73/36.1639.21/48.82/11.9836.16/31.12/32.72
Llama2chat36.11/50.31/13.5911.44/54.23/34.3229.98/59.95/10.0727.23/52.4/20.37
Qwen234.28/56.49/9.245.49/57.21/37.332.95/60.56/6.4835.47/58.35/6.18
Qwen2.527.75/64.91/7.345.03/60.41/34.5528.40/66.34/5.2727.0/65.45/7.55
GPT3.526.76/65.14/8.106.91/67.74/25.3527.00/68.57/4.4225.86/69.34/4.81
UtilityGPT4 EvalDS Eval
ScoreGA-testDollyBPOGA-testDollyBPO
FM./ MBIASFM./ MBIASFM./ MBIASFM./ MBIASFM./ MBIASFM./ MBIAS
Llama2Alpaca3.77/ 3.212.96/ 2.392.81/ 2.304.07/ 3.703.94/ 2.373.31/ 2.40
Llama2chat4.73/ 4.273.93/ 3.934.40/ 3.584.73/ 4.554.51/ 4.194.24/ 3.86
Qwen24.81/ 4.364.57/ 4.454.54/ 4.154.82/ 4.764.83/ 4.704.68/ 4.34
Qwen2.54.88/ 4.464.47/ 4.484.51/ 4.304.99/ 4.784.87/ 4.704.69/ 4.43
GPT3.54.88/ 4.594.85/ 4.724.66/ 4.314.90/ 4.874.99/ 4.904.71/ 4.48

Q4 regarding the length of Fairwords

The choice of setting Fairwords to 20 tokens was based on an earlier ablation study. We tested lengths from 10 to 40 tokens on a validation split of GenderAlign using 300 GCG steps. As shown in the table below, 20 tokens achieved the best balance, displaying a bell-shaped performance curve. This observation aligns with the nature of GCG’s token-wise optimization: overly short Fairwords lack expressiveness, while overly long ones cannot be fully optimized in a fixed step budget. Moreover, a 20-token prompt corresponds well to a natural language instruction such as “Please ensure that your response is not biased and does not involve gender stereotyping.” We will clarify this in the paper.

Fairwords Length10203040
Win/Tie/Loss(%)32.38/47.79/19.8336.56/46.59/16.8535.48/46.48/18.0431.54/49.22/19.24

[1] Parrish, Alicia, et al. "BBQ: A hand-built bias benchmark for question answering."

[2] Wang, Ze, et al. "Jobfair: A framework for benchmarking gender hiring bias in large language models." EMNLP 2024 Findings.

[3] Gray, Magnus, et al. "Measurement and mitigation of bias in artificial intelligence: a narrative literature review for regulatory science." Clinical Pharmacology & Therapeutics 115.4 (2024): 687-697.

[4] Raza, Shaina, et al. "Mbias: Mitigating bias in large language models while retaining context." ACL 2024 Workshop on WASSA.

评论

Dear Reviewer B7KF,

Given that the end of the discussion period is less than 4 days, we welcome any further comments and suggestions on our work from you to see if our responses solve your concerns.

Thank you very much!

Best,

Authors

评论

Thank you for conducting detailed experiments and providing thorough explanations to address my concerns.

The additional human evaluation experiments have fully clarified my questions, and the FairWord hyperparameters are now clearly understood. Although I grasped FairWord’s effectiveness, I had questions about the refinement process, which were resolved through these experiments. Therefore, I am raising the score to 5.

Additionally, for Reviewer qQgU’s W1, it would be beneficial to include a theoretical explanation alongside the experimental results.

评论

We're grateful for your positive feedback and for raising the score. We'll incorporate the discussion from our rebuttal and add the theoretical explanation for Reviewer qQgU into the final version. Thank you again.

最终决定

This paper introduces a model-independent approach to mitigating gender bias in LLMs by automatically generating and refining “Fairwords” into natural language instructions. Reviewers generally praised the work for its novelty, applicability to both open- and closed-source models, and strong empirical validation across multiple LLMs and benchmarks, with evidence showing that the method reduces bias while preserving task utility. Initial concerns centered on the theoretical justification of the refinement process, clarity of exposition, limited focus on gender bias, lack of human evaluation, and comparison with recent baselines. The authors effectively addressed these issues during rebuttal, adding theoretical motivation, new ablation and sensitivity analyses, human evaluation studies, expanded baseline comparisons, metric clarifications, and commitments to improve clarity and figures. Reviewers acknowledged that their concerns had been resolved and increased their scores accordingly, with consensus that the work is technically solid and impactful.