PaperHub
5.3
/10
Poster4 位审稿人
最低3最高6标准差1.3
3
6
6
6
2.5
置信度
正确性2.8
贡献度2.5
表达2.0
NeurIPS 2024

Bias Amplification in Language Model Evolution: An Iterated Learning Perspective

OpenReviewPDF
提交: 2024-05-09更新: 2024-11-06
TL;DR

As LLMs undergo self-evolution, it is imperative for us to deepen our understanding of this process.

摘要

关键词
Large Language ModelLLM AgentSelf-improvementCognitive ScienceBayesianIterated Learning

评审与讨论

审稿意见
3

This paper investigates iterated learning with large language models (LLMs). The authors present theoretical analyses demonstrating how bias is amplified when LLMs are chained in an iterated learning process. Moreover, they conduct experiments with LLMs that confirm the amplification of bias during iterated learning. When the weights of LLMs are allowed to be updated during this process, the authors find that manipulating the information passed through intergenerational transmission can train LLMs to exhibit a desirable bias, such as providing responses that are both helpful and concise.

优点

The experiments were conducted in three successive steps: in-context learning with both explicit and implicit hypotheses, followed by in-weight learning. I appreciate this structure, although the results are somewhat difficult to interpret (as detailed below). These three experiments are quite novel.

缺点

The paper devotes extensive text (Sections 3 and 4) to presenting their key theoretical results: (1) iterated learning with Bayesian agents leads to a stationary distribution that effectively samples from the agent’s prior, and (2) in-context learning can be understood as an implicit Bayesian inference process. I found both results to lack novelty as they are essentially paraphrasing existing findings. For instance, Griffiths & Kalish (2007) have already provided the result (1), and both Xie et al. (2021) and Zhang et al. (2023) have presented the same result (2). Therefore, I do not believe the theoretical results should be featured in such detail given the established prior work.

Due to lack of clarity, I am also uncertain about the results from Sections 5 and 6. The results from Section 7 are the clearest to me, where controlling the transmission channel to pass on more helpful and concise data allows the LLM to amplify this good bias. However, I am unsure if this is a viable long-term approach, as it still involves training a model using data generated by the same model. Numerous studies, including Shumailov et al. (2023) and Lu et al. (2020), have shown that this method can eventually lead to model collapse. The authors might not have observed model collapse because the iterated learning chain was not run for a sufficient number of iterations. Furthermore, the evidence that imposing too strong of a constraint (line 384) negatively impacts the model’s helpfulness during iterated learning seems to support this idea.

问题

First of all, the font sizes in the tables and figures are too small to read. Second, the results in each table and figure are not self-explanatory. Terms like “Corr”, “r_20”, and “BOTH” in Table 1, as well as the numbers in Table 2, lack clear definitions and justification. What do these terms mean specifically? Are smaller values in Table 2 better? The text does not explain these terms either. Due to the extensive theoretical analysis, the key experimental results received insufficient attention in the main text. Moreover, the paper ends abruptly without sufficient discussion and lacks sections on limitations and future directions.

局限性

This work contributes useful empirical knowledge to the field of LLMs. However, I would like to suggest that the authors place greater emphasis on the empirical results and ensure these results are clearly presented. A key missing piece is the important comparison between in-weight iterated learning and model collapse. It would be beneficial to identify when in-weight iterated learning is effective and when model collapse is likely to occur.

References

Griffiths, T. L., & Kalish, M. L. (2007). Language evolution by iterated learning with Bayesian agents. Cognitive science, 31(3), 441-480.

Xie, S. M., Raghunathan, A., Liang, P., & Ma, T. (2021). An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080.

Zhang, L., McCoy, R. T., Sumers, T. R., Zhu, J. Q., & Griffiths, T. L. (2023). Deep de Finetti: Recovering Topic Distributions from Large Language Models. arXiv preprint arXiv:2312.14226.

Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., & Anderson, R. (2023). The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493.

Lu, Y., Singhal, S., Strub, F., Courville, A., & Pietquin, O. (2020, November). Countering language drift with seeded iterated learning. In International Conference on Machine Learning (pp. 6437-6447). PMLR.

作者回复
  1. The paper devotes extensive text (Sections 3 and 4) …

There is an important misunderstanding in this claim. The Bayesian-IL framework has been discussed in the cognitive science community for decades; Griffith et. al. formally describe it using the Bayesian framework. In that paper, sections 1-4 prove the case when hh is sampled from the posterior. The converging distribution is just the prior. In sections 5-6, they experimentally showed that when considering hh that maximizes the posterior, the converging distribution is a delta distribution centered at the hh with the highest prior. That paper provides an idea of proving this; we formalize that in our appendix.

More importantly, most of the works in Bayesian-IL including Griffith et. al. only consider an imitation-only IL. However, as demonstrated in many related works, the interaction phase is very important in avoiding mode collapse or enhancing the quality of knowledge. As a result, formalizing this phase is necessary. That is one of our main contributions. Since the interaction phase usually has different forms, we introduce hHeffh\in H_{eff} as one specific formalization that roughly makes sense across tasks. We verified that by imposing this constraint, the converging posterior will be the argmax of P0P_0 within this HeffH_{eff}. If this HeffH_{eff} rules out degenerate knowledge, our algorithm will not degenerate. If this HeffH_{eff} rules out harmful responses, our LLM will evolve to be less harmful.

In Section 4, we use a similar proof technique as Xie et al. to study a different problem. We show that when dt1d_{t-1} is large enough, sampling dP(ddt1)d\sim P(d|d_{t-1}) is equivalent to first sample hh^*, then dP(dh)d\sim P(d|h^*). Plus, we also discuss the in-weight learning scenario in Section 4.2, not done by Xie et al.

2.1 Due to lack of clarity, I am also uncertain ...

The ACRE experiments include several subtle designs, which are powerful for directly supporting our theory rather than only observing the final accuracy: when hh is explicit, we can directly measure the evolution of all P(h)P(h).

The word-generating experiment simulates a creative writing task. However, this design makes the bias easier to evaluate by counting the length or ranking difficulty of the generated words. We explain why these two experiments are so important for our theory in Appendix A2, and will highlight that more in the next version.

2.2 However, I am unsure if this is a viable long-term approach …

We believe self-improving, or more generally learning from AI content, will continue to become more important in the future. Such methods have the potential to break the constraints of limited data, as stated in many recent works. Our theory suggests a general trend of bias amplification as long as there some some common bias in the prior of agents in different generations. Since self-interaction is flourishing and bias amplification is unavoidable, our work could inspire more efficient methods to solve this problem.

2.3 The authors might not have observed model collapse because the iterated learning …

It is true that if we amplify the bias too much. P0P_0 will degenerate, i.e., some hh in the model will disappear. This does not contradict our analysis. Actually, in Table 1, the low correct prediction on d0d^0, when only imitation exists, falls into the model collapse scenario. However, as implied by our theory, and many related works about iterated learning, adding a good interaction phase is very efficient for avoiding model collapse. The experiments in Appendix B also clearly show that good interaction avoids converging to degenerate languages.

3.1 First of all, the font sizes ... Second, the results in each table and figure are not self-explanatory. …

We'll fix font size and add more detailed explanations of the notations in the captions. For the terms used in Table 1, all the numbers are evaluations of the converged hh. An explanation of Corr. d0 can be found in line 295, i.e., the number of correct predictions of the hh in d0. The r_20 is the ratio of how many screen:off occur in the final hh, because following our prompt format, the value of the last object “screen” always occurs at the 20th token. BOTH means how many hh satisfies both requirements mentioned above. The results of these measurements all match our theory well. That is why we believe although this experiment is hard to understand, they give stronger support for our theory than only observing the final accuracy.

3.2 Are smaller values in Table 2 better? …

For results in Table 2, the numbers verify that we can control the bias amplification using the method suggested by our theory. For example, the imitation-only amplifies the model’s own bias, hence the ratio of easy works increases to 96%. However, if we believe this bias is not good and want to guide the model to generate more hard words, we can design the corresponding interaction phase. So in the line named “hard”, the ratio of easy works is the lowest among all settings. The average rank and average length do similar things. We believe combining Figure 3 and the explanation in Section 6 helps to understand Table 2 better.

  1. This work contributes useful empirical knowledge to the field of LLMs. However, I would like to suggest that …

Thanks for this suggestion. The main motivation of our paper is to provide theoretical support to understand existing self-data-augmentation methods. Since the theory in this paper has important differences from the existing ones, we decided to talk more about the theory and toy experiments.

Regarding the question of when mode collapse occurs: our answer is “the method is beneficial before mode collapse”, similar to “the model keeps improving before overfitting”. Similar to early-stopping in traditional systems, we can also stop the model’s evolution when we observe mode collapse. Techniques for this, though, are out of the scope of this paper.

评论

Thank you for your response. I’m still struggling to understand how imposing constraints during the interaction phase helps mitigate the model collapse problem in iterated learning. From my perspective, while introducing certain constraints might cause the iterated learning process to converge to the mode rather than the prior distribution, it doesn’t seem to fully address the model collapse issue, as the model would still ultimately converge to the prior. I may have overlooked something important, so I’ve adjusted my confidence level accordingly.

评论

Thanks very much for your follow-up question. Actually, the interaction phase used to avoid mode collapse slightly differs from all the examples demonstrated in the main context (the one in the Appendix did this). In short, we can understand the "mode collapse" as a phenomenon in which some semantic features are gradually lost during multiple-generation evolution. Hence we can design an interaction phase that contains some tasks that require such features. For example, if the mode collapse makes the model ignore some features (like the number of objects in a sentence or an image), we can design an interaction task that requires the model to predict the number of objects. Or more directly, in LLM's evolution, we can require the model to generate more diverse responses in terms of xxx features and let another LLM evaluate the diversity, which might also help to avoid mode collapse.

In summary, if we design the constraints as requiring the model to be successful on some tasks relying on the mode (feature) that might collapse, iterated learning can then help us counter mode collapse during self-improvement.

However, all the examples above require us to pinpoint what feature would collapse before designing such an interaction task, which is also an open problem for the community. So we didn't highlight this point too much in our main context. But the example in Appendix B is a good toy example to understand the whole story.

审稿意见
6

The paper discusses how the widespread adoption of Large Language Models (LLMs) and their iterative interactions can lead to an evolutionary process similar to human cultural evolution. It leverages the Bayesian Iterated Learning (IL) framework to explain how subtle biases in LLMs are amplified over iterations. The authors outline key characteristics of LLM behavior in this framework, supported by experimental verification, to predict and guide the evolution of LLMs.

优点

  1. The application of the Bayesian-IL framework to LLM evolution is innovative and offers a new perspective on understanding and guiding LLM behavior.
  2. The paper provides a solid theoretical foundation and validates its claims with comprehensive experiments across various LLMs.
  3. The proposed strategies for guiding LLM evolution can be highly beneficial for designing more effective algorithms for bias mitigation and alignment.

缺点

  1. The framework relies on several assumptions that may not hold in all practical scenarios, potentially limiting its applicability.
  2. While the framework is validated on specific LLMs, its generalizability to all LLM systems remains to be fully explored.
  3. The paper primarily addresses explicit biases, and while it acknowledges the challenge of implicit biases, it does not provide concrete solutions for detecting and mitigating them.

问题

  1. Can you provide more details about the experimental setup, specifically the choice of datasets and evaluation metrics used to validate your framework?

局限性

See weaknesses above.

作者回复
  1. The framework relies on several assumptions that may not hold …

Thanks for this question. The theoretical proof indeed needs a lot of assumptions which might not be true in practical systems. However, as we stated in Appendix A, we only expect the general trend (i.e., the bias amplification and the constraining role of the interaction phase) to hold for practical models. The latter is well supported by our experimental results in different settings.

The main message we want to highlight is that theory only provides a guarantee in the ideal case under assumptions. Since our model and tasks are becoming more and more complicated, some of these assumptions are very easily violated. Then, the experimental results support that when specific assumptions are mildly violated, some important principles uncovered by the theory still hold, which could be a guide for designing our future systems. For this paper, specifically, we conclude that a well-designed interaction phase and manipulating prompts when generating new examples are two efficient ways to guide LLM’s evolution (i.e., help amplify good biases or restrain bad ones).

  1. While the framework is validated on specific LLMs …

Our in-context learning experiments are indeed API-based, and only verified on several commercial LLMs like GPT, Claude, and Mistral. The in-weights learning experiment in section 7 is on LLama2. Our experiments cover the representatives of SOTA closed-source and open-source models with different scales. We believe this suggests our method could generalize across different models. We would also like to extend our analysis to more general LLMs and tasks in our future work.

More experimental results on various LLMs will definitely argue for or against our proposed theory, and we expect future work may do so. Nevertheless, as we mentioned in the related works of the paper, we think we’ve found much evidence (experimental results across multiple settings, theoretical analysis) to support our claims.

  1. The paper primarily addresses explicit biases …

Thanks for this great question. We also believe that understanding the hidden bias in LLM prior is important for applying it to different domains. We believe a general principle for manipulating the bias is that: we must first pinpoint what the bias is, then find some method to mitigate it. The method provided in this paper has the potential to figure out such a bias efficiently. Note that based on our analysis, if we kept doing self-improvement in-weight learning, like self-reward, for several generations, some hidden bias would also be amplified. If that occurs, we might have to re-train the model and waste all the computing resources. However, based on our theory, we can first conduct the iterated in-context learning for several generations, during which, we can define different prompts to elicit the potential bias in the model’s prior (just like how we manipulate the prior in the ACRE task). After pinpointing such biases in the prior, we can then design a corresponding interaction phase for the self-improving in-weights learning, and hence avoid amplifying them.

  1. Can you provide more details about the experimental setup …

Thanks for pointing this out. All details of the in-context learning experiments can be found in Appendix D. We will add the details of the experiments in Section 7 in the next version.

审稿意见
6

The main idea of this paper is similar to Xie et al., as both attempt to incorporate in-context learning of Large Language Models (LLMs) into the framework of Bayesian inference. This paper extends the concept to include more general multi-agent, multi-round self-improvement of LLM systems within the framework of Bayesian Iterated Learning. Based on the proposed Bayesian-IL framework, the authors prove LLM's bias amplification during self-improvement and propose a potential solution by introducing hypotheses. Experimental results confirm the existence of bias amplification during LLM self-improvement and demonstrate the effectiveness of the proposed solution in specific tasks.

优点

  1. The attempt to conceptualize self-improvement of LLM agents with Bayesian Learning is relatively novel.

  2. The proposed framework provides a systematic perspective for understanding bias amplification in LLM agents and potential solutions. It conceptualizes the design of agent-based LLM systems into initialization, imitation, and interaction phases, providing guidance for reducing bias amplification.

  3. The experiment section presents comprehensive empirical evidence of bias amplification during LLM's self-interaction. The experiments are well-designed to cover diverse setups, task scenarios, and base LLMs. Results confirm the existence of bias amplification during LLM self-improvement and demonstrate the effectiveness of the proposed solution in specific tasks.

缺点

  1. The presentation of this paper could benefit from reorganization for better reading flow. Specific comments include:

a. Provide a running example in the introduction, such as the ACRE experiment, to help readers better understand the concepts discussed in the theoretical proof.

b. Reorganize Sections 3 and 4 to make the argument more concrete and concise. There is no need to repeat similar proofs available in previous literature (e.g., Xie et al.). Focus on the novel aspects of the proposed work, such as bias amplification.

c. Add a connection paragraph before Section 5, summarizing the planned experiments and their connection with the proposed theoretical framework. Describe task environments and their nature, or refer readers to the appendix if needed. For example, the compositional language experiment should not be mentioned without context. The nature of ACRE tasks should not be discussed in the middle of a sub-experiment.

d. Move Appendix A into the main text, as it contains valuable discussions highlighting the theoretical and practical value of this work.

  1. Both the theory and experiments do not consider multi-agent scenarios, where embodied LLM agents might be competing/collaborating, heterogeneous/homogeneous, or have partial/global observation

  2. In the proof of Proposition 1, the authors claim that hHeffh \in H_{eff} can be guaranteed by adding constraints during imitation and interaction phases. This process might not be applicable to all Bayesian agents during iterated learning in general. For example, if agents are unaware of task completions, it is non-trivial to "rule out" unsuitable hypotheses at the end of each iteration. Consider multiple embodied LLM agents interacting with a task environment with sparse rewards, where they might not receive feedback at each timestamp.

  3. In the proof of Proposition 2, the Markov property of hidden hypotheses and the LLM sampling process used in the second line lacks proof. Can we always find a hidden hypothesis hh such that P(ddt1,h)=P(dh)P(d|d_{t-1}, h) = P(d|h), meaning LLM's sampling only depends on this variable instead of the prompt input?

  4. There is a gap between the theoretical framework and empirical evidence. As claimed in Appendix A, most of the assumptions used in the theoretical proof are violated in the experiments. How do experimental results support the proposed Bayesian-IL framework? The claim in line 314 that the proposed framework can predict LLM's behavior is less accurate. The framework actually predicts the phenomenon of amplifying bias during LLM's self-improvement, rather than individual LLM agent's behavior (e.g., when they will make mistakes given certain inputs). The proposed framework is one explanation of bias amplification among many others.

  5. The solutions authors propose to reduce bias amplification all depend on domain knowledge about the task. What if the intrinsic biases of LLMs are unknown and the criteria for task completion are difficult to measure?

问题

Questions are asked in the previous section.

局限性

Limitations are discussed in the appendix.

作者回复
  1. The main idea of this paper is similar to Xie et al. …

We'd like to first highlight some differences from Xie et al. They show that LLM agents behave like Bayesian agents when doing ICL, linking LLM evolution to our Bayesian-IL framework. However, our paper focuses on LLM’s knowledge evolution when the model keeps learning from its predecessors and generates new samples for their descendants, different from Xie et al. Plus, the results in this paper also have the potential to be extended to the in-weights learning scenario, not mentioned in Xie et al.

1 - (a, c, d). The presentation of this paper could benefit ...

Thanks very much for the great suggestions; we will update the next version based on them.

1 - b. … There is no need to repeat similar proofs available in previous literature (e.g., Xie et al.). …

As stated in the previous response, we only use the technique in Xie et. al. in Proposition 2. We believe all other discussions are necessary to understand the whole story of the paper (and are not discussed in Xie et. al.). For example, the discussions of Bayesian agents and iterated learning framework in Section 3 do not make any assumptions on how the agent is implemented: it can be a pure Bayesian agent like our experiments in Appendix B, a human whose behavior is well approximated by Bayesian rule (common in cognitive science), or LLM doing in-context learning (as in Xie et. al.).

  1. Both the theory and experiments do not consider multi-agent scenarios …

Thanks for pointing this out. We believe for any type of multi-agent scenario, the interaction between two agents (one agent can be the environment) is the building block. Hence this paper starts from this atom of behavior to understand the whole process.

We originally included some experiments on multi-agent settings, particularly auction games, where more than two agents cooperate or compete in the bidding process. We add different personas to different agents (e.g., some are conservative, some are radical, and some are superstitious always bidding for a number ending with 8). We find that after several rounds, the agent’s preferences are indeed amplified. However, as we already have several experiments under different settings, making the story a bit hard to digest, we decided to focus only on the theory part and the most fundamental 2-agent settings to verify the theory. We will add a discussion on multi-agents in our next revision.

  1. In the proof of Proposition 1, the authors claim …

Thanks for this insightful question. It is true that in most practical cases, designing a perfect interaction phase that imposes hHeffh\in \mathcal{H}_{eff} is almost impossible. However, as we mentioned in our Appendix A and Section 7, a filtering (or ranking) design during the interaction phase plays a similar role of imposing this constraint: the practical algorithm will assign more weights and credits to those examples coming from such hhs. If this filter is strong enough, it is equivalently that we have this constraint. If not, the strict guarantee of the theory no longer holds, but the general trend that “a good interaction phase can down-weigh bad examples” still holds.

Note that the theory only provides a guarantee in the ideal case with assumptions. Since our model and tasks are becoming more complicated, some of these assumptions are very easily violated. Then, the experimental results support that when specific assumptions are mildly violated, some important principles uncovered by the theory still hold, which could be a guide for our future system. For this paper, specifically, we want to highlight that a well-designed interaction phase and manipulating prompts when generating new examples are two efficient ways to guide LLM’s evolution.

  1. In the proof of Proposition 2, the Markov property …

There is a small misunderstanding here. First, we claim sampling dP(ddt1)d\sim P(d|d_{t-1}) is equivalent to first sampling hh^*, then dP(dh)d\sim P(d|h^*). The P(ddt1,h)=P(dh)P(d|d_{t-1},h)=P(d|h) never occurs in our paper. That is important for the implicit hh case because the model can only get new samples based on the data generated in the previous generation. Using this proposition can extend our Bayesian-IL analysis to these more practical cases. The ACRE task, where hh is explicit, does not need Proposition 2; only Proposition 1 is enough.

  1. There is a gap between the theoretical framework and empirical evidence …

Thanks for pointing that out. We will change the imprecise claims throughout the paper and also move the discussions about assumptions and practical systems in the main context. We also believe the carefully designed measurements in sections 5 and 6 (although they are a bit hard to understand) support the results well.

  1. The solutions authors propose to reduce bias amplification all depend on domain …

Thanks for this great question. We also believe that the domain knowledge about the task is important for figuring out the hidden bias in the prior. However, when the bias is hidden, our method provides a reasonable approach towards probing and pinpointing the potential bias.

Note that based on our analysis, if we kept doing self-improvement in-weight learning, like self-reward, for several generations, some hidden bias would also be amplified. If that occurs, we might have to re-train the model and waste all the computing resources. However, based on our theory, we can first conduct the iterated in-context learning for several generations, during which, we can define different prompts to elicit the potential bias in the model’s prior. After pinpointing such biases in the prior, we can then design a corresponding interaction phase for the self-improving in-weights learning, and hence avoid amplifying them. Furthermore, even when the intrinsic biases are unknown, we could still design reasonable heuristics for filtering, e.g. using a reward model.

评论

Thank you for your responses. They address most of my concerns and help me better understand the contribution of this paper. I have increased my rating.

审稿意见
6

The paper explores the evolution of Large Language Models (LLMs) through the lens of Iterated Learning (IL), drawing parallels with human cultural evolution. The authors propose a Bayesian framework to analyze how biases in LLMs are amplified over generations of learning. They introduce the concept of Iterated Learning in the context of Bayesian agents, demonstrating how subtle biases can be magnified as knowledge is transmitted across generations. The authors argue that understanding the evolutionary process of LLMs can help in designing more effective algorithms for alignment, bias mitigation, or amplification.

优点

  • The paper has a good theoretical basis.
  • The paper's focus on bias amplification in LLMs is timely and important, given the increasing concern about fairness and bias in LLMs.
  • It is interesting from the perspective of Bayesian framework and iterative learning.

缺点

  • Do you think that the Bayesian framework and iterative learning are effective and robust across different large language model agents? Could these methods lose their effectiveness with changes in model architecture or parameter scale?

问题

  • In lines 51-52 of the introduction, you mentioned, "We believe that our analysis can enhance our understanding of LLMs, and aid in designing more effective algorithms for alignment, bias mitigation or amplification, and similar tasks." Could you provide more concrete examples to demonstrate how this analysis plays a role in downstream tasks and applications to strengthen this claim?

局限性

NA

作者回复
  1. Do you think that the Bayesian framework and iterative learning …

Thanks for this good question. In short, the proposed framework doesn’t care about the implementation and training details of LLMs. That is also the most important benefit of using a highly abstract way (i.e., Bayesian behavior) to analyze LLM’s evolution. Based on our theory, as long as the LLM’s in-context capability is strong enough and can be approximated using a Bayesian update (as argued by Xie et al.), the subsequent analysis will hold regardless of the model's details. (Our results on different LLMs have similar trends, also showing the generalizability of this framework.) Furthermore, as stated in Appendix A, even when some assumptions are violated, some important principles uncovered by our theory still hold. For example, we experimentally showed that when the agents in different generations are different LLMs, e.g., a GPT agent playing with a Claude agent, the studied bias is also amplified. That is because although these two models have different P0P_0, as they are both trained on the datasets collected from the internet, some biases are shared in their priors. Hence their evolution still follows our theory well.

Additionally, in Section 7, we showed that when the model conducts in-weight learning instead of in-context learning (where the Bayesian learning behavior assumptions do not strictly hold), the bias could also be amplified, and a good interaction phase can mitigate that. This experiment uses a smaller open-source model (7B), suggesting the effectiveness holds even if we change the parameter scale (and setup). In summary, we are confident that the main trends described by this paper are generalizable to many practical scenarios.

  1. In lines 51-52 of the introduction, you mentioned …

Thanks for this suggestion. The first hint from our analysis is that the interaction phase can play an important role in guiding the model’s evolution. The result in Section 7 is a good example. In this experiment, we use carefully designed prompts to change the preference of the model in the interaction phase (i.e., let the judgment model prefer more concise responses). The results show that combining this bias and on-policy DPO makes the model converge to helpful and concise responses, and successfully constrains the amplification of length bias in P0P_0.

On the other hand, note that in our framework, P0P_0 is the model’s confidence given the instruction prompt. Hence we can design clever prompts when the model generates new examples for the next generation. The example in Lines 258-267 shows the feasibility of this idea. This example verifies that we can manipulate the bias in P0P_0 by adding a sentence in the prompt. As a result, if we can pinpoint the potential bias we want to mitigate, we can design a corresponding prompt when generating new examples from the model. We leave further exploration of this interesting direction to future work.

评论

Thank you for your response, which addresses most of my concerns. I would keep my rating.

最终决定

The authors seem to have addressed the various concerns the reviewers expressed in their initial reviews. Ultimately, three of the four reviewers side with accept, while the fourth reviewer maintains a "reject." From my perspective, the authors have satisfactorily addressed many (but potentially not all) of this reviewers' questions. I believe the paper addresses an interesting and important topic and makes a solid contribution. Therefore, I advocate for the paper being accepted if there is space. I encourage the authors to implement what they indicated they would do in the discussion to further improve the paper.