Scalabale AI Safety via Doubly-Efficient Debate
We give a complexity-theoretic formalization of the use of natural language debate for AI safety, and prove theorems regarding the power and limitations of debate.
摘要
评审与讨论
The paper introduces a model called the doubly-efficient debate. Here, two competing provers attempt to convince a verifier of a computation's correctness, relying on black-box judgements (like human feedback). This method ensures that any polynomial-time computation can be confirmed with just a few queries to the human judgment black-box. The model promises efficient verification, but it requires the AI models to produce comprehensive reasoning traces that can stand up to rigorous human analysis.
The researchers formalize a scenario where two AI models compete to convince a verifier, who can consult human judgment, of a solution's accuracy. The aim is to ensure the right solution is reached without excess computation and that human queries remain limited regardless of the complexity of the task. Protocols developed demonstrate success in various settings: deterministic human judgment, stochastic human judgment, and settings with witnesses.
优点
The introduction of the "doubly-efficient debate" model is an innovative way to approach the challenge of training and oversight in AI, particularly for Large Language Models (LLMs). By pitting two models against each other to verify the correctness of their outputs, the paper seeks to streamline and make efficient the process of verification, which is a unique proposition.
The paper emphasizes real-world issues, such as the potential high-stakes consequences of language models used for drafting laws or contracts. This grounding in practical applications elevates its relevance and appeal to practitioners in the field.
The paper explores various scenarios, including deterministic and stochastic human judgment. This comprehensive approach ensures that the proposed models and protocols are tested under diverse conditions, enhancing their reliability.
缺点
The paper poses questions about how the model would deal with errors from the oracle, either due to incorrect judgments or stochastic nature. The mere existence of these questions signifies a lack of clarity or solution in the paper about handling such errors effectively. Models where the oracle might make errors, either randomly or on a limited set of queries, introduce an element of unpredictability into the verification process. The paper does not seem to offer robust strategies to mitigate or address these potential errors.
Although the paper introduces a theoretical model, the practical implementation of such a model and its real-world viability are not deeply explored. The proposed model's scalability, robustness, and efficiency in real-world applications remain an open question.
问题
Could the authors clarify the connection between their protocols and the framework introduced in [1]? [1] Du Y, Li S, Torralba A, et al. Improving Factuality and Reasoning in Language Models through Multiagent Debate[J]. arXiv preprint arXiv:2305.14325, 2023.
We are very happy to hear that you find that our approach offers a unique proposition that would be of relevance and appeal to practitioners in the field! Thank you for your encouraging words!
"Could the authors clarify the connection between their protocols and the framework introduced in [1]? [1] Du Y, Li S, Torralba A, et al. Improving Factuality and Reasoning in Language Models through Multiagent Debate[J]."
The biggest distinction is that the method of [1] is test-time only invervention. That is, several models provide answers to a question, then are prompted to re-evaluate their answers based on the output of the other models and so on until a final answer is output.
In contrast, our paper is a training-time intervention. That is, we train the models via reinforcement learning to answer questions by winning debates against opposing models. So in our case the debate happens at training time and the goal is to produce models that give honest, correct, and verifiable answers.
Additionally, our paper proves formal theoretical guarantees showing that our debate training method needs only a fixed constant amount of human feedback per question, even when the length of the human-verifiable argument for the correct answer is very long. This is the main high-level takeaway message of our theorems, and it has no counterpart in [1], which is an empirical, test-time method.
This manuscript contributes to the 'scalable oversight' literature, presenting results for an environment in which two adversarial provers argue in favor of and against a result, for human review.
优点
originality
The paper seems to make improvements to an existing literature.
quality
The work seems to be of good quality.
clarity
The paper is well written and clear.
significance
AI models are becoming larger and more capable, making AI safety an increasingly important topic. It seems to me that this approach - adversarial provers and human oversight - is promising.
缺点
Caveat: I've given myself a low confidence score as this literature is not one that I know or have worked in. Thus, I would have benefitted from a very simple running example through the paper. I understand that space is tight, and expect that readers actually working in this area would benefit from that less than I would, so would certainly not make including one a strong recommendation.
My main concern about this approach is that it relies on unsound reasoners, overseen by an unsound human. While I agree that, ultimately, there are turtles or elephants all the way down, we can choose how to position the elephants/turtles. The autoformalization project (e.g. Jiang et al.'s Draft, Sketch, Prove) relies to a greater extent on sound reasoners. While I think that each of these approaches has distinct strengths and weaknesses, I think that they should at least be compared.
Minor typos:
- "makes progress both" -> "makes progress on both"
- "currently know for delegating" -> "currently known for delegating"
- the final sentence on p.3 ("For a real number ...") is a fragment
- "for it's correctness" -> "for its correctness"
问题
Can you present a simple example of a false proof that survives the protocol, because of inconsistencies or errors in the (human) oracle? An ideal example, from my point of view, would display a subtle oracular error (e.g. a minor, buried assumption on real numbers) that spirals into a clearly false result. Perhaps an easy way to do this would be to show the machines yielding both True and False, due to the oracle's replies.
The probabilities in Definition 3.2 stood out: are these merely illustrative (so that any result could be replaced by arbitrary constants and ), or would even qualitative results derived be overturned by use of different fractions (e.g. are there critical values for these numbers)?
Definition 6.1: can you provide an example of an oracle that is not -Lipschitz?
FYI
Not questions re: the reviewing of this paper, but the sort of questions I would ask if talking to the authors about the research more generally. Thus, these do not need to be answered here:
- is the argumentation procedure in Dewatripont and Tirole's "Advocates" a special case of this framework in any way? Their approach to efficiency is different from that taken here, but may be complementary?
- the economic theory literature also contains models of 'cheap talk' and 'long cheap debate' (like a 'debate' in the current paper), in which two biased but informed advisors make comments to a decision maker, who tries to determine the true state of the world from their comments. In the canonical version, the comments are intervals, rather than the present probabilities.
- Foster & Vohra's chapter on calibration, in which a decision maker attempts to identify whether or not a purported expert has true expertise, by means of repeated questioning, also seems generally related.
- Dung's abstract argumentation framework also came to mind, explicitly considering arguments and their attacks/refutations.
We are very encouraged by your comment that our proposed combination of adversarial provers and human oversight is a promising direction in an increasingly important topic. We hope that our proposal can accelerate much needed progress on this subject.
"Can you present a simple example of a false proof that survives the protocol, because of inconsistencies or errors in the (human) oracle?"
Please see our response below regarding the example of the oracle machine that computes majority. This example shows that if we are attempting to find the majority vote of questions where the correct probability for yes for each is , an error of just in the human oracle is enough to flip the answer from 0 to 1. Thus, using debate to conduct polling on a very evenly contested issue can result in large error, even if each individual answer is only slightly biased.
An important point to make here is that after training with debate, one can easily verify at test time if an oracle machine is non-Lipschitz at by simply running the protocol by asking to argue that and then running it again while asking for to show that . If is able to win the debate in both cases, then it is clear that is not sufficiently Lipschitz and the results cannot be trusted.
"The probabilities in Definition 3.2 stood out: are these merely illustrative (so that any result could be replaced by arbitrary constants and ), or would even qualitative results derived be overturned by use of different fractions (e.g. are there critical values for these numbers)?"
The probabilities in Definition 3.2 are just illustrative constants, all that is needed is that they are bounded away from . A key point here is that the probability is over the internal randomness used by . Therefore by simply running the machine independently times on the same input and taking the majority vote of the answers, the probability of outputting the wrong answer goes to zero at a rate exponential in .
"Definition 6.1: can you provide an example of an oracle that is not -Lipschitz?"
Let and be the oracle that outputs 1 with probability on every query. Let be the machine that, on input , takes the majority vote of the outputs of on the strings lexicographically following . Then is not -Lipschitz for any .
To see why consider the oracle which outputs 1 with probability . By standard concentration bounds, the probability that the majority over samples from is 1 is at most . On the other hand, the probability that the majority of samples from is 0 is . Thus by our choice of a change in the probabilities of the oracle by can result in a change in the output of by nearly 1. Thus is not -Lipschitz at for .
"FYI"
Thank you for this list of related debate/argument concepts. Your pointers contain some very interesting ideas that could be used extend or change our debate model, and to reason about debate more generally. We look forward to engaging with these ideas actively in future work. Thank you again!
This paper studies interaction protocol for the debate framework introduced by Irving et al. 2018, focusing on computational complexity aspects. More specifically, the main motivation of this work is that the original framework assumes that the honest strategy can simulate AI systems for an exponential number of steps. The paper introduces new debate-based protocols, where this dependence is polynomial. Inspired by interactive proofs, the debate framework is modeled as a triple , where and are provers and is a verifier, modeled as oracle Turing machines -- oracle models a human evaluator. The paper systematically studies the problem at hand, considering both deterministic and stochastic debate protocols. The results show that the proposed protocols are complete and sound, and that the prover that argues for the correct decision and the verifier run in a polynomial number of steps. Moreover, the number of oracles quires that the verifier requires is constant.
优点
- The paper is well written and enjoyable to read. It introduces all the relevant content important for understanding the formal framework and the results. The practical examples provided throughout the main paper clearly motivate the protocols studied in this work, and are helpful for understanding technical details.
- To my knowledge, these results are novel, and provide a different perspective on the debate framework. The protocols are relatively simple, but yield guarantees which appear to improve those from prior work.
- The paper provides rigorous analysis of the protocols, proving that they are sound and complete, and show that the protocols are efficient in relevant parameters of the problem setting. The proofs appear to be relatively simple, and at the first glance, they seem correct.
- These results would be of an interest to researchers working on alignment problems in AI, and could spark interesting discussions on the practicality of the debate based framework in large language models.
缺点
-
The paper primarily provides a theoretical treatment of the protocols it considers. Given that there is a concrete practical scenario that motivates these protocols, and since the protocols appear to be relatively simple, it seems that the authors could have conducted an experimental evaluation akin to the one in (Irving et al. 2018), but focusing on LLMs. Experiments that compare this work to prior debate-based approaches would be useful.
-
Although the work relaxes some requirements/assumption compared to prior work, more specifically (Irving et al. 2018), it doesn't fully address all the challenges related to utilizing the debate-framework in practice. For example, the protocols rely on a relative restrictive assumption that the oracle representing human judgment is correct/unbiased. That said, these limitations are clearly discussed in the conclusion section.
问题
I don't have specific clarification questions. However, it would be great if the authors could provide a discussion related to my comments about the weaknesses of their work. More specifically, some discussion on whether it would be possible to set up an experiment based on LLMs which showcases the utility of their framework.
We are very happy to hear that you found our paper well written and enjoyable to read with practical examples provided throughout the main paper that clearly motivate the protocols studied in this work, and are helpful for understanding technical details! This type of encouragement means a lot to us.
"I don't have specific clarification questions. However, it would be great if the authors could provide a discussion related to my comments about the weaknesses of their work. More specifically, some discussion on whether it would be possible to set up an experiment based on LLMs which showcases the utility of their framework."
At the scale of the examples given in our paper (e.g. writing long contracts or conducting meta-analyses), LLMs are not currently good enough to "get-off the ground" in an RL training setup where one model produces outputs and the other points out flaws. In particular, the first model will likely always have many flaws in its outputs.
However, we believe that it will soon be possible to set up smaller-scale but useful experiments with LLMs utilizing this framework. The simplest version of such an experiment would be to train two models and in an adversarial RL setup where the human judge determines who wins. For example, a natural setting could be math or logical reasoning problems that require many steps of chain-of-thought to solve. Here the model is supposed to provide long chain-of-thought answers to questions, and is supposed to point out an incorrect step in the chain-of-thought. Human raters are then shown the step pointed out by . If the rater judges the step to be incorrect wins, otherwise wins. The base models are then trained by playing against each other to win at this task. Our method applied to this task would allow the number of human ratings required to scale with the number of problems solved, rather than the number of problems times the number of steps per problem.
The key properties that such an experiment testing our theory needs to have is that:
- It is possible for the models to break down complex arguments into simpler, human-judgeable steps.
- The arguments made are long enough that the efficiency gained by only requiring human feedback for a single step is significant.
Thank you for your response, and for providing a discussion related to my comments about potential experiments.
The paper proposes a doubly-efficient debate, where two polynomial-time provers compete to convince a significantly more efficient verifier that they have correctly solved a computational problem that depends on black-box access to human judgments.
优点
See question part
缺点
See the question part
问题
This paper proposes two efficient-debate protocols for large language model debate. The paper is out of my research domain and it is hard for me to follow the paper. It would be appreciated if the authors could answer the question below.
-
Could authors give an example to show the difference between the proposed debate method and existing works such as Irving et al. (2018)? Why the proposed method could get a better bound intuitively?
-
How to implement the method in practice to help researchers to train an LLM?
- "Could authors give an example to show the difference between the proposed debate method and existing works such as Irving et al. (2018)? Why the proposed method could get a better bound intuitively?"
The key difference between our approach, and that of Irving et al. (2018) is that we explicitly require that the debater model arguing for the truth can do so efficiently. In contrast, Irving et al. (2018) assume that both debaters have unlimited computational power. To see why this can make a difference, consider the example for Theorem 6.2, in which we want to use a debate between LLMs to perform a meta-analysis of scientific literature on whether caffeine improves test-taking performance. Suppose that in this example, the truth is that the literature shows that caffeine does not improve performance. The model could attempt to -hack the analysis by selecting a subset of existing studies that give the desired result, and then inventing an after-the-fact reason to reject all the studies that were not contained in the selected subset.
The key difference between our approach and that of Irving et al. (2018) now arises with regard to the model , which is tasked with arguing for the truth (i.e. that caffeine does not improve performance). In the approach of Irving et al. is modelled as having unlimited computational resources. So would be allowed to employ brute-force strategies in order to show that has -hacked. For example, could re-run the analysis on every subset of studies the same size as the subset selected by , and show that the opposite conclusion is reached a large majority of the time. Note that this strategy requires to run a potentially exponentially large number of meta-analyses in order to refute just one meta-analysis run by .
Thus, using the approach of Irving et al. it is possible that the debater arguing for the truth may necessarily need to spend exponentially more compute in order to actually win the debate. Our approach explicitly requires the design of protocols where the debater arguing for the truth must run efficiently. In fact, the protocols we design allow the debater arguing for the truth to win by spending only compute time, where is the time that it would take to solve the problem without debate. To rephrase this in the context of the meta-analysis: if it is possible to correctly perform a meta-analysis via forward passes of an LLM, then our debate protocol requires only forward passes in total to guarantee that the debater arguing for the truth will win.
- "How to implement the method in practice to help researchers to train an LLM?"
In practice, in the simplest version of our protocol, one would train two models and via reinforcement learning to play the following zero-sum game:
- Sample a question from the dataset.
- Prompt to give a chain-of-thought answer to .
- Prompt to point out a step in 's answer which is incorrect.
- Show the step pointed out by to a human rater. If it was incorrect, wins, otherwise wins.
We thank all of the reviewers for their careful reading of our paper as well as for their constructive questions and comments. Moreover, we are very grateful for your support of our work!
Despite the high ratings, the reviewers' confidence in this paper is relatively low. To make a more objective judgment, AC thoroughly read the paper. Overall, the paper employs the Debate framework to address some misalignment issues. Debate is a common framework in the field of AI Alignment, and the authors have effectively framed its theory and methodology.
However, in terms of the framework, I fail to see significant differences from Irving et al. (2018). Experimentally, the authors have not conducted substantial experiments on any Large Language Models (LLMs). The reviewers share this concern. I have carefully read the exchange between Reviewer TQe8 and the authors, and the process described in the authors' Response 2 does not significantly differ from that in Irving et al. (2018). Furthermore, 'the protocols rely on a relatively restrictive assumption that the oracle representing human judgment is correct/unbiased,' which seems to be a limitation of this work.
In summary, this paper makes some interesting explorations and analyses based on Irving et al. (2018) but lacks sufficient innovation. AC is concerned about the practical applicability of the proposed mechanism in LLMs. One suggestion is for the authors to conduct comprehensive experiments on some open-source LLMs to demonstrate the superiority of their framework. This paper has not met the acceptance standards of ICLR, and it is hoped that the authors will further revise and enhance the persuasiveness of their framework.
为何不给更高分
N/A
为何不给更低分
N/A
Reject