PaperHub
5.3
/10
Rejected4 位审稿人
最低5最高6标准差0.4
5
5
5
6
3.5
置信度
正确性2.8
贡献度2.8
表达3.0
ICLR 2025

Communicating Activations Between Language Model Agents

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05
TL;DR

We propose a simple technique whereby language models communicate via activations.

摘要

Communication between multiple language model (LM) agents has been shown to scale up the reasoning ability of LMs. While natural language has been the dominant medium for inter-LM communication, it is not obvious this should be the standard: not only does natural language communication incur high inference costs that scale quickly with the number of both agents and messages, but also the decoding process abstracts away too much rich information that could be otherwise accessed from the internal activations. In this work, we propose a simple technique whereby LMs communicate via *activations*; concretely, we pause an LM $B$'s computation at an intermediate layer, combine its current activation with another LM $A$'s intermediate activation via some function $f$, then pass $f$'s output into the next layer of $B$ and continue the forward pass till decoding is complete. This approach scales up LMs on new tasks with *zero* additional parameters and data, and saves a *substantial amount of compute* over natural language communication. We test our method with various functional forms $f$ on two experimental setups—multi-player coordination games and reasoning benchmarks—and find that it achieves up to $27.0$% improvement over natural language communication across datasets with $<$$1/4$ the compute, illustrating the superiority and robustness of activations as an alternative "language" for communication between LMs.
关键词
large language modelsmultiagent communicationembedding representationmultiagent debate

评审与讨论

审稿意见
5

The paper explores the idea of communication between language model agents via activations rather than in natural language. Specifically, activations from an intermediate layer of model A are merged into the activations of an intermediate layer (j) of model B and used as inputs for subsequent layers (j+1) of model B.

The paper presents two types of experiments: coordination games (where A and B have the model params) and collaborative reasoning (where A and B are different models with different sizes). In the former, the experiments show that AC (activation communication) is more effective than natural language communication. In the latter, the experiments find AC improving accuracy against both single model baselines and natural language debates (NLD).

优点

Originality: Activation-based communication between LMs is a novel direction, to the best of my knowledge. Quality: The paper shows modest quality vs. computation gains against the baselines. Clarity: The motivations for activation communication are clearly explained. Significance: Collaboration between multiple language agents is a potentially important direction.

缺点

[Clarity] There is some ambiguity in the explanation of the grafting inputs and outputs (see Questions).

[Experiments] The experiment set-up for the coordination game is not realistic—player A can simply send its text prompt to player B without incurring additional computation, since the models are identical. It would be more realistic if model A is smaller than model B in this case.

An additional baseline for collaborative reasoning is CoT with model B alone. This will illustrate whether the quality gains from A+B comes from additional inference computation.

A useful direction for experiments is to investigate the case of RAG, where the retrieved contents are processed by a small model A (whose activations are potentially pre-computed), while the response is generated by model B.

[Method] The exploration of the methods seems insufficient. For example, could we use cross attention to incorporate activations from model A to B?

问题

In sec 3.1 "replace the activation of the last token"replace the activation of the last token (hB,j )tB ∈ R dB ←− f((hA,k)tA ,(hB,j )tB ) for some function f : R dA+dB → R dB ;"

Does this suggest that the inputs to f contain only the activations for the last tokens from A and B and the grafting only affects the activation of the last token of model B?

The computation complexity analysis seems to be contradictory: "here F(P, D) = O(PD) for non-learned f and O(D^2) when f is the mapping matrix". O(D^2) implies that only the last tokens are used, while O(PD) seems to suggest that the activations of all prefix tokens are used.

评论

Thank you for your valuable feedback! We appreciate that you found our approach original and well-motivated. We address specific comments from the review below:


The experiment set-up for the coordination game is not realistic—player A can simply send its text prompt to player B without incurring additional computation, since the models are identical. It would be more realistic if model A is smaller than model B in this case.

There is significant theoretical interest in this setup; both Countries and Tip Sheets are examples of both Lewis signaling games and two-player language coordination games which have been heavily studied in multi-agent communication literature [1, 2]. Also the core purpose of these toy setups is to show AC communication is indeed a viable method while being extremely computationally efficient. (In fact, if AA were to simply output its prompt and then pass that as context for BB, this would require more compute than AC!)

Furthermore, in general (and as is shown by the reasoning benchmark setup), BB may need access to not just necessary information in AA's prompt but instead/also knowledge encoded in AA's weights.


An additional baseline for collaborative reasoning is CoT with model BB alone. This will illustrate whether the quality gains from A+B comes from additional inference computation.

Note that the single-model baselines in Table 3 already involve CoT prompting, showing that the quality gains from AC are independent of additional inference computation. We have made this clearer in the revised paper.


A useful direction for experiments is to investigate the case of RAG, where the retrieved contents are processed by a small model AA (whose activations are potentially pre-computed), while the response is generated by model BB.

While this experiment would be too costly to run on our setup, this definitely makes sense, and is a great application of AC. We will plan to pursue this in future work.


Could we use cross attention to incorporate activations from model A to B?

This is exactly the method of CALM [3]. However in this work, we limit our focus to task-agnostic methods; CALM and other composition/grafting methods require a substantial amount of additional task-specific parameters and data, hence we do not compare to these. Furthermore these methods require much more compute in the form of LM finetuning, layer or router training, etc.; our approach is far more compute efficient.


Does this suggest that the inputs to ff contain only the activations for the last tokens from AA and BB and the grafting only affects the activation of the last token of model B?

Yes. We have made this clearer in the revised paper.


The computation complexity analysis seems to be contradictory: "here F(P,D)=O(PD)F(P, D) = O(PD) for non-learned ff and O(D2)O(D^2) when ff is the mapping matrix". O(D2)O(D^2) implies that only the last tokens are used, while O(PD)O(PD) seems to suggest that the activations of all prefix tokens are used.

Thank you, this should say "F(D)=O(D)\mathcal{F}(D) = O(D) for non-learned ff and O(D2)O(D^2) when ff is the mapping matrix". We have fixed this in the revised paper.


We hope these improvements adequately address your questions, and we hope that you consider the extent of these improvements in your evaluation. Please let us know if there are any additional questions or concerns; we'd be happy to address them.

[1] David Lewis. Convention: A philosophical study, 2008.

[2] Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language, 2016.

[3] Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Nitish Gupta, Shikhar Vashishth, Sriram Ganapathy, Abhishek Bapna, Prateek Jain, and Partha Talukdar. Llm augmented llms: Expanding capabilities through composition, 2024.

评论

We greatly appreciate your time and valuable feedback. If there is anything else that needs clarification or further discussion, please do not hesitate to let us know.

评论

Thanks for the responses. While I appreciate the computation constraint, the overall motivations for AC remain quite weak for me. Considering concerns raised by other reviewer, I will keep the current score.

评论

The motivation behind AC is stated clearly in the paper: Communication between multiple LM agents has been shown to scale up the reasoning ability of LMs. While natural language has been the dominant medium for inter-LM communication, it is not obvious this should be the standard: not only does natural language communication incur high inference costs that scale quickly with the number of both agents and messages, but also the decoding process abstracts away too much rich information that could be otherwise accessed from the internal activations. Hence we propose communicating via activations, which we empirically show scales up LMs on new tasks with zero additional parameters and data, and saves a substantial amount of compute (>$$80%) over natural language communication.


After obtaining more compute, we have measured results on the entire MMLU benchmark (57 datasets), shown in the official comment Complete MMLU Results above.

As shown in the comment, AC matches/outperforms NLD on 48/57 datasets, indicating its superiority and robustness as an alternative "language" for inter-LLM communication.

We hope that with these strong and complete benchmark results now provided, the reviewer considers increasing their score to acceptance.

审稿意见
5

This paper proposes a new "model communication" method, which instead of facilitating debate using the output tokens of the models, does so using the intermediate communications of different models.

The paper motivates it's approach using compute and information-bottleneck arguments, particularly in that in communicating via activations, only one of the model's is required to undergo a full-forward pass, and that the information communicated is richer respectively.

The method shows good results, with the approach outperforming the baselines on coordination games across the board, and with either one of its variants (w and w/o a learnt linear mapping) outperforming the conventional debate baseline on 7 reasoning tasks.

Some interesting analysis is also shown to find the layer combinations for communication between the models, which converge to the same optimal numbers (k,j=26) for different tasks.

优点

Novel and Interesting Idea
This finding could also be related to recent model-merging literature where sharing activations between models both during training and inference improves performance. Multi-agent debate is an important field of study, and discovering new effective and computationally more efficient methods is useful.

I also appreciate the authors attempts to motivate their methods both with mathematical formulations for compute and with more theoretical arguments in terms of the informativeness of the intermediate activations vs. output tokens.

Good results against reasonable benchmarks
The method did compare with the standard debate setup, and did so for both coordination and reasoning type tasks, and outperformed the baselines almost across all of them.

Useful analysis on intermediate layer sharing
The analysis on which layer combinations are optimal (Figure 2) is interesting, and could inspire potential future lines of work in understanding intermediate model activations and communication.

缺点

  1. The approach only works for different model weights. While it is true that the paradigm of debate between different models is interesting, it still is useful to support debate between identical models. One of the arguments of the method is efficiency, but in a practical setting, having to load different model weights could be impractical.
  2. The approach only works for 2 models. One of the key promises of debate is that it can scale model communication to many agents. This approach only showed communication between 2, which only scratches the surface of the larger debate landscape. It would be useful to show how this method compares with traditional debate between more agents, and perhaps attempt to scale the method to such settings.
  3. The reasoning results are unclear and inconsistent. While one of the stated highlights of the method is that it is "task-agnostic," in the reasoning benchmark, two different variants of the method (with and w/o learnt linear projection communication bridge) seem to perform inconsistently when compared with each other. For example in some settings, one substantially outperforms the other (HS Psych, AC is much better, Prof. Law, AC(W) is much better). It would be prudent to analyze and explain why these inconsistencies exist.

问题

  1. For the reasoning tasks, why were the particular tasks chosen? Can you show results on the whole of MMLU? How about MMLU-Pro?
  2. Why do you think AC vs. AC(W) works better for some tasks versus other?
  3. Could this technique scale to more models favorably?
评论

Thank you for your valuable feedback! We appreciate that you found our approach interesting and well-motivated, and our results and analysis useful. We address specific comments from the review below:


The approach only works for different model weights. While it is true that the paradigm of debate between different models is interesting, it still is useful to support debate between identical models.

Thanks for raising this point. We can extend AC to work for debate between identical models as follows: let AA and BB be instances of the same model. We can sample a completion from AA with temperature and graft the last-token layer-kk activation of the completion into BB at layer jj as part of the AC procedure. This still saves a substantial amount of compute over NLD between 2 model instances, showing our technique can apply to this setting. Below are the results of this experiment (512512-token completions sampled with temperature 0.70.7, debate run for 22 rounds with 22 instances of LLaMA3.18B`LLaMA-3.1-8B`, and AC run with A=B=LLaMA3.18BA = B = `LLaMA-3.1-8B`):

MethodBiog.GSM8kHS Psych.LogicCol. Bio.Prof. LawPub. Rel.
LLaMA3.18B`LLaMA-3.1-8B`83.9±0.0\mathbf{83.9}\pm 0.060.0±4.960.0\pm 4.965.0±0.165.0\pm 0.142.0±0.1\mathbf{42.0}\pm 0.150.0±0.250.0\pm 0.220.0±0.820.0\pm 0.853.0±0.253.0\pm 0.2
NLD80.8±0.080.8 \pm 0.070.0±3.7\mathbf{70.0}\pm 3.785.0±3.6\mathbf{85.0}\pm 3.635.0±4.835.0\pm 4.878.0±4.1\mathbf{78.0}\pm 4.140.0±4.9\mathbf{40.0}\pm 4.953.0±5.153.0\pm 5.1
AC83.7±0.083.7\pm 0.060.0±4.960.0\pm 4.985.0±3.6\mathbf{85.0}\pm 3.640.0±4.940.0\pm 4.974.0±4.474.0\pm 4.440.0±4.9\mathbf{40.0}\pm 4.979.0±4.1\mathbf{79.0}\pm 4.1

Indeed, while communication between multiple model instances doesn't always show improvement over the single model itself (a well-known result from [1]), AC matches/outperforms NLD on five of the seven datasets.

The intuition behind debate between multiple identical model instances is that sampling multiple completions (with temperature) from the same model yields diverse reasoning paths that can be recombined into a stronger final answer. The above experiment shows that the same intuition holds for AC—we are sampling multiple times from the same model, but passing responses between agents via AC rather than as NL messages.

More broadly, however, we claim that the multiple-model setting is just as realistic (perhaps more so than the setting of multiple instances of the same model), as recent advances in LLM development have led to the release of models with specialized abilities and of different sizes that merit complementary usage. Our work thus answers the question: How can we get the best performance by leveraging multiple models of distinct capabilities and sizes, relative to the added inference-time compute over a single forward pass through any single model?

评论

The approach only works for 2 models. One of the key promises of debate is that it can scale model communication to many agents. This approach only showed communication between 2, which only scratches the surface of the larger debate landscape. It would be useful to show how this method compares with traditional debate between more agents, and perhaps attempt to scale the method to such settings.

First, we want to distinguish between "agents" (distinct model instances) and "models" (distinct LMs).

With debate, more agents can help because we are sampling different outputs from each agent (model instance), and this can yield diverse reasoning paths that are recombined to produce stronger final outputs.

This isn't true with AC, as one activation grafting step from AA to BB inherently communicates to BB all of AA’s knowledge/beliefs about the prompt it was given. We argue this is actually a benefit of AC over NLD, as we don't require increasing token budgets to extract more and more information out of the LMs.

A similar argument can be made for the number of rounds in NLD. Indeed, we find in additional experiments that AC outperforms NLD even with additional rounds of debate:

MethodBiog.GSM8kHS Psych.LogicCol. Bio.Prof. LawPub. Rel.
NLD (1 round)83.6±0.083.6\pm 0.072.0±4.572.0\pm 4.565.0±4.865.0\pm 4.840.0±4.940.0\pm 4.968.0±4.668.0\pm 4.630.0±4.630.0\pm 4.663.0±4.863.0\pm 4.8
NLD (2 rounds)80.2±0.180.2\pm 0.175.0±4.375.0\pm 4.383.0±0.883.0\pm 0.837.0±0.137.0\pm 0.171.0±0.171.0\pm 0.130.0±0.130.0\pm 0.163.0±0.763.0\pm 0.7
NLD (3 rounds)80.1±4.680.1\pm4.679.0±4.1\mathbf{79.0}\pm 4.170.0±4.670.0\pm 4.645.0±5.045.0\pm 5.063.0±4.863.0\pm 4.840.0±4.9\mathbf{40.0}\pm 4.974.0±4.4\mathbf{74.0}\pm 4.4
NLD (4 rounds)78.0±0.078.0\pm0.079.0±4.1\mathbf{79.0}\pm 4.1*****
AC84.6±0.0\mathbf{84.6}\pm 0.064.0±4.864.0\pm 4.885.0±0.8\mathbf{85.0}\pm 0.847.0±0.1\mathbf{47.0}\pm 0.178.0±0.9\mathbf{78.0}\pm 0.930.0±0.130.0\pm 0.174.0±0.1\mathbf{74.0}\pm 0.1
^*Runs required too much compute

We see that for 5 of the 7 reasoning benchmarks, AC beats NLD even with 3-4 rounds while using substantially less compute, highlighting the superiority and robustness of activations as an alternative “language” for inter-LM communication.

Furthermore, AC, like debate, can scale to more than 2 models. For instance, consider a setup (using the notation from Section 3.1 of our paper) where for any i,j,k,mi, j, k, m with j<kj < k, we:

  1. run a partial forward pass Bj(xB)B_{\leq j}(x_B) to get last-token activation bj\mathbf{b}_j;
  2. run a partial forward pass Ai(xA)A_{\leq i}(x_A) to get ai\mathbf{a}_i;
  3. replace bjf(ai,bj)\mathbf{b}_j \longleftarrow f(\mathbf{a}_i,\mathbf{b}_j);
  4. continue BB's forward pass till layer kk to get last-token activation bk\mathbf{b}_k;
  5. run a partial forward pass Cm(xC)C_{\leq m}(x_C) to get cm\mathbf{c}_{m};
  6. replace cmf(bk,cm)\mathbf{c}_m \longleftarrow f(\mathbf{b}_k,\mathbf{c}_m);
  7. then continue CC's forward pass till decoding is complete.

This 33-model setup can extend to an arbitrary number of models.

See above for experiments extending AC to more agents.

评论

The reasoning results are unclear and inconsistent. While one of the stated highlights of the method is that it is "task-agnostic," in the reasoning benchmark, two different variants of the method (with and w/o learnt linear projection communication bridge) seem to perform inconsistently when compared with each other. For example in some settings, one substantially outperforms the other (HS Psych, AC is much better, Prof. Law, AC(W) is much better). It would be prudent to analyze and explain why these inconsistencies exist.

First, note that our results are exceptionally consistent in terms of outperforming baselines. AC consistently outperforms both single-model baselines, and furthermore offers up to 27.027.0% improvement over NLD across six of the seven reasoning datasets.

Second, note that AC (W\mathbf{W}) is still task-agnostic, in that W\mathbf{W} depends only on AA and BB (we learn W\mathbf{W} only once for each model pair (A,B)(A, B) using general text, e.g. sequences from AA and/or BB’s pretraining data mixes.)

As to why the AC (W\mathbf{W}) results seem less consistent, note our reasoning from the paper: "We hypothesize that the benefits from the learned linear layer are less consistent across datasets because the subset of C4 data used to train W\mathbf{W} likely contains text more semantically similar to some datasets than others, hence some datasets provide W\mathbf{W} with out-of-distribution inputs which reduces performance compared to vanilla AC."

Indeed, we verify this hypothesis by training W\mathbf{W} on the GSM8k train set (to produce Win dist\mathbf{W}_{\textrm{in dist}}) and then evaluating with this task-specific linear layer on the GSM8k test set:

ACAC (W\mathbf{W})AC (Win dist\mathbf{W}_{\textrm{in dist}})
64.0±4.864.0\pm 4.866.0±4.866.0\pm 4.878.0±4.178.0 \pm 4.1

Indeed, learning W\mathbf{W} in-distribution significantly boosts performance, confirming our hypothesis. Unfortunately we cannot run this experiment for the other datasets, as there is no in-distribution training data available for MMLU (we use all public data for testing).

Hence, this suggests that AC (W\mathbf{W}) should unilaterally improve over vanilla AC if we choose a training set with good coverage across many tasks and distributions, such that there are sentences semantically similar to prompts across the span of downstream task datasets.


For the reasoning tasks, why were the particular tasks chosen? Can you show results on the whole of MMLU? How about MMLU-Pro?

Following the NLD [1] and CIPHER [2] setups, we evaluated on Biographies and GSM8k, then randomly drew 5 datasets from MMLU across all category divions. Unfortunately it was computationally infeasible to run on all MMLU/MMLU-Pro data (on our cluster setup, each experiment takes on average 2 hours per (dataset, method) pair, indicating >$$400 hours to run on all of MMLU/MMLU-Pro). However, this is part of our future work.


We hope these improvements adequately address your questions, and we hope that you consider the extent of these improvements in your evaluation. Please let us know if there are any additional questions or concerns; we'd be happy to address them.

[1] Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023.

[2] Chau Pham, Boyi Liu, Yingxiang Yang, Zhengyu Chen, Tianyi Liu, Jianbo Yuan, Bryan A. Plummer, Zhaoran Wang, and Hongxia Yang. Let models speak ciphers: Multiagent debate through embeddings, 2024.

评论

I appreciate the authors efforts in providing a detailed response. However, I'd appreciate some further clarification on your response.

See above for experiments extending AC to more agents.

Thanks for your theoretical explanation. Where exactly are the results? I only see results for more rounds.

Following the NLD [1] and CIPHER [2] setups, we evaluated on Biographies and GSM8k, then randomly drew 5 datasets from MMLU across all category divions. Unfortunately it was computationally infeasible to run on all MMLU/MMLU-Pro data (on our cluster setup, each experiment takes on average 2 hours per (dataset, method) pair, indicating hours to run on all of MMLU/MMLU-Pro). However, this is part of our future work.

I do sympathize with computational constraints. At the same time, since MMLU is one of two main baselines (other one being GSM-8k) I find it hard to accept this paper at this level with incomplete benchmarks. ICLR is a conference that deserves full-fledged papers with complete benchmarks. Besides, it would be relatively easy to "game" the system by choosing subsets whose results are favorable (not accusing you of that, just saying.)

While I do think the idea is useful, I find it hard to accept with such incomplete benchmarks. I thank you again for your responses, and I am still open to raising my score to acceptance, should the complete benchmarks be shown, or something much closer to that.

评论

Recall that we define "agents" as distinct model instances. In our experiments extending AC to communication between identical model instances, we limit to 2 agents as this setup already outperforms NLD with >$$2 agents and >$$2 rounds.


After obtaining more compute, we have measured results on the entire MMLU benchmark (57 datasets), shown in the official comment Complete MMLU Results above.

As shown in the comment, AC matches/outperforms NLD on 48/57 datasets, indicating its superiority and robustness as an alternative "language" for inter-LLM communication.

We hope that with these strong and complete benchmark results now provided, the reviewer considers increasing their score to acceptance.

审稿意见
5

This paper proposes a method for communication between two LMs using intermediate activations from two LMs A and B. Specifically, the paper proposes to take compute partial activations of two LMs upto layers k and j -- followed by a function that merges these two activations in the last token and assign that value to the last token of the second LM. The authors show various kinds of functional forms for activation merging with replace being the most effective. Results on synthetic and reasoning datasets show partial improvements over baseline methods although I believe additional experiment details are required to claim the effectiveness of the proposed method.

优点

Strengths:

  • The paper proposes a method that uses only last token modification between two LMs which is a more efficient communication technique compared to communicating in raw tokens. Section 3.2 of the paper presents a good analysis of the computational efficiency if the paper.

缺点

Weakness:

  • This method uses the activation replacement only for the last token activation of the LM B with the last token activation of LM A. Therefore, the information of the entire prompt of the LM A, is compressed in the last token. I am not sure how this will be performing compared to natural language dialog especially in cases where the LM A response might incur a long response (with possibly chain of thought for intermediate answer computation). This is because the last token activation activation replacement cannot capture the nuanced response from CoT traces. This might explain the GSM8k performance where the proposed method does not perform well compared to NLD baseline since it requires intermediate steps for answer compuation.
  • The experimental details are lacking to claim the effectiveness of the proposed method. For table 2 results on Countries and Tip Sheet, I could not find the details on how k=j=26 was chosen. Was a separate validation set used for choosing this value? Was the same set used to tune the hyperparameters of baseline methods?
  • For learning the mapping parameter W, the authors report that since 3072 sentences were used from C4 dataset, the semantic similarity might not be similar to the evaluation tasks. Does learning W in-distribution help the performance? For example, train W on train set of GSM8k and evaluate it on the test set.

问题

Please see questions above. Additionally:

  • Could you please comment on how this method would perform on reasoning tasks that require intermediate computations via chain of thought? For example, algorithmic tasks in BigBenchHard [1] might require intermediate CoT computations and I am skeptical that a single token replacement at the last token would capture the nuanced information in a complex CoT response.

[1] Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H.W., Chowdhery, A., Le, Q.V., Chi, E.H., Zhou, D. and Wei, J., 2022. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.

评论

Thank you for your valuable feedback! We appreciate that you found our approach effective and compute analysis strong. We address specific comments from the review below:


The last token activation replacement cannot capture the nuanced response from CoT traces. This might explain the GSM8k performance where the proposed method does not perform well compared to NLD baseline since it requires intermediate steps for answer compuation.

We distinguish between two questions here, and address both in turn:

Is replacing just the last token activation enough to communicate information from AA to BB?

Note that after applying masked attention in each of the previous Transformer layers, the last token activation of AA attends to all tokens before it, hence incorporating information from the entire sequence. Indeed, this must be the case for activation communication to recover the gap between the zero-communication and skyline setups on both coordination games, which (for Tip Sheets in particular) require information starting at the first few tokens of AA's prompt to be communicated.

In addition, we experiment with summing the activations of all tokens in the sequence rather than just the last (we cannot replace all tokens as this would just replace BB's layer-jj activation with AA's layer kk-activation):

MethodBiog.GSM8kHS Psych.LogicCol. Bio.Prof. LawPub. Rel.
AC (ff = replace)84.6±0.0\mathbf{84.6}\pm 0.064.0±4.864.0\pm 4.885.0±0.8\mathbf{85.0}\pm 0.847.0±0.1\mathbf{47.0}\pm 0.178.0±0.9\mathbf{78.0}\pm 0.930.0±0.1\mathbf{30.0}\pm 0.174.0±0.1\mathbf{74.0}\pm 0.1
AC (ff = sum)79.7±0.079.7\pm 0.066.0±4.7\mathbf{66.0}\pm 4.765.0±4.865.0\pm 4.842.0±4.942.0\pm 4.950.0±5.050.0\pm 5.025.0±4.325.0\pm 4.337.0±4.837.0\pm 4.8
AC (all tokens)76.0±0.076.0\pm 0.062.0±4.962.0 \pm 4.935.0±4.835.0\pm 4.842.0±4.942.0\pm 4.961.0±4.961.0\pm 4.915.0±3.615.0\pm 3.626.0±4.426.0\pm 4.4

Indeed, applying ff to all tokens decreases performance relative to applying ff to just the last token. Note that the fact performance generally decreases from f=f = replace to f=f = sum, and further with all tokens, is expected. The high performance of AC with f=f = replace means that the edited last-token activation b\mathbf{b} retains some meaning in BB's activation space; it is less likely for this to be the case when f=f = sum (at the very least b\mathbf{b} has norm roughly 2×2\times that of BB's original last-token activation), and when doing this for all tokens we'd expect performance to decrease even further as now all activation vectors, not just the last, are out-of-distribution w.r.t. BB's activation space.

Does activation communication lose out on the benefits of CoT? [This is also the reviewer's first question]

This is a great question, and we empirically verify the answer is no.

First, note that we still reap the benefits of CoT when we sample a completion from BB after AC (where BB gets all the information encoding AA's "beliefs" about the prompt via AC, hence CoT on AA's side is not needed). To verify this intuition, we experiment with prompting AA with CoT, generating a full response, and then passing the layer-kk last-token activation of the CoT response to BB as part of AC:

MethodBiog.GSM8kHS Psych.LogicCol. Bio.Prof. LawPub. Rel.
AC84.6±0.084.6\pm 0.064.0±4.864.0\pm 4.885.0±0.8\mathbf{85.0}\pm 0.847.0±0.1\mathbf{47.0}\pm 0.178.0±0.978.0\pm 0.930.0±0.130.0\pm 0.174.0±0.1\mathbf{74.0}\pm 0.1
AC (W\mathbf{W})86.8±0.0\mathbf{86.8}\pm 0.066.0±4.8\mathbf{66.0}\pm 4.870.0±0.170.0\pm 0.135.0±0.135.0\pm 0.179.0±0.9\mathbf{79.0}\pm 0.945.0±0.1\mathbf{45.0}\pm 0.163.0±0.163.0\pm 0.1
AC (CoT)82.1±0.082.1\pm 0.066.0±4.0\mathbf{66.0} \pm 4.080.0±4.080.0 \pm 4.026.0±4.426.0\pm 4.467.0±4.767.0\pm 4.740.0±4.940.0\pm 4.963.0±4.863.0\pm 4.8

Indeed, we empirically find our above intuition to hold, as there is no significant improvement over vanilla AC when generating from AA using CoT.

评论

For table 2 results on Countries and Tip Sheet, I could not find the details on how k=j=26k=j=26 was chosen. Was a separate validation set used for choosing this value? Was the same set used to tune the hyperparameters of baseline methods?

Figure 2 in the paper shows 2D contour plots of accuracy over different values of kk and jj. For computational reasons we only do this hyperparameter sweep on the Countries and Tip Sheets datasets; we simply cross-apply the optimal values here, k=j=26k = j = 26, to the reasoning benchmarks and find that the values seem to generalize well across datasets, which is quite interesting in its own right.

There are no hyperparameters to be tuned for the baselines in the coordination games, nor for the reasoning benchmarks (we fix the number of rounds and agents both to 22 for apples-to-apples comparison with AC—though in additional experiments we find that AC outperforms NLD with additional rounds, see General Response).


For learning the mapping parameter W, the authors report that since 3072 sentences were used from C4 dataset, the semantic similarity might not be similar to the evaluation tasks. Does learning W in-distribution help the performance? For example, train W on train set of GSM8k and evaluate it on the test set.

We appreciate the reviewer's intent to verify our reasoning about the benefits of W\mathbf{W} in the paper, which goes as follows: "We hypothesize that the benefits from the learned linear layer are less consistent across datasets because the subset of C4 data used to train W\mathbf{W} likely contains text more semantically similar to some datasets than others, hence some datasets provide W\mathbf{W} with out-of-distribution inputs which reduces performance compared to vanilla AC."

While a salient feature of our approach is that it introduces zero additional task-specific parameters and data by virtue of keeping W\mathbf{W} task-agnostic, we indeed verify our hypothesis by training W\mathbf{W} on the GSM8k train set (to produce Win dist\mathbf{W}_{\textrm{in dist}}) and then evaluating with this task-specific linear layer on the GSM8k test set:

ACAC (W\mathbf{W})AC (Win dist\mathbf{W}_{\textrm{in dist}})
64.0±4.864.0\pm 4.866.0±4.866.0\pm 4.878.0±4.178.0 \pm 4.1

Indeed, learning W\mathbf{W} in-distribution significantly boosts performance, confirming our hypothesis. Unfortunately we cannot run this experiment for the other datasets, as there is no in-distribution training data available for MMLU (we use all public data for testing).


We hope these improvements adequately address your questions, and we hope that you consider the extent of these improvements in your evaluation. Please let us know if there are any additional questions or concerns; we'd be happy to address them.

评论

We greatly appreciate your time and valuable feedback. If there is anything else that needs clarification or further discussion, please do not hesitate to let us know.

审稿意见
6

This paper proposes a novel approach to enable inter-agent communication among language model (LM) agents through direct activation sharing. The authors argue that using activation states for communication, instead of conventional natural language, enhances computational efficiency and performance. Specifically, the method involves grafting activation outputs from one model’s layer into another’s, combining the information through simple operations (sum, mean, replace) or a learned linear mapping. Empirical evaluations on coordination games and reasoning benchmarks demonstrate promising results, with the activation-based method achieving up to 27% higher accuracy than natural language communication while using only a quarter of the computational resources.

优点

  • Innovation in Communication Mechanisms: The paper presents a creative shift from token-based natural language communication to activation-based sharing between LMs. This direction is original and leverages intermediate model states, which theoretically encode richer representations than single-token outputs.

  • Compute Efficiency: By bypassing the resource-heavy natural language decoding process, this method reportedly achieves substantial computational savings. This aspect is well-supported with theoretical compute models and validated experimentally.

  • Experimental Validation: The experiments are thorough, covering two main setups—coordination games and reasoning tasks across diverse datasets. The paper also explores different activation fusion functions, providing a nuanced understanding of how these choices impact results.

缺点

  • Interpretability and Debugging Concerns: Activation sharing sacrifices transparency and interpretability, particularly since activations do not translate directly into human-readable information. While the authors acknowledge this, their response is inadequate, failing to consider how the complexity of debugging inter-agent communication is compounded when activations, rather than explicit language, are involved.

  • Sparse Performance Gains on Some Tasks: The empirical results do not uniformly favor activation-based communication. In certain benchmarks, gains are minimal or inconsistent (as in Table 3, where Natural Language Debate occasionally matches or outperforms activation-based approaches). This variability weakens the paper’s claim that activation-based communication is a universal improvement.

  • Selection of Baselines: The paper uses Natural Language Debate (NLD) as the primary baseline. While relevant, it would have been more compelling if additional baselines were included, such as model merging methods or more advanced methods of multi-agent debate. Limiting comparisons may overstate the efficacy of the proposed approach.

  • Scalability and Real-world Applicability: The activation-based communication approach seems limited in real-world settings. Many existing LMs, especially proprietary ones, restrict access to activations, limiting the applicability of this approach in settings beyond research or open-source models. This restriction is addressed minimally, without discussing how activation-based communication could be practical for typical API-based LM access.

问题

  • Choice of Activation Fusion Functions: Why were basic functions like sum, mean, and replace chosen over more sophisticated functions that might better capture semantic alignment between models?

  • Choice of Layers of Activations to Fuse: What is the effect of the choice of layers to fuse and what is the strategy to choose?

  • Behavior with Different Model Architectures: Have you considered whether this approach generalizes effectively across architectures with different layer configurations or attention mechanisms? The focus on models from a single family (LLaMA) may obscure limitations when used with other architectures.

  • Application to API-accessible LMs: How does the proposed method scale to real-world environments where only API access is available, such as when interacting with black-box models?

  • Missing Reference for LLM-based multi-agent communication: CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society (NeurIPS 2023)

评论

Thank you for your valuable feedback! We appreciate that you found our approach original and well-supported and our experiments thorough. We address specific comments from the review below:


Interpretability and Debugging Concerns: While we acknowledge the complexity of "debugging AC", we would like to make a few points. First, past work has justified that we do not necessarily need to understand the intermediate debates amongst LLMs [1]. Second, there is a fundamental tradeoff between interpretability and information preservation (as activations, by virtue of being much higher-dimensional than the space of natural language, allow proportionally higher-entropy communication) [1], which merits discussion beyond the scope of this work. But third, we actually posit that our method suggests a new avenue towards interpreting LM activations: "translating" activations based on the beliefs they induce as messages in listening agents, similar to the method put forward in [2]. This would make the communicated activation vectors interpretable, and we recognize this as a promising avenue for future research.


Sparse Performance Gains on Some Tasks: We'd like to push back against this claim, as the consistency of AC outperforming single-model baselines and NLD is well-supported by our results in Table 3. To reiterate our analysis from the paper: Notably, while NLD consistently outperforms LLaMA3.23B`LLaMA-3.2-3B`, it does not always display a performance improvement over LLaMA3.18B`LLaMA-3.1-8B`; but remarkably, AC consistently outperforms both single-model baselines. In fact, AC offers an up to 27.027.0% improvement over NLD across six of the seven reasoning datasets.


Selection of Baselines: We limit our focus to task-agnostic methods; approaches such as CALM [3] and other composition/grafting methods require a substantial amount of additional task-specific parameters and data, hence we do not compare to these. Furthermore these methods require much more compute in the form of LM finetuning, layer or router training, etc.; our approach is far more compute efficient.

Regarding multiagent debate, the NLD setup from [4] that we evaluate against is the predominant method of NL communication. This is quite a strong and flexible NL approach, incorporating CoT into each model's prompt and allowing varied numbers of agents and rounds. Indeed, we find in additional experiments that AC outperforms NLD even with additional rounds of debate:

MethodBiog.GSM8kHS Psych.LogicCol. Bio.Prof. LawPub. Rel.
NLD (1 round)83.6±0.083.6\pm 0.072.0±4.572.0\pm 4.565.0±4.865.0\pm 4.840.0±4.940.0\pm 4.968.0±4.668.0\pm 4.630.0±4.630.0\pm 4.663.0±4.863.0\pm 4.8
NLD (2 rounds)80.2±0.180.2\pm 0.175.0±4.375.0\pm 4.383.0±0.883.0\pm 0.837.0±0.137.0\pm 0.171.0±0.171.0\pm 0.130.0±0.130.0\pm 0.163.0±0.763.0\pm 0.7
NLD (3 rounds)80.1±4.680.1\pm4.679.0±4.1\mathbf{79.0}\pm 4.170.0±4.670.0\pm 4.645.0±5.045.0\pm 5.063.0±4.863.0\pm 4.840.0±4.9\mathbf{40.0}\pm 4.974.0±4.4\mathbf{74.0}\pm 4.4
NLD (4 rounds)78.0±0.078.0\pm0.079.0±4.1\mathbf{79.0}\pm 4.1*****
AC84.6±0.0\mathbf{84.6}\pm 0.064.0±4.864.0\pm 4.885.0±0.8\mathbf{85.0}\pm 0.847.0±0.1\mathbf{47.0}\pm 0.178.0±0.9\mathbf{78.0}\pm 0.930.0±0.130.0\pm 0.174.0±0.1\mathbf{74.0}\pm 0.1
^*Runs required too much compute

We see that for 5 of the 7 reasoning benchmarks, AC beats NLD even with 3-4 rounds while using substantially less compute, highlighting the superiority and robustness of activations as an alternative “language” for inter-LM communication.


Scalability and Real-world Applicability: Exploring API-only approaches is highly limiting (see AC's meta review of [1] at last year's conference). Furthermore, recent releases of powerful open-source models [5] merit the development of embedding-based techniques.

评论

Choice of Activation Fusion Functions: Our paper unifies activation-level grafting techniques under a single framework, parameterized by the function ff used to combine activations. Crucially, the more complex ff is, the less compute our technique saves over traditional NL communication. We explore simple forms of ff (e.g., sum, mean) that require zero additional task-specific parameters and data, are far more compute-efficient, and also consistently outperform natural language debate. The performance of AC even with simple functional forms ff illustrates the superiority and robustness of activations as an alternative "language" for communication between LMs.


Choice of Layers of Activations to Fuse: Activations of late-enough layers essentially encode the model's entire knowledge about the provided context as well as its predicted completion and confidence in that completion (see Figures 1 and 7 in [6] and [7], respectively, which show that linear probes tasked with predicting certain output characteristics from a Transformer's intermediate layer embeddings of its input work poorly for early layers, extremely well after around the halfway point of computation, but then probe accuracy drops closer to the final layers). Indeed, these curves of probe accuracy by layer indicate that the final layers and LM head "throw away" information not useful for next-token prediction that very well could be useful for communicative purposes. This is precisely why we choose a layer around halfway through the LM; indeed, [7] find that by around the halfway point of an LM's computation, it has developed "enriched entity representations" of the input that would be quite useful for communication over the next word representations of later layers.

We validate this empirically; Figure 2 in the paper shows 2D contour plots of accuracy over different values of kk and jj. For computational reasons we only do this hyperparameter sweep on the Countries and Tip Sheets datasets; we simply cross-apply the optimal values here, k=j=26k = j = 26, to the reasoning benchmarks and find that the values seem to generalize well across datasets, which is quite interesting in its own right.


Behavior with Different Model Architectures: First, note that prior work on inter-LM communication has only focused on large models (GPT3.5/4`GPT-3.5/4`), leaving the efficacy of debate on smaller, open-source models underexplored; our study addresses this gap by reimplementing NLD [4] in experiments with smaller-scale models. While we only test LLaMA`LLaMA` models, note the following: (1) The models we use are among state-of-the-art open-source models, meriting our focus; and (2) we extensively vary both the LLaMA`LLaMA` suite (LLaMA`LLaMA`-{2,3,3.1,3.2}\{2, 3, 3.1, 3.2\}) and parameter count (170B1-70\textrm{B}). This is already quite a broad coverage of models/model sizes; but further exploration is part of our future work.


Application to API-accessible LMs: See Scalability and Real-world Applicability in Part 1 of the response.


Missing Reference for LLM-based multi-agent communication: Thank you for the suggestion, we have added this reference in the revised version of the paper.


We hope these improvements adequately address your questions, and we hope that you consider the extent of these improvements in your evaluation. Please let us know if there are any additional questions or concerns; we'd be happy to address them.

[1] Chau Pham, Boyi Liu, Yingxiang Yang, Zhengyu Chen, Tianyi Liu, Jianbo Yuan, Bryan A. Plummer, Zhaoran Wang, and Hongxia Yang. Let models speak ciphers: Multiagent debate through embeddings, 2024.

[2] Jacob Andreas, Anca Dragan, and Dan Klein. Translating neuralese, 2018.

[3] Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Nitish Gupta, Shikhar Vashishth, Sriram Ganapathy, Abhishek Bapna, Prateek Jain, and Partha Talukdar. Llm augmented llms: Expanding capabilities through composition, 2024.

[4] Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023.

[5] Abhimanyu Dubey et al. The llama 3 herd of models, 2024.

[6] John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations, 2019.

[7] Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models, 2024.

评论

We greatly appreciate your time and valuable feedback. If there is anything else that needs clarification or further discussion, please do not hesitate to let us know.

评论

We appreciate all reviewers’ time and valuable feedback! We have conducted additional experiments and incorporated the rebuttal content into our revision to answer the reviewers’ questions and concerns. Below is a summary of main changes:

ChangeSectionRelated Reviewers
Modifying activations of all tokensAppendix B.1Reviewer pCow
Incorporating chain-of-thought promptingAppendix B.2Reviewers pCow, Gubh
Learning W\mathbf{W} in-distributionAppendix B.3Reviewers pCow, fS21
AC between identical modelsAppendix B.4Reviewer fS21
Additional rounds of NLDAppendix B.5Reviewers 8hgU, pCow, fS21
Additional discussion of Lewis signaling gamesSection 4.1Reviewer Gubh

We thank the reviewers for their feedback, and we’re happy to answer any further questions.

评论

Thank you all reviewers for your valuable time and feedback. We wanted to share one more important result:

After obtaining more compute, we have measured results on the entire MMLU benchmark (57 datasets):

DatasetNLDAC
Conceptual Physics60.0±4.960.0 \pm 4.968.0±4.6\mathbf{68.0 \pm 4.6}
High School Chemistry50.0±5.0\mathbf{50.0 \pm 5.0}37.0±4.837.0 \pm 4.8
Security Studies60.0±4.960.0 \pm 4.960.0±4.960.0 \pm 4.9
Jurisprudence84.0±3.684.0 \pm 3.684.0±3.684.0 \pm 3.6
Logical Fallacies63.0±4.863.0 \pm 4.872.0±4.5\mathbf{72.0 \pm 4.5}
College Computer Science44.0±5.044.0 \pm 5.044.0±5.044.0 \pm 5.0
International Law55.0±5.055.0 \pm 5.059.0±4.9\mathbf{59.0 \pm 4.9}
Miscellaneous90.0±3.090.0 \pm 3.095.0±2.2\mathbf{95.0 \pm 2.2}
Marketing70.0±4.670.0 \pm 4.685.0±3.6\mathbf{85.0 \pm 3.6}
Elementary Mathematics75.0±4.3\mathbf{75.0 \pm 4.3}58.0±4.958.0 \pm 4.9
Machine Learning42.0±4.942.0 \pm 4.942.0±4.942.0 \pm 4.9
High School Macroeconomics44.0±5.044.0 \pm 5.075.0±4.3\mathbf{75.0 \pm 4.3}
High School US History45.0±5.045.0 \pm 5.071.0±4.6\mathbf{71.0 \pm 4.6}
Human Aging56.0±5.056.0 \pm 5.072.0±4.5\mathbf{72.0 \pm 4.5}
Astronomy79.0±4.179.0 \pm 4.180.0±4.0\mathbf{80.0 \pm 4.0}
Computer Security56.0±5.056.0 \pm 5.075.0±4.3\mathbf{75.0 \pm 4.3}
High School Statistics55.0±5.0\mathbf{55.0 \pm 5.0}42.0±4.942.0 \pm 4.9
Professional Medicine79.0±4.1\mathbf{79.0 \pm 4.1}65.0±4.865.0 \pm 4.8
Electrical Engineering58.0±4.958.0 \pm 4.960.0±4.9\mathbf{60.0 \pm 4.9}
High School Computer Science63.0±4.863.0 \pm 4.870.0±4.6\mathbf{70.0 \pm 4.6}
College Physics50.0±5.0\mathbf{50.0 \pm 5.0}28.0±4.528.0 \pm 4.5
Management74.0±4.174.0 \pm 4.175.0±4.3\mathbf{75.0 \pm 4.3}
Moral Scenarios40.0±4.940.0 \pm 4.940.0±4.940.0 \pm 4.9
World Religions58.0±4.958.0 \pm 4.972.0±4.5\mathbf{72.0 \pm 4.5}
Virology47.0±5.047.0 \pm 5.050.0±5.0\mathbf{50.0 \pm 5.0}
Philosophy67.0±4.767.0 \pm 4.770.0±4.6\mathbf{70.0 \pm 4.6}
Abstract Algebra50.0±5.0\mathbf{50.0 \pm 5.0}28.0±4.528.0 \pm 4.5
High School Government and Politics80.0±4.0\mathbf{80.0 \pm 4.0}61.0±4.961.0 \pm 4.9
High School Biology60.0±4.960.0 \pm 4.965.0±4.8\mathbf{65.0 \pm 4.8}
College Mathematics65.0±4.865.0 \pm 4.866.0±2.4\mathbf{66.0 \pm 2.4}
Global Facts33.0±5.033.0 \pm 5.037.0±4.8\mathbf{37.0 \pm 4.8}
High School World History71.0±4.071.0 \pm 4.074.0±4.4\mathbf{74.0 \pm 4.4}
High School European History68.0±4.068.0 \pm 4.071.0±4.6\mathbf{71.0 \pm 4.6}
College Medicine65.0±4.8\mathbf{65.0 \pm 4.8}53.0±5.053.0 \pm 5.0
High School Geography67.0±4.767.0 \pm 4.779.0±4.1\mathbf{79.0 \pm 4.1}
Anatomy74.0±4.474.0 \pm 4.474.0±4.474.0 \pm 4.4
Human Sexuality75.0±4.375.0 \pm 4.375.0±4.375.0 \pm 4.3
Medical Genetics79.0±4.179.0 \pm 4.182.0±3.8\mathbf{82.0 \pm 3.8}
Professional Accounting40.0±4.940.0 \pm 4.948.0±4.5\mathbf{48.0 \pm 4.5}
US Foreign Policy89.0±3.189.0 \pm 3.190.0±3.1\mathbf{90.0 \pm 3.1}
Business Ethics43.0±5.043.0 \pm 5.044.0±5.0\mathbf{44.0 \pm 5.0}
College Chemistry41.0±5.041.0 \pm 5.047.0±5.0\mathbf{47.0 \pm 5.0}
High School Physics40.0±5.040.0 \pm 5.047.0±5.0\mathbf{47.0 \pm 5.0}
Professional Psychology54.0±4.854.0 \pm 4.855.0±5.0\mathbf{55.0 \pm 5.0}
评论
Dataset (cont.)NLDAC
Sociology68.0±4.168.0 \pm 4.168.0±4.6\mathbf{68.0 \pm 4.6}
High School Microeconomics95.0±2.295.0 \pm 2.295.0±2.2\mathbf{95.0 \pm 2.2}
High School Mathematics55.0±5.055.0 \pm 5.055.0±5.055.0 \pm 5.0
Prehistory75.0±4.3\mathbf{75.0 \pm 4.3}60.0±4.960.0 \pm 4.9
Nutrition64.0±4.564.0 \pm 4.570.0±4.6\mathbf{70.0 \pm 4.6}
Clinical Knowledge65.0±4.365.0 \pm 4.365.0±4.865.0 \pm 4.8
Moral Disputes58.0±4.858.0 \pm 4.860.0±4.9\mathbf{60.0 \pm 4.9}
Econometrics40.0±5.040.0 \pm 5.040.0±4.940.0 \pm 4.9
High School Psychology83.0±0.883.0 \pm 0.885.0±0.8\mathbf{85.0 \pm 0.8}
Formal Logic37.0±0.137.0 \pm 0.147.0±0.1\mathbf{47.0 \pm 0.1}
College Biology71.0±0.171.0 \pm 0.178.0±0.9\mathbf{78.0 \pm 0.9}
Professional Law30.0±0.130.0 \pm 0.130.0±0.130.0 \pm 0.1
Public Relations63.0±0.763.0 \pm 0.774.0±0.1\mathbf{74.0 \pm 0.1}
Average60.7±2.060.7 \pm 2.062.7±2.2\mathbf{62.7 \pm 2.2}

As shown above, AC matches/outperforms NLD on 48/57 datasets, indicating its superiority and robustness as an alternative "language" for inter-LLM communication.

We hope that with these strong and complete benchmark results now provided, the reviewers consider raising their scores to support acceptance.

AC 元评审

This paper introduces a novel method for inter-agent communication using activation sharing rather than natural language. The approach pauses one model’s computation at an intermediate layer, combines its activation with another model’s, and continues the forward pass. As claimed. this method improves performance over natural language debate, using only 1/4 of the compute, with up to 27% accuracy gains on reasoning tasks, and requires no additional parameters or training data. The authors validate these claims with experiments across coordination games and reasoning benchmarks, including results on all 57 MMLU datasets, where the method outperforms natural language debate on 48.

Reviewers generally found the idea novel and promising, noting its clear motivation and strong empirical support. However, they raised three key concerns: (1) Limited interpretability compared to natural language communication, though the authors argue this represents a necessary tradeoff between interpretability and information preservation, (2) Initial evaluation on only a subset of MMLU datasets, which was later addressed with complete results showing AC outperforms NLD on 48/57 datasets, and (3) Questions about scalability beyond two models, though the authors provide a theoretical framework for extending the approach to multiple models. The reviewers seemed unconvinced after the rebuttal and remained mostly negative about this work. But the AC sees the potential of this work and believes it a valuable contribution to the community after addressing those concerns.

审稿人讨论附加意见

See above.

最终决定

Reject