6.0

/10

Poster3 位审稿人

最低5最高7标准差0.8

3.7

置信度

COLM 2025

MeMAD: Structured Memory of Debates for Enhanced Multi-Agent Reasoning

Shuai Ling,Lizi Liao,Dongmei Jiang,Weili Guan

OpenReview PDF

提交: 2025-03-19更新: 2025-08-26

TL;DR

We propose Memory-Augmented Multi-Agent Debate (MeMAD), which systematically organizes and reuses past debate transcripts to improve performance on complex reasoning tasks without requiring parameter updates.

摘要

关键词

Multi-Agent DebateMemory Augmentation

评审与讨论

审稿意见

评分: 6置信度: 42025-05-11

The existing Multi-Agent Debate frameworks typically treat each debate independently, discarding valuable debate transcripts once an answer is finalized. This severely limits the system’s capacity for continuous improvement. Thus, this work attempts to incorporate long-term memory into the MAD framework in a parameter-free manner. The authors' approach involves building a structured memory system specifically designed to store evidence of successful and failed reasoning pathways. This design enables continuous knowledge accumulation, allowing LLMs to learn from and build upon collective historical experiences.

The idea is novel and highly necessary. Experimental results demonstrate that their work achieves an average performance improvement of 3.3% across several datasets. However, the main weakness of this paper lies in the lack of justification for the necessity of designing a dedicated memory framework tailored for the MAD scenario. The authors do not provide comparisons with other methods such as LangChain or MemGPT, making it difficult to assess the actual ratio of output to input of their proposed solution.

The manuscript currently contains numerous unclear aspects that significantly hinder proper evaluation. While I acknowledge the possibility of misinterpretation on my part, I must emphasize that - given my substantial expertise in both MAD and LLM memory systems - the majority of these clarity issues likely stem from deficiencies in the manuscript itself rather than reviewer misunderstanding. In light of these unresolved clarity concerns, I have maintained a relatively neutral scoring position at this stage. However, I remain fully prepared to adjust my evaluation score accordingly during the discussion period, pending satisfactory clarification of these critical issues.

接收理由

The motivation is well-justified. A key limitation of existing MAD frameworks is that they require reasoning from scratch in every debate, despite the substantial computational cost involved. Enabling models to reuse debate experiences is thus a highly important and natural direction for improvement.
The proposed method demonstrates reasonable feasibility. In fact, similar memory systems have already been deployed in several online products, equipping LLMs with near-infinite long-term memory capabilities.
The method is parameter-free. Notably, recent research trends focus on Inference-Time Training to inject memory into model parameters. However, such approaches are only feasible for API providers with training access. MeMAD offers a distinct and practical alternative to address memory challenges.
Experimental results confirm that MeMAD achieves strong performance in both accuracy and generalization.

拒绝理由

Indeed, most of my Reasons to Accept are also my Reasons to Reject.

Need for a MAD-specific memory framework: Given the availability of many mature industrial-grade LLM memory frameworks (e.g., LangChain and MemGPT), is a specialized memory framework for MAD scenarios truly necessary? Unfortunately, the paper's analysis only briefly mentions in the Related Work section: "Nonetheless, existing memory-augmentation methods predominantly focus on general memory operations (reading, writing, retrieval), largely neglecting the structured transformation of multi-agent debate interactions into reusable reasoning experiences." This argument is far from convincing to me.
Comparison with Inference-Time Training approaches: Similar to the above point, numerous recent studies employ Inference-Time Training to store memories directly in model parameters. Personally, I find this approach more fundamental and natural. At the very least, the authors should include experiments or analysis demonstrating the orthogonality between MeMAD and Inference-Time Training methods.
Clarity in experimental setup: After extensive searching, I couldn't find clear documentation in the Experiments section about how memories were established for the benchmark datasets before inference. Perhaps I missed it, but I believe it's the authors' responsibility to clearly present this crucial step to readers. As I understand it, the first sample in the test set would have no memory, while the nth sample would have debate memories from samples 1 to n-1. This suggests that in-benchmark order and cross-benchmark order could significantly impact the reasoning results - an analysis I didn't see addressed.
Limited complexity reduction: Surprisingly, MeMAD demonstrates only marginal reductions in complexity, which is quite puzzling. This raises questions about the practical significance of reusing past reasoning experiences. If the MeMAD framework fails to meaningfully reduce inference costs, what compelling advantages does it offer to justify its adoption over existing solutions?

给作者的问题

Regarding Weakness 1: I would like to understand what specific advantages the MeMAD framework offers compared to existing open-source memory frameworks (e.g., LangChain, MemGPT). Alternatively, could you explain why those existing frameworks are not suitable for MAD scenarios?
Regarding Weakness 3: Could you please clarify how exactly you built the memory index for the benchmarks in your experiments? Furthermore, what was the impact of inference order on your results?
Regarding computational costs and API usage: Did your cost calculations include both the memory generation costs and inference time costs, or did you only consider the inference time costs?

2025-05-31

Thank you for your insightful review! We are happy to clarify the points that may have caused confusion.

R1 & Q1：Regarding the Need for a MAD-specific Memory Framework

We suspect there might be a misunderstanding regarding our use of the term "framework" in MeMAD. We are NOT proposing a new memory infrastructure system akin to LangChain or MemGPT. Instead, MeMAD represents a methodological framework focused on systematically extracting, structuring, and reusing debate experiences from MAD processes. Our implementation actually utilizes existing tools—we built upon AutoGen and could equivalently use LangChain. Our key innovation is specifically in what we store and how we reuse MAD experiences, rather than the memory system itself.

For instance, MemGPT addresses the limitation of constrained context windows by leveraging virtual memory paging between main memory and disk, similar to traditional operating systems. Such methods focus primarily on memory forms (e.g., chunks or graphs) and operations such as reading, writing, and forgetting. In contrast, MeMAD prioritizes the content of memory, specifically the structured acquisition and effective utilization of debate experiences, a facet relatively underexplored in existing memory-augmentation approaches.

R2: Regarding Comparison with Inference-Time Training.

We appreciate your suggestion regarding Inference-Time Training approaches. When designing MeMAD, we specifically targeted scenarios where modifying model parameters is impractical—such as when using API-based LLMs. Many high-performing models (e.g., GPT-4, Claude3.7) do not provide training access, thus making parameter-free approaches essential for practical deployment.

We concur that storing memories in model parameters is fundamentally appealing. However, from our perspective, Inference-Time Training may potentially face certain challenges: (1) ensuring parameter updates do not negatively affect the model's existing knowledge during inference, and (2) managing substantial computational costs associated even with minimal parameter updates, especially at scale. Further exploration into these methods is warranted.

Due to current resource constraints, we are unable to perform experiments involving parameter updates.

R3 & Q2: Regarding Experimental Setup Clarity

Our method comprises two phases:

Experience Accumulation Phase (Section 4.1): We use a separate training dataset (detailed in Appendix B) where each sample undergoes the full debate process to generate structured experiences, stored in the Memory Bank. Importantly, no memory retrieval occurs during this phase - each sample's experience accumulation is independent.
Retrieval and Inference Phase (Section 4.2): Each test sample accesses the complete Memory Bank accumulated from the experience accumulation phase. The Memory Bank remains fixed during testing—no new experiences from test samples are added. Thus, the order of test samples does not influence results, as each sample accesses identical memory resources.

This design ensures fair evaluation and avoids the complexities associated with continuous learning during testing, which would require accurately judging response correctness without ground truth—a substantial challenge in complex reasoning tasks. We believe this area merits future exploration.

R4 & Q3： Regarding Limited Complexity Reduction

We acknowledge MeMAD demonstrates only marginal complexity reduction compared to MAD. However, our primary objective is performance improvement rather than cost reduction. As MeMAD incorporates memory retrieval, it naturally incurs additional costs. Our goal was to minimize this cost increase while achieving significant accuracy gains.

Regarding cost calculations: We measured only inference-time costs, excluding memory generation costs. Memory generation represents a fixed, one-time expense amortized across all test samples; consequently, as the number of inference samples grows, this initial cost per sample becomes negligible.

We hope these clarifications address your concerns. If you have any further questions or concerns, we warmly welcome continued discussion.

2025-06-03

The authors have adequately addressed my Q1, R1, R2, Q3, and R4. However, after clarifying the Experimental Setup, I have some concerns: After expending a significant amount of inference FLOPS to construct the structured memory, the accuracy improvement we obtained is only around 2%. Moreover, this improvement was achieved under the condition that the memory and test samples belong to the same domain. In real-world scenarios, user input is generally very diverse, and we cannot predict its domain in advance to construct an in-domain structured memory. Under such circumstances, I doubt we could even achieve the 2% improvement.

2025-06-04

Thank you for the thoughtful follow-up. We appreciate your raising concerns regarding the cost-benefit ratio and applicability of MeMAD in more realistic, open-domain settings.

First, regarding memory construction cost: while the experience accumulation phase does require a one-time inference cost, this is conducted entirely offline and amortized across downstream usage. In practical deployment scenarios—such as serving large user bases in domains like STEM education or legal QA—user queries tend to exhibit substantial semantic overlap. In these cases, the same structured memory can be reused across many sessions. As such, the marginal cost per query becomes negligible, while accuracy improvements remain.

Second, regarding domain generalization: we agree that this is a key challenge. Our retrieval mechanism leverages semantic similarity, and our memory structure—based on abstracted reflections rather than raw solutions—facilitates transfer. In fact, our transferability results (Table 5) demonstrate a 4% accuracy gain when applying memory from MATH500 to a structurally different target task (MMLUPro-Math). This suggests that even under distribution shift, the accumulated experiences retain substantial utility.

We recognize that generalization to truly open-domain user input remains an open problem. We appreciate the reviewer highlighting this direction and will consider more robust, adaptive memory strategies in future work. If you have any further questions or concerns, we warmly welcome continued discussion.

2025-06-06

OK, although I still have doubts about the effectiveness of MeMAD in the general domain, the motivation of this work does make sense. I will raise my score to 6.

2025-06-06

Thank you for your constructive feedback throughout the review process. We appreciate your time and consideration.

审稿意见

评分: 5置信度: 42025-05-13

This paper introduces MeMAD (Memory-Augmented Multi-Agent Debate), a multi-agent framework designed to improve multi-agent reasoning by systematically reusing prior debate experiences. While existing MAD approaches treat each reasoning episode independently, MeMAD captures and stores structured records of both successful and failed debates, enriched with self-reflections and peer feedback. MeMAD further retrieves semantically relevant past experiences and injects them into the agent prompts to inform future debates.

接收理由

MeMAD is a new architecture that systematically accumulates and reuses structured debate experiences
MeMAD demonstrates good performance improvements across multiple complex reasoning benchmarks (e.g., MATH500, GPQA), outperforming both single-agent and multi-agent baselines with prompting.
The approach shows strong cross-task transferability, indicating practical utility when compared with existing costly multi-agent works.

拒绝理由

Lacks a thorough investigation into scenarios where MeMAD fails or underperforms. Since some memory module is involved, it is unclear how robust the method is under noisy memory, domain shifts, or even adv. examples, which may raise concerns about “fitting” to benchmark settings.
While MeMAD combines debate and memory, the core components (e.g., self/peer reflection, semretrieval) can be viewed as adapted from prior works [include but not limited to 1,2,3,4]. The novelty lies mainly in their integration rather than in new collaborations, which may limit the perceived conceptual advancement.
Although experiments are made with MAD and MoA, the paper omits some potential baselines from memory-augmented single-agent systems (e.g., MemGPT) that could offer similar benefits without requiring multi-agent.
Memory retrieval is entirely based on embedding similarity, which may introduce sensitivity to encoding quality and overlook some structures in reasoning. The paper does not explore or compare alternative retrieval strategies, potentially limiting robustness.

[1] Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate https://aclanthology.org/2024.emnlp-main.992/

[2] Improving Factuality and Reasoning in Language Models through Multiagent Debate https://arxiv.org/abs/2305.14325

[3] Corex: Pushing the Boundaries of Complex Reasoning through Multi-Model Collaboration, In COLM https://openreview.net/forum?id=7BCmIWVT0V

[4] Self-Refine: Iterative Refinement with Self-Feedback

给作者的问题

Some part of the paper, especially in the part about memory retrieval and feedback prompts is a little bit unclear

2025-05-31

Thank you for your detailed review. Below, we address your concerns:

R1："Lacks a thorough investigation into scenarios where MeMAD fails or underperforms."

Our evaluation covers diverse domains to prevent "fitting" to benchmarks. Specifically, we evaluated MeMAD across four distinct domains: mathematics (MATH500), science (GPQA, covering physics, chemistry, biology), law, and economics.

We have already evaluated robustness to domain shifts. In Section 5.4 (Table 5), we demonstrate cross-task transferability by applying experiences accumulated from MATH to MMLUPro-Math. Despite significant format differences (open-ended vs. multiple-choice), MeMAD maintains a 4.0% improvement over the MAD baseline (0.765 vs. 0.725), demonstrating robust transfer capabilities.

Regarding noisy memory and adversarial examples, we recognize these as valuable areas for future exploration and intend to address them in subsequent studies.

R2: "Core components (self/peer reflection, semantic retrieval) adapted from prior works, limiting conceptual advancement."

Prior works ([1,2,3,4]) primarily optimize performance within a single debate or collaborative session without reusing accumulated experiences. For instance, Self-Refine iteratively improves answers within the same question instance, and [1] encourages divergent thinking within individual debate episodes. Corex [3] proposes Discuss, Review, and Retrieve modes.

In contrast, MeMAD is the first to systematically accumulate experiences across multiple debates and reuse them for future tasks. Our key innovation is systematically extracting, structuring, and reusing historical experiences from the MAD process.

R3: "Omission of memory-augmented single-agent baselines (e.g., MemGPT)."

We compared MeMAD against memory-augmented single-agent baselines in Appendix D.1 (Figure 3). The memory-augmented single-agent baselines utilize the same memory retrieval mechanism as MeMAD.

Results (replicated below from Appendix D.1 for clarity) show that memory augmentation alone for single agents (GPT-4o-mini + Memory) did not surpass traditional MAD performance, confirming that both memory augmentation and multi-agent debate are essential and complementary.

Regarding MemGPT specifically, MemGPT addresses the problem of "limited fixed-length context windows" in long conversations or document analysis by treating constrained context as a memory source. Therefore, MemGPT may not be suitable for our complex reasoning tasks.


Method	MATH500	GPQA	Law	Economics	Avg
GPT-4o-mini	0.485	0.379	0.395	0.683	0.486
GPT-4o-mini + Memory	0.530	0.399	0.433	0.720	0.521
MAD	0.552	0.409	0.425	0.701	0.522
MAD + Memory (MeMAD, ours)	0.590	0.460	0.440	0.777	0.568

R4: "Memory retrieval solely based on embedding similarity, limiting robustness."

We conducted empirical analyses to evaluate the robustness and generality of our retrieval mechanism in Section 5.3 (Detailed Analysis and Ablations of MeMAD).

Encoding methods (Table 3): We compared four encoding methods across all datasets—token-overlap-based BM25 and three embedding-based methods (bge-m3, nomic-embed-text, mxbai-embed-large). MeMAD consistently improves performance with all encoding methods compared to other baselines, with embedding-based approaches outperforming BM25. This demonstrates robustness regarding encoding choices.
Retrieval strategies (Table 4): On the Economics dataset, we analyzed different selection strategies (Random, Similarity, Diversity) across different memory configurations ( $\mathcal{M}^+$ , $\mathcal{M}^-$ , $\mathcal{M}^+ \cup \mathcal{M}^-$ ). All retrieval strategies outperform the MAD baseline. This comprehensive evaluation demonstrates that memory retrieval benefits are robust across various strategies.

The consistent improvements across all encoding methods and retrieval strategies validate the robustness of our approach, addressing concerns regarding sensitivity to specific retrieval implementations.

Q1: "Some part of the paper, especially in the part about memory retrieval and feedback prompts is a little bit unclear"

The memory retrieval process (Section 4.2) utilizes semantic similarity to identify relevant experiences, with detailed steps provided in the section. The feedback prompt templates are fully specified in Appendix E.1 (Tables 8-9). We are happy to clarify any specific aspects that remain unclear.

We hope these clarifications address your concerns.

评论- Thanks for your rebuttal

2025-06-07

Thank you for your response. I believe your rebuttal partially addresses some of my concerns, but I still have doubts about the novelty of “R2: Core components.” The concept of “systematically extracting, structuring, and reusing historical experiences” is essentially based on the essence of the works I listed (including but not limited to them). I think the authors should at least cover the mentioned representative works in the revision. Due to your rebuttal, I will raise my rating from 4 to 5.

2025-06-08

Thank you for your feedback and reassessment during the review process.

We have already included discussions on relevant works such as Encouraging Divergent Thinking [1], Improving Factuality (which corresponds to MAD) [2], and Self-Refine [4] in the Introduction and Related Work sections of our paper. In our revision, we will further incorporate and explicitly discuss Corex [3] and other representative works, clearly articulating the distinctions and connections between these existing approaches and our contributions.

We appreciate your suggestions.

审稿意见

评分: 7置信度: 32025-05-13

The paper introduces MeMAD (Memory-Augmented Multi-Agent Debate), an enhancement of the Multi-Agent Debate (MAD) framework that allows the agent to learn from past experiences without parameter update by storing and retrieving relevant structured information that is accumulated while answering previous questions. Specifically, both self-reflections (generated by the agent itself) and peer-reflections (generated for the agent by other agents) are collected, and agent leverages information from both successful and unsuccessful experiences via similarity-base retrieval.

Experiments over several datasets that require complex reasoning show consistent improvement over single-agent and multi-agent baselines, using GPT-4o-mini as the agent model. Additional experiments validate the contribution of each component in the proposed method. The learned experiences are also shown to be transferrable between different tasks, and also benefit stronger models (GPT-4o and DeepSeek-V3).

The paper is very well written. The formal notations used in the paper make it easy to follow. The experiments are thorough and the experimental results are convincing.

接收理由

The paper makes a valuable enhancement to the MAD framework, which is shown to improve performance on complex reasoning tasks. The proposed method may be applicable to a wide range of tasks.

拒绝理由

2025-05-31

Thank you for your review and positive feedback. Your positive comments are encouraging for us.

2025-06-06

Thank you

最终决定Accept

2025-07-08

This paper proposes a structured memory for multiagent debate. Reviewers appreciated the clarity of the writing in the paper as well as the empirical effectiveness of the results. Reviewers had some concerns over the extent of the difference in the approach in the paper compared to prior work in the multiagent debate area. After taking a read over the paper, the AC found the approach to be significantly different than prior works, and is an interesting idea that makes for a good poster at the conference.