5.5

/10

Poster4 位审稿人

最低2最高4标准差0.9

3.3

置信度

创新性2.8

质量2.8

清晰度2.5

重要性2.3

NeurIPS 2025

Can Dependencies Induced by LLM-Agent Workflows Be Trusted?

Yu Yao,Yiliao Song,Yian Xie,Mengdan Fan,Mingyu Guo,Tongliang Liu

OpenReview PDF

提交: 2025-05-08更新: 2025-10-29

TL;DR

This paper presents a dynamic framework that enables reliable execution under violated conditional independence assumptions.

摘要

关键词

Multi-LLM-Agent System

评审与讨论

审稿意见

评分: 2置信度: 42025-07-01

The authors delve into the root cause of inter-agent misalignment and argue that it stems from the failure of the “conditional independence” assumption in existing workflow designs. Building on this observation, they introduce SEQCV, a dynamic framework that tackles the problem through sequential execution, cross-verification, and recursive decomposition.

优缺点分析

Strength

This paper is generally in a good structure and easy to follow.
This paper demonstrated performance gains on several benchmarks, suggesting its effectiveness.

Weakness

Token cost would be high. Every trajectory and answer is verified, which appears to incur excessive token costs and risks exceeding the model’s context window. Even for easy query such as 1+1, the workflow method still needs to decompose and verify, which introduces unnecessary cost.
The novelty of the proposed method is unclear. Currently, it seems like another one manually-designed workflow.
The paper does not explain how tasks are decomposed.
No ablation study: subsequent experiments do not directly validate whether inter-agent misalignment has been solved, making the link between the stated problem and the experiments unclear.
The framework’s reliance on strictly sequential execution would make the workflow slow during inference.

问题

局限性

yes

格式问题

作者回复

2025-07-31

Dear Reviewer QP7F,

Thank you very much for your detailed review and constructive comments.

We have carefully followed your suggestions and added comprehensive ablation studies to the revised version.

After receiving your feedback, we recognize that we should have placed greater emphasis on key content in the appendix and highlighted important claims more clearly. In response, we have revised the paper to include more detailed explanations and discussions to avoid further confusion.

Q1. For easy queries such as 1+1, the workflow method still needs to decompose and verify

This will not be decomposed, as mentioned in line 111 under the advantage: "We only split a task when this cross-model consensus indicates that splitting is necessary to resolve disagreement."

We have followed your comments and summarized the decomposition rates on benchmark datasets. Using our method with gpt-4o-mini, gpt-4.1-mini, and gpt-4.1-nano together, the results are shown in response to Q6 of Reviewer QP7F.

Thank you so much for helping improve our paper.

Q2. The paper does not explain how tasks are decomposed

The Task Decompose Prompt and task decomposition template are shown in Appendix F.

Note that this only happens when the answer provided by multiple LLMs on the same task is very inconsistent (50% disagreement). If it is a simple task, it will not be decomposed.

To make it clearer, we have provided generated workflows in our revised paper, see one as follows

{
  "subtasks": [
    {
      "id": 0,
      "objective": "Implement a Maze class that loads and stores the maze layout, including wall and path definitions, and provides collision detection.",
      "depends_on": []
    },
    {
      "id": 1,
      "objective": "Implement a Pellet class defining pellet properties and a draw method for rendering on the maze.",
      "depends_on": []
    },
#### .... for (saving spaces) ....
    {
      "id": 5,
      "objective": "Implement a Tank class handling AI movement (patrol and chase), collision with Maze, shooting persistent bullets, and absorbing bullets to increase firing rate.",  
      "depends_on": [0, 2]
    },
    {
      "id": 6,
      "objective": "Implement a Game class that initializes Pygame, loads the Maze, instantiates Pellet, Bullet, SpecialItem, Player, and Tank objects, and manages overall game state and scoring.",
      "depends_on": [0, 1, 2, 3, 4, 5]
    },
    {
      "id": 7,
      "objective": "Implement the main game loop within the Game class to handle event processing, update all game objects, perform collision checks, render everything each frame, and enforce a consistent frame rate.",
      "depends_on": [6]
    }
  ]
}

Q3. The novelty is unclear. Currently, it seems like another manually-designed workflow

Response to Manual Workflow Concern

Thank you so much for the insightful question. We do not use a manually designed workflow. We believe there may be some confusion. Any system requires a basic workflow definition—this is true even for the simplest setup, such as passing user input to a GPT-4o model to produce output. In that sense, all systems have some form of defined workflow to ensure functionality.

As discussed in Q2, our method does not rely on a fixed, manually designed workflow. Instead, it allows large language models (LLMs) to generate task-specific workflows only when necessary, based on disagreement detection. This dynamic and adaptive workflow generation is a key feature of our approach.

To avoid further confusion, we have revised our paper to clarify the notion of "workflow" and emphasize the distinction between fixed/manual workflows and dynamically generated ones.

Novelty Justification

We found that we should emphasize the contributions and perspectives we want to provide more explicitly. We have carefully added the following to our revised version. Please do not hesitate to let us know where further improvements are needed. Thanks for your time and effort.

Theoretical Formulation and Identification of Inter-Agent Misalignment We provide a systematic formulation of the data-generation process in multi-agent systems using probabilistic graphical models. This formulation enables a rigorous understanding of inter-agent misalignment by showing that it arises from violations of conditional-independence assumptions.
Key Insight and Real-World Impact
- We emphasize that conditioning each subtask on the full history of verified outputs is crucial for inter-agent alignment.
- In some cases, it is necessary to sacrifice parallelism in favor of sequential execution to ensure reliable task execution.
- This insight is particularly important given that DAG-based decomposition is already used in industry AI agent systems. For example, AWS KIRO IDE precisely decomposes a user's coding requirement into a DAG and executes it.
An Effective Solution to Inter-Agent Misalignment Issues Our method provides a framework for mitigating inter-agent misalignment across diverse domains. The same mechanism works for mathematical reasoning (GSM8K: 96.3%), complex multi-hop tasks (HotpotQA: 83.5% F1), and novel code generation scenarios.
Strong Empirical Results on Novel Tasks We evaluate our method in compositional game environments where models must combine rules from different games (e.g., Pac-Man + Tank mechanics, Snake + RPG elements). These hybrid tasks require true reasoning rather than relying on memorization or shallow pattern matching, since these game combinations have not been designed before.
Efficiency Gains Most remarkably, our sequential approach achieves a 2.5× speedup over Flow (665 s → 217 s) and a 2.1× speedup over Atom (523 s → 217 s) while maintaining superior quality.
Other Novel Contributions That May Inspire Future Work
- Investigating how to leverage different models and their differences could be important and warrant further study.
- An agent system should split workflows only when a task is difficult, not at the outset. In our paper, we use agreement across different models as a surrogate for task difficulty.
- Segment-based validation for early stopping can be useful for saving tokens.

Q4. Token cost would be high, there are risks exceeding the model's context window

Thank you for the insightful comments. The output token cost for verification can be ignored, as the output token is minimal, typically just a single token such as “correct” or “wrong.”

The main contributors to token cost are:

The number of input tokens required for verification (i.e., providing the candidate output and relevant context), and
The number of LLMs used to perform cross-verification.

For example, in our current setup, each verification involves three models verifying each other’s outputs, requiring six parallel verification calls. While this increases input token usage, the output remains negligible. Note that input tokens are much cheaper than output tokens in terms of cash cost (typically 1/4 to 1/5). We have carefully included a limitations section that covers this information.

We believe this trade-off is acceptable. As shown in the response to Q1 of Reviewer asVu, introducing cross-validation with multiple models improves performance by 20.4%. We believe this performance gain justifies the modest additional cost. Furthermore, this multi-model, same-task paradigm is gaining traction in the community. For example, the recent open-source project MassGen has already attracted 1.7k Discord users.

For the risk of the model's context window, we acknowledge this limitation but this should not be a major one for many real-word tasks. As current LLMs (e.g., GPT-4o mini) can have a 128,000 token context window which is approximately 96,000 words or around 17,000 lines of code. For Gemini API, it can even allow 50,000 lines of code.

Q5. The framework's reliance on sequential execution would make the workflow slow during inference

Table 10 in Appendix B (page 26) provides runtime comparisons across all tasks.

It is worth mentioning that:

Although SeqCV is slower than the single-agent baseline, it significantly outperforms it in generation quality without using the "high" version of the GPT model. Moreover, it is 2.5× faster than Flow and 2.1× faster than Atom, while achieving better output quality.
While AFlow appears faster at 160.86 s, this is because it generates sparse results (see demo links in our appendix).

The efficiency gains stem from our advanced system design:

We have implemented segment-based cross-validation to verify intermediate outputs before task completion, avoiding unnecessary computation.
We split tasks only when they are genuinely difficult.
Existing methods rely on an additional "summary" agent to integrate subtask outputs by consuming and aggregating the entire dialogue history (e.g., summary.py in the FLOW GitHub repository). By introducing a mandatory sequential post-processing step, they reduce the system to two stages: a parallel stage followed by a sequential stage. Our method directly concatenates the generated results, eliminating the need for a separate summary module in a subsequent stage.

Q6. No ablation study: subsequent experiments do not directly validate whether inter-agent misalignment has been solved

By following your comments, we conducted control experiments using parallel and sequential runs with 4o-mini.

Note that directly measuring inter-agent misalignment is challenging, as it generally requires manual inspection of all the generated answers. To address this, we adopt quantitative performance as a proxy, which is based on the assumption that fixing inter-agent misalignment leads to more coherent and reliable final outputs.

Due to space constraints, the both results are presented in our response to Q1 of Reviewer asVu.

2025-08-09

Dear Reviewer QP7F,

Thank you for having taken your time to provide us with your valuable comments, which help improves our paper. This is a gentle reminder that the discussion period is nearing its conclusion. If you have any additional questions or concerns, please let us know so we can resolve them before the discussion period concludes.

Thank you

The Authors

2025-08-09

Thanks for the responses.

Regarding Q1&Q4, for such simple query, the proposed system will still generate several solutions to check cross-model consensus. In other words, my major concern is that the whole system seems "heavy", repeated/crossed verification/decomposition would incur much unnecessary cost. Meanwhile, given that the proposed system incurs much token cost, I can not find a token comparison in the paper.

Regarding Q3, I am not fully convinced. Your motivation is the misalignment issues, what are the specific functions of different components of your system. In other words, I would like to see more insights on "which part contribute to what issue". It not convincing for me if you think that your design cross-verification solves misalignment because such rationale could be seen from several papers such as multi-agent debate. So, the design rationale should be clearly stated, otherwise, there could be hundreds of design choices for misalignment issue, and why your current design is the best?

Regarding Q6, since you are targeting the issue of misalignment, I would like to see metrics on misalignment. Rather than just performance metric. The accuracy metric could be resulted by many factors, it should be verified that the performance gain actually comes from the reduced misalignment.

2025-08-07

Dear Reviewer QP7F,

Thank you for your constructive feedback.

After reviewing your comments, we realized that some misunderstandings (Q1, Q2, Q3, Q5) may have stemmed from our insufficient emphasis in the main paper. Due to space constraints, a lot of information was placed in the supplementary material. In our revised main paper, we have further emphasized that:

simple tasks are not decomposed (Q1),
the workflow is not manually designed (Q2),
how task decomposition works (Q3), and
runtime comparisons (Q5).

We would very much appreciate it if the Reviewer could double-check the results and demo links in the supplementary material.

More importantly, in light of your suggestions, we have improved our paper by discussing token cost and context window usage, adding ablation studies, and making our contributions more explicit with quantitative results and the immediate real-world relevance of our work to the latest industrial agent systems (e.g., AWS’s KIRO IDE).

Note that this is a gentle reminder that the discussion period will conclude in 2 days. If there are any remaining questions, please don’t hesitate to let us know so that we can address them before the discussion ends. If you believe that our responses have satisfactorily addressed your comments, we would greatly appreciate it if you could kindly consider raising your score to reflect that the issues have been resolved.

All the best wishes,

The Authors

评论- Followup responses from the authors

2025-08-09

Dear Reviewer QP7F

We sincerely thanks for your quick response~

Q1 thorough comparison on the token cost

Thank you for your valuable suggestions. We are adding more experiments for the cost of both input and output tokens.
This should not be a major reason for rejecting this paper considering 1) the above examples we provided clearly show that the concern about our cost being too high does not hold 2) 0.5 runnig time compared to existing system and 3) dramatic improvement shown in the appendix and demo (Appendix A, line 350).
We kindly ask the reviewer to refer to the demo links (Appendix A, line 350) in the appendix to view the impressive performance.
As this is the last few hours for the rebuttal, we do not have sufficient time to show cost of both input and output tokens here, they will be added into the revised version to further rich our evaluation. We sincerely appreciate your understanding.

Q2 turning a parallel manner into a sequential one should be regarded as a novel and interesting idea.

Note that our contribution more than a change of manner of execution. More importantly, at present, many believe that a parallel manner is superior. However, this assumption can be inherently biased.
Our findings indicate that for some use cases (performance and safty as the first priority), a sequential design has to be used, and it can be more effective than a completely parallel design.
These findings are suppored by both theoretical explanation and empirical evidence.
Why might this be the case? As we pointed out at the beginning, the parallel manner based on graph structures can lead to inconsistency in the intermediate agent outputs, which in turn causes misalignment. To the best of our knowledge, no prior study has explicitly investigated this as a leading cause, which we believe highlights the contribution of our work.
Additionally, to improve the running time efficiency, many other designs (different models on the same tasks, early validation, splitting tasks only when necessary, etc.) are carefully implemented, making our sequential framework at least 2 times faster than the SOTA parallel method.

Q3 Accuracy is not a good measure for misalignment

As mentioned before, we have strictly controlled variables and ensured that the only change is from sequential to parallel. For this ablation, the only change is the context provided to the task (conditional on all prior output). If there are any parts overlooked by us, please kindly let us know.
We clearly explained that to directly measure the misalignment, human evaluation is required to review the results one by one, which is not scalable and may not be objective.
Even after reviewing the recent literature, we did not find any direct indicators for evaluating such misalignment. If you are aware of any, we will definitely incorporate them.

Best wishes,

The authors

评论- Followup responses from the authors

2025-08-09

Dear Reviewer QP7F，

Thanks for these insightful comments and the opportunity to provide further explanation.

We believe that these comments stem from a misunderstanding or from differences in background.

We have carefully emphasized the following points in the revision to ensure that readers from diverse backgrounds can clearly understand why our method can solve inner-agent misalignment by including more explanation, ablation, and examples.

Please do not hesitate to leave any comments before the discussion concludes. It is great to communicate and understand each other’s perspectives.

Followup Q1: Q1 & Q4 (Overhead for Simple Queries)

Fact: For simple queries (e.g., “1+1” you mentioned), our system does not trigger task decomposition and does not perform repeated cross-model consensus checks.
Cost: In a trivial case using two models, the only overhead is one extra model output plus six tokens for voting, 7 extra tokens in total.
Scope: Different systems have different purposes. Our system is not designed for such trivial queries.
- Just as you wouldn’t use a quantum computer to calculate “1+1” or take a plane to travel 15 km.
- It is designed for complex, dependency-rich tasks (e.g., improving DAG-based industrial systems such as AWS KIRO).
- In real-world applications, no single agent system can handle all task types optimally — different systems should work together for different types of problems.

Followup Q2: Cross-verification is not our misalignment fix

You are totally correct about cross-verification cannot solve misalignment. To boost running time efficiency, we use it. Spcifically we early cross-verification on partial subtask outputs to avoid wasting computation when an early failure is detected.

1. Our solution for solving misalignment: sequential conditioning

The core mechanism for addressing inner-agent misalignment is to avoid trust in the dependency graph and instead use sequential execution conditioned on all prior outputs.

Example: Task — “Deliver a reinforcement learning lecture” Existing DAG-based agent system (e.g., FlOW) usually generate workflow like below

task 1: Overall structure → task 2: Introduce Q-Learning
task 1: Overall structure → task 3: Introduce Deep Q-Learning

By trusting in the dependency graph, existing mtehods run Introduce Q-Learning and Introduce Deep Q-Learning in parallel once Overall structure is done.
However, this can lead to notation or terminology inconsistency, because Introduce Q-Learning might define notations that Deep Q-Learning should use.
This means that inner-agent misalignment happens

Our method:

Run Overall structure first.
Then run Introduce Q-Learning.
Finally, run Introduce Deep Q-Learning, but condition its execution on the outputs of both previous steps.
Then, this give model the abiltiy to reuse the notation or terminology in Q-Learning when gernating Deep Q-Learning
This is why the inner-agent misalignment can be solved.

2. Runtime Efficiency improvement for sequential running

Early cross-verification checks partial outputs before a subtask is fully completed.
Recursive splitting is triggered only when a validation fails, to avoid unnecessary decomposition.

Followup Q3: metric for reduced misalignment

We fully agree that accuracy gain alone does not necessarily imply reduced misalignment in general.

To address this, we conducted strictly controlled experiments to isolate the effect of misalignment reduction. Specifically:

We took the exact same workflow generated by an existing parallel method.
We replaced only the parallel execution with our sequential conditioning mechanism (as described above).
We removed cross-verification and recursive splitting so that no other components influenced the result.

Under these conditions, the only difference between the two systems was whether subtasks were executed in parallel or sequentially conditioned on all prior outputs. Therefore, any improvement in accuracy can be directly attributed to reduced inter-agent misalignment.

Configuration	Score	Δ vs. Parallel
Parallel (baseline)	52.3%	–
+ Sequential Only	58.4%	+6.1%

This controlled setting ensures that the measured improvement is a direct consequence of addressing inter-agent misalignment caused by trusting graph dependencies, achieved through sequential conditioning.

Many thanks, The authors

2025-08-09

Thanks for the response.

Q1: You should provide a thorough comparison on the token cost (both input and output). Scaling compute to get better results is not a surprise. Such a comparison is always needed even if you think that your method does not cost too much.

Q2: Yes, I agree with you that this would somehow resolve the misalignment. But this is so intuitive that I cannot agree with you that turning a parallel manner into a sequential one should be regarded as a novel and interesting idea.

Q3: This experiment shows that sequential manner is better than parallel manner. But still, there is a gap. So many things will be affected by this manner transition. Don’t you think that a more direct indicator is beneficial to show that your method indeed solves the issue of misalignment and thus improves the performance? Since this is your key motivation, I believe this metric is required.

I do not mean to reject your paper. But right now, I cannot persuade myself to raise the score. Correct me if anything is wrong.

审稿意见

评分: 4置信度: 42025-07-03

This paper introduces SEQCV, a dynamic framework designed to address the issue of inter-agent misalignment in LLM-agent systems. The authors identify the violation of conditional independence assumptions—where subtask responses are assumed to be reliable and dependent only on parent responses—as the root cause of these misalignments. SEQCV tackles this by executing subtasks sequentially, conditioning each on all prior responses, and immediately verifying them through consistency checks across diverse LLM models. If a response is deemed unreliable, a recursive splitting mechanism breaks down the subtask into smaller components. Experiments demonstrate that SEQCV improves accuracy by up to 17.3% and reduces execution time by more than half on complex tasks compared to existing methods.

优缺点分析

Strengths

The paper pinpoints the violation of conditional independence assumptions in LLM-agent workflows as a core issue leading to inter-agent misalignment, which significantly impacts quality and runtime efficiency. This highlights a fundamental challenge in current LLM-agent systems.
SEQCV introduces a dynamic framework that ensures reliable execution despite violated conditional independence assumptions. It achieves this through sequential subtask execution, conditioning on prior responses, consistency checks, and a recursive splitting mechanism for unreliable outputs.
SEQCV shows substantial gains in both accuracy and efficiency across various tasks and datasets, including mathematical reasoning, knowledge-intensive reasoning, logical reasoning, and multi-hop reasoning. It improves accuracy by up to 17.3% and reduces execution time by more than half on complex tasks.

Weaknesses

The paper does not provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments. This hinders reproducibility for other researchers.
The evaluation relies on an LLM (gpt-4.1-nano) to determine whether predictions match ground truths and calculate accuracy. While this can be efficient, it introduces a potential bias or limitation if the evaluating LLM itself has errors or biases, or if its "ground truth" assessment isn't truly objective.
SEQCV's verification step relies on "semantically consistent across diverse LLM models" and "peer-reviewed by agents of other LLM models". While the paper mentions using "gpt-4o-mini, gpt-4.1-mini, and gpt-4.1-nano as the core models" for evaluation and "three gpt-4o-mini models" or "gpt-4o-mini, o4-mini and o3-mini" as backbone models, it doesn't explicitly define what constitutes "diverse" enough models to ensure robust cross-model validation. The effectiveness of this mechanism heavily depends on the actual diversity and complementary strengths of the chosen LLMs, which isn't thoroughly explored.
While the paper claims SEQCV "avoids costly misalignment corrections and delivers higher effective throughput than parallel pipelines" , the fundamental sequential nature of subtask execution and the consistency checks at each token sequence checkpoint inherently introduce overhead. Although it aims to avoid costly misalignment corrections, the frequent verification and potential recursive splitting could still lead to higher overall latency for simple tasks where misalignment is less likely. The efficiency gains might primarily be realized on complex tasks prone to significant misalignment errors in parallel setups.

问题

How is "semantic consistency" precisely defined and measured during the cross-model validation process, especially given that LLMs may generate varied but equally valid responses for certain subtasks?
What is the practical impact of the "maximum recursion depth" in the recursive splitting mechanism? Does reaching this limit imply a failure to solve the subtask, and if so, how does SEQCV handle such scenarios and their impact on the overall task objective?
Could the paper provide a more detailed breakdown of the latency implications of SEQCV's sequential execution and frequent verification steps across different task complexities, perhaps with a comparison to the latency of re-execution after misalignment in parallel approaches?

局限性

Yes.

最终评判理由

I've confirmed the author's rebuttal, and I will keep my original score.

格式问题

N/A

作者回复

2025-07-31

Dear Reviewer KXbZ,

Thank you so much for your postive support and insightful comments! We follow your comments and make clarification as follows.

Q1. Information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments

Our experiments require minimal local computational resources, as they primarily involve calling LLM APIs. Any standard laptop with a stable internet connection is sufficient.

For reference, the hardware configuration we used is:

13th Gen Intel(R) Core(TM) i7-13700H @ 2.40 GHz
16.0 GB RAM (15.7 GB usable)

Q2. The evaluation relies on an LLM to determine whether predictions match ground truths and calculate accuracy

Thanks for this construtive comments.

We have included F1 scores and Hit rate in the appendix E.
The LLM-based evaluation is more suitable for reflecting accuracy because LLM-generated answers often contain formatting variations or additional context.

For example:

Multiple choice problems: The model might answer "answer: A", "A.", or "A" - all correct but requiring semantic understanding
Q&A questions: Answers can be more complex with equivalent but differently phrased responses

Importantly, we use the LLM evaluator with great care. Each question–answer pair is evaluated independently. Specifically, one questions at a time via API calls to determine whether the answer is correct or not. In this way, context length and potential dependencies across questions do not interfere with the judgment. We then manually aggregate the results to compute accuracy.

Q3. The effectiveness of this mechanism heavily depends on the actual diversity and complementary strengths of the chosen LLMs, which isn't thoroughly explored

Thank you for raising this important point. You are correct that diversity alone is not sufficient when selecting backbone LLMs.

Our selection principle: We believe model choice should be task-specific. For a given task, we aim to select different models that (1) achieve similar accuracy (i.e., no significantly weaker model), and (2) are trained on different datasets to ensure diversity.

Practical implementation: Since task-specific accuracy assessment remains an open challenge, in practice, we select models that are publicly accessible via API and demonstrate similar overall performance—for example, gpt-4o-mini, o3-mini, and o4-mini.

Q4. The fundamental sequential nature of subtask execution and the consistency checks at each token sequence checkpoint inherently introduce overhead

While sequential execution does introduce some overhead, our design includes several optimizations that actually make it faster than existing parallel methods:

Efficiency of consistency checks:

The check takes approximately 2 seconds (also depending on network speed), as we only require models to output a single token ("yes" or "no"), making the validation process very lightweight.
More importantly, all cross-model checks are independent and run in parallel. As a result, checking one answer takes roughly the same amount of time as checking multiple answers.

System optimizations: We have implemented several improvements for sequential execution:

Segment-based validation: Early validation allows immediate error detection and prevents wasted computation
Elimination of summary bottlenecks: Unlike existing methods that require a sequential "summary agent" stage after parallel execution, we directly concatenate results
Adaptive splitting: We only decompose tasks when necessary (when consensus fails), avoiding unnecessary overhead

Empirical results: Despite the sequential nature, our method achieves faster runtime than existing parallel methods shown in Appendix B:

SeqCV: 217.14s (our method)
AFlow: 160.86s (generates fewer tokens)
Flow: 665.00s (2.5× slower than ours)
Atom: 522.57s (2.1× slower than ours)

The reason AFlow is faster is simiply because it generates significantly fewer tokens.

Q5. Could the paper provide a more detailed breakdown of the latency implications of SeqCV's sequential execution and frequent verification steps across different task complexities, perhaps with a comparison to the latency of re-execution after misalignment in parallel approaches?

Direct comparison with parallel methods' re-execution latency after misalignment is challenging because existing parallel methods cannot identify when misalignment occurs.
Hard to identify when misalignment occurs is precisely our key motivation for designing a sequential approach that directly reduces inner misalignment issues rather than attempting to detect and correct them post-hoc.

As detailed in Appendix B (and noted in our response to Q4), our method runs in 217.14 seconds, which is faster than several parallel baselines.

As explained earlier, each verification step is very lightweight, requiring approximately 2 seconds, since it only involves generating a single token ("yes" or "no") in parallel across models.

Runtime breakdown: In our experiments, we observe that about 24.7 seconds are spent on verification, each taking1.5 to 2.5 seconds depending on network conditions. The remaining ~192.4 seconds are used for sequential LLM execution.

Component Total Time (s)
Sequential LLM execution ~192.4
Verification ~24.7
Total Runtime (SeqCV) 217.1
It is worth mentioning that the main bottleneck in the verification step is network latency and API response time. The results can vary significantly depending on network conditions.

Component	Total Time (s)
Sequential LLM execution	~192.4
Verification	~24.7
Total Runtime (SeqCV)	217.1

Q6. Potential recursive splitting could still lead to higher overall latency for simple tasks where misalignment is less likely

Thank you so much for the important comments. For simple tasks, recursive splitting is rarely triggered, as our method only performs decomposition when necessary.

We have followed your comments and summarized the decomposition rates on benchmark datasets. Using our method with gpt-4o-mini, gpt-4.1-mini, and gpt-4.1-nano together, the results are as follows:

GSM8K (math reasoning): 4%
MATH (complex mathematics): 6%
HotpotQA (multi-hop reasoning): 6%
LongBench (long-context reasoning): 7%

Q7. How to check if output sequence is "semantic consistent" for cross-model validation process

We apologize for the confusion, this was a writing issue. To avoid further misunderstanding, we have removed the term "semantic consistent." The checking is performed through a structured multi-model voting process.

For each result segment generated by an LLM, we prompt multiple models to independently vote yes or no on three specific criteria (see Appendix F for the prompt):
- Logical continuation: Does the current segment logically continue the task as described?
- Error-free quality: Is it free from critical errors, omissions, or contradictions?
- Sufficiency for progression: Is it sufficient to serve as the basis for the next iteration?
Consensus mechanism: If the majority (or unanimous) vote is yes for all three questions, we treat the segment as semantically consistent and continue generation without splitting. This approach captures semantic coherence beyond simple syntactic matching.

Q8. What is the practical impact of the "maximum recursion depth" in the recursive splitting mechanism? Does reaching this limit imply a failure to solve the subtask, and if so, how does SeqCV handle such scenarios and their impact on the overall task objective?

In our experiments, we set the maximum recursion depth to 3. This limit serves as a practical safeguard against infinite decomposition while allowing sufficient depth for most complex tasks.

Empirical frequency:

Benchmark datasets: None of the standard benchmark tasks reached the maximum recursion depth
Creative tasks (Appendix A): Only 2 out of 35 challenging creative tasks exceeded the limit

Failure handling: If the maximum recursion depth is reached without consensus, our system marks the subtask as failed, then uses a single model to execute the task and ensembles the result. This approach prevents system deadlock while maintaining minimal functionality.

2025-08-07

Dear Reviewer KXbZ,

Thank you again for your positive support and for helping improve our paper. We are sorry about the annoying message.

This is a gentle reminder that the discussion period will conclude in 2 days. If there are any remaining questions, please don’t hesitate to let us know so that we can address them before the discussion ends.

Many thanks,

The Authors

2025-08-03

Dear Reviewer KXbZ,

Thank you for taking the time to provide us with your valuable comments, which have greatly helped improve our paper. We have followed your suggestions, including providing an explanation of why we use LLM-based evaluation, adding results for recursive splitting, and detailing the timing for verification steps. All responses have been carefully incorporated into our revised version.

This is a gentle reminder that the discussion period will conclude in three days. If you have any additional questions or concerns, please let us know so that we can address them before the discussion period ends. If you feel our responses have satisfactorily addressed your concerns, we would greatly appreciate it if you could raise your score to reflect that the issues have been resolved.

All the best,

The authors

审稿意见

评分: 4置信度: 22025-07-03

The paper shows that graph‑based multi‑LLM workflows break down because subtasks’ outputs are not conditionally independent; this causes style, reasoning, or objective misalignment among agents. The authors propose SEQCV, which executes subtasks sequentially, conditioning each on all verified history. In add ition, it performs segment‑level cross‑model voting to accept or discard partial outputs, and recursively splits a failing subtask into simpler ones. Across six benchmarks, SEQCV improves accuracy by 3‑4pp and cuts runtime on complex tasks despite the sequential flow.

优缺点分析

strengths:

Solid empirical gains; diverse tasks; transparent methodology and clear figures, pseudocode, and motivation.
Addresses a real reliability gap in agent systems; technique is broadly applicable.
Fine‑grained cross‑model verification 、 dynamic splitting is novel.

Weaknesses:

Accuracy gains are moderate; sequential flow may bottleneck very wide graphs.
Relies on established ideas (voting, recursion) stitched together.

问题

How much do sequential execution, cross‑validation, and recursive splitting each contribute?
Can independent branches of a DAG run in parallel without re‑introducing misalignment?
What exact criterion defines “agreement,” and how do you guard against unanimous but wrong answers?

局限性

SEQCV requires an acyclic task graph and may struggle with feedback loops .

格式问题

作者回复

2025-07-31

Dear Reviewer asVu,

Thank you for your postive support and professional comments for helping improve our paper. We follow your comments and make further clarification as follows.

Q1. Accuracy gains & Ablation Study

Thank you for the insightful comments and this valuable opportunity for us to strengthen our empirical study.

For benchmarks in Table 1, the average accuracy gain is 3% compared to the best state-of-the-art agent system. We believe that the benchmark tasks do not well reflect the challenges commonly encountered in the wild. For example, the GSM8K task “These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations" can be easily answered. Therefore, applying multi-agent system does not provide a significant advantage here as a single high-performance agent with extensive chain-of-thought reasoning capabilities is enough.

These observations motivated us to design a suite of more creative and novel tasks (in Appendix A). Inspired by the Grok-2 evaluation approach, we ensure that most designed tasks are not present in the training data of LLM-agents by combining multiple rule sets (e.g., Pac-Man + Tank war), thereby requiring advanced reasoning rather than memorization.

We have followed your comments and conducted additional experiments to show quantified results for creative tasks, where our method achieves substantial improvements.

Performance Metrics We evaluate each method on seven challenging agentic tasks using three metrics:

Hard Requirements (HR): % of task-specific requirements fully satisfied
Execution Success (ES): % of runs completing without major bugs
Constraint Adherence (CA): % of runs respecting all specified constraints

Task Specifications:

NeurIPS Website: 4 requirements + 1 constraint (HTML+CSS format) onstraints (Python, no sound, no external images)
...... (for saving spaces)
Tetris + Bejeweled: 5 requirements + 2 constraints (Python, no sound)
Travel Plan: 4 requirements + 1 constraint (LaTeX format)

Total Score = (HR + ES + CA) / 3.

Table 1. Average Performance Across Methods This table is newly introduced to give overall context.

Method	HR	ES	CA	Total
AFlow	36%	30%	40%	36%
Atom	26%	40%	30%	29%
Flow	56%	70%	40%	58%
o4-mini-high	81%	100%	70%	82%
SeqCV	83%	100%	100%	88%

Key Findings

Baseline agent systems use o4-mini. Our method, SeqCV, uses both o4-mini and o3-mini together and achieves 88% accuracy. In contrast, a high-performance single model (o4-mini-high) scores 82%, which demonstrates that our method using multiple weaker models can outperform a stronger model.

We have also followed your comments and conducted ablation studies with each task run over five trials, and the best results were selected and counted for evaluation.

Table 2. Ablation Study: Component Contributions

Configuration	Score	Δ vs. Parallel
Parallel (baseline)	52.3%	–
+ Sequential Only	58.4%	+6.1%
+ SeqExec + Cross-Val	78.6%	+26.3%
+ SeqExec + Cross-Val + RecSplit (Full SeqCV)	88.0%	+35.7%

Q2. Can independent branches of a DAG run in parallel without re-introducing misalignment?

If we can fully trust the output of a branch, i.e., when the conditional independence satisfies, then it can safely run in parallel. For example, suppose we have two independent tasks, T1 and T2, both pointing to a downstream task T3 (i.e., T1 → T3 and T2 → T3). Since T1 and T2 are independent of each other, they can be executed in parallel without causing misalignment, as long as their outputs are properly synchronized before T3 begins. However, this is not a realistic assumption in the wild.

It may also worth mentioning that our method is faster than existing parallel methods:

SeqCV: 217s (our method)
AFlow: 161s (generates far fewer contents, see demos links in the appendix)
Flow: 665s (2.5× slower than ours)
Atom: 523s (2.1× slower than ours)

The speed-up comes from

Eliminates summary bottleneck: Existing methods require a sequential "summary agent" stage after parallel execution (e.g., see summary.py in Flow repository), creating a parallel→sequential workflow
Segment-level early stopping: We validate incrementally and can halt immediately when errors are detected

Q3. What exact criterion defines "agreement"?

For each result segment generated by an LLM, we prompt multiple models to independently vote yes or no on the following questions:

Does the current segment logically continue the task as described?
Is it free from critical errors, omissions, or contradictions?
Is it sufficient to serve as the basis for the next iteration?

If the majority (or unanimous) vote is yes for all three questions, we treat the segment as agreed upon and continue generation without splitting.

Q4. How do you guard against unanimous but wrong answers?

Thanks for this insightful questions. There is no silver bullet for detecting wrong outputs without human intervention. In fact, even with human intervention, it can be challenging to evaluate correctness for novel or complex questions. This should be a fundamental limitation of current AI systems. This also holds for our system: if multiple competitive LLMs unanimously agree on a flawed output, the system will unfortunately still fail.

To mitigate this risk, one strategy is to incorporate external verification and feedback mechanisms whenever possible. For example, in code generation tasks, we can write a script to automatically compile the generated code to check for any compilation errors and further modify the generated results based on those errors.

Q5. Sequential flow may bottleneck very wide graphs

Thank you for this insightful comment. we have acknowledged it to our limitations in the revised paper. It is also worth mentioning that:

Our method only fails to provide benefits in a very specific type of wide graph. Specifically, one where each independent branch (1) has no incoming edges from outside the branch (i.e., no parent nodes pointing into it), and (2) has no connections to nodes in other branches. In such a strictly isolated structure, there is little opportunity for our method to reduce inter-agent misalignment, as there is no misalignment.
In all other cases, our method offers a trade-off between maximizing reliability and improving context management, especially when branches interact or depend on shared upstream information.
It's important to note that
- context size is usually not a limitation, as current LLMs (e.g., even GPT-4o Mini) support a 128,000-token context window—equivalent to approximately 96,000 words or around 17,000 lines of code.
- Maximizing reliability can be the first priority in many domains. For example, in code generation, AWS recently introduced the KIRO IDE, which decomposes a coding task into a DAG of subtasks. In this bug-sensitive domain, conditioning on the full context and executing sequentially is often the safer choice.

Q6. Relies on established ideas

Thank you for this important question about novelty. While individual components like voting and recursion exist in prior work, our contributions lie in the systematic analysis of why existing methods fail and how to combine these techniques effectively to address an important inter-agent misalignment problem in multi-agent systems.

Our key contributions:

Systematic formulation of inter-agent misalignment: We provide the first rigorous analysis using probabilistic graphical models to show why parallel execution violates conditional independence assumptions, leading to misalignment. This theoretical foundation was missing in prior work.
Discovery of hidden dependencies: We identify that latent dependencies can propagate along paths raised by approximation of output generated by LLMs, which explains a major reason for existing inter-agent misalignment occurs.
Real-world impact: Highlighting the need to carefully handle conditional independence assumptions is crucial for future agentic system design, especially in high-stakes applications where safety and correctness are paramount. As noted in Q5, DAG-based decomposition is already used in industry systems (AWS KIRO IDE). Our insights about conditional independence violations have immediate relevance for such systems.
Novel system design innovations:

Sequential based method: propose an effective solution that conditioning on all history rather than just parent nodes, which eliminates hidden dependencies.
Segment-level cross-validation: Unlike traditional voting, we validate incrementally at the segment level, enabling early error detection and preventing wasted computational run time cost.
Adaptive recursive splitting: We split tasks only when necessary (when consensus fails), not upfront like existing methods. This adaptive approach also reduces unnecessary computational and run time cost.
Elimination of summary bottlenecks: Our design removes the need for separate summary agents that create sequential bottlenecks in parallel systems. Existing methods rely on an additional "summary" agent to integrate subtask outputs by consuming and aggregating the entire dialogue history (e.g., summary.py in the FLOW GitHub repository).

2025-08-03

Dear Reviewer asVu,

Thank you for taking the time to provide us with your valuable comments, which have greatly helped improve our paper. We have followed your suggestions, including adding additional experiments on ablation studies, discussing the immediate real-world relevance of our paper to the latest industrial agent systems (e.g., AWS’s KIRO IDE), and explaining cases involving independent branches of a DAG. All the responses have been carefully included in our revised version.

Thank you

The Authors

审稿意见

评分: 4置信度: 32025-07-07

The paper considers failures in multi-agent systems caused by inter-agent misalignments. When complex tasks are decomposed into a graph of subtasks, the system often assumes conditional independence between them. However, this assumption is frequently violated. Since LLMs produce approximate outputs without access to ground truth, hidden dependencies can emerge, leading to cascading errors. To address this, the authors propose a framework called SEQCV. It executes subtasks sequentially, conditioning each new step on the full history of previously verified outputs. SEQCV also introduces a sequential generation and cross-validation mechanism. This allows the system to selectively split complex subtasks into smaller ones. The authors demonstrate that SEQCV improves accuracy across several benchmarks. It also reduces overall execution time by catching errors early, thus avoiding expensive full-context corrections later.

优缺点分析

Strengths:

The efficiency of multi-agent LLM systems is an important and understudied research area. This paper provides an insightful study on this topic.
The paper proposes a new framework, SEQCV, to improve the reliability and efficiency of multi-agent LLM systems.
The paper validates SEQCV on six standard reasoning benchmarks (including MATH and HotpotQA) and show the proposed solution performs well. In addition, the paper includes a challenging, domain-specific (game development) task. Finally, I found the results provided in the Appendix being insightful as well.

Weaknesses:

Questionable claims and assumptions:

The claim that existing literature assumes subtasks in a DAG decomposition are conditionally independent is quite strong. You seem to draw inspiration from reinforcement learning and optimal control theory by suggesting that LLM-based workflows resemble partially observable Markov decision processes (POMDPs) rather than Markov decision processes (MDPs). However, this is not a novel insight. It is already widely accepted that LLM (or any neural model) outputs are approximations, not ground truths. Therefore, ML-based workflows are inherently non-Markovian.
You should clearly justify the differences in core assumptions. In particular, many prior works limit inter-agent input sharing not due to a belief in conditional independence, but for context management reasons. Conditioning on full dialogue history can indeed resolve non-Markovian issues, but it is memory-intensive.

Generality: The main contribution of this paper is a practical method for managing context. By conditioning on previously verified outputs, the method helps mitigate non-Markovian effects in LLM workflows. This is a useful idea. However, it is unclear how the approach would generalize to more autonomous agentic systems. In such settings, agents choose their communication partners, and the interaction graph may not be a DAG, e.g., include cycles to implement things like self-refinement.
Lack of ablation studies: SEQCV introduces several interesting components: Global-Context Construction, Subtask-Context Generation, Cross-Model Validation, and Dynamic Recursive Splitting. However, the paper lacks ablation studies. Without them, it is hard to assess the individual contribution of each module. This limits the overall evaluation and interpretability of the system.
Clarity: You should introduce and define all symbols before using them in Section 2. Otherwise, the formalism is difficult to follow.

问题

What are the latency measurements for the accuracies reported in Table 1? It would be useful to provide an accuracy-latency trade-off curve. Such a curve could illustrate performance across the different system configurations you tested.
How was cross-validation performed? Specifically, how does this methodology apply to a system where agents have distinct and interdependent roles, such as the CEO, Software Engineer, and Code Reviewer in ChatDev?
The paper uses a "consensus of LLMs" as a proxy for ground truth. What is the justification for this assumption? In particular,

How did you account for shared systemic biases given the LLMs were likely trained on similar data?
There is previous work showing that majority voting does not always scale [3]. In this light, can you better justify this method?
Were the LLM consensus results ever grounded? For example, were they validated against a benchmark created by human experts? Please specify.

The fact that LLM outputs are approximations and that errors can cascade in a sequential process is a well-understood challenge in ML-based decision processes [1,2]. Could you clarify wether and how your paper provides a new perspective when it comes to framing the problem beyond previously identified challenges, e.g., (a) LLMs are unreliable and (b) workflows need robust error handling?

[1] Sun, Chuanneng, Songjun Huang, and Dario Pompili. "LLM-Based Multi-Agent Decision-Making: Challenges and Future Directions." IEEE Robotics and Automation Letters (2025). [2] Dai, Peng, and Daniel Weld. "Artificial intelligence for artificial artificial intelligence." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 25. No. 1. 2011. [3] Chen, Lingjiao, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, and James Zou. "Are more llm calls all you need? towards scaling laws of compound inference systems." arXiv preprint arXiv:2403.02419 (2024).

局限性

Yes.

格式问题

None

作者回复

2025-07-31

Dear Reviewer C5z9,

We sincerely thank you for your positive support and thoughtful feedback. Your insightful comments, especially on reinforcement learning and ablation studies are very helpful in guiding our revisions to make the manuscript more accessible and relevant to a broader audience. Below, we have summarized your comments and suggestions, along with our detailed responses.

Q1. Ablation Study

Thank you for much for this important questions and efforts on helping us improve our paper. Due to limited space, please kindly refer to Q1 of Reviewer asVu for more details.

Q2. This paper draws inspiration from an RL concept, i.e., POMDPs

Thank you very much for the insightful question! We formulate the data-generating process using probabilistic graphical models, which allow to directly read off the conditional independencies. POMDPs are a specific instance of such models, which is why our insights can be naturally interpreted through POMDPs. In our revision, we have also provided analysis from the POMDP perspective in the revised paper, e.g., treating the ground truth answer as a latent state.

Q3. Clarify how your paper provides a new perspective?

Thank you so much for the opportunity to clarify our novelty. We realized the need to articulate our contributions and perspectives more explicitly. In our revised version, we have carefully emphasized the following key points. Feel free to highlight further improvements needed. We truly appreciate your time and effort.

Theoretical Formulation and Identification of Inter-Agent Misalignment Building on the analysis on underlying causes of inter-agent misalignment, we provide theoretical formulation using probabilistic graphical models to show that inter-agent misalignment arises from violations of conditional independence and can occur across three task execution patterns:

$T_2 \rightarrow T_1 \rightarrow T_3$
$T_2 \leftarrow T_1 \rightarrow T_3$
$T_2 \leftarrow T_1 \leftarrow T_3$

This theoretical foundation was missing in prior work and offers a new lens for understanding agent system failures.

Key Insight and Real-World Impact

We emphasize that conditioning each subtask on the full history of verified outputs is crucial for maintaining inter-agent alignment.
It is also necessary to sacrifice parallelism in favor of sequential execution to ensure reliability.
This insight has immediate real-world relevanceas DAG-based decompositions are already employed in industrial systems such as AWS’s KIRO IDE, which translates user coding requirements into executable DAGs. In such bug-sensitive domains, reliability must take priority.

Strong Empirical Results on Novel Tasks Beyond benchmarks, we evaluate our method in creative tasks where models must combine rules from different tasks (e.g., Pac-Man + Tank City, Snake + RPG elements). These hybrid tasks cannot be solved by memorization, as they are not in the LLM's training data. On these tasks, our method achieves substantial improvements.
Efficiency Gains Despite using sequential execution, our approach achieves a 2.5× speedup over Flow (665s → 217s) and a 2.1× speedup over Atom (523s → 217s), while maintaining higher quality outputs.
Additional Contributions with Potential to Inspire Future Work

Leveraging differences across multiple models for agreement-based validation could be a promising direction for future research.
We propose that agent workflows should be split only when the task is difficult (determined via model disagreement).
Our segment-based validation also enables early stopping, which helps reduce unnecessary token consumption.

Q4. Prior works limit inter-agent input sharing not due to a belief in conditional independence, but for context management reasons

Thank you so much for the insightful question.

We believe that prior works may not have noticed the issue that conditional independence cannot be trusted. Even they had explicitly recognized this, their system design would differ, such as only partitioning context when approaching the model's input limit. However, in their experiments, despite the context length for all tasks being far away from reaching the model's limitation, they still partition the context, which suggests an implicit and possibly unnecessary precaution.

Q5. Conditioning on full dialogue history is memory-intensive

Thank you for the valuable comment. We agree that relying on full dialogue history introduces significant memory overhead. Existing methods often depend on an additional summary agent to integrate subtask outputs by consuming and aggregating the entire dialogue history (e.g., summary.py in the FLOW GitHub repository). However, such design is not efficient as it not only increases memory consumption but also compromises the benefits of parallelism. By introducing a mandatory sequential post-processing step, it reduces the system to a parallel stage followed by a sequential bottleneck.

In contrast, our method does not require a summary agent. Instead, we perform a single, lightweight sequential pass to integrate results, which maintains efficiency and preserves the scalability advantages of parallel agent execution.

We strongly agree that minimizing reliance on full dialogue history is an important and open challenge. We are actively investigating this direction, aiming to discover patterns of dependence and conditional independence in natural language.

Q6. Latency measurements We use runtime as the metric for measuring latency, as reported in Table 10 in Appendix B (page 26). Although our method follows a sequential design, it benefits from removing the summary agent (as discussed in A4), which allows it to outperform existing methods in generation speed. Furthermore, while AFlow shows comparable speed, this is not due to the system's efficiency but rather because it generates a significantly smaller number of tokens than us.

Q7. Can the method be extended to a system having agent roles? Thank you for the question. Currently, there are two major streams in agentic systems that can work in a complementary manner:

Coordination-based streams, where agents select communication partners and assign tasks (e.g., selecting a Software Engineer to perform a specific subtask).
Content-generation-based streams, where a group of agents collaboratively generate content to complete a task (e.g., a team of Software Engineers working together to develop a solution).

Our work focuses on the second stream, which can be seamlessly embedded into coordination-based systems to support more complex inter-agent interactions. This stream already has impactful real-world applications. For instance, AWS recently introduced the KIRO IDE, which decomposes a coding task into a DAG of subtasks.

More importantly, we believe that timely highlighting the need to carefully handle the conditional independence assumption can be crucial for the future design of agentic systems, especially in high-stakes applications like coding, where safety and correctness are paramount.

Q8. How does the approach include cycles to implement things like self-refinement?

We support task refinement through recursive task splitting: when different LLMs disagree on a task's output, we rerun it by decomposing it into smaller sub-tasks.

Our method also allows to incorporate other refinements. For example, similar to Flow, we can add a task refinement module outside the DAG to support iterative improvements. However, it's worth noting that such refinement is computationally expensive. Flow with refinement requires roughly 5× more runtime than without refinement.

Q9. Justify assumption of using LLMs' consensus as ground truth This assumption is borrowed from conventional ensemble-based methods, which have been shown to improve performance. We assume that when multiple LLMs agree, the answer is likely to be correct.

Q10. There is previous work showing that majority voting does not always scale [3]. In this light, can you better justify this method?

Our method align with the claim that majority voting does not scale linearly with the number of LLMs. This insight directly informed our design: we do not use as many LLMs as possible. Instead, we use a small ensemble of three competitive LLMs (gpt-4o-mini, gpt-4.1-mini, and gpt-4.1-nano) that exhibit sufficient diversity while maintaining comparable performance. This setup balances diversity and efficiency while minimizing noise from weaker or redundant models.

Q11. How to reduce systemic biases if LLMs are trained on similar data?

There is no silver bullet for this issue, not just for LLM-based agents, but even for human collaboration. We acknowledge this as an important direction for future work.

At present, we should encourage the use of models with similar performance that are trained on different datasets. Additionally, designing prompts that encourage models to produce answers via different reasoning processes may also help mitigate shared biases.

Q12. Were the LLM consensus results ever grounded?

To the best of our knowledge, there is no annotated dataset specifically designed to evaluate the quality of LLM consensus, making direct quantification difficult.

We conducted ablation studies by removing the consensus component, which resulted in worse performance, demonstrating its effectiveness.

Q13. Notation table

We have followed your comments and have added a notation table at the start of our method section.

2025-08-07

Dear Reviewer C5z9,

Thank you again for your valuable contribution in reviewing and helping improve our paper.

This is a gentle reminder that the discussion period will conclude in two days.

If there are any remaining questions, please don’t hesitate to let us know so that we can address them before the discussion ends. If you believe that our responses have satisfactorily addressed your comments, we would greatly appreciate it if you could kindly consider raising your score to reflect that the issues have been resolved.

All the best wishes,

The Authors

2025-08-03

Dear Reviewer C5z9

Thank you for your time and effort in reviewing our paper! Thanks to your comments, our manuscript has greatly improved.

We have followed your suggestions by adding additional experiments for ablation studies, framing our approach as a POMDP, making our contributions more explicit, and discussing the immediate real-world relevance of our paper to the latest industrial agent systems (e.g., AWS’s KIRO IDE).

If you have any further questions or concerns, please kindly let us know so that we can address them before the discussion period concludes.

Sincerely,

The Authors

最终决定Accept (poster)

2025-09-17

This paper addresses a real and timely failure mode in multi-LLM agent workflows: inter-agent misalignment arising from violated conditional-independence assumptions. The paper then proposes SeqCV, a sequential, history-conditioned execution with segment-level cross-model verification and adaptive recursive splitting.
Reviewers broadly agree the problem framing is useful, the system design is clear, and the empirical evidence is persuasive: they report consistent gains on standard reasoning benchmarks and larger creative tasks, with runtimes competitive with or better than popular parallel pipelines, despite the move to sequential conditioning.

There were questions and concerns regarding experiments and clarity. The rebuttal seem to address some of the key issues, with ablations that isolate component contributions such as sequentialization, cross-validation, splitting. The discussion also resulted in addition of latency breakdowns, decomposition prompts, and reproducibility details.
Remaining concerns are mostly about scope and positioning: the novelty is largely on careful integration of well known ingredients which might be deemed incremental; reliance on LLM consensus without additional verification; and token-/context-cost accounting is still deemed thin.

After reviewing the paper and author discussion, I would discount the Reviewer QP7F's criticism as to me they largely lean on points that are either not core issues or already addressed. To list of few of these and why they are invalid concenrs:

"the workflow method still needs to decompose and verify, which introduces unnecessary cost." The method splits only when cross-model consensus indicates disagreement. That is stated verbatim.
"The novelty of the proposed method is unclear. Currently, it seems like another one manually-designed workflow." To me there seems to be enough novelty and the split is learned and applied dynamically to the DAG during execution.
"It is a manually designed workflow." The split is learned and applied dynamically to the DAG during execution. Algorithm 2 shows how the graph is updated and recursed.
"No ablation study" The paper explicitly lists investigations using cross-validation modules and workflow-decomposition modules to verify component necessity. While the deepest ablations are added in the discussion/rebuttal, the main text includes these component studies.