EnCompass: Enhancing Agent Programming with Search Over Program Execution Paths
programming framework that does runtime compilation of an LLM-based agent workflow into a search space, enabling independent experimentation of different overlaying inference-time search strategies
摘要
评审与讨论
EnCompass is a python-based framework for performing inference-time search on code-based agents systems. The goal of EnCompass is enable users to easily implement different search algorithms without having to change many of the underlying details. EnCompass frames different inference time strategies as a search over different execution paths of a non-deterministic program, which is exactly what is done in LLM tree-based agents. Many examples are presented in the paper on how Encompass results in more straightforward and robust code for users especially when trying inference time algorithms.
优缺点分析
Strengths:
- Making LLM and agent systems more robust and easier otherwise use is a high impact area since many users often use unoptimized custom built implementations that vary from user to user. Easier experimentation and implementation would benefit many people in the area. (Significance)
- Both the underlying observations about the connections between inference time algorithms and probabilistic program is novel as well as the design of the system. (Originality)
- The many examples comparing the original implementation and the EnCompass implementation are very useful in understanding the system. (Clarity, Quality)
- Initial results show that less code is required when using the system. (Quality)
Weaknesses
- EnCompass seems to have a high learning curve, making it hard for users to adapt initially, especially in relatively simplistic cases. (Quality)
- It is unclear how much compute is required in general for EnCompass to run and what happens if that compute is not available. For example, if an LLM is being used locally and there is only one GPU, how is that parallelized. (Clarity)
问题
-
What kind of hardware is required for EnCompass? Can someone with a consumer laptop run it using a package like Ollama?
-
Is there a way to visualize or trace what is happening as the search is happening so the user knows what is going on?
-
While this is beyond the scope of the paper, having such a framework for "LLM-in-control" could also be useful. Could EnCompass be extended to include that?
局限性
yes
最终评判理由
The authors addressed all of my concerns during the rebuttal including computational resources. It can be hard to estimate the learning curve of EnCompass right now but with tutorials and visualizations it seems that users could learn it easily. I think EnCompass addresses a very important issue and provides a framework for reproducible experiments which does not exist at the moment so I recommend accept.
格式问题
no
We thank the reviewer for their time and effort. We address your comments and questions below.
Weakness 1, learning curve
While the concept of probabilistic angelic nondeterminism (PAN) is unfamiliar to many people, it is the only concept that needs to be understood to learn EnCompass. Otherwise, the basic use of EnCompass is completely encapsulated by three commands: branchpoint(), record_score(), and my_agent(...).search(...). EnCompass is fairly lightweight to use, and one can create tutorials and reference examples for common search strategies.
Weakness 2, compute and parallelization
How much compute is required and what happens when that is not available will be determined by the code that the user writes, as well as the specific search algorithm being used (which could be user-defined, or one that EnCompass provides out-of-the-box). As an example, consider the following code:
@encompass.compile
def run_agent()
branchpoint(branching=3)
call_llm("prompt 1") # Replace this with whatever framework/API you're using to call LLMs, e.g., LangChain, DSPy, Ollama, OpenAI API, etc.
branchpoint(branching=5)
call_llm("prompt 2") # Replace this with whatever framework/API you're using to call LLMs, e.g., LangChain, DSPy, Ollama, OpenAI API, etc.
run_agent().search("parallel_dfs")
This depth-first search will result in 3 LLM calls with "prompt 1" and 3 * 5 = 15 LLM calls with "prompt 2", with maximum parallelism (unless multithreading is disabled via the max_workers=1 keyword argument). How much compute this requires or how multiple LLM requests get queued or dynamically batched will be determined by the framework the user uses for calling LLMs.
Question 1, hardware requirements
EnCompass by itself has no dependencies. This is because the compiler uses the Python standard library ast for AST transformations and the search algorithms also only use Python standard libraries.
However, the agent code you write may have certain requirements, e.g., if you run LLMs locally with Ollama, then you probably need a GPU, and your agent code will depend on the Ollama library:
import ollama
@encompass.compile
def run_agent()
branchpoint(branching=3)
response1 = ollama.generate(model="gemma:2b", prompt="prompt 1")
branchpoint(branching=5)
response2 = ollama.generate(model="gemma:2b", prompt="prompt 1")
run_agent().search("parallel_dfs")
The fact that EnCompass has no dependencies means that you can flexibly use it with any LLM framework, which is one of the strengths of EnCompass.
Question 2, visualization and tracing
Yes – there are a few ways to do this depending on what is being visualized/traced.
For tracing LLM calls/usage, there are existing libraries that do that. For example, when we ran our experiments, we used AgentOps. EnCompass also provides a primitive for recording usage of any resource via record_costs.
For recording the number of times each branchpoint is visited, EnCompass does this for named branchpoints (Appendix B.2): func.branchpoint_step_counts keeps track of this. (To name a branchpoint, do branchpoint(name=<name>).)
The user can implement any custom tracing/logging/visualization about the search process by writing a wrapper around branchpoint():
@encompass.compile
def branchpoint_with_tracing(logger_state):
logger_state = log_and_visualize(logger_state) # User-defined function to trace/log/visualize
branchpoint()
return logger_state
Then replacing every branchpoint() in your agent with logger_state = searchover(branchpoint_with_tracking(logger_state)) will enable tracing/visualization.
For example, to visualize the entire search tree every time a new node gets created:
class Node:
def __init__(self, root=None):
self.children = []
self.root = root if root is not None else self
def create_child(self):
child = Node(root=self.root)
self.children.append(child)
return child
def visualize_tree(self):
self.root.visualize_sub_tree()
@encompass.compile
def branchpoint_with_visualization(parent_node):
parent_node.visualize_tree()
branchpoint()
child_node = parent_node.create_child() if node is not None else Node()
child_node.visualize_tree()
return child_node
@encompass.compile
def run_agent():
cur_node = Node()
cur_node = searchover(branchpoint_with_visualization(cur_node)) # used to be branchpoint()
# Step 1
...
cur_node = searchover(branchpoint_with_visualization(cur_node)) # used to be branchpoint()
# Step 2
...
cur_node = searchover(branchpoint_with_visualization(cur_node)) # used to be branchpoint()
# Step 3
...
If certain tracing capabilities (such as the one above) become a common requirement by users, we will implement them within EnCompass itself so that the user doesn't need to implement it themselves.
Question 3, extension to "LLM-in-control" agents
In “LLM-in-control” agents where an LLM chooses actions to take in an environment, since the interface the environment provides already makes it easy to implement search, there isn’t a need for disentangling the core agent workflow from the overlaying search strategy, which is the key offering of EnCompass. Nevertheless, it is still helpful to have a framework that provides out-of-the-box search algorithms that you can easily add to your agent, as well as a standardized interface for search algorithms so that they can be reused across different “LLM-in-control” agents.
Indeed, EnCompass accomplishes both of these goals for “LLM-in-control” agents. Suppose your “LLM-in-control” agent has an environment agent_env: AgentEnv in which the LLM takes actions action: Action, so that a rollout looks like
for step in range(max_steps):
action = llm_decide_action(agent_env)
agent_env.act(action)
Then by putting this into EnCompass,
@encompass.compile
def run_agent():
for step in range(max_steps):
branchpoint()
action = llm_decide_action(agent_env)
agent_env.act(action)
record_score(evaluate_state(agent_env))
we now have access to a standardized interface for implementing search (Appendix B3) and can also use out-of-the-box search algorithms by writing, e.g., run_agent().search(“beam”, beam_width=3, default_branching=3).
More recently, there’s been works that could be described as “LLM+program-in-control” agents (not discussed in the paper), where the LLM writes agent code which could itself call LLMs or spawn LLM subagents, etc. These agents have the flexibility of “LLM-in-control” agents combined with the benefits of programming (more functional and hence better state management, expressive control-flow structures, etc.). For these agents, the fact that there are agents/subagents defined programmatically means that the issue of entangling the underlying workflow with the overlaying search strategy will resurface, which is addressed by EnCompass. We conducted preliminary experiments in this direction – see our response to Weakness 2 of Reviewer fzrw.
The paper identifies gap for a framework for agent inference-time search strategies. The key bottleneck identified is not having a clean abstraction/separation between core agent-workflow logic, and search logic. So the paper proposes to separate the expression of the core workflow of the agent and inference-time strategies. The key insight highlighted by the paper is that inference-time strategies can be viewed as search over different execution paths of the non deterministic program. With this insight, the paper presents Encompass, a framework allowing the user to express locations of non-determinism in code using a special annotation and then compiling the annotated program into a search space object. Overall, the paper contributes the probabilistic angelic nondeterminism (PAN) programming model along with Encompass, a python library implementation, and demonstrate case studies using encompass.
优缺点分析
I highly appreciate the problem motivated by the paper (being able to define an agentic workflow, and separately invoking various well known as well as custom search strategies over it). The abstraction provided by the Encompass library are convenient, and will no doubt enable further research in agentic inference time search by making it convenient for researchers to experiment with different combinations of agentic-workflow architecture and applying different search algorithms over it. The paper highlights multiple case studies augmenting different base agents with different search techniques, hence establishing the flexibility of the Encompass library. Hence, I believe that Encompass is going to be a very useful contribution as a software.
However, I have the following concerns which I request the authors to discuss:
- While the motivation to decouple agent-workflow and branch-selection-in-nondeterministic-execution is clear, there are situations where the branch selection decision is taken non-deterministically, but stochastically depending on the branch-local state. In such scenarios, it is desired that the agent-workflow be able to encode a latent parameter which the search strategy can take into account. For example, using LLMs to select specialize subagent for query routing. In this scenario, while this program point is certainly a non-deterministic branchpoint, there are strong priors to select nodes. For such scenarios (which I believe are common usecases), it would strengthen the framework to discuss or support latent state communication between workflow and search modules.
- Increasingly, agents are being developed with control flow authority delegated to LLMs, as opposed to the "program-in-control" agents studied in this paper. Such systems go a step further, and enable multiple agents to collaborate, making the system highly non-deterministic, as well as making it impossible to annotate
branchpoint()ahead of time (which is one of the core abstractions of Encompass). It would be interesting to discuss how the Encompass abstractions might be adapted or extended to handle LLM-driven agent workflows, where branchpoints cannot be statically anticipated.
问题
Can the authors address the 2 main points I raised in the strengths and weaknesses section above?
I am also curious to learn about the following (mostly minor) points:
- Could you explain if Encompass works in a multithreaded environment? Specifically, can the refine example in section 3.2 be used with multithreading?
- How does the system handle exceptions thrown within Encompass programs? It is not clear from the paper what
protectdoes. - What is the behaviour when
searchoveris not used to call sub-modules? - Can Encompass search with the notion of a budget (which could be in terms of time, money, system resources, etc.)? I believe it would be especially useful if there is support for a search timeout, that can stop the search after a user-provided threshold time.
局限性
yes
最终评判理由
Author's rebuttal sufficiently addressed the concerns as noted in the comment.
格式问题
N/A
We thank the reviewer for their detailed questions and comments -- we provide responses to them below.
Weakness 1, communication between underlying workflow and overlaying search process
The current version of EnCompass allows the following way to communicate between the search process and agent workflow: since the search strategy can see all variables in the state of the agent via checkpoint._frame, a variable can be devoted to be latent information that the search process uses to make decisions, and another variable can be devoted to messages sent to the agent from the search process
@encompass.compile
def run_agent():
...
latent_info = ...
branchpoint()
print(received_message)
...
checkpoint = run_agent().start()
latent_info = checkpoint._frame['latent_info']
# Make decisions based on latent_info
checkpoint._frame['received_message'] = "some message" # Send message to agent
We plan to implement a cleaner two-way communication feature between the agent workflow and the search process. We will mimic the syntax for communication in Python generators. Recall that Python generators use the syntax received_value = yield sent_value for communication, e.g.,
def example_generator():
name = yield "Hi! What’s your name?"
age = yield f"Hi {name}! How old are you?"
yield f"{name} is {age} years old"
gen = example_generator()
print(next(gen)) # Prints "Hi! What’s your name?"
print(gen.send("Jack")) # Prints "Hi Jack! How old are you?"
print(gen.send(34)) # Prints "Jack is 34 years old."
We plan to implement similar syntax in EnCompass, so our example above will look like
@encompass.compile
def run_agent():
...
latent_info = ...
received_message = branchpoint(message=latent_info)
...
checkpoint = run_agent().start()
latent_info = checkpoint.message
# Make decisions based on latent_info
checkpoint.step("Some message") # Send message the next time step() is called
Weakness 2, agents where LLMs have more control-flow authority
One way to implement LLM-driven agent workflows is to have LLM write the agent code that calls LLMs. Systems like this have control-flow in the authority of LLMs since LLMs are writing the code. (If the LLM cannot statically anticipate, e.g., the second half of the agent workflow, it can even write code that implements just the first few steps, and at the end spawn an LLM sub-agent that does the rest.) Since everything is still in code – just written by LLMs – EnCompass can be directly applied to the code to enable search. In this use case, EnCompass is used to simplify agent programming not for humans, but for LLMs.
In such a system, a solution to not being able to statically anticipate branchpoints is to replace the remainder of the workflow with a sub-agent call, and that sub-agent will now have more information about the current progress of the workflow that it can write code knowing where to place branchpoints.
But that replaces the entire second half of the workflow with a subagent call. In the case where one wants to write the entire agent workflow in code but dynamically determine where to place branchpoints, one simple solution that can be currently implemented with EnCompass is to put branchpoints inside if blocks:
@encompass.compile
def run_agent():
# Step 1
...
if should_branch_after_step1(step1_result):
branchpoint()
# Step 2
...
if should_branch_after_step2(step1_result, step2_result):
branchpoint()
...
For a related discussion, see our response to the last question by Reviewer 49qq.
Question 1, multithreading
Yes! All out-of-the-box search algorithms are by default implemented with multithreading so that creating the children of a node are done in parallel. (To disable parallelism, use the keyword argument default_max_workers=1 when calling search().)
However, multithreading can cause race conditions when different search branches mutate shared state. For the refine example in Section 3.2, enabling multithreading will likely cause all search branches to see an empty feedbacks, so they do fresh resamples instead of refinement. Because the logic of refinement is sequential and is incompatible with parallelism – later refinements depend on the results of previous refinements – one must disable multithreading here by passing default_max_workers=1 in calling search() (which we omitted for simplicity in the code example).
Besides globally turning on/off multithreading, EnCompass also lets you specify for each branchpoint whether samples of children of a node at that branchpoint can be done in parallel via a keyword argument to branchpoint() – by writing branchpoint(max_workers=1), multithreading is turned off.
Question 2, exception handling and protect()
An unhandled exception that didn’t occur inside protect will cause the entire program to error out with that exception. While Python natively has try: ... except: ..., EnCompass supports two additional patterns for error handling:
- Putting the code inside
protect(...), as shown in Listing 6 of the paper. This will automatically create a retry loop that retries from the most recent branchpoint when the protected code throws the specified exception, until the exception is no longer thrown. For example, consider the following code.
@encompass.compile
def run_agent():
sum_so_far = 0
for i in range(10):
branchpoint()
sum_so_far += protect(1 / random.choice([0, 1, 2]), ZeroDivisionError)
return sum_so_far
run_agent().search("sampling", num_rollouts=1)
Although this conducts only 1 rollout, we automatically retry from the most recent branchpoint whenever there’s a ZeroDivisonError. Thus, each iteration of the loop gets retried until the randomly chosen integer is not 0, so in the end sum_so_far is the sum of 10 independent random variables that are each 1 or ½ with equal probability.
2. Catching the exception and killing the current search branch with kill_branch(). For example,
@encompass.compile
def run_agent():
sum_so_far = 0
for i in range(10):
branchpoint(branching=5)
try:
sum_so_far += 1 / random.choice([0, 1, 2])
except ZeroDivisionError:
kill_branch()
record_score(sum_so_far)
return sum_so_far
run_agent().search("best_first")
Here, we conduct best-first search with branching factor 5. So we sample 5 times iteration 0 of the loop, and the sample with the highest sum_so_far continues on and gets 5 samples of iteration 1 of the loop, and the one that has the highest sum_so_far continues on and gets 5 samples of iteration 2 of the loop, so on and so forth. When we sample iteration i of the loop and it errors out, that search branch is killed, i.e., the sample is discarded. So on average we only have 3.33 samples of each iteration of the loop. It is possible that all search branches get killed before any of them reach the return statement, in which case the entire agent will error out with AllBranchesKilledError.
Question 3, calling sub-modules without searchover
For an EnCompass function sub_agent, sub_agent(...) is a SearchSpaceWithArgs object that provides methods such as search(search_algo, **search_config) (Section 2.2, Appendix B.2). Thus, without using searchover to run the sub-agent (searchover(sub_agent(...))), sub_agent(...) will be that SearchSpaceWithArgs object. This can be used to implement nested search. For example, in the code translation case study, we have both beam search at the class level, and when translating each class, we conduct nested beam search at the method level. Indeed, in the loop that iterates through all the classes, we do sub_agent(...).search(“beam”, ...) instead of searchover(sub_agent(...)) (Listing 19 line 189).
Question 4, search with budget
While current out-of-the-box search algorithms provided by EnCompass do not automatically support searching with a budget, the user can track usage of various resources and what the search tree looks like in EnCompass – see our response to Question 2 by Reviewer 49qq.
To implement global budget constraints, there are two ways that come to mind:
- We can make use of the
Search.search_generator(self, init_program_state)interface (Searchis the base class for all search algorithms). This is a generator that yields a result every time a new result becomes available. So we can implement asearch_with_budget()method that iterates through the generator until some budget has been exceeded, where budget tracking is supported viarecord_costs(Listing 7). Then it returns the best result among all the results that have been yielded so far. - We can follow a similar implementation to that of
early_stop_search(). Forearly_stop_search(), there's aCheckpoint.early_stopped_searchattribute that the search algorithm sees and stops the search process if it isTrue. Similarly, we can modify theCheckpoint.step()method so that if the budget has been exceeded, the returned checkpoint will have an attributeexceeded_budgetset to True, that the search algorithm can see so that it can stop.
Dear authors, your responses sufficiently address my 2 primary concerns regarding the ability to communicate between the agent and search process, and also about search for programs with LLM-in-control. Thank you also for clarifying the other questions, which provide me a deeper understanding of the system - I encourage the authors to add these details to the paper, perhaps in the appendix.
Dear Reviewer,
We're glad that we have been able to address your concerns and answer your questions. We will add these clarifying details to the paper. We again thank you for your time and effort reviewing our paper!
Best, The Authors
Dear Reviewer,
This is a gentle reminder that the discussion period is ending in 2 days. Your response to our rebuttal is greatly appreciated! We thank you for your valuable feedback on our work.
Best,
The Authors
This paper addresses interence-time scaling issues for agent programming contexts that rely specifically on "program-in-control” style agents in which the workflow for the agents is specified in code and LLMs used to solve each subtask. Borrowing from work in probabilistic programming the framework, called EnCompass, relies on a progamming model that separates the workflow logic from the inference-time (search) strategy. EnCompass allows programmers to easily experiment with different inference-time approaches by defining a set of core primitives via a python function decorator. The primitives can then be invoked in the workflow code to invoke specific inference-time scaling strategies.
优缺点分析
Strengths
-
The approach makes it easy for programmers of (program-in-control) agent-based systems to experiment with (and share code for) different inference-time search strategies rather than coding those strategies directly into the agent workflow code. This is nice. It would potentially allow more complex tasks to be solved reliably using agent-based deep-learning approaches.
-
Examples are provided to show how multiple standard inference-time search strategies would be implemented. An evaluation using three case studies summarizes the code differences with and without EnCompass; results show that EnCompass pretty significantly simplifies the code.
-
EnCompass can be used in conjunction with existing LLM prompting and tool use frameworks like LangChain.
Weaknesses
-
It isn't clear that the programming solution provided isn't already in use by folks building agent-based LLM systems.
-
The limitations of EnCompass are not thoroughly discussed.
问题
-
Are there references to show the rise in developing "program-in-control" agent-based systems? This would help to motivate the need for EnCompass.
-
Do we know that programmers aren't already using EnCompass-style approaches when developing agent-based LLM systems?
-
What is the definition of "angelic" nondeterminism vs. nondeterminism?
-
In term of the limitations, what type of inference-time scaling strategies would NOT benefit from EnCompass? Can EnCompass be used for workflows that combine program-in-control and LLM-in-control approaches, or do these not really occur?
局限性
Yes.
最终评判理由
No unresolved issues for me. The rebuttal together with the other reviews (and their rebuttals) addressed my concerns.
格式问题
None.
We appreciate the reviewer for their time and effort reviewing our paper. We address your concerns and questions below.
Weakness 1 and Question 2, does the solution already exist?
To the best of our knowledge, no openly published agent framework compiles an agent workflow to a search space object to enable abstractly specifying inference-time search strategies. It is certainly hard to say what existing programming solutions are being used in proprietary frameworks, but we feel that speculation falls outside the purview of the NeurIPS review process (consistent with the NeurIPS call-for-papers policy).
Weakness 2, need more discussion of limitations
Here is a more thorough discussion of the limitations of EnCompass:
- EnCompass is most useful for “program-in-control” and “LLM+program-in-control” agents where there’s a program (written by a human or LLM) that defines the agent workflow. It is not as useful for “LLM-in-control” agents where the agent is an LLM outputting actions to act in an environment. However, “LLM+program-in-control” could be considered more expressive/general than “LLM-in-control” agents and has seen increasing interest lately (see our response to Question 1 below).
- At the implementation level, because EnCompass uses a decorator that reads the function’s source code and modifies its behavior (supporting additional primitives like
branchpoint()), it may not be compatible with other decorators that read the function’s source code and assumes regular Python. This is not currently a practical issue since it is rare for decorators to read and rely on the function's source code. Neither is this a fundamental limitation, since it can be addressed with a more feature-rich transpiler that can reconcile decorators.
Question 1, popularity of "program-in-control" style agents
There has been continued interest in “program-in-control” style agents like Hypothesis Search [16], AlphaTrans [17], Self-Discover [18], Syzygy [a], and Parsel [b]. As hand-designed workflows, they often target more specific tasks (e.g., AlphaTrans and Syzygy are code translation agents), but these also include generalists (Hypothesis Search for inductive reasoning, Parsel for coding, and Self-Discover for general reasoning). Agents for specific large-scale tasks — such as AlphaTrans and Syzygy, both for translating entire code repositories — are often "program-in-control" since, at such a scale, having the skeleton of the agent workflow be programmatically defined is important for structuring the workflow and ensuring it follows a well-defined plan.
A few recent works use a hybrid “LLM+program-in-control” approach, where LLMs write agent code (ADAS [33], GodelAgent [c], Chain-of-Code [d]), combining the benefits of programming and natural language. EnCompass continues to be helpful in this situation since it makes it easier for LLMs to write agents with sophisticated inference-time strategies. We demonstrate in recent experiments we ran applying EnCompass to Chain-of-Code – see our response to Weakness 2 of Reviewer fzrw.
[a] Shetty, Manish, et al. "Syzygy: Dual code-test C to (safe) Rust translation using llms and dynamic analysis." arXiv preprint arXiv:2412.14234 (2024).
[b] Zelikman, Eric, et al. "Parsel🐍: Algorithmic Reasoning with Language Models by Composing Decompositions." Advances in Neural Information Processing Systems 36 (2023): 31466-31523.
[c] Yin, Xunjian, et al. "Godel agent: A self-referential agent framework for recursive self-improvement." arXiv preprint arXiv:2410.04444 (2024).
[d] Li, Chengshu, et al. "Chain of code: Reasoning with a language model-augmented code emulator." arXiv preprint arXiv:2312.04474 (2023).
Question 3, angelic nondeterminism vs. nondeterminism
Nondeterminism is the general concept where the execution of a program has multiple available paths. A program state is not always followed by a unique next program state; instead, potentially multiple next program states are possible. "Angelic" nondeterminism refers to ways of executing a nondeterministic program where execution paths that lead to "better" outcomes are preferred and chosen. For example:
- In nondeterministic programs that solve decision problems, the output "yes" is preferred to the output "no", so if there exists an execution path that leads to the output "yes", then that execution path will be taken. (These nondeterministic deciders are used to define complexity classes like NP.)
- In implementations of the amb operator (found in some Lisp dialects and other languages), amb nondeterministically chooses from a set of values, where the actual choices made are those that satisfy constraints or assertions later in the program. The "good" execution path that gets chosen (if it exists) is the one where all constraints are satisfied.
The opposite of "angelic" nondeterminism is "demonic" nondeterminism, where the worst possible choices will be made (from the perspective of achieving the desired outcome). Probabilistic programs represent another form of nondeterminism, where execution paths are chosen according to specified probability distributions.
Implementation of angelic nondeterminism necessarily involves searching across different execution paths to find those that satisfy the desired criteria.
Question 4, limitations?
To your first question, inference-time strategies that are designed to be entangled with the agent workflow benefit less from EnCompass. Suppose one designs a complex agent with inference-time strategies that are highly specific to the underlying agent workflow and dependent on runtime details specific to this agent. An example could be a code translation agent that does refinement for errors of a certain kind (e.g. compilation errors) but sampling for other errors (e.g. runtime errors), has a very specific backtracking strategy (e.g. if the last 3 translated methods failed, then backtrack to the previous file), etc. Then there isn't a general search strategy like "beam search" or "best-first search" that replicates such specific inference-time strategies, so EnCompass is not applicable in this situation.
However, it has been on our to-do list to address this by allowing communication between the underlying agent workflow and the search process (see our response to Weakness 1 of Reviewer CQ3F). Then we can define a custom search process that receives messages from the agent on, e.g., which branchpoint it wants to backtrack to, and the search process backtracks to that branchpoint. Here, we continue to see the benefits of separating search from the underlying workflow that we have discussed in the paper (306-352) while also enabling highly customized search processes that are tailored to the workflow. By making checkpoints of the agent program, EnCompass allows the search process to cleanly backtrack to any checkpoint, even if the logic of doing so depends on details of the current state of the agent process.
To your second question, one way to combine program-in-control and LLM-in-control approaches is to have an LLM write the agent program, which could call LLMs and spawn sub-agents, etc. Here, EnCompass can be used as to the programs that the LLM writes. (Also see our responses to Weakness 2 of Reviewer CQ3F and the last question of Reviewer 49qq.)
The rebuttal together with the other reviews (and their rebuttals) addressed my main concerns. I am raising my rating as a result.
Dear Reviewer,
We greatly appreciate your time and effort reviewing our paper, and we are happy that we were able to address your concerns.
Best,
The Authors
Dear Reviewer,
This is a gentle reminder that the discussion period is ending in 2 days. Your response to our rebuttal is greatly appreciated! We thank you for your valuable feedback on our work.
Best,
The Authors
The authors have provided a rebuttal to your comments, and it's an important part of the review process to give their response careful consideration. Please take a moment to review their rebuttal and provide any follow-up comments. This will help ensure there’s sufficient time for discussion and any necessary follow-up.
Best regards,
AC
This paper presents the ENCOMPASS framework a method designed to decouple the agent workflow logic from inference-time strategies in large language model (LLM)-based agent systems. The authors introduce the concept of "probabilistic angelic nondeterminism" (PAN) and demonstrate how it can be used to enable flexible experimentation with different search-based inference-time strategies in AI agents. The framework provides a clean way to experiment with strategies such as best-of-N sampling, beam search, and Monte Carlo Tree Search (MCTS). Three case studies are presented, showcasing the advantages of ENCOMPASS in enhancing the reliability and performance of LLM-based agents.
优缺点分析
Strengths:
- The paper is clearly written and the programming framework is carefully designed and presented in detail.
- The proposed framework could potentially benefit future research on LLM Agents.
- The authors provide clear showcases of how the proposed framework can facilitate research on test-time strategies for LLM Agents.
Weaknesses:
- The novelty and technical contribution of the paper is limited, it in general looks more suitable for a technical report for an open-source codebase, but may not suits the scope of academic conferences very well.
- There lacks quantitative experiments on agent benchmarks such as GAIA/HLE/BrowseComp/etc., therefore, it is unclear if the proposed framework can achieve similar or higher empirical performance compared to existing agent frameworks. It is also unclear whether and how much test-time strategies, which can be easily implemented using the framework, actually works.
问题
How does the proposed framework compares against existing agent frameworks on standard agent benchmarks?
局限性
The token consumption comparison of building similar agents using the proposed agent framework and existing ones are not analyzed.
格式问题
N/A
Response to Weakness 1, novelty and technical contribution
EnCompass is novel in a couple of ways.
- EnCompass is, as far as we know, the first (and so far only) inference-time and search framework for agents.
- Conceptually, we introduced a unifying framework of common inference-time strategies as special cases of search over nondeterministic execution paths, and this conceptual framework was implemented as an actual programming framework. Programming frameworks that unify and simplify various aspects of agent programming regularly appear at top ML conferences – examples include:
- DSPy at ICLR 2024 (spotlight): https://iclr.cc/virtual/2024/poster/17642
- AutoGen at COLM 2024: https://openreview.net/forum?id=BAakY1hNKS
- We applied the idea of angelic nondeterminism — a programming construct typically limited to work on the early days of symbolic automated reasoning — to the actual programming of modern LLM agents. The specific version of angelic nondeterminism — probabilistic angelic nondeterminism — is also novel.
Response to Weakness 2 and Question 1, quantitative experiments and comparisons on agent benchmarks
It is widely reported that test-time inference strategies can improve LLM agents [3, 4, 5, 6, 7, 8, 9, 10, 11, 16, 17, 18]. Common examples include Best-of-N, refinement, MCTS, and more, all of which are known to work. Our EnCompass framework enables an elegant way to specify such strategies so that implementation is easy and streamlined. Our three case studies demonstrate this.
To address the comment on lack of additional benchmarks, we conducted an additional set of evaluations on BIG-Bench Hard, and integrated EnCompass with Chain-of-Code (CoC) [a], which is a SOTA method on that benchmark. Our version of Chain-of-Code (CoC) uses an LLM to write agent code where undefined function calls are emulated by LLMs. EnCompass then compiles the LLM-written code so that we can easily search over different combinations of LLM emulation outputs. We ran parallel beam search with beam width 2 and branching factor 2. Search branches that errored out are killed, and for those that survive, we took the majority vote answer. We use 3-shot prompting as in the original paper, and Qwen3-14b to match the discontinued text-davinci-003 which the original authors used. The results are shown below -- CoC+EnCompass achieves the best performance.
| Direct (reported) | Direct (reproduced) | CoT (reported) | CoT (reproduced) | CoC (reported) | CoC (reproduced) | CoC+EnCompass (ours) | |
|---|---|---|---|---|---|---|---|
| All | 55 | 54 | 72 | 79 | 84 | 84 | 87 |
| No LLM emulation needed* | 44 | 46 | 72 | 81 | 94 | 96 | 96 |
| Need LLM emulation | 66 | 63 | 72 | 76 | 72 | 72 | 77 |
*These are the tasks where the LLM-generated code is a regular deterministic Python program where search over nondeterministic execution paths, by definition, would not help.
We accomplished this without the need to prompt the LLM to write code that implements parallelized search by itself. Instead, we simply took the original code it generated, modified it to add branchpoints, and used EnCompass to compile it and apply beam search. This further demonstrates the ease with which EnCompass can be applied to existing agents to improve performance via search.
Note that our evaluations do not compare EnCompass with other frameworks because EnCompass is, to the best of our knowledge, the only inference-time strategy and search framework (see Related Work section). Another framework that is the only framework of its kind is DSPy (ICLR 2024 spotlight). Indeed, the evaluations in the DSPy paper do not compare with other frameworks, but instead focus on showcasing the result of prompt optimization. Similarly, our main evaluation of EnCompass focuses on showcasing how easy it is to use EnCompass to implement and experiment with different inference-time strategies and search algorithms on top of any underlying agent workflow.
Our integration of EnCompass with CoC highlights that EnCompass is designed to be orthogonal to existing agents/agent frameworks so that it can be flexibly used in conjunction with them to achieve the benefits of both. For example, you can also use both EnCompass to implement inference-time strategies and DSPy for prompt optimization in your agent and get the benefits from both the easily-implemented inference-time strategies and the prompt optimization. If you had to choose between EnCompass and DSPy then you'd want to compare them, but you can just use both.
Finally, we note that benchmarks that require an LLM-in-control agent architecture are currently not in the scope of EnCompass. For example, GAIA, HLE and BrowseComp are all designed to evaluate LLMs interacting with an environment via tools (e.g., web search), and are thus out of scope. Nevertheless, an interesting direction for future work is to extend EnCompass to agents that are more "LLM-in-control". For example, an LLM controller could write EnCompass agent code that can itself call LLMs or spawn LLM-written EnCompass subagents. Our experiments on Chain-of-Code above can be seen as preliminary experiments in that direction, and the results have been promising, but a full exploration requires a separate project.
[a] Li, Chengshu, et al. "Chain of code: Reasoning with a language model-augmented code emulator." arXiv preprint arXiv:2312.04474 (2023).
Limitation 1, token consumption comparison
In all of our case studies, we list the API costs involved in each experiment to compare the scaling laws of different inference-time strategies. Since cost is determined by the agent you build and not by the framework(s) used to build it†, we do not compare the cost of an EnCompass agent vs. the cost of an equivalent agent without EnCompass — they will have the exact same token consumption. Instead, we focus on showcasing how EnCompass is used to implement different inference-time strategies, and to compare their scaling laws we have reported cost in our case studies.
†DSPy may be an exception since it can be used to optimize prompts to change the token consumption.
Dear Reviewer,
This is a gentle reminder that the discussion period is ending in 2 days. Your response to our rebuttal is greatly appreciated! We thank you for your valuable feedback on our work.
Best,
The Authors
The authors have provided a rebuttal to your comments, and it's an important part of the review process to give their response careful consideration. Please take a moment to review their rebuttal and provide any follow-up comments. This will help ensure there’s sufficient time for discussion and any necessary follow-up.
Best regards,
AC
Thanks for the responses. I still think the lack of proper comparison with related and similar methods, as well as lack of evaluation on standard and popular benchmarks, make the submission not strong and convincing enough. I therefore maintain my score.
We thank the reviewer for taking the time to respond to our rebuttal. We believe there remain misunderstandings of our paper, so we address your remaining concerns below. We hope our comments below can clarify the relationship between our work and prior work, the main message of our paper, our resultant approach to evaluation, and the intended scope of EnCompass.
Comparison with related methods
As pointed out in our rebuttal to W1 (bullet point 1) and W2/Q1 (paragraph 4), there are currently no existing inference-time search frameworks to the best of our knowledge. So, as far as we know, nothing currently exists that we are able to compare EnCompass against.
Nevertheless, we can present an ablation study to show the utility of EnCompass compared to a feature-incomplete version of EnCompass with only branchpoint() and record_score() ("Partial EnCompass" below). For each of "No EnCompass", "Partial EnCompass", and "Full EnCompass", we report the best-performing search algorithm that can be implemented with minimal changes to the base agent code (aka no changes in the base agent control-flow and no changes in how code is grouped into functions).
| Best-performing strategy | Reference to Figure 2(a) | Slope of scaling law† | Self-validation (%)‡ | Cost ($)‡ | |
|---|---|---|---|---|---|
| Base | — | dark blue point | — | 48.1 2.2 | 0.69 0.04 |
| No EnCompass | GBoN | red line | 2.1 0.8 | 56.0 4.0 | 17.42 0.25 |
| Partial EnCompass | beam (c.) | orange line | 5.4 0.8 | 72.9 3.6 | 16.97 0.16 |
| Full EnCompass | beam (c.) + beam (f.) | pink line | 6.3 0.7 | 80.8 4.9 | 11.64 0.16 |
†Percentage increase in performance per cost doubling
‡Best-performing set of search parameters
By adding the basic constructs of branchpoint() and record_score() to the base agent, we already achieve a competitive scaling law via beam search at the class level. But with the full EnCompass framework that also has searchover() for calling EnCompass functions and branchpoint_choose() to choose among a given set of choices, we can easily implement nested beam search at the method level as well, further improving the scaling law.
Standard and popular agent benchmarks
We would first like to reiterate that the main message of our paper is: EnCompass enables easy implementation and experimentation of inference-time strategies and search in "program-in-control" style agents to improve their performance. The focus is not to develop a new agent that achieves SOTA on a standard and popular agent benchmark. Thus, our paper focuses on demonstrating how inference-time strategies are implemented in EnCompass (Section 3) and how to easily implement inference-time strategies in an agent to improve its performance (the case studies). We believe our paper provides the evidence needed to support our main message.
This focus of our paper also explains our approach to how evaluations were chosen in our case studies. In the case studies where the agent was previously published, we demonstrate how to apply EnCompass to these existing agents to improve performance, so for fair comparison we evaluate on the same benchmarks that the original papers for these agents evaluated on.
Our new evaluation in our rebuttal — Chain-of-Code on BigBenchHard — follows the same approach to maximally support our paper's main message. To best show the utility of EnCompass on an existing agent — Chain-of-Code — we also evaluate on the same benchmark that Chain-of-Code did, i.e., BigBenchHard.
Note also that we believe BigBenchHard is typically considered a standard/popular benchmark.
For many of the other standard/popular benchmarks such as the ones you mentioned, the top-performing agents are all "LLM-in-control" style agents which are outside the current scope of EnCompass, which focuses on "program-in-control" style agents. As a result, demonstrating how we apply EnCompass to a SOTA agent for these benchmarks is out of scope.
We envision a future with hybrid "LLM+program-in-control" style agents (see our rebuttal to W2/Q1), where the programmatic structure of LLM-written agents and subagents will make EnCompass applicable again. When such agents enter the leaderboard of some of the popular benchmarks you mentioned, we will be able to demonstrate EnCompass as applied to these agents. However, building such an agent warrants a separate project, as our current paper focuses on EnCompass.
Nevertheless, we did find a hybrid-style agent (Chain-of-Code) that has leading performance on a popular benchmark (BigBenchHard), and we have demonstrated EnCompass further improving the benchmark performance in our rebuttal to W2/Q1. This further supports our claim that EnCompass can be used to easily improve the performance of agents with some programmatic structure.
Dear Reviewer,
We thank you for your time and effort reviewing our paper. Since the discussion period is ending in 1 day, we would greatly appreciate it if you could take the time to respond to our additional comments that we believe should clear up some misunderstandings. We believe that our responses resolve the concerns you have -- please let us know if you have any further questions. Thank you!
Best,
The Authors
This paper presents EnCompass, a framework that improves robustness and usability of LLM-based agent systems. The paper is clearly written, reduces implementation effort, and enables flexible experimentation with different agentic-workflow architectures and search strategies, which will likely facilitate further research in agentic inference-time search. While one reviewer raised concerns regarding the comparison with other methods, the authors have already addressed the main points sufficiently in the rebuttal. Therefore, I recommend acceptance.