PaperHub
6.6
/10
Poster6 位审稿人
最低3最高4标准差0.5
4
3
3
3
4
4
ICML 2025

Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning

OpenReviewPDF
提交: 2025-01-13更新: 2025-07-24
TL;DR

We leverage offline RL with a parameter-efficient and generally applicable hierarchy to ground LLMs as efficient decision-making agents.

摘要

关键词
Language agentsHierarchical reinforcement learningOffline reinforcement learning

评审与讨论

审稿意见
4

This paper introduces GLIDER, an approach for fine-tuning LLMs to act as agents in interactive environments. GLIDER relies on a hierarchical architecture in which the LLM is used to both propose a plan (high-level policy) and execute it (low-level policy). Using Behavioral Cloning and offline RL, the authors demonstrate that GLIDER can learn to solve various tasks in ScienceWorld and AlfWorld, achieving strong performance.

One important aspect of GLIDER is that the low-level policy is trained solely on subgoals proposed by the high-level policy: the latter not only generates these subgoals but also rewards the low-level policy based on its trajectory. The authors argue that this approach preserves the low-level policy’s strong generalization abilities, as the subgoals it is trained to solve are not directly tied to the environment’s tasks.

The paper provides in-depth ablation studies on different components of GLIDER, highlighting the importance of offline RL and the hierarchical architecture. It also demonstrates that offline RL can successfully exploit suboptimal demonstrations. Finally, the authors show that GLIDER can efficiently learn new tasks using online RL to fine-tune the high-level policy. Transitioning from offline RL to online RL is achieved seamlessly using the chosen RL method: AWAC.

The authors conclude by discussing GLIDER's broader potential, suggesting that its applications extend beyond interactive textual environments.

update after rebuttal

Most of my concerns below were due to missing explanations in the manuscript as well as misunderstandings about the generalization tests. The authors have addressed all of these concerns. I therefore maintain my recommendation to accept this paper.

给作者的问题

  1. Eq. 3 indicates that a regularization on the outputs' length is also applied to the low-level policy. I do not understand this regularization term as the low-level policy is supposed to output the environment's actions. Could the authors provide further explanations on this?
  2. The Appendix B mentions "Cross-task Generalization sampling". Were these data necessary?

论据与证据

The proposed method and empirical evidence clearly support the paper's claims by introducing an approach that leverages hierarchical training, seems pretty general, adapts to unseen tasks, and can be easily trained on new tasks in an online regime.

方法与评估标准

GLIDER is a relatively straightforward method. The choice of AWAC appears to be well-suited, as it enables both offline and online RL.

One concern I have relates to the cc parameter, which controls the length of low-level trajectories. I could not find detailed information on how this parameter is set, apart from a footnote stating that it depends on the subgoal. Since the subgoals are generated dynamically by the high-level policy, it is unclear how one can determine in advance the possible subgoals and the number of steps required to solve them. The paper suggests that cc is fixed, meaning that the high-level policy awaits cc steps before evaluating the low-level policy's trajectory. However, what happens if the subgoal is completed in fewer than cc steps?

Additionally, the paper states that "the subtask completion can be easily accessible from the environment observation" to compute the intrinsic reward for the low-level policy. This assumption seems overly optimistic when moving to the online regime. First, since subgoals are generated by the high-level policy, they could be highly abstract and difficult to evaluate. Second, the feasibility of this approach heavily depends on the complexity of the environment and the LLM's capabilities. For instance, what would happen in Minecraft if the subgoal generated is "build a house"? Furthermore, what occurs if the low-level policy is unable to solve the proposed subgoals? How does the framework handle situations where the high-level policy produces an effective plan, but the low-level policy fails to execute it?

理论论述

This paper contains no theoretical claims.

实验设计与分析

ScienceWorld and AlfWorld are two well-established benchmarks for evaluating LLM-based embodied agents. GLIDER demonstrates strong performance, and the analyses using three different LLMs highlight its robustness.

However, I believe the experiments would benefit from additional baselines, particularly those involving online RL methods. I also could not find sufficient details regarding the baselines. For example, what dataset is used by ETO, which is supposed to collect trajectories in an online regime?

Additionally, in Section 4.4, what data did AWAC use? Was the dataset emptied before the experiment, or did it still include the demonstrations used in Section 4.2? Also, what exactly is the AC baseline in Section 4.4? A clearer explanation of these details would improve the study's transparency.

Finally, I would like to mention that I really enjoyed reading the ablations. The results provide valuable insights into the impact of different components, particularly emphasizing the importance of the hierarchical framework and offline RL.

补充材料

I reviewed the appendices, which provide details on the experimental setup, especially on collecting offline data and the prompts used. Algorithm 1 was also useful to understand which weights were changed in the offline-to-online experiments.

与现有文献的关系

The links with prior work are covered overall. The paragraph discussing online RL approaches for LLM agents in Section 2 could mention other prior work, such as GLAM (Carta et al., 2023) or POAD (Wen et al., 2024). The latter also fits methods using an utterance-level critic, similar to ArCHer (Zhou et al., 2024), which is mentioned in Section 3.4. Prior work using intrinsic rewards generated by an LLM could also be mentioned, e.g. ELLM (Du et al., 2023).
All these references are not essential to understanding the key contributions, they are just suggestions.

遗漏的重要参考文献

I do not see any essential missing reference.

其他优缺点

Concerning weaknesses, I already mentioned all my concerns in previous parts. One additional strength of this paper is that it is well-written and easy to follow.

其他意见或建议

The authors argue that the high-level policy produces general subgoals, allowing the low-level policy to remain useful even for unseen tasks. However, I would have liked to see empirical evidence supporting this claim. For instance, did the authors analyze the subgoals in the demonstration dataset? What subgoals were generated in the offline-to-online experiments, and how different were they from those in the demonstrations?

Additionally, the authors chose not to finetune the low-level policy in offline-to-online experiments, based on the assumption that subgoals from training tasks would be reused in new tasks. Have they tested finetuning the low-level policy? If so, how does it affect the results?

I also found Figure 2 difficult to interpret due to the overwhelming amount of information and arrows. Even with the caption, it was unclear to me. For instance, in (a), it is unclear what is frozen and what is finetuned—the finetuning block seems to be associated with the low-level trajectory.

作者回复

We deeply appreciate your constructive comments for improving our paper. We would be incredibly grateful for your continued support in reviewing our response and offering further encouragement.


Q1. How is the temporal abstraction parameter c determined, given that subgoals are dynamically generated?

A1. The low-level policy dynamically determines when to end a subtask by analyzing environment observations, rather than strictly waiting for c steps, shown in L.214~215. This adaptive mechanism ensures efficient execution while maintaining the hierarchical control structure.


Q2. How does the framework evaluate subtask completion and handle failures when subgoals are abstract or when the low-level policy cannot execute the high-level plan effectively?

A2. The low-level policy executes subtasks, while the high-level policy learns finer-grained subtasks through feedback. In ScienceWorld, during early epochs, the high-level policy generates unachievable subtasks like "heat substance" but finds broken stove, later adapting to executable ones like "Navigate to foundry". Similarly in Minecraft, it evolves from "build a house" to specific tasks like "gather wood" and "create foundations".


Q3. How does ETO collect trajectories in the online regime, and what dataset does it use?

A3. For ETO baseline, we first perform SFT on the base model, then let it interact with the environment online to collect both failure trajectories and expert demonstrations for DPO training. We will provide more detailed baseline implementation details in the appendix.


Q4. What data did AWAC use in Section 4.4? Was the dataset emptied or did it retain the demonstrations from Section 4.2?

A4. Following AWAC, we kept the offline demonstration data instead of emptying the dataset, as it helps maintain base distribution performance. Unlike Section 4.2, Section 4.4 focused on evaluating adaptation to novel scenarios by selecting specific out-of-domain tasks and collecting additional data through online interaction


Q5. What exactly is the AC baseline in Section 4.4? A clearer explanation of these details would improve the study's transparency.

A5. The AC (Actor-Critic) baseline differs from AWAC(Advantage Weighted Actor-Critic) by using standard policy gradient instead of advantage weighting. We will provide more detailed implementation specifications in the appendix.


Q6. Section 2's literature review could be enriched with additional works (GLAM, POAD, and ELLM) on online RL approaches for LLM agents.

A6. Thanks for your helpful advice. We will adjust the related work section based on these suggestions.


Q7. Where is the empirical evidence showing how the generated subgoals differ between demonstrations and new tasks, and how do they support the claimed generalization ability?

A7. Our approach demonstrates cross-task generalization through qualitative and quantitative evidence. Appendix Figure 9 shows subgoal reuse between tasks, with both directly transferable components and analogous patterns. The effectiveness is validated on three out-of-domain tasks unseen during training, where GLIDER significantly outperforms baselines.

Methodtest-conductivityfind-animalboil
AC0.30 ± 0.100.45 ± 0.050.40 ± 0.15
AWAC0.35 ± 0.150.60 ± 0.100.45 ± 0.10
GLIDER0.98 ± 0.020.99 ± 0.010.95 ± 0.05

Q8. Why wasn't the low-level policy finetuned in offline-to-online experiments, and what would be the performance impact if it was?

A8. L.252~258 explain this design choice. Low-level skills use intrinsic reward functions instead of task-specific ones, allowing cross-task generalization and robustness to distribution shifts between offline training and online deployment. While joint finetuning could improve performance, adjusting only the high-level policy is sufficient given the task-agnostic nature of low-level skills.


Q9. Figure 2's method diagram is overcomplicated and hard to understand, with unclear connections and training flows.

A9. We will simplify the Figure 2. During value head finetuning, the LLM backbone remains fixed, while policy training updates the LLM to learn behaviors.


Q10. What is the rationale behind including a length regularization term in Equation 3?

A10. The length regularization term constrains action sequences to match the inherently short nature of valid environment interactions. By dividing the SFT loss by sentence length n as E[logπθ(as)n]=E[logπθ(as)]+logn-E[\log\frac{\pi_{\theta}(a|s)}{n}]=-E[\log\pi_{\theta}(a|s)] + \log n, it effectively prevents unnecessarily long action generation.


Q11. The Appendix B mentions "Cross-task Generalization sampling". Were these data necessary?

A11. Distribution shift in offline RL leads to policy failures on out-of-distribution states. Cross-task sampling simulates potential shifts by exposing the model to varied scenarios, helping handle unseen tasks.


Best regards,

Authors

审稿人评论

I thank the authors for their reponse. I will leave my score as it is, as my review was mostly asking for additional clarifications.

作者评论

Thank you for your continued support in reviewing our response. We are happy to see that we could address your concerns. We truly appreciate your time and effort in helping us improve our work!

审稿意见
3

The paper proposes GLIDER, an extension of hierarchical RL to LLM agents where two separate LLM control the high and low level policy. The proposed training framework relies on a combination of supervised fine tuning and implicit Q learning. The authors test on two domains, ScienceWorld and AlfWorld, obtaining promising results in both cases. The authors' experiments use relatively small LLMs (<10B), which is understandable given the need for in-device learning and speed.

Overall, this research is promising, but some aspects require more work or clarification, and I recommend rejection in its current state

给作者的问题

  1. Can you demonstrate that the given method obtains the optimal policy? Can you show that Equation (7) is correct? The LLM policy actions are at the token level, whereas the expression is formulated at the MDP transition level. In a way, the LLM has a different MDP at the token level. There should be a way to attribute the full transition reward to each token so that Equation (7) holds. For instance, what would be the equivalent formulation for assigning a reward to each token? Discussion around this topic could be useful, or even lead to more in-depth mathematical analysis.
  2. Unless I am confused, the definition of response length as a regularizer in Equation (3) seems problematic, given that SFT is applied at the token level. Did the authors intend to imply that they down-/up-weight certain dataset elements? Could this be reformulated as a weighted loss to emphasize certain dataset elements? Additionally, did you include an end-of-text token during SFT? In my experience, this kind of regularization hasn’t been necessary to produce shorter responses after SFT, but I understand each data can be different.
  3. Why did you train for only five epochs during SFT? Since you’re also using SFT as a baseline, did you attempt to optimize its hyperparameters (e.g., train longer, add dropout)?
  4. One of my concerns is that without SFT, the performance is comparable with the baselines NAT and ETO. Could you elaborate further on the role of SFT and why this is still a fair comparison?

论据与证据

The experiments and methodology partially support the claims, but not some elements need additional work. For example:

“we propose an offline hierarchical framework GLIDER with superior efficiency and broad applicability.”

Superior parameter efficiency to what? Any sensible baseline involving LLMs would use PEFT.

“Our method enables fast offline-to-online adaptation to -stationary environments”

Generalization, fine-tuning, and non-stationarity are related but different concepts. The authors could be more precise. I think the authors focus on generalization to new tasks. It does not appear as if their online setting changes has a continuosly changing non-stationarity environment.

“Comprehensive studies on ScienceWorld and ALFWorld show that our method consistently improves and generalization capacity, surpassing a of baselines by a significant margin.”

See my comments in the experiment section.

方法与评估标准

Experiments are run on standard benchmarks: ScienceWorld and ALFWorld. The experiments are limited, but this is common in this literature, given the computational cost. Overall, the proposed evaluation for the methods is sensible.

理论论述

No theoretical analysis provided.

实验设计与分析

The experiment design seems sound. I have some questions:

  1. Did the authors experiment with multiple seeds?
  2. What dataset did you use for the SFT: the purely expert one or the mix with medium expertise too? If the latter is the case, could SFT eventually be better than your method if enough expert data is available?
  3. Why did you train for only five epochs during SFT? Since you’re also using SFT as a baseline, did you attempt to optimize its hyperparameters?

补充材料

I read the material.

与现有文献的关系

The paper is timely and relevant to the RL community, increasingly interested in integrating LLMs. At the same time, its contribution might be limited since there are no theoretical results; the extension to hierarchical agents seems straightforward. Further, it maintains a common limitation with hierarchical RL: that there needs to be domain knowledge (in this case, pre-labelled trajectories) providing options/high-level actions.

遗漏的重要参考文献

The paper could benefit from a more extensive literature discussion around LLMs in RL. Examples:

  • Carta, T., Romac, C., Wolf, T., Lamprier, S., Sigaud, O., and Oudeyer, P.-Y. Grounding large language models in interactive environments with online reinforcement learning. In ICML, 2023.

  • Du, Y., Watkins, O., Wang, Z., Colas, C., Darrell, T., Abbeel, P., Gupta, A., and Andreas, J. Guiding pretraining in reinforcement learning with large language models. In ICML, pp. 8657–8677, 2023.

  • Szot, A., Schwarzer, M., Agrawal, H., Mazoure, B., Metcalf, R., Talbott, W., Mackraz, N., Hjelm, R. D., and Toshev, A. T. Large language models as generalizable policies for embodied tasks. In ICLR, 2023.

  • Wang, Z., Cai, S., Chen, G., Liu, A., Ma, X., Liang, Y., and CraftJarvis, T. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. In NeurIPS, 2023.

  • Wen, M., Wan, Z., Wang, J., Zhang, W., and Wen, Y. Reinforcing LLM agents via policy optimization with action decomposition. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

其他优缺点

A weakness is the lack of code despite being a mostly experimental paper.

其他意见或建议

I don’t understand the notation in (1) and (2). Why is this a sum, or what is Sigma? How can you sum “brackets” and what does the semi-colon inside the bracket mean. What is d? Please explain your notation as I believe it is non-standard in the literature.

伦理审查问题

N/A

作者回复

Methods

Q1. The optimal policy guarantee and the correctness of Eq. (7), given that actions are at the token level while rewards are formulated at the transition level.

A1. First, our policy derivation of inherits Advantage-Weighted Regression (AWR) [1-2], a popular algorithm using convergent maximum likelihood loss to convert RL into supervised learning subroutines. Our method follows their theoretical properties, and we will establish the theoretical counterpart in the appendix.

Second, GLIDER correctly attributes the full transition reward to each token through Eq. (7). Assigning the transition-level advantage AA to each token wiw_i is equivalent to assigning the correct advantage to the whole action as E[exp(A(s,u))i=1nlogπθ(wis,w1:i1)]=E[exp(A(s,w1:n))logπθ(wi:ns)]E[\exp(A(s,u))\cdot\sum_{i=1}^n\log\pi_{\theta}(w_i|s,w_{1:i-1})]=E[\exp(A(s,w_{1:n}))\cdot\log\pi_{\theta}(w_{i:n}|s)], using the relationship between total probability and conditional probabilities in autoregressive models.

[1] Advantage-weighted regression: Simple and scalable off-policy RL, arXiv:1910.00177, 2019.

[2] AWAC: Accelerating Online Reinforcement Learning with Offline Datasets, arXiv:2006.09359, 2020.


Q2. Superior parameter efficiency to what?

A2. The superior parameter efficiency is mainly reflected in the design of the RL model and the hierarchical structure.

First, we share the same frozen LLM backbone for actor and critic to greatly reduce model parameters, in contrast to many RL algorithms using independent actor and critic. Second, we let the high and low levels share the same model and differentiate them with a lightweight hierarchy prompt, as opposed to typical hierarchical methods with independent models at each level. Also, we adopt PEFT to further improve parameter efficiency following the common practice in LLMs.


Q3. Clarification about generalization, fine-tuning, and non-stationarity.

A3. In the offline-to-online stage, we focus on GLIDER’s fast adaptation ability to unseen tasks using a small number of fine-tuning steps, i.e., few-shot generalization. We will make more precise clarification.


Q4. Hierarchical RL needs domain knowledge.

A4. We design a generally applicable hierarchy to minimize reliance on domain knowledge. The low-level policy is instructed by an intrinsic reward indicating the sub-task completion, which is easily accessible fro environment observations, alleviating the necessity for any manual or task-specific design.


Experiments

Q5. A weakness is the lack of code.

A5. We have uploaded the code as supplementary materials for reproducibility when submitting the main paper. Further, we will release detailed tutorials and datasets to ensure full reproducibility.


Q6. Why training for only 5 epochs in SFT? Did you attempt to optimize its hyperparameters?

A6. We trained SFT for 5 epochs to avoid overfitting. Also, we optimized the hyperparameters and added dropout in LoRA layers, which can be found in our submitted code.

SFT epochs234567
Score34.1442.0245.1150.1740.1637.43

Q7. The dataset for SFT, the role of SFT, the comparison between SFT and GLIDER on expert datasets.

A7. We use expert datasets for SFT, matching the nature of supervised learning. The role of SFT is to construct a base agent to improve learning stability and sample efficiency for subsequent stages. GLIDER adopts RL to steer the base agent towards use-specified tasks, excelling both imitating demonstrations and adapting behaviors through trial-and-error. Experiments show a 11.6%~19.6% performance improvement over SFT.

SFTGLIDER
ScienceWorld (seen/unseen)69.40/57.1277.43/68.34
AlfWorld (seen/unseen)62.24/65.2171.56/75.38

Q8. Did the authors experiment with multiple seeds?

A8. Yes. Our method obtains close results across different seeds. The following table shows the example result of GLIDER with LIama-3-8B, score on unseen tasks in ScienceWorld.

Seed0123
Score68.3469.7268.2367.11

Q9. Regularizer in Eq. (3), and using the end-of-text token in SFT.

A9. nn is the length of the sentence generated by policy πθ\pi_{\theta}. We divide the original SFT loss by the sentence length as E[logπθ(as)n]=E[logπθ(as)]+logn-E[\log\frac{\pi_{\theta}(a|s)}{n}]=-E[\log\pi_{\theta}(a|s)] + \log n. A smaller length nn will induce a larger gradient of loss, thus regularizing the model to output relatively shorter sentences. Also, we did use the end-of-text token in SFT, and our length regularizer is complementary to it.


Summary Response

We appreciate your time in reviewing our manuscript and your valuable comments. We have made a number of changes and clarifications to address your concerns, and we are more than delighted to have further discussions to improve our manuscript. If our response has addressed your concerns, we would be grateful if you could re-evaluate our work.

审稿人评论

Thank you for your responses. I have some remaining questions:

Q1. Ok.

Q2. I still think that claiming "superior" efficiency as a contribution can be misleading. First, sharing some layers/parameters between the actor and critic is common. This is even mentioned in the PPO paper for Atari games (Schulman et al. 2017). Second, PEFT is applied to virtually any modern LLM application.

Q3. Thank you. I agree that few-shot environment generalization clearly describes what the experiment is evaluating, while non-stationarity is misleading.

Q4. Ok.

Q5 I apologize for missing the earlier code submission. Thank you.

Q6 It's strange to see so much overfitting with only five training epochs if using dropout during SFT. Did you observe changes in performance for different dropout values? Looking at the code, I see a dropout of 0.05, which should be enough. But from a very quick inspection of the code, I don't see model.train() or model.eval() being called. I wonder if dropout is implemented at all. I woudl expect less overfitting with higher dropout, potentially at the expense of less performance overall. But there should be a dropout value for which almost no overfitting takes place. Do you have an alternative explanation of why dropout is not working here?

Q7 Ok.

Q8. Ok.

Q9. For nn to be a valid regularizer in Eq 2, it needs to be a differentiable function of θ\theta. Can you explain how this is the case? Otherwise, it is not a regularizer; it is something else.

References

  1. Schulman et al. 2017. Proximal Policy Optimization Algorithms.
作者评论

Thank you for your continued support in reviewing our response. Please refer to the following responses to your further questions.


Q2. About parameter efficiency.

A2. The parameter efficiency of GLIDER contains three aspects. The most attractive aspect is reflected in our hierarchical structure. We propose to share the models for the high- and low-level policies, and differentiate them using a hierarchy prompt that specifies the level of current inputs. This design benefits from harnessing LLMs’ powerful capability to perform in-context learning, i.e., tackling a series of complex tasks by feeding lightweight prompts to a single foundation model. In contrast, traditional hierarchical RL methods [1-3] usually train independent models at each level, resulting in a multiplication of model parameters. Then, we share the backbone between the actor and critic, and use PEFT as two additional techniques to further improve parameter efficiency. All these designs work as a whole to provide an integrated parameter-efficient architecture that addresses the significant challenge of LLMs tackling long-horizon interactive tasks.

Further, we compare GLIDER to three additional ablations to verify the effectiveness of our parameter-efficient architecture. GLIDER can: i) save 50% of model parameters with a 1% performance loss compared to using separate models for the two levels; ii) save 50% of model parameters with a 3% performance loss compared to using separate models for the actor and critic; and iii) save 99% of trainable model parameters with a 3% performance loss compared to full parameter fine-tuning.

MethodTrainable parametersTotal parametersPercentageScore
GLIDER0.18 GB30.22 GB0.58%68.34
Decouple high and low policies0.36 GB60.44 GB0.58%69.05
Decouple actor and critic0.23 GB60.19 GB0.38%70.47
Full parameter fine-tuning30.04 GB30.22GB99.40%70.50

[1] Data-efficient hierarchical reinforcement learning, NeurIPS 2018.

[2] Learning multi-level hierarchies with hindsight, ICLR 2019.

[3] Sub-policy adaptation for hierarchical reinforcement learning, ICLR 2020.


Q6. About overfitting and dropout in SFT.

A6. Thank you for your insightful comments on overfitting and dropout. We add a hyperparameter analysis experiment to observe the relationship between the dropout value and overfitting. The new result is completely consistent with your insight: a higher drouput value will indeed expect less overfitting, or expect overfitting with larger training epochs. We will add more hyperparameter analysis on the SFT stage in the appendix. In GLIDER, we use SFT to construct a base agent, and our focus is steering LLMs toward complex interactive tasks using the proposed offline hierarchical RL framework.

LoRA dropoutepoch 2epoch 3epoch 4epoch 5epoch 6epoch 7epoch 8
0.0534.1442.0245.1150.1740.1637.43/
0.131.9039.2344.1049.1049.7247.24/
0.1533.0440.2944.9848.1349.0050.1348.77

Q9. About the regularizer of sentence length in SFT.

A9. We regularize the model to output relatively shorter sentences by re-weighting, where a smaller length nn will induce a larger weight for the original SFT loss. More precisely, we achieve the effect of regularization by re-weighting the loss, other than including an additional loss that is differentiable w.r.t model parameters. We will clarify it more precisely.

Further, we conduct an intuitive hyperparameter analysis experiment to observe the relationship between the length regularization ratio λ\lambda in Eq. (3) and the average length of output sentences. The new result shows that the model tends to output shorter sentences with a larger λ\lambda, verifying that our re-weighting successfully achieves the effect of regularization.

λ\lambda0.51.01.5
Sentence length23.022.216.8

We were wondering if our responses have resolved your concerns. If you have any additional questions or suggestions, we would be happy to have further discussions.

Best regards,

The Authors

审稿意见
3

This paper integrates hierarchical reinforcement learning with LLMs, leveraging a high-level planner to decompose tasks and a low-level executor to perform actions. Experimental results on ScienceWorld and ALFWorld demonstrate significant performance gains over existing baselines, with strong adaptability to unseen tasks.

给作者的问题

  1. How does the proposed method compare with GPT models using reflection or ReAct-based methods in terms of performance?

  2. Regarding the sub-task decomposition and intrinsic reward design, the paper primarily relies on environmental observations to determine whether a sub-task is completed. Is this design applicable when extended to other, possibly more complex, environments?

  3. During the prompt design process, did you observe any effects of the quality of high-level sub-tasks generated on the execution performance of the low-level actions?

论据与证据

Yes

方法与评估标准

The benchmarks used—ScienceWorld and ALFWorld—have already demonstrated strong performance from advanced LLMs, such as the GPT-series models, leveraging prompt-based methods. These tasks are highly structured. The authors employ small-scale LLMs and apply a multi-stage training approach, yielding improvements over the base models; however, this outcome is somewhat expected. To more effectively evaluate the method’s impact, the authors could introduce GPT-based prompting baselines to compare whether small-scale LLMs with hierarchical training can match or even outperform larger models. Furthermore, testing on more complex environments, such as WebShop, would provide stronger evidence of the proposed approach’s scalability.

理论论述

The paper does not appear to provide formal mathematical proofs for theoretical claims, as it primarily focuses on empirical validation.

实验设计与分析

While the benchmark selection may be somewhat limited in scope, the experimental design and baseline comparisons are thorough and well-executed.

补充材料

Yes. I reviewed the Appendix.

与现有文献的关系

The hierarchical approach of first leveraging an LLM to generate high-level plans before executing them is not particularly novel. However, the paper’s parameter-efficient design is well-executed. The use of a shared LLM backbone for both high- and low-level policies, combined with lightweight fine-tuning techniques, enhances efficiency without significantly increasing computational costs.

遗漏的重要参考文献

  • RL-GPT: Integrating Reinforcement Learning and Code-as-policy
  • Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

其他优缺点

None

其他意见或建议

None

作者回复

Q1. The authors could introduce GPT-based prompting baselines (Reflextion or ReAct) to compare whether small-scale LLMs with hierarchical training can match or even outperform larger models.

A1. Following your advice, we compare GLIDER using small-scale LLMs (LIama-3-8B) to ReAct using large-scale LLMs (GPT4). Results in the following table demonstrating that our small-scale LLMs with hierarchical training can match prompting baselines with much larger LLMs, with a 5%~98% performance improvement.

MethodScienceWorld (seen/unseen)Alfworld (seen/unseen)
GLIDER (LIama-3-8B)77.43/68.3471.56/75.38
ReAct (GPT-4)67.32/65.0944.29/38.05

Q2. The novelty of the hierarchical approach in LLM to generate high-level plans.

A2. We innovatively combine the strengths of LLMs with the structural advantages of hierarchical RL to address the significant challenge of tackling long-horizon interactive tasks. While building upon existing research in LLMs and hierarchical RL, we propose a novel integration of these areas, design a parameter-efficient and generally applicable hierarchy, and enable fast offline-to-online adaptation.


Q3. The paper relies on environmental observations to indicate sub-task completion. Is this design applicable when extended to other, possibly more complex, environments?

A3. This design alleviates the necessity for any manual or task-specific design, and is naturally applicable in more complex environments. For example, in Minecraft, a broad task like "build a house" could be decomposed into simpler sub-tasks like "gather wood materials". The completion of that sub-task can be easily monitored by an attribute “inventory changes” of the environment observation. Our design remains robust at any complexity level, as long as the environment provides clear completion signals.


Q4. Did you observe any effects of the quality of high-level sub-tasks generated on the execution performance of the low-level actions?

A4. The quality of generated sub-tasks can affect the execution performance of the low-level policy. Intuitively, if the generated sub-task is too hard (e.g., close to the original task), the low-level policy can fail to accomplish it. That is why we need the hierarchical decomposition to break down the original task into simpler sub-tasks. By trial-and-error, the high-level policy adapts its behavior using environment-provided rewards, and is reinforced to generate increasingly fine-grained sub-tasks.

This “reinforcement” mechanism can be illustrated by a task evolution example: when the high-level policy generates a sub-task “heat substance” in the kitchen, the low-level policy attempts to use the stove but receives feedback that it's broken. After this failed attempt, the high-level policy adapts by generating a more specific sub-task “navigate to foundry to find blast furnace”.


Q5. Testing on more complex environments, such as WebShop.

A5. We will add a new benchmark, WebShop, to further validate GLIDER’s performance. Due to very limited time, we have only completed the offline dataset collection, and we are rushing to update them before the rebuttal deadline.


Q6. More extensive literature like RL-GPT and Thought Cloning.

A6. Thanks for your advice, and we will cite more relevant references.


Summary Response

We appreciate your time in reviewing our manuscript and your valuable comments. We have made a number of changes and justifications to address your concerns, and we are more than delighted to have further discussions to improve our manuscript. If our response has addressed your concerns, we would be grateful if you could re-evaluate our work.

Best regards,

Authors

审稿意见
3

The work focuses on long-term decision-making problems with LLM's. They propose GLIDER using concepts of hierarchical reinforcement learning, namely decomposing a complicated task into pieces of small tasks. The low-level controller can be goal agnostic and finish its low level task efficiently as directed by high-level policy. The key selling points for the algorithm are:

  • enhanced exploration in long context tasks
  • fast online adaptation to non-stationary environments
  • strong transferability due to task-agnostic (low-level controller is task agnostic)
  • parameter efficient (this comes by sharing of weights in value and actor-network, and also across high and low-level policies) The authors demonstrate performance on ScienceWorld and ALFWorld datasets.

给作者的问题

See above

Update after rebuttal

Thanks to authors for addressing my questions, especially to run the experiment of quantifying performance loss. Please include the promised changes in the final version. I will maintain my inclination towards acceptance.

论据与证据

Yeah, most of the claims are supported by empirical evidence and ablation studies. I do not find empirical evidence on "fast online adaptation to non-stationary environments". Also, they claim parameter efficient which they are by sharing of weights but i think it needs to be also empirically validated by loss in performance due to sharing of weights as compared to using different weights.

方法与评估标准

yeah, the method is tried on long context tasks. However would be good to apply it on multiple datasets, currently on a limited number of datasets.

理论论述

The paper is experimental (no theoretical claims).

实验设计与分析

Yeah, the presented empirical evidence is strong for hierarchical structure.

补充材料

yes

与现有文献的关系

I am not aware of similar work in the context of LLM's. The paper connects existing ideas from offline RL, Hierarchical RL, SFT, etc.

遗漏的重要参考文献

no

其他优缺点

Strengths:

  • Proposed an overall framework with offline hierarchical RL for LLM's which is parameter-efficient, generalizable to an extent and scalable.
  • Presented strong evidence of hierarchical breakup for long context task

Weakness: The work combines existing ideas and demonstrates its use case in solving long context tasks for LLMs. However, there is no starking novelty. Also, it's not clear to me if they really do need a value function here or if it can be surpassed. Also see claims and evidences above for other weaknesses.

其他意见或建议

line 80: two LLM policies: which two? low and high level?

89: hierarchical token level not clear

remove 2nd bullet point of contribution or merge it, it's a feature of your algorithm (not a contribution)

3.1: add details about your state and action space

180: what is \sigma_N

213: around eqn3: Does n_h, n_l belongs to expectation? i don't understand why it would affect gradient of loss?

241: What is explicit modelling of the behaviour policy?

line 247-259: You first train low level and then fix it and later train only high level?

270: what is temporal abstraction knowledge?

305:311: Add more details about both the tasks.. elementary science experiments doesn't say much, some e.g., would be nice

406: what is online fine-tuning? I am wondering why is it saying about generalization. Ideally, for generalization you could fine tune on all tasks except test-conductivity, find-animal, boil and then check trained model performance on these tasks?

作者回复

Q1. About online fine-tuning, generalization, empirical evidence on fast online adaptation.

A1. Online fine-tuning refers to adapt a pretrained policy to unseen tasks using several fine-tuning steps, i.e., few-shot generalization to new tasks. We train a policy using offline datasets on all tasks except test-conductivity, find-animal, and boil. Then, we fine-tune the trained policy by interacting with these tasks online. The evidence is in Sec. 4.4. Generalization Analysis via Online Fine-tuning (L394-L429), also shown in the following table. GLIDER obtains a 65%-226% performance improvement over baselines.

Methodtest-conductivityfind-animalboil
AC0.30 ± 0.100.45 ± 0.050.40 ± 0.15
AWAC0.35 ± 0.150.60 ± 0.100.45 ± 0.10
GLIDER0.98 ± 0.020.99 ± 0.010.95 ± 0.05

Q2. Performance comparison between sharing weights and using different weights.

A2. We compare the full GLIDER to two more ablations: using different weights for the actor and critic, and using different for the high-level and low-level policies. The following table shows the performance on unseen tasks in ScienceWorld, with the LIama-3-8B backbone. We save 50% of model parameters with a <3% performance loss.

MethodTotal parametersScore
GLIDER30.22 GB68.34
Decouple actor and critic60.19 GB70.47
Decouple high and low policies60.44 GB69.05

Q3. The novelty of combining existing ideas and demonstrates its use case in solving long context tasks for LLMs.

A3. We innovatively combine the strengths of LLMs with the structural advantages of hierarchical RL to address the significant challenge of tackling long-horizon interactive tasks. We propose a novel integration of these areas, design a parameter-efficient and generally applicable hierarchy, and enable fast offline-to-online adaptation.


Q4. It's not clear to me if they really do need a value function here or if it can be surpassed.

A4. Classical RL algorithms learn a value function to backpropagate expected optimal returns using dynamic programming updates. GLIDER learns a value function to estimate the action advantage, and regress the advantage-weighted policy. The existence of a value function is a crucial property of RL, which enables the agent to adapt behaviors through trial-and-error rather than only imitating demonstrations. The ablation study (L354-L369) shows a 13.9%~33.5% performance improvement of pure ORL over SFT.

BackboneSFTORL
Mistral-7B45.1160.23
Gemma-7B40.5449.48
LIama-3-8B50.1757.12

Q5. L241: What is explicit modeling of the behavior policy?

A5. GLIDER maximizes an advantage-weighted likelihood function as maxE[exp(A(s,a))logπ(as)]\max E[\exp(A(s,a))\cdot\log\pi(a|s)]. We learn a value function to estimate the action advantage, and regress the advantage-weighted policy. The maximization of the likelihood E[logπ(as)]E[\log\pi(a|s)] corresponds to an explicit modeling of the behavior policy.


Q6. L213: How $n_h/n_l$ affects the gradient of loss.

A6. nn is the length of the sentence generated by policy πθ\pi_{\theta}. We divide the original SFT loss by the sentence length as E[logπθ(as)n]=E[logπθ(as)]+logn-E[\log\frac{\pi_{\theta}(a|s)}{n}]=-E[\log\pi_{\theta}(a|s)] + \log n. A smaller length nn will induce a larger gradient of loss, thus regularizing the model to output relatively shorter sentences.


Q7. L247-259: You first train low level and then fix it and later train only high level?

A7. At SFT and ORL, we simultaneously train the high- and low-level policies. Then, at the offline-to-online stage, we fix the low-level skills and only fine-tune the high-level policy.


Q8. L270: what is temporal abstraction knowledge?

A8. Temporal abstraction refers to abstracting a sequence of temporal actions into a skill, gaining the potential to greatly speed planning and learning on large problems.


Q9. L89: hierarchical token level not clear.

A9. “The hierarchical token-level actors and sentence-level critics” refers to that we have a token-level actor and a sentence-level critic at both levels.


Q10. It would be good to apply it on multiple datasets.

A10. We will add a new benchmark WebShop. Due to very limited time, we have only completed the offline dataset collection, and we are rushing to update them before the rebuttal deadline.


Summary Response

We appreciate your time in reviewing our manuscript and your valuable comments. We have made a number of changes and justifications to address your concerns, and we are more than delighted to have further discussions to improve our manuscript. If our response has addressed your concerns, we would be grateful if you could re-evaluate our work.

Best regards,

Authors

审稿意见
4

This paper introduces GLIDER, a hierarchical reinforcement learning framework designed to enhance the decision-making capabilities of large language models (LLMs) in long-horizon tasks. The authors propose a two-layer structure where a high-level policy decomposes complex tasks into sub-goals, which a low-level controller executes using reinforcement learning.

This hierarchical decomposition allows for better exploration and improved long-term credit assignment, addressing the challenges faced by LLMs in sparse-reward scenarios. The method is designed to be parameter-efficient, leveraging shared LLM parameters between the high and low levels to reduce computational overhead.

The framework is evaluated on ScienceWorld and ALFWorld benchmarks, demonstrating significant performance improvements over prompt-based methods (ReAct, Reflexion) and fine-tuning baselines (NAT, ETO).

给作者的问题

Would like to see authors' thoughts on the scope of environments and complexity of the training pipeline.

论据与证据

In this paper, the authors provide thorough experimental comparisons and ablation studies on both ScienceWorld and ALFWorld. The results show that (1) hierarchical learning improves performance over strong baselines, (2) offline-to-online adaptation is faster and more stable, and (3) the approach is parameter-efficient across different model scales.

Each of these claims has corresponding quantitative evidence (e.g., success rates, adaptation curves, ablation metrics). No obvious discrepancies are found.

方法与评估标准

The authors picked well-known benchmarks (ScienceWorld, ALFWorld) that require text-based reasoning and long-horizon planning, which makes sense for testing a hierarchical RL approach. They also used standard offline RL and supervised fine-tuning methods, which makes it easier to judge how well their model handles efficiency, generalization, and adapting to new tasks.

理论论述

No formal proofs or heavy theoretical claims are made.

实验设计与分析

The training pipeline is well-documented and seems logically consistent. The authors show clear comparisons with multiple baselines and ablation studies.

补充材料

Not quite. The codes are shared in the supplementary material but due to the emergency review, there is not sufficient time to read through or reproduce.

与现有文献的关系

Not obvious.

遗漏的重要参考文献

No.

其他优缺点

Strengths

  1. It nicely combines hierarchical RL with LLMs in an original way, showing strong performance on complex text-based benchmarks.
  2. The paper is pretty thorough, with lots of experiments and ablation studies that support their claims.
  3. The idea of reusing learned low-level skills for fast online adaptation is practical and smart.

Weaknesses

  1. The training pipeline is a bit complicated (three stages), which might be tough for others to reproduce. However, I think this could be alleviated by the codes shared by the authors.
  2. The paper doesn’t get deeply into real-world deployments or more physically grounded tasks, so scope could be potentially limited.

其他意见或建议

No

作者回复

We deeply appreciate your positive and constructive comments for improving our paper. We would be incredibly grateful for your continued support in reviewing our response and offering further encouragement.


Q1. The training pipeline is a bit complicated (three stages), which might be tough for others to reproduce.

A1. We have provided code with the submission to ensure reproducibility of our main paper. Detailed tutorials and datasets will be released later.

While GLIDER involves a three-stage pipeline, each stage serves a distinct purpose. As shown in the following Table, where we compare the performance of baseline, ORL-only, and complete GLIDER implementations on unseen ScienceWorld tasks, the offline RL phase alone can still outperform baseline approaches. Adding SFT improves the results further by efficiently learning valid interactions, though at additional training cost. The optional online phase, enabled by AWAC's smooth offline-to-online transition, makes the overall process more robust and adaptable.

MethodMistral-7BGemma-7BLlama-3-8B
NAT50.7944.9848.76
ETO51.8547.8452.33
GLIDER (ORL)60.2349.4857.12
GLIDER (SFT+ORL)65.1458.5068.34

Q2. The paper doesn’t get deeply into real-world deployments or more physically grounded tasks, so scope could be potentially limited.

A2. This work demonstrates significant algorithmic value through a hierarchical approach that decomposes complex tasks into high-level planning and low-level execution skills. The effectiveness has been validated in sophisticated benchmarks featuring dynamic, sparse reward environments and diverse task scenarios. Meanwhile, these applications in environments with varied tasks and changing states demonstrate the robustness of this design, showing potential for embodied domains and enabling more sophisticated skill composition in robotic manipulation, game strategy learning, and autonomous navigation where multi-level control and temporal abstraction are essential for handling real-world complexity.


Best regards,

Authors

审稿意见
4

This paper addresses the challenges of using Large Language Models (LLMs) for long-horizon decision-making tasks, specifically their difficulties with exploration and long-term credit assignment, especially in sparse-reward settings. To mitigate these challenges, the authors propose a framework called GLIDER (Grounding Language Models as Efficient Decision-Making Agents via Offline HiErarchical Reinforcement Learning). GLIDER introduces a parameter-efficient hierarchical structure to LLM policies. The framework employs a scheme where a low-level controller is supervised by abstract, step-by-step plans that are learned and instructed by a high-level policy. This hierarchical design aims to decompose complex problems into a sequence of chain-of-thought reasoning sub-tasks, thereby providing temporal abstraction to enhance exploration and learning in long-horizon tasks.

给作者的问题

Could you provide a more detailed comparison of GLIDER with other LLM-based approaches to decision-making, if any exist? Specifically, what are the key differences and advantages of GLIDER compared to methods that directly use LLMs for planning without the hierarchical RL structure?

In the experimental evaluation, could you provide a more detailed analysis of the parameter efficiency of GLIDER? Please include a comparison of the number of parameters used by GLIDER and the baseline methods.

论据与证据

Claim: LLMs struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Evidence: The paper cites the challenges of exploration and credit assignment as motivation for their approach. The introduction sets up this problem as a key issue in applying LLMs to decision-making. The experiments likely demonstrate improved performance in long-horizon, sparse-reward tasks compared to baseline methods, further supporting this claim.   Claim: The GLIDER framework, which introduces a parameter-efficient hierarchical structure to LLM policies, mitigates these challenges. Evidence: The paper details the GLIDER framework, including the hierarchical structure with high-level and low-level policies. The experimental results, particularly comparisons with baseline methods, demonstrate the effectiveness of GLIDER in improving performance on long-horizon tasks. Ablation studies, if present, would further strengthen this claim by showing the contribution of the hierarchical structure.   Claim: The hierarchical design decomposes complex problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to enhance exploration and learning. Evidence: The paper explains how GLIDER learns and uses abstract, step-by-step plans to supervise the low-level controller. The results should show improved performance and exploration, suggesting that the decomposition into sub-tasks is beneficial. Visualizations or analyses of the learned sub-tasks and how they contribute to the overall solution would provide further support. Figure 9, for example, illustrates the hierarchical decomposition.

方法与评估标准

Methods:

Hierarchical Reinforcement Learning: The decomposition of the problem into high-level planning and low-level control is a logical approach for long-horizon tasks. It allows the high-level policy to focus on strategic planning while the low-level policy executes the plan, which aligns with the "divide-and-conquer" principle mentioned in the paper.   Language Models for Decision-Making: Utilizing LLMs to generate and interpret plans is novel and leverages the reasoning capabilities of LLMs. The integration of LLMs within the reinforcement learning framework seems appropriate for tasks that require complex reasoning and planning. Parameter-Efficient Hierarchy: The emphasis on parameter efficiency is important for scaling the method and making it practical.

Evaluation Criteria:

The evaluation criteria and benchmark datasets seem reasonable for assessing the effectiveness of GLIDER.

Long-Horizon Decision-Making Tasks: Evaluating the method on tasks that require long-term planning and execution is essential for demonstrating its ability to handle the challenges it aims to address. The paper should define these tasks clearly and explain why they are appropriate benchmarks. Sparse-Reward Scenarios: Evaluating the method in sparse-reward settings is critical, as this is where exploration and credit assignment are most challenging. The evaluation should demonstrate that GLIDER outperforms other methods in these scenarios. Comparison with Baseline Methods: Comparing GLIDER with appropriate baseline methods, including other reinforcement learning algorithms and potentially LLM-based approaches, is crucial for demonstrating its superiority. The choice of baseline methods should be justified. Metrics: The paper should employ appropriate metrics to evaluate performance, such as success rate, return, or other task-specific metrics. These metrics should be clearly defined and relevant to the evaluation goals. Ablation Studies: If included, ablation studies can provide valuable insights into the contribution of different components of the GLIDER framework. This helps to understand the importance of the hierarchical structure, the planning mechanism, and other design choices.

理论论述

The paper focuses primarily on an empirical approach, presenting a novel framework (GLIDER) and demonstrating its effectiveness through experiments. While the core contribution is algorithmic and empirical, there are some implicit theoretical claims.

The paper does not contain formal mathematical proofs for theoretical claims. The emphasis is on the empirical validation of the GLIDER framework. This is not necessarily a weakness, as many papers in machine learning focus on empirical contributions. However, it's important to recognize that the theoretical underpinnings are largely based on established principles and demonstrated through experiments rather than formal proofs.

Areas for Potential Improvement:

While not strictly required, the paper could benefit from a more detailed discussion of the theoretical implications of using LLMs in hierarchical RL. This could involve relating the approach to existing theoretical results in hierarchical RL or discussing the limitations and potential failure modes of the approach from a theoretical perspective.

实验设计与分析

The experimental design appears to be sound and generally supports the claims made in the paper. The authors have attempted to evaluate GLIDER in a comprehensive manner, considering various aspects of its performance.

补充材料

The detailed example provided in the document serves the purpose of supplementary material by offering a more in-depth view of a key aspect of the research. It enhances the reader's understanding and supports the claims made in the paper.

与现有文献的关系

the paper innovatively combines the strengths of LLMs (reasoning, language understanding) with the structural advantages of hierarchical reinforcement learning to improve long-horizon decision-making. It builds upon existing research in LLMs, hierarchical RL, and planning, but it proposes a novel integration of these areas.

遗漏的重要参考文献

By discussing these related works, the paper can provide a more comprehensive context for its contributions and better position GLIDER within the broader scientific literature.

其他优缺点

Strengths:

Originality: The paper presents a novel approach by effectively combining LLMs and hierarchical reinforcement learning for long-horizon decision-making. The way it leverages LLMs to generate and follow abstract plans within a hierarchical framework is a creative combination of existing ideas. Significance: The paper addresses a significant challenge in applying LLMs to complex, sequential tasks. Overcoming the limitations of LLMs in exploration and credit assignment has the potential to broaden the applicability of LLMs in various domains, including robotics, game playing, and automation. Clarity: The paper is generally well-written and easy to follow. The authors clearly explain the GLIDER framework and its components. The figures and examples, such as Figure 9, aid in understanding the proposed approach. Weaknesses:

Limited Theoretical Analysis: As mentioned earlier, the paper lacks strong theoretical underpinnings. While the empirical results are promising, a more in-depth theoretical analysis of the proposed approach would strengthen the paper. Scope of Evaluation: While the paper evaluates GLIDER on relevant tasks, expanding the evaluation to a broader range of complex, long-horizon tasks would further demonstrate its generalizability and robustness. Analysis of Limitations: The paper could benefit from a more thorough analysis of the limitations of GLIDER. Discussing potential failure cases, scenarios where GLIDER might struggle, and the computational costs associated with the approach would provide a more balanced perspective.

其他意见或建议

Typos and Grammar: A careful pass for typos and grammatical errors is recommended to improve the overall polish of the paper. Clarification of Terminology: While generally clear, some terminology could be further clarified. For example, explicitly defining what constitutes a "long-horizon" task in the context of the paper would be helpful.

伦理审查问题

N/A

作者回复

We deeply appreciate your constructive comments for improving our paper. We would be incredibly grateful for your continued support in reviewing our response and offering further encouragement.


Q1. What are the advantages of GLIDER compared to direct LLM planning approaches?

A1. We have conducted thorough ablation studies on hierarchical structure. As shown the following Table, hierarchical models consistently outperform their non-hierarchical counterparts across all LLM backbones, demonstrating the necessity of our approach.

Backbonew/o Hierw/ Hier
Mistral-7B47.3065.14
Gemma-7B50.1658.50
Llama-3-8B53.9468.34

Q2. What is the parameter efficiency of GLIDER and how does it compare with baseline methods in terms of parameter count?

A2. Based on our additional experimental evaluation on ScienceWorld using LLaMA3-8B as the base model, GLIDER demonstrates parameter efficiency in two key aspects: (1) With LoRA, it requires only 0.18GB trainable parameters while achieving superior performance over baselines ETO and NAT; (2) GLIDER with LoRA (0.58% parameters) achieves comparable performance to full fine-tuning (99.40% parameters), showcasing effective parameter utilization through prompting-based hierarchical control.

MethodTrainable parametersTotal parametersPercentagePerformance
GLIDER(LoRA)0.18 GB30.22 GB0.58%68.34
GLIDER(Full)30.04 GB30.22GB99.40%70.50
ETO0.05 GB29.97GB0.17%52.33
NAT0.05 GB29.97GB0.17%48.76

Q3. Why the benchmarks are appropirate?

A3. ScienceWorld and ALFWorld are ideal benchmarks for evaluating GLIDER, featuring dense and sparse reward structures respectively. The complex, non-deterministic nature of both environments requires the agent to continuously adjust its strategy rather than pre-computing a complete solution path, validating GLIDER's real-time planning abilities.


Q4. How does GLIDER handle sparse-reward scenarios?

A4. GLIDER addresses sparse-reward challenges through its hierarchical structure. High-level planning breaks down complex goals into manageable subgoals, while low-level execution ensures efficient exploration and clearer credit assignment for each subtask. This is validated by our strong performance compare to baseline in AlfWorld, a sparse-reward benchmark.

MethodAlfWorld(Seen)AlfWorld(Unseen)
NAT60.7159.70
ETO64.2964.18
GLIDER71.56(\uparrow 11.31%)75.38(\uparrow 17.45%)

Q5. Grammar and Clarification of Terminology

A5. We appreciate the reviewer's suggestions and will thoroughly proofread the manuscript. Regarding the "long-horizon" task, it refers to tasks that cannot be fully planned at the beginning but require continuous planning throughout a long sequence of steps. For example, when "boiling water", if the sink breaks, the agent must replan to find alternative water sources.


Best regards,

Authors

最终决定

The paper addresses the agent planning problem with a novel hierarchical RL algorithm using LLM policies. After rebuttal, all 6 reviewers agree with the writing quality and are satisfied with the thorough experiments. A solid contribution. Hence accept.