Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks
We address the problem of learning diverse long-horizon tasks in open-world environments via reinforcement learning and planning over basic skills.
摘要
评审与讨论
This paper studies the problem of skill learning and planning for solving long-horizon tasks. The proposed method first learns basic skills via RL with intrinsic rewards, and leverages LLM to construct a skill dependency graph, which is then used to search the skill sequence for completing the task. Experiments are performed on Minecraft tasks to validate the idea.
优点
The proposed pipeline that leverages skill reusing for solving long-horizon tasks is straightforward and reasonable;
The method achieves good results from the evaluation perspective;
The paper is well-organized and easily read.
缺点
The method relies on pre-defined basic skills, which need human understanding of the task decomposition;
Leveraging pre-trained LLM for generating skill relationships seems not very reliable, the qualities of the skill dependency graph highly depend on the quality of the prompts coming from the user definition.
Only a few baselines are compared in the experiments.
问题
How to ensure the completeness of the pre-defined skill set?
How many human interventions are needed to generate skill dependencies during preprocessing? And how to ensure the LLM-generated skill graph is correct?
Since this work also needs expert knowledge while providing the prompts for invoking LLM, it would be interesting to see how the proposed method compares to skill planning baselines that utilize symbolic reasoning for planning [1][2][3].
Some relevant references, please consider discussing and citing:
[1] LEAGUE: Guided Skill Learning and Abstraction for Long-Horizon Manipulation, RA-L 2023;
[2] Symbolic Plans as High-Level Instructions for Reinforcement Learning, ICAPS 2020;
[3] PEORL: Integrating Symbolic Planning and Hierarchical Reinforcement Learning for Robust Decision-Making, IJCAI 2018.
This paper aims to accomplish long-horizon tasks in the Minecraft environment. Instead of trying to learn the task directly with RL algorithms, this paper decomposes the long-horizon tasks into atomic subtasks. It then uses a planning module to execute the atomic subtasks sequentially.
This paper indicates that learning the subtasks with vanilla RL could be hard and sample inefficient. To overcome this problem, they use a semi-rule-based Finding-skill to locate the agent close to the target objects/mobs, which makes subtasks much easier to accomplish. Additionally, the paper uses intrinsic rewards generated from a pretrained MineCLIP model.
After learning the atomic subtasks, the proposed model uses a depth-first search on the subtask dependency graph generated by LLMs to determine the sequence of sub-tasks to be executed.
Experiments demonstrate improved success rates on 40 Minecraft tasks.
优点
The main merit of this paper is the decomposition of long-horizon tasks into atomic subtasks, which are much easier and more sample-efficient to learn compared to the original task. Although the idea of task decomposition is not novel in the Minecraft environment, this paper uses a novel way to perform the decomposition via its "Finding-skill".
缺点
My main concern with the proposed Plan4MC model is its generalizability to other environments. The model heavily relies on the Finding-skill that navigates the agent to initial positions that make it relatively easy to accomplish atomic subtasks. Specifically, the Finding-skill separates the Minecraft terrain into grids, and uses a learned policy to navigate the agent to different grids. Within each grid, the agent uses lidar information to move itself close to the target mob. The lidar information is not feasible for humans when playing Minecraft, and is not used in many other related works. Although the authors claim that they can easily replace the lidar with a CV model, this contradicts a motivation of the paper that claims Minecraft data is hard to acquire, since the CV model would need to train on a large amount of Minecraft data.
Empirically, although ablation studies show that the Finding-skill helps improve the overall success rate of Plan4MC, it would be nice for the authors to compare it with other models such as Dreamer V3 and STEVE-1 [1], which have demonstrated very good performance on most tasks reported in the paper including cut tree and mine stone.
Another main concern is that, although the paper argues that Minecraft data is hard to acquire, it would be nice to develop sample-efficient RL approaches, the proposed method also requires extensive Minecraft data. Specifically, when training the atomic subtasks, intrinsic rewards from MineCLIP, a CLIP model trained with Minecraft videos paired with captions, are used. Therefore, Plan4MC actually "uses" lots of Minecraft demonstration videos under the hood.
The proposed model uses LLMs to acquire task-related knowledge. However, it is unclear from the paper its relation between models such as Voyager, which uses LLMs to extract knowledge in Minecraft for lifelong learning.
[1] Lifshitz, Shalev, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. "STEVE-1: A Generative Model for Text-to-Behavior in Minecraft." arXiv preprint arXiv:2306.00937 (2023).
[2] Wang, Guanzhi, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. "Voyager: An open-ended embodied agent with large language models." arXiv preprint arXiv:2305.16291 (2023).
问题
Is it possible that the Finding-skill can be made more general for other domains?
How does the proposed atomic subtask model compare to other methods such as STEVE-1?
The paper focuses on creating a generalized agent capable of addressing multiple long-horizon tasks in open-world settings, specifically in the Minecraft game environment.
To do so, the authors employ a hierarchical decision-making approach which integrates two main schemes: (1) the LLM-based high-level skill sequence planning scheme that exploits the prior knowledge of LLM and a skill dependency graph, and (2) the fine-grained basic skill learning scheme including Finding-skill that provides a common navigation technique in the Minecraft’s open-world settings.
These schemes are implemented as a single framework, Plan4MC, and tested across 40 different Minecraft tasks. These skill planning and learning schemes are designed specifically for the long-horizon open-world Minecraft tasks and they can be seen as the main contributions of the paper. This paper also includes the performance evaluation and ablations of the schemes.
优点
Within the Minecraft game context, the Plan4MC scheme of learning fine-grained basic skills including the general Find-skill seems promising for improved open-world exploration.
Regarding the performance, Plan4MC outperforms the MineAgent (in MineDojo) baseline and the Plan4MC without Finding-skill approach for learning basic skills. It also provides robust performance over the interactive LLM-based skill planning.
缺点
The paper primarily demonstrates the Plan4MC capabilities within the Minecraft environment, offering limited insight into the broader applicability of the two key schemes: LLM-based skill planning with a skill graph and the general Finding-Skill exploration. While Minecraft indeed offers a diverse range of long-horizon tasks in open-world settings, Plan4MC does not sufficiently establish that the two schemes can be generalized beyond this specific context.
The two schemes appear tailored for the Minecraft environment settings (see the questions); thus, it would enhance the paper's contributions if it included experiments and analyses that demonstrate the broader applicability of these proposed schemes beyond the Minecraft context. If conducting experiments outside of Minecraft is considered too labor-intensive, a more in-depth discussion on potential generalization, beyond what's presented in Appendix I, would be beneficial.
问题
When formulating and implementing the Finding-skill to provide a good initialization of other skills, it is assumed that target items are uniformly distributed across the world surface. Is this assumption valid in general open-world environments? If the distribution of target items is influenced by the surrounding environment (i.e., not uniformly distributed), wouldn't the proposed RL (based on a 10x10 grid and state counting exploration) be inefficient? It seems that leveraging intrinsic reward techniques [1, 2, 3] might be more effective for efficient exploration. Could the author clarify the contribution and generalization of the Find-skill not just for the Minecraft environment but for general open-world tasks?
[1] Burda, Yuri, et al. "Exploration by random network distillation." arXiv preprint arXiv:1810.12894 (2018).
[2] Badia, Adrià Puigdomènech, et al. "Never give up: Learning directed exploration strategies." arXiv preprint arXiv:2002.06038 (2020).
[3] Seo, Younggyo, et al. "State entropy maximization with random encoders for efficient exploration." International Conference on Machine Learning. PMLR, 2021.
The paper leverages ChatGPT’s prior knowledge to construct a skill graph tailored for Minecraft tasks. In scenarios where the target environment is not familiar to the LLM training data (unlike Minecraft), how can this LLM-based skill plannings scheme be adopted? Could the authors clarify the originality and superiority of Plan4MC’s LLM-based skill graph generation, compared to similar techniques such as [4, 5, 6] where LLM is leveraged for high-level skill planning.
[4] Ahn, Michael, et al. "Do as i can, not as i say: Grounding language in robotic affordances." arXiv preprint arXiv:2204.01691 (2022).
[5] Huang, Wenlong, et al. "Language models as zero-shot planners: Extracting actionable knowledge for embodied agents." International Conference on Machine Learning. PMLR, 2022.
[6] Song, Chan Hee, et al. "Llm-planner: Few-shot grounded planning for embodied agents with large language models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
Minor typo: on Page 23, in Table 9. stone sward, stone axe, iron trap door, skill icon are in wrong images.
The authors use hierarchical policies to solve tasks in the Minecraft game. Minecraft is a huge game and therefore has large exploration bottlenecks. To tackle this, they break tasks down into subtasks using ChatGPT and then construct a DAG representing the graph over skills. When presented with a task, they search over the DAG using DFS to construct a plan. As the agent’s state changes, they research the DAG to construct a new plan and new next goal.
They pretrain a “finding” policy trained with state-count based exploration for efficient navigation. The finding policy is a combination of a high level policy that emits a goal for a fixed, pretrained low level policy. They then train a policy to do manipulation and crafting tasks, which can use the finding policy to locate relevant blocks or mobs.
They find their method outperforms other methods at similar levels of sample efficiency. They find they can even construct iron tools.
优点
- Using LLMs to construct a DAG is interesting and original
- Their method seems fairly sample efficient, and it seems promising to extract domain specific knowledge from LLMs for the purpose of planning.
缺点
- The method seems very cobbled together. It’s hard to tell what would generalize to other environments as there is so much environment-specific information used. For example: choices like using DQN to train a low level policy but then PPO to train the high level policy is odd and unmotivated.
- Paper organization and writing seems very poor to me. Only after 1/2 to 2/3 through the paper did I understand (though I’m still unclear on many details) most of the technical details. I think the authors could spend a lot of time revamping their abstract, introduction, method diagram, etc. to be more clear.
- I found the language used in the abstract and the introduction hard to follow. The authors go into some detail, but not enough to fully understand, but too much such that I ended up confused at the end of those sections. I would recommend either (a) keeping the method more high level in these sections or (b) describing the method in enough detail so that one isn’t left confused.
- In abstract, unclear what “propose a Finding-skill to improve the sample efficiency for training all the skills” means. It’s still unclear when mentioned in the Introduction.
- There were many mentions of “skill search algorithm” without just saying it's DFS
- There is a long history of hierarchical RL. I recommend the authors include discussion of related work.
- It’s unclear to me that you could easily pre-generate a DAG in more real settings. It would be nice to have some discussion around this.
问题
- It was unclear to me in section 3.1 what the goals of the high level finding policy are. What do you mean you decompose each 10x10 area into grids?
- Section 3.2. What is the goal distribution?
- Did the LLM skill graph do anything you couldn’t have done just using structured knowledge of the skill tree in minecraft?
- For the Interactive LLM planner and other ablations, do you include the finding skill?
- What FPS does the game operate at?
- Do you use automated craft actions or use the game interface?
The authors propose a combination of techniques for learning to play Minecraft, based on prompting ChatGPT and an intrinsic exploration bonus. While the combined method achieves some number of simple tasks, the performance lacks behind standard RL baselines that learn without external knowledge, comparison to baselines is lacking, and reviewers agree that the scientific contribution of the presented ideas is too limited and specific to the Minecraft environment. Generalizing the method to show compelling results on other RL environments and comparison to Minecraft baselines from the literature could strengthen the submission.
为何不给更高分
Lack of baselines from the literature, low empirical performance, method not general enough
为何不给更低分
N/A
Reject