/10

Spotlight4 位审稿人

最低3最高5标准差0.7

ICML 2025

MCU: An Evaluation Framework for Open-Ended Game Agents

Xinyue Zheng,Haowei Lin,Kaichen He,Zihao Wang,QIANG FU,Haobo Fu,Zilong Zheng,Yitao Liang

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

摘要

关键词

benchmarkautomatic evaluationagent

评审与讨论

审稿意见

评分: 42025-03-09

This paper presents Minecraft Universe (MCU), a novel evaluation framework designed to benchmark open-ended AI agents in Minecraft. The authors develop a system with three main innovations: a large-scale collection of atomic tasks spanning from combining diverse categories and subcategories; an LLM-based task configuration generator that creates diverse task initialization conditions; and a VLM-based automatic evaluation system that rates agent performance across six dimensions. Their experiments with state-of-the-art Minecraft agents (including GROOT, STEVE-I, and VPT variants) reveal significant limitations in current models with the MCU.

给作者的问题

Please discuss how MCU enables researches to assess open-ended capabilities that are not present in other non-Minecraft benchmarks

论据与证据

I find that the claims in this submission are well-supported, clearly written and easy to follow.

方法与评估标准

Yes, they do

理论论述

there are not theorethical claims

实验设计与分析

I examined the experimental designs and analyses in this paper, particularly focusing on their automatic evaluation (AutoEval) system and the agent benchmarking experiments.

The validation of AutoEval appears methodologically sound. The authors compare their VLM-based evaluation approach against human annotations using appropriate metrics (F1 scores for comparative evaluations and correlation coefficients for absolute ratings). They collect a reasonably sized dataset of 500 trajectories across 60 tasks with human annotations from 20 expert Minecraft players. The correlations only drop in more subject metrics like creativety which is intuitive.

For the agent evaluation experiments, they test four foundation agents (GROOT, STEVE-I, VPT(BC), and VPT(RL)) on a diverse subset of tasks with multiple random seeds, which is a reasonable approach. They evaluate both inter-task generalization (across different task categories) and intra-task generalization (across difficulty levels), which addresses important dimensions of agent capabilities.

Their main limitations I noticed were 1) LLM diversity on task generation may be limited 2) It is not clear how creativity is measured, e.g. is it more creative if I paint 50% of the house of a different colour vs 20%? Is it the steps taken? do you measure if the agents take any kind of new steps in the trajectory of building a house or if they take creative steps that, while might be problematic, could make sense on how to build a house?

补充材料

I went trhough the related work, the environment setting, taks generation and the prompts.

与现有文献的关系

I missed connecting this work to the FormalMethods + RL literature. Task definition and comoposition in the MCU has strong connections to a this domain of works, where many employed the minecraft-inspired environment from [1]. I would strongly encourage the author to include a paragraph to likn this paper to that body of work since it feels just natural that future FM +RL literature transition to the MCU asa default benchmark.

The paper does include references to previous works on minecraft as an environment, open-ended agents and llm-as-a-judge work. However, I also missed connections to other open-ended benchmarks like Habitat-lab or Nethack and what aspects of intelligence the MCU measures that those other benchmark don't

[1] Andreas, Jacob, Dan Klein, and Sergey Levine. "Modular multitask reinforcement learning with policy sketches." International conference on machine learning. PMLR, 2017.

遗漏的重要参考文献

None that I am aware of

其他优缺点

其他意见或建议

If possible I would include a brief summary of the literature overview in the intro. Personally I am not a big fan of relegating the entire related work discussion to the appendix

作者回复

2025-04-01

LLM diversity on task generation may be limited

Thank you for this thoughtful observation. To mitigate prompt-induced bias and encourage diversity in LLM-generated configurations, we explicitly design our prompts to promote variability in initialization elements such as biome, weather, and player state (Lines 49–54).

We validate our current prompt through:

Case study: For the task “craft a crafting table”, we ran 10 generations. The results exhibited wide variation in commands (e.g., /time set day, /give oak_log, /setblock blue_bed), including different wood types like birch, oak, and spruce.
Quantitative analysis: For 5 randomly selected tasks, we conducted 100 generations each. The item-level overlap with few-shot examples was only 2%, and on average, 89% of the commands in each task were unique.

We will also keep improving our prompt to achieve better configuration generation diversity.

It is not clear how creativity is measured, e.g. is it more creative if I paint 50% of the house of a different colour vs 20%? Are these steps taken? Do you measure if agents take any kind of new steps in the trajectory of building a house or if they take creative steps that, while might be problematic, could make sense of how to build a house?

Creativity is not simply a matter of percentages. Painting 50% or 20% of the house a different color doesn’t inherently determine creativity. What truly matters is the intention and impact behind the choice. If painting a portion of the house in a different color introduces a unique aesthetic, challenges norms, or enhances the overall design, then it can be considered creative, regardless of the percentage. Even if some steps are risky, they can still be creative if they contribute to the goal and add diversity.

Here are the specific criteria for "building a house" generated by the criteria prompt in Appendix G.3:

Creative Attempts: creative attempts exhibited by the agent during doing task

e.g. using different materials to create a visually appealing design
e.g. unique themes or styles applied to the house, use of diverse materials for decoration, innovative lighting techniques

Also note that, The VLM is guided to consider, but not be limited to, the examples provided.

Please discuss how MCU enables researchers to assess open-ended capabilities that are not present in other non-Minecraft benchmarks

In the Introduction and Section 2.1 of our paper, we explain Minecraft’s unique suitability as a benchmark for open-ended capabilities and our motivation for selecting it. In brief:

1.Vast State Space

Minecraft’s state space is extraordinarily large—reputedly surpassing the number of atoms in the universe—allowing for virtually unlimited configurations of biomes, blocks, and entities. In contrast, many alternative benchmarks (e.g., Habitat-lab) rely on pre-scanned real world data which makes memorization rather than genuine generalization feasible.

2.High Task Diversity

MCU defines 3,452 atomic tasks that can be combined into a staggering number of compositional tasks covering navigation, crafting, combat, and more. This level of intra- and inter-task diversity is uncommon in other platforms, making it difficult to replicate the breadth and depth of challenges found in Minecraft. Other open-ended benchmarks primarily focus on a final goal, such as finding the Oracle in NetHack, are limited in inter-task diversity.

3.Open-Endedness

Minecraft naturally accommodates complex, multi-step tasks (e.g., obtaining diamonds) that require agents to plan and coordinate over extended horizons, remember terrain and resource locations, and adapt to dynamic objectives. Such open-ended exploration is central to fostering agents with genuinely flexible and robust capabilities—traits that are difficult to assess using more constrained benchmarks.

About references to other open-ended benchmarks and RL literature

Thank you for your valuable suggestions. We will incorporate these references and discussions in the next version of the paper.

审稿意见

评分: 52025-03-12

MCU proposes a scalable benchmark for open-ended game agents in Minecraft. It introduces 3,452 atomic tasks, spanning 11 categories and 41 subcategories, that can be dynamically composed into complex challenges. Using an LLM-based task configuration generator, the framework creates diverse, realistic scenarios, while a VLM–powered AutoEval system automatically scores agent performance with over 90% alignment to human judgments. Experimental results show that even state-of-the-art agents (e.g., VPT variants, STEVE-I, and GROOT) struggle with task diversity and complexity, highlighting the need for further advances in generalization and creativity in open-world environments.

给作者的问题

Given that LLMs can produce similar outputs when repeatedly prompted with the same instructions, how do you ensure sufficient diversity in the generated configurations?
Have you considered integrating alternative, possibly open-source, VLMs for automatic evaluation to reduce costs and increase flexibility, and how might these alternatives compare in performance?
MBU uses manually defined soft constraints in the configuration prompts. How extensive can these constraints be, and how do you address scalability concerns for complex, multi-step tasks where capturing all nuances might be difficult and require manual efforts?
Currently, all evaluation criteria are addressed using a single prompt. Have you tested using separate prompts for each criterion, and if so, do these yield improved performance by allowing the VLM to focus on one aspect at a time?
What was the rationale behind selecting these six specific evaluation criteria?
Will the dataset of evaluated trajectories be open-sourced to support further research and the development of alternative evaluation methods?
Section 3.1 mentions 500 trajectories, while Appendix D.3 indicates 600. Could you clarify which is correct?
Why do the annotation website's individual video questions include “which agent” for the “Task Progress” and “Action” principles?
What is the cost of running a full evaluation using MCU, given the reliance on VLMs like GPT-4v? Is cost scalability a concern for potential users?
How are Hard mode tasks created? Do they require task-specific heuristics, or is there a general method to scale task difficulty across different categories and subcategories?
What criteria or time limits determine when an episode or task is terminated?
Can you provide a detailed breakdown of how many tasks originate from each source (e.g., Minecraft Wiki, MineDojo, SkillForge, in-game data, and original designs by the authors)?
Why did you choose not to include tasks with additional constraints or extra criteria, and would their inclusion benefit the benchmark?
How should the evaluation system assess error correction when an agent makes no errors?
Is sampling one frame every 30 frames sufficient to capture all critical details for evaluation? Have you explored how different sampling frequencies might affect the alignment between AutoEval and human judgments?
How can task progression be evaluated for an open-ended task such as “decorate_the_cave”?

论据与证据

The authors elaborate on why Minecraft serves as a good evaluation basis with scalable complexity and open-endedness by calculating its vast state space.
They assert strong task diversity by aggregating 3,452 atomic tasks across multiple categories, enabling scalable composite task creation, and including an LLM-based configuration generator mechanism requiring some manual efforts, and they compare MCU to MineDojo to demonstrate improved task solvability and diversity.
The authors prove the effectiveness of AutoEval by crowdsourcing human-evaluated data and reporting a 91.5% alignment between their VLM-based method and the annotated data.
The authors reveal the current limitations of open-ended agents designed for prior Minecraft environments, experimentally showing that SOTA agents (e.g., VPT, STEVE-I, GROOT) struggle as task complexity and diversity increases.
The authors claim that their six evaluation criteria comprehensively capture the challenges of real-world tasks, although this claim remains debatable due to the reasons found in the following sections.

方法与评估标准

MCU effectively captures the unpredictable nature of open-world gameplay by blending intra-task diversity, through LLM-generated variations in biomes, weather, and player states, with a broad inter-task diversity, as seen in its 3,452 atomic tasks. These tasks span challenges from precise control to complex reasoning and knowledge application, drawing from sources such as the Minecraft Wiki, MineDojo, SkillForge, in-game data, and original contributions from the authors. Moreover, the tasks vary in difficulty, creating a dynamic testing environment to assess agent generalizability in conditions mirroring the complexity of real gameplay.
MCU’s defines atomic tasks solely by their goal, independent of the method, tools, or specific conditions. This approach isolates the fundamental capability the agent must master. For instance, a task like "mine stone" can be instantiated under varying initial conditions, ensuring that the agent develops a robust policy rather than overfitting to a single scenario. Moreover, the ability to combine these tasks using logical operators enables the creation of progressively complex challenges that mirror the intricacies of real-world task descriptions. The authors should explicitly quantify how many tasks originate from each source to properly credit them and determine the authors’ original contribution. MineDojo may have repetitive and unsolvable tasks, but it is not clear from the paper to what extent. It would also be beneficial to incorporate task descriptions with constraints.
The authors utilize an LLM-based configuration generator combined with a self-verification loop, leveraging feedback from the Minecraft simulator to ensure task validity. To further enhance this mechanism, they introduce manually defined soft constraints in the prompt, guiding the LLM toward generating feasible tasks. However, specifying these soft constraints requires human expertise and detailed prior knowledge of Minecraft, introducing substantial manual effort. This reliance on human-defined constraints limits scalability and task variety.
Prompting GPT to supply surplus resources to ensure solvability can inadvertently lower task difficulty, as agents can exploit the abundance rather than managing resources efficiently. Moreover, surplus resources can mask configuration inaccuracies and complicate evaluation, as agents might complete tasks in unintended ways. For example, if the same task is generated under different biomes or weather conditions but one configuration provides significantly more resources, it may falsely appear easier, leading to erroneous conclusions about the relative difficulty of the biome or weather conditions.
The evaluation pipeline’s reliance on the GPT-4v API may create cost barriers for users, limiting accessibility and scalability. Since other alternatives are not tested, it is not clear whether they would be compatible. While many state-of-the-art LLM APIs, including GPT-4v, incur costs, some open-source alternatives exist, though they may not match GPT-4v’s performance.
Some evaluation criteria may not translate well across all atomic task categories. First, evaluating creativity or material usage for a cut-and-dry task like find_pink_tulip seems misplaced because the task is inherently straightforward, with little room for creative problem-solving. Second, the material usage metric does not seem very relevant for the “Motion” and “Find & Navigation” categories, where materials are rarely required unless the agent needs to craft a pickaxe for mining or a boat for exploring. Third, if an agent performs a task flawlessly with no errors, measuring error correction becomes moot since no corrections are necessary. Fourth, it is unclear how to measure task progress for an open-ended task such as decorate_the_cave. Finally, the presence of trade-offs between evaluation criteria is noteworthy. For instance, higher creativity scores might inherently require the agent to use materials less efficiently or to sacrifice task efficiency for completion. However, certain evaluation metrics exhibit significant overlap. Avoiding unrelated or unnecessary actions and adequately using material naturally correlates with task efficiency. It is not clear how these criteria should be distinguished.

理论论述

There are no formal proofs in this work; the focus is on empirical validation and system design. As such, theoretical claims aren’t a central aspect of the paper.

实验设计与分析

The evaluation pipeline samples every 30th frame from the agent’s trajectory, but this approach may miss important details. For example, an agent could perform unnecessary actions for 29 frames and then behave correctly on the 30th, which could misrepresent its overall performance. It would be useful for the authors to justify the choice of 30 frames and to show that a denser sampling rate does not degrade alignment with human evaluations.
The authors developed a dedicated website for crowdsourcing human evaluations. The annotators’ competency is preemptively checked, and the trajectory comparison is well-designed and intuitive.
AutoEval’s reliability depends heavily on the quality of the underlying VLM. Although I am skeptical about using VLMs to evaluate RL agents in open-ended settings because these models may not fully capture the nuances of gameplay, the authors report a 91.5% average agreement rate with human assessments across the evaluation criteria. Future improvements of VLMs could further improve this alignment.

补充材料

I reviewed the full appendix in great detail.

与现有文献的关系

MCU builds directly on earlier evaluation suites like MineDojo and SkillForge, while incorporating recent advances in LLM and VLM technology. It extends prior work by addressing scalability and task diversity, thereby making a significant contribution to the literature on open-ended game agents. The authors discuss the main existing Minecraft-based environments and open-ended agents that have been evaluated in these environments.

遗漏的重要参考文献

The authors could reference a recent work [1] on Open-World RL on Minecraft.

[1] Li, Jiajian, et al. "Open-World Reinforcement Learning over Long Short-Term Imagination." arXiv preprint arXiv:2410.03618 (2024).

其他优缺点

I have incorporated all the strengths and weaknesses in the sections above.

其他意见或建议

Typo line 134 “based on a vision-language model”
Some of the results in Table 1 appear to be incorrectly bolded.
When evaluating composite tasks, it would be informative to examine which subtask the agent chooses in OR compositions, as well as the sequence of subtasks completed in AND compositions.
The titles of the right columns in Figures 8 and 9 don’t fit on the page.

作者回复

2025-04-01

Due to character limitations, we regret that we can only provide a simplified version of the response below:

What was the rationale behind selecting these six specific evaluation criteria?

Some evaluation criteria may not translate well across all atomic task categories.

It is not clear how these criteria should be distinguished.

The selection of the six evaluation criteria in the MCU framework was guided by the goal of capturing a broad yet practical spectrum of agent competencies needed in open-ended, real-world environments.

Regarding your specific points:

Creativity of simple tasks (e.g., find_pink_tulip) can involve creative strategies (e.g., climbing for better visibility).
Error correction is most informative when agents do make mistakes, but agents that complete tasks without errors are awarded the highest possible score in this category.
Task progress in open-ended tasks like decorate_the_cave can be quantified by tracking the extent and coherence of changes made relative to the environment’s initial state.
Material usage in "find" tasks may involve the use of navigation items such as boats or compasses.
On the point of overlap between criteria, such as between task efficiency and action control, we agree that correlations exist. However, they emphasize different aspects: task efficiency focuses on outcomes (e.g., time to completion), while action control emphasizes the process (e.g., avoiding redundant or counterproductive actions).

Configuration Diversity?

Please refer to the response to Reviewer TkoE in Question 1.

How extensive can the soft constraints be? How do you address scalability concerns for complex tasks?

Soft Constraints: Defined by “bad-case analysis”, generally applicable across tasks, ensuring feasibility within Minecraft’s rules. Currently, 8 soft constraints are used.
Scalability: The task complexity concern is related to current LLM capabilities. As LLMs improve, more sophisticated tasks will become feasible. Generating task specifications is easier than solving them, so we expect scalability to improve naturally.

Separate prompts per criterion?

A test on 20 random samples showed F1 = 91.2% with separate prompts vs. 90.6% with a combined prompt.
However, cost rose nearly sixfold since reprocessing was required for each criterion.
Thus, a combined approach balances efficiency and accuracy.

Evaluation Cost & Open-Source Models

We do not plan to evaluate 3,000 tasks at once; we are selecting a representative subset.
For the 35 tasks in the paper, each run (10 trials per task) costs 13.2 USD per agent using GPT-4o.
We also tested MiniCPM-V-2_6 (8B, Aug. 2024) and JarvisVLA (7B, Mar. 2025). While open-source VLMs still lag, they are catching up.

Method	Survive	Build	Craft	Mine	Explore	Average
MineClip	11.0	45.0	44.0	73.0	0.0	34.6
Ours(MiniCPM)	65.0	43.0	80.0	59.0	53.0	60.0
Ours(JarvisVLA)	73.0	62.0	73.0	84.0	65.0	71.4
Ours(GPT-4o)	100.0	85.0	62.0	71.0	100.0	84.0

How are hard mode tasks created?

Created by prompting the LLM to add complexity and constraints (e.g., obstacles, random disturbances). e.g., In “mine_iron_ore,” visually similar ores (gold/coal) are placed nearby to increase ambiguity.

What criteria or time limits determine when an episode or task is terminated?

Following GROOT settings: 600 steps for atomic tasks, 12,000 steps for compositional tasks.

Can you provide a detailed breakdown of how many tasks originate from each source?

Minedojo	SkillForge	Minecraft Wiki	In-game data	LLM & expert brainstorming
5.2%	0.9%	12.0%	79.2%	2.7%

Why did you choose not to include tasks with additional constraints or extra criteria?

Additional constraints would break the “atomic” nature (Section 2.3).
Atomic tasks are core test units; overlapping tasks dilute evaluation efficiency.

Is sampling 1 frame every 30 enough?

Inteval	20	25	30	40
F1 score	0.75	0.87	0.90	0.68

Too many frames (interval=20) approach GPT-4o’s limit (50 images), risking overload.
Too few (interval=40) may miss vital details.
Interval=30 strikes the best balance (F1=0.90).

Evaluating progress in “decorate_the_cave”?

The criteria generated by LLM: e.g. agents select a suitable cave to decorate, add decorative elements inside the cave, and ensure the cave is well-lit and visually appealing

Regarding open-sourced datasets, typos, formatting, references etc.

Thank you for your meticulous attention to detail. We will fix these issues.

审稿人评论

2025-04-05

The examples and explanations of the six specific evaluation criteria make them more apparent.
The high percentage of unique commands in the item-level overlap analysis among LLM-generated tasks satisfies my concern about task diversity.
My concern with soft constraints is that as agents improve and we introduce more complex, multi-step tasks, the constraints themselves must also become more nuanced. This adds manual overhead since soft constraints need to be carefully adjusted to match the increasing complexity of the tasks. If they are indeed universal enough to foresee any issues, then there is no issue.
Thank you for evaluating the separate prompt setting. It would be useful to include this result in the paper to justify the use of combined prompts.
I think the high cost of GPT-4o’s API might limit the usability of MCU for the time being, but since VLMs are bound to become cheaper over time, I don’t see this as a major issue.
I appreciate that MCU includes difficulty variation not only across tasks, but also within tasks by providing a hard version. It would be helpful to clarify how many and which of the atomic tasks have a hard counterpart. Currently, I only see the six listed in Table 5. If the number isn’t too large, it would also be helpful to list them in the appendix along with a brief description of the modifications.
12,000 steps seems like a long horizon. Are there any early termination conditions aside from the agent dying? If the agent ends up in an unrecoverable state, such as crafting the wrong item or falling into a pit it can’t escape from, then the remainder of the episode is unlikely to yield useful transitions.
Thank you for clarifying the sources of the tasks. It would be helpful to include this distribution in the paper.
Additional constraints would break the “atomic” nature (Section 2.3).

I meant to not list them as atomic tasks, but introduce a separate category, e.g., constrained tasks, similar to composite tasks. I suppose this is more in the realm of future work.
Evaluating different sampling rates justifies the selected value of 30.
I understand that the character limitation prevented replying to all my comments, and you focused on the most impactful ones. However, I am still curious about the following:
1. What proportion of tasks in MineDojo are unsolvable or repetitive?
2. If the LLM inconsistently allocates surplus resources, favoring, for example, one biome type over another, it could artificially simplify certain tasks. Could this lead to false conclusions about their relative difficulty?

I thank the authors for their work. My core concerns have been resolved. I have increased the score.

作者评论

2025-04-05

Thank you for recognizing our work! Below are our responses addressing your remaining comments and suggestions:

Thank you for evaluating the separate prompt setting. It would be useful to include this result in the paper to justify the use of combined prompts.

Thank you for this suggestion. We will incorporate this evaluation into our next revision.

I appreciate that MCU includes difficulty variation not only across tasks, but also within tasks by providing a hard version. It would be helpful to clarify how many and which atomic tasks have a hard counterpart. Currently, I only see the six listed in Table 5. If the number isn’t too large, it would also be helpful to list them in the appendix along with a brief description of the modifications.

We appreciate your point and would like to clarify that every atomic task in our list has a corresponding hard version. Specifically, we designed two distinct prompt templates for each task: one for the simple configuration and another for the hard configuration. Currently, Appendix G.1 only presents the prompt for simple configurations. We will add the hard-mode prompts in our next revision. Additionally, we have conducted extensive evaluations on 90 atomic tasks under hard-mode conditions, as detailed in Appendix F.

12,000 steps seems like a long horizon. Are there any early termination conditions aside from the agent dying? If the agent ends up in an unrecoverable state, such as crafting the wrong item or falling into a pit it can’t escape from, then the remainder of the episode is unlikely to yield useful transitions.

Thank you for highlighting this issue. Currently, we do not implement early termination conditions. However, we acknowledge that this could be optimized further. We are exploring integrating an open-source Vision-Language Model (VLM) to facilitate early termination when the visual progress remains static for several consecutive frames, thus enhancing evaluation efficiency.

Thank you for clarifying the sources of the tasks. It would be helpful to include this distribution in the paper.

Thank you for your suggestion. We will include this task distribution explicitly in our next version.

Additional constraints would break the “atomic” nature (Section 2.3).

I meant to not list them as atomic tasks but introduce a separate category, e.g., constrained tasks, similar to composite tasks. I suppose this is more in the realm of future work.

This is a valuable suggestion. We will carefully consider introducing a "constrained tasks" category in MCU as part of future work.

What proportion of tasks in MineDojo are unsolvable or repetitive?

As described previously, we used GPT-4o to filter the MineDojo creative task list, removing tasks considered repetitive or unsolvable by intermediate-level human players (e.g., "Build the Sydney Opera House"). This filtering left us with 521 tasks, representing approximately 33.40% of the original MineDojo creative task set (which means 66.6% are unsolvable or repetitive).

If the LLM inconsistently allocates surplus resources, favoring, for example, one biome type over another, it could artificially simplify certain tasks. Could this lead to false conclusions about their relative difficulty?

We are sorry to make a confusion here. Surplus resource allocation applies only to the easy-mode configurations. In hard-mode configurations, we purposefully challenge the agent’s resource-management abilities by providing minimal necessary resources. In contrast, easy mode intentionally includes ample resources since efficient resource utilization is considered an advanced skill. Furthermore, hard mode introduces additional complexity through irrelevant items (testing selective usage), rare biomes (testing generalization), and minimal viable resources (e.g., a wooden sword versus a diamond sword for cow combat in easy mode).

We will clarify this in our revised paper.

审稿意见

评分: 42025-03-13

Minecraft Universe (MCU) introduces an advanced evaluation framework for AI agents in Minecraft. The authors build on a history of environments and datasets for Minecraft agents (e.g. MineStudio, MineDojo), to provide a polished evaluation framework with a large diversity of high-quality tasks and a novel automatic evaluation procedure.

Their experiments include an analysis of their auto-evaluation procedure using human annotation and evaluation of existing agents in MCU.

给作者的问题

None. Thank you for your work!

论据与证据

The main claim of having produced a new and high-quality Minecraft evaluation framework decomposes into two claims:

Improvement of agent tasks: Compared to existing task suites (MineDojo) from the authors' observations of issues in existing task datasets and building improvements upon that. While not quantitatively supported, previous issues are highlighted clearly and are improved upon by-design with their new filtered task dataset and configuration framework.
Improvement of automated evaluation: The authors design a new automatic evaluation framework that uses VLMs to rate agent trajectories on a predefined rubric of criteria. Section 3.1 provides convincing evidence that their method meaningfully improves upon prior methods (MineCLIP).

Separately, I want to address the claim of the enduring difficulty of MCU. The authors state "Enabling the composition of atomic tasks into more intricate tasks. This approach exponentially increases both the number and complexity of tasks"

I am not convinced that composing multiple atomic tasks increase the complexity of the challenge significantly. In general, I would expect that if an agent can perform each of the atomic tasks robustly, composing the tasks together does not make the tasks much harder than simply requiring more time to perform them (complexity stacks ~additively rather than in some more complex way)

方法与评估标准

Evaluation of the AutoEval methods:

The authors use comparisons of pairwise preferences and absolute ratings against human annotated trajectories to show that their method has a higher agreement with humans.
While the evidence does support that their method outperforms competing methods, the setup does have limitations
- Pairwise preference datasets are likely to be generally "easy" unless the trajectory pairs are carefully chosen to test "close-calls" (and I don't think this was done).
- On the other hand, the correlation metrics on absolute ratings are hard to interpret in isolation (how do I interpret a correlation of 0.71?), unless e.g. you show that the correlation between the auto-eval and human ratings is close to the correlation between humans and other humans.
Overall, I expect that this rubric-based method is a useful and decent judge, but likely to have weaknesses (LLMs tend to be overly optimistic judges) and may be vulnerable to spec gaming / goodhart's law at the edges.

Evaluation of existing agents:

The setup for evaluating agents is mostly just MCU's setup, which is sound.
The choice of agent baselines and tasks appears suitable.

理论论述

No key theoretical claims.

实验设计与分析

Nothing to add beyond what I've already mentioned.

补充材料

I have skimmed the attached code and appendix, which appear comprehensive.

与现有文献的关系

MCU is the latest addition in a long history of Minecraft-based model/agent evaluation/training environments (MineRL, MineDojo, MineStudio). It brings welcome improvements to the space, in particular creating a streamlined and effective evaluation environment which is suitable and practical for evaluating present-day agents.

遗漏的重要参考文献

None that come to mind.

其他优缺点

Strengths

I think MCU is a welcome addition to the agentic benchmarks space, and I will be interested to see how modern agents (e.g. systems like general-purpose Computer-Use agents) perform on it.
MCU has a broad diversity of tasks and solid auto-evaluation framework, making it interesting and useful.

Weaknesses

I would recommend establishing a "canonical" setup for users wishing to benchmark agents on MCU: how many tasks, which tasks, and a single final scalar metric that aggregates performance. The current setup is flexible but leaves too many variables up to the users which can make it difficult to make comparisons.
The ceiling of difficulty for MCU is somewhat limited given the straightforward instruction-following nature of the tasks. This is sufficient for the type of agents currently tested in this work, but I expect general agents to quickly become adept at this (in the same way that current LLMs are great at instruction-following), and then this benchmark may be saturated soon.
- I am not confident about this claim, so if I am wrong, it would be useful to see some discussion about the difficulty ceiling of the tasks, e.g. how many hours it would take a human to complete the most difficult challenges, and what kinds of reasoning/planning complexities are present.

其他意见或建议

None.

伦理审查问题

N/A

作者回复

2025-04-01

The correlation metrics on absolute ratings are hard to interpret in isolation (how do I interpret a correlation of 0.71?), unless e.g. you show that the correlation between the auto-eval and human ratings is close to the correlation between humans and other humans.

We appreciate this point and would like to clarify that we already report inter-rater agreement in Lines 380–384: “we compute the *inter-rater agreement* for scoring the same trajectory, revealing a higher Pearson correlation for task progress (0.83) and a lower correlation for creativity (0.69).” This provides a meaningful reference for interpreting the correlation between auto-eval and human ratings in context.

I am not convinced that composing multiple atomic tasks increases the complexity of the challenge significantly. The ceiling of difficulty for MCU is somewhat limited given the straightforward instruction-following nature of the tasks.

Our core motivation for composing multiple atomic tasks is to introduce long-horizon dependencies that go beyond basic instruction following. These composite tasks require capabilities such as high-level task planning (e.g., determining the optimal execution order based on task dependencies), memory management (given the limited context length of policy models), and error correction (recovering from early mistakes that may impact downstream steps).

For instance, in the composed task “mine iron, craft an iron pickaxe, and mine diamond”, the agent must break this into subtasks with strict dependencies. If it fails to mine enough iron, it cannot proceed to craft the pickaxe, and thus cannot complete the final step. Success in such tasks requires more than executing steps sequentially—it demands adaptive reasoning, context-aware decision making, and robustness to cascading errors. These qualities represent a meaningful increase in complexity and present new research challenges.

How many hours would it take a human to complete the most difficult challenges in Minecraft?

As a reference, completing the “Ender Dragon” challenge—a commonly recognized long-horizon goal—typically takes a human player between 10 and 30 hours. This includes time spent exploring, gathering resources, crafting appropriate gear, locating the stronghold, and finally, fighting the dragon (which itself takes 10–30 minutes). The overall process involves multi-stage planning, resource optimization, and consistent execution across diverse subtasks.

I would recommend establishing a “canonical” setup for users wishing to benchmark agents on MCU: how many tasks, which tasks, and a single final scalar metric that aggregates performance. The current setup is flexible but leaves too many variables up to the users.

Thank you for the valuable suggestion. In response, we are working on defining a standardized benchmark configuration for MCU. Specifically, we plan to select 10 primary categories (excluding the “others” category) and curate 8 representative tasks from each, resulting in a total of 80 tasks. Each task will include both simple and hard manually verified configurations. For evaluation, we will report the average performance across all tasks and dimensions as a single scalar metric. This canonical setup will be included in the next version of the benchmark to promote consistency and comparability across future work.

审稿意见

评分: 32025-03-14

The paper introduces Minecraft Universe (MCU), a framework that improves evaluation for agents playing Minecraft. MCU includes over 3K composable atomic facts, an LLM-based generator that generates complex tasks by composing the atomic facts, and an automatic evaluation method with a VLM. The paper shows the advantage of MCU tasks vs previous works, and high correlation between human ratings and automatic evaluation.

给作者的问题

Could you please share more details on the brainstorming step (Lines 258-259). How did this step improve task generation? Does it apply to all atomic tasks?
Doesn’t the few-shot config prompt (Lines 257-261) bias generations to examples similar to those in the prompt? I believe the paper would benefit from additional discussion regarding the generated tasks.
For the comparative evaluation (Lines 348-351), why are there a “tie” and “both are bad” classes? What about cases where both generations are good?

论据与证据

My main concern regards the quality of the generated data. The verification step (lines 279-289) describes a re-generation process when errors are detected, but I did not see any analysis describing the quality of the generated examples. As atomic facts are composed automatically with an LLM, many issues can arise including hallucinations or low-diversity. I believe the paper could benefit from examining - the diversity of the generated tasks, the verification process in Sec.2.4 (How many errors are detected? How many errors are not detected?), and quantifying the advantages in Fig.2. (proportion of open-ended tasks, distribution of difficulty levels, etc.).

While this may be out of scope, the evaluation framework (Sec.2.5) is only evaluated on a single game, Minecraft. Evaluating the generalizability of the method to additional multimodal tasks (e.g., web browsing, multimodal code generation, other gaming environments), can be helpful for future research.

I am happy to consider raising my scores if these issues are addressed.

方法与评估标准

The proposed methods make sense for the problem. The only issue I see is that the evaluation framework is not shown to generalize to new environments (see Claims and Evidence).

理论论述

None

实验设计与分析

I checked the soundness of the experiments in Sec.3.

补充材料

I glimpsed over the appendix. It includes many details that are not referenced from the main paper. The paper could benefit from referring to the main parts of the appendix from the main paper.

与现有文献的关系

Developing agents that autonomously plan complex “realistic” games such as Minecraft is a major challenge. The paper improves over prior work (e.g., Minedojo) by introducing new tasks and a new evaluation framework.

遗漏的重要参考文献

None

其他优缺点

None

其他意见或建议

None

作者回复

2025-04-01

My main concern regards the quality of the generated data. (How many errors are detected?) quantifying the advantages in Fig.2 (proportion of open-ended tasks, etc.).

Thank you for raising this important point.

To evaluate the quality of the generated data, we randomly sampled 100 atomic tasks and performed configuration generation. We observed a 9.8% error rate, primarily due to issues such as “unknown block_type or item”. In addition, we identified a 3.7% rate of undetected errors (e.g., omission of necessary items). After verification, over 95% of the configurations were confirmed to be accurate.

To quantify the advantages shown in Fig. 2, we processed the MineDojo creative task list using GPT-4o and filtered out tasks that are not solvable even by humans. From this, 521 tasks remained in MineDojo (note: these tasks are without runnable configurations). We further filtered programmatic tasks according to the definition of atomic tasks, resulting in 268 atomic tasks in MineDojo compared to 2,190 in MCU. Notably, every task in MCU is associated with both simple and hard variants, while MineDojo officially releases difficulty levels for only 64 tasks.

	Creative tasks (filtered: solvable & non-repetitive)	Programmatic tasks (filtered: atomic)	Tasks with Difficulty Levels
MineDojo	521	268	64
MCU	1,262	2,190	345

Could you please share more details on the brainstorming step (Lines 258–259)? How did this step improve task generation? Does it apply to all atomic tasks?

As mentioned in Lines 953–957, we incorporate tasks generated through both expert brainstorming and LLM collaboration. This process is useful for generating creative tasks. More specifically:

Expert-LLM collaboration: Experts iteratively prompt LLMs to produce creative task ideas. For example, they begin with prompts such as “Let’s brainstorm some creative tasks that intermediate-level Minecraft players could accomplish. {few examples}” and refine outputs across rounds with feedback like “Creating a Mona Lisa statue is too complex—can you offer simpler alternatives?”
Expert proposal: We work closely with a university Minecraft club, where experienced players propose engaging and imaginative tasks. This process contributed tasks such as “Prepare a birthday for your neighbor”.

While brainstorming significantly enriches task diversity, not all tasks are derived from this process. Appendix C.1 details the four task sources where tasks are collected.

Doesn’t the few-shot config prompt (Lines 257–261) bias generations to examples similar to those in the prompt?

We validate our current prompt through:

Case study: For the task “craft a crafting table”, we ran 10 generations. The results exhibited wide variation in commands (e.g., /time set day, /give oak_log, /setblock blue_bed), including different wood types like birch, oak, and spruce.
Quantitative analysis: For 5 randomly selected tasks, we conducted 100 generations each. The item-level overlap with few-shot examples was only 2%, and on average, 89% of the commands in each task were unique.

We will also keep improving our prompt to achieve better configuration generation diversity.

For the comparative evaluation (Lines 348–351), why are there “tie” and “both are bad” classes? What about cases where both generations are good?

When both generated outputs are of similarly high quality and indistinguishable in effectiveness, we classify the comparison as a “tie.” Conversely, we mark them as “both are bad.” These cases are recorded solely for annotation clarity and are excluded from final comparison metrics. We only compute metrics based on pairs where a clear winner can be determined (line 351-353).

While this may be out of scope, the evaluation framework (Sec. 2.5) is only evaluated on a single game, Minecraft. Evaluating the generalizability of the method to additional multimodal tasks (e.g., web browsing, multimodal code generation, other gaming environments), can be helpful for future research.

Due to the constraints of the rebuttal period, extending AutoEval to additional multimodal domains is currently beyond our scope. However, we agree that evaluating AutoEval across different domains would strengthen the contribution. We plan to explore in future work.

最终决定Accept (spotlight poster)

2025-05-01

In this work, the authors introduce Minecraft Universe (MCU), a framework that builds tasks from the Minecraft game to evaluate AI agents' task solving capabilities. There are three key components in MCU: 1) a diverse and large set of atomic tasks that can be composed into more difficult and longer horizon tasks; 2) a task composition mechanism that composes sub-tasks into bigger tasks in a meaningful way; and 3) a evaluation framework that scores an AI agent's task solving video, the authors show the automatically generated scores have high alignment with human ratings.

This is clearly a strong paper where all reviewers agree that the proposed presented data can be benefit the ICML community, the procedure of collecting the data can be even more useful.

The reveiwers' concerns were mostly clarification questions on data collection, the properties of the collected tasks, and on how the composition mechanism works. The authors did a great job providing additonal details and clarifications, including some additional experimental results (evaluation on open-sourced models). Reviewers agree the rebuttal has made the work stronger.

AC's additional comments:

Please make sure to include the Related Work section in the main body in your future submissions, this is a very risky move. I wouldn't be surprised if another AC decides to desk reject such a submission, because this is unfair to other submission authors, who have less space to present their work. I decide to let go this time because all reviewers really like the work, and I agree the work can be valuable to the community.
I fail to find if the authors plan to open-source the work, including the data, the data generation pipeline, and the automatic evaluation pipeline. I would encourage the authors to open-source otherwise the work's value will significantly drop.
I feel it's appropriate to cite the "OG" work on Minecraft environment/framework: The Malmo Platform for Artificial Intelligence Experimentation. Johnson et al., 2016. Many existing Minecraft frameworks/benchmarks are built on top of Malmo.
I agree with Reviewer Huw3 that establishing a "canonical" setup for users wishing to benchmark agents on MCU is super important. It may be useful to make a leaderboard website where researchers can submit their scores (with paper supporting their numbers), this could accelerate this line of research a lot.
Minor typo: L328 G.1 -> G.3