MLGym: A New Framework and Benchmark for Advancing AI Research Agents
MLGym introduces a framework and benchmark suite for evaluating and developing large language model agents on diverse AI research tasks.
摘要
评审与讨论
This paper presents MLGym, a new environment framework to measure the research capabilities of models and a corresponding benchmark to compare models. The benchmark includes multiple focus areas, such as game theory questions and computer vision.
The authors explain how the framework works and how to continue extending it, the various tasks in the bechmark, and the final metric calculated by the benchmark. Then, they compare a few models on the benchmark and present a pareto curve for the models presented.
Overall, it's a competent paper and valuable framework+benchmark. My main concern is that the benchmark seems close to being solved already, and the models presented in the paperare already last-Gen. However, the framework is still valuable in itself, so I still lean towards acceptance.
接收理由
This is a good framework that will become increasingly important. We need to be able to train and measure the agentic capabilities of language models in new and varying ways. This gym framework for AI research tasks fills an important tooling need and could become a solid standard in the field.
The default tasks in the MLGym-bench are interesting and give a solid way of comparing models. I have some concerns about the benchmark, but these are addressable.
拒绝理由
- I understand and appreciate the value of the framework outside of the benchmark, but no training experiments are provided.
- This field is moving insanely fast, but the models here are unfortunately out of date. R1, Gemini 2.5 pro, o3, Grok, and Sonnet 3.7 all deserve consideration.
- I'm worried these new models may already max out the benchmark. O1 seems close based on figure 2.
- There isn't much consideration given to vision tasks. They are included, but they could be integrated much more and in greater difficulty.
给作者的问题
- Can you add some or all of the above models (Gemini 2.5 pro, o3, Grok, and Sonnet 3.7) to the benchmark?
- Have you used the framework for any model training? What were the results?
- small error, missing ref at the end of section 3.1
- how much headroom is left in the benchmark, what is the max score and how hard will it be to achieve? Do you expect it to be solved soon, or still useful for a long time to come?
We thank the reviewer for the thoughtful and detailed review. Please find our comments below:
Q1: Additional results
We have added seven frontier models; DeepSeek R1, Claude-3.7-Sonnet, O3-mini, Gemini-2.0-Flash, Gemini-2.5-Pro, Llama-4-Scout, and Llama-4-Maverick; to the evaluation pool. All analysis, cost curves, and performance-profile plots will be updated.
We were not able to add results for O3 and Grok due to time and computational limitations. However, we open-source the framework and benchmark to foster open development and reproducibility.
| Model | Best AUP | Last AUP |
|---|---|---|
| GPT-4o | 1.288 | 1.317 |
| o3-mini | 1.214 | 1.236 |
| o1-preview | 1.423 | 1.444 |
| DeepSeek-R1 | 1.249 | 1.267 |
| Llama-3.1-405B | 1.330 | 1.353 |
| Llama4-Scout | 1.267 | 1.288 |
| Llama4-Maverick | 1.303 | 1.330 |
| Claude-3.5-Sonnet | 1.426 | 1.419 |
| Claude-3.7-Sonnet | 1.350 | 1.378 |
| Gemini-1.5-Pro | 1.420 | 1.405 |
| Gemini-2.0-flash | 1.374 | 1.385 |
| Gemini-2.5-Pro | 1.419 | 1.445 |
| baseline | 1.220 | 1.243 |
In the table above, new models are shown in italic, and the best scores are highlighted in bold. As we can see from the table above, the benchmark remains unsaturated, even with the release of new frontier models that push the boundaries of reasoning and agentic workflow. A detailed discussion on benchmark saturation can be found in the response for Q3.
Q2: Training results
We appreciate this question. We have not conducted training runs using the framework. We believe that implementing the framework and benchmarking requires considerable effort, and we would like to limit the scope of this work to the framework and benchmark release.
Creating such a framework is the first and foremost step in setting up research in subsequent directions, exploring training algorithms. As such, we have focused on setting up the groundwork for such exploration.
Moreover, such exploration warrants its own comprehensive study, as it is a relatively new field, and it would significantly increase the scope and computational resource requirements of the work.
We hope this clarifies the contributions and the reason why we decided to limit the scope of this work.
Q3: Benchmark Saturation
We would like to clarify the metrics used in this work.
1. Performance Profiles and AUP
We use Performance Profiles and Area under Performance Profiles. Both of these metrics measure the comparative performance of the selected models. This metric is not intended to provide an absolute score and does not have an upper limit. The metric upper limit changes depending on the selection of models or agents.
For example, in Figure 2, O1-preview reaching 1 in terms of performance ratio does not signify the ceiling of the benchmark. Instead, it states that at this value of , O1-preview can outperform all the other models in the pool, but does not imply that it can achieve the best possible performance on each task in the benchmark.
2. Why a comparative metric?
We have selected a comparative metric for this work due to the open-ended nature of the research tasks. Since we are comparing models on open-ended tasks, we want a metric that is robust for comparison and will not saturate. AUP score can be seen as similar to ELO rating, which is used for measuring comparative performance between models on LMArena. However, ELO has its challenges, and the AUP score has been shown to integrate performance on diverse tasks with different metrics in AutoML.
We have also reported raw performance metrics for each task and model in Tables 5 and 6 in Appendix D.1. Below, we show a snapshot of the raw performance scores from Table 5 (Best Attempt@4) with baseline performance, Claude-3.5-Sonnet (overall best) and Best performance score achieved by one of the models in the evaluation pool.
| Task | Claude-3.5-Sonnet | Best Performance | Best Model |
|---|---|---|---|
| House Price | 0.921 | 0.931 | o1-preview |
| 3-SAT | 15.728 | 11.894 | Llama4-Scout |
| CIFAR-10 | 0.895 | 0.895 | Claude-3.5-Sonnet |
| Fashion MNIST | 0.945 | 0.945 | Claude-3.5-Sonnet |
| MS-COCO | 0.298 | 0.473 | Llama4-Maverick |
| Language Modeling | 4.476 | 3.966 | o1-preview |
| MNLI | 83.036 | 84.249 | Gemini-2.5-Pro |
| Battle of Sexes | 1.442 | 1.445 | Gemini-2.0-flash |
| Prisoners Dilemma | 2.567 | 2.634 | Llama4-Scout |
| Blotto | 0.576 | 0.576 | Claude-3.5-Sonnet |
| Breakout | 35.017 | 71.389 | Gemini-15-Pro |
| Meta Maze | 48.562 | 48.562 | Claude-3.5-Sonnet |
| Mountain Car Continuous | - | 74.737 | Gemini-1.5-Pro |
As shown in this table, no single model consistently achieves the best performance across all tasks. The AUP score is intended to assess the overall consistency of the models amongst the evaluation pool.
3. Flexibility of the Benchmark
The saturation of this benchmark depends on the saturation of the underlying tasks themselves. For this reason, we decided to include a diverse set of tasks that are collectively not saturated. For example, some of the game theory tasks have no theoretically correct solution, and the proper solution depends on the opponent's strategy. Thus, it is easy to increase/decrease the difficulty of the task. Similarly, human researchers are still working towards the Language Modeling problem.
Moreover, we would also like to point out that MLGym-Bench is designed to be a live benchmark; that is, if a dataset used in the task becomes saturated and the agent reaches the best solution (1) that would signify a step change in LLM agents capability (controlling for the data leakage) and (2) the dataset or the task can be easily replaced with a similar but more challenging variant, thus maintaining the difficulty of the benchmark.
Q4: Vision tasks
We acknowledge that the current CV task set is modest. However, this benchmark is not intended to be a comprehensive set of tasks for all fields; rather, it is meant to be a small selection of relevant, diverse tasks spanning a wide range of fields/problems. We intentionally want to keep the benchmark small and well-curated to democratize its use, i.e., anyone can easily evaluate it without needing a lot of compute.
As discussed in Q3.3, MLGym-Bench is supposed to be a live benchmark, and with community involvement, we can increase the difficulty and breadth of the benchmark as research progresses. Moreover, as discussed in Appendix A, MLGym-Bench currently focuses on the Level 1: Baseline Improvement categorization of tasks. However, with the framework, it is easy to simply remove the baseline code and let the agent start with just an evaluation script, thereby increasing difficulty with a simple switch.
Due to the reasons listed above, we believe this benchmark will remain useful for a long time to come and can be easily extended to support a much wider variety of tasks, while also increasing the difficulty.
We thank the reviewer for their thoughtful review and questions, and we hope that we have adequately addressed them. We have also noted the reference mistake pointed out by the reviewer and will update the manuscript to fix it. We are always available to answer any further questions and address any comments the reviewer may have during the discussion period.
Thanks so much for adding the other frontier models! I believe this adds much more strength to the paper.
Apologies for misunderstanding the comparative scores. I have re-read the work and I originally misunderstood how it was calculated - Figure 2 led me to think that the benchmark had been saturated.
While I believe adding the other frontier models does strengthen the work, I also agree with the criticism from the reviewer about including a short forward-looking section on multi-agent settings. It would be unfortunate if this benchmark is not extendable in any way to multi agent settings, even if that isn't the focus of the work.
Overall, I will keep my score the same - this is not my main field of expertise, so I feel comfortable keeping at a 6 with 3 confidence.
We want to thank the reviewer for their insightful comments, which helped us strengthen the paper.
We would like to add a clarification regarding the multi-agent criticism. We would like to note that the framework is certainly extensible to multi-agent settings. The decoupling of Agent and Environment modules allows for the efficient implementation of multi-agent systems. We are happy to add a future work section on multi-agent settings using MLGym and for AI research agents in general. We will also add an example of creating a simple multi-agent system to the source code documentation.
The future work section of the paper will discuss various avenues of research for building a multi-agent system for AI Research Agents, and the documentation will outline the implementation details of such a system using the MLGym framework.
We hope you will consider raising the score since we have addressed your main and additional concerns.
Thank you for your detailed feedback during the discussion period. We appreciate the time you took to review the submission.
The paper presents MLGym and the accompanying MLGym‑Bench: the first Gym‑style environment aimed at studying large‑language‑model (LLM) agents that perform full AI‑research workflows. Thirteen open‑ended tasks—spanning computer vision, natural‑language processing, reinforcement learning, and game theory—require agents to formulate hypotheses, generate or process data, implement code, train models, and iterate. By embracing the standard Gym API, the framework is positioned to support reinforcement learning, curriculum learning, and other training paradigms, while also evaluating a wide range of artifacts (CSV predictions, model checkpoints, RL policies, strategy code). Baseline experiments with several frontier LLMs show modest gains over provided baselines (mainly via improved hyper‑parameter selection) but little evidence of creative algorithmic discovery.
接收理由
MLGym meaningfully broadens the scope of “agentic” evaluation beyond short‑form reasoning or code‑completion benchmarks, forcing models to engage in longer, iterative research loops. The Gym abstraction lowers the barrier for researchers who already build on widely used RL libraries, and the open sourcing of both framework and tasks is a welcome commitment to community adoption. The inclusion of tasks from four distinct sub‑fields, and the ability to grade heterogeneous artifacts, make the benchmark conceptually richer than prior narrowly scoped suites.
拒绝理由
-
The experimental section is currently too thin to validate the benchmark’s breadth. Only a handful of mostly closed‑source LLMs are evaluated, and the lone open‑source entrant is a 405 B‑parameter giant, leaving open questions about how mid‑sized models (e.g., DeepSeek‑67 B, Qwen‑72 B) or strong instruction‑tuned 7–34 B checkpoints would fare.
-
The study considers solely single‑agent settings, despite multi‑agent collaboration being common in real research workflows and a natural fit for Gym. Evaluation focuses on average‑utility‑per‑task (AUP), which the authors did not introduce and which captures only overall task score; richer metrics, sample efficiency, compute cost, and qualitative novelty, are left unexplored, even though the tasks purportedly span data science, game theory, CV, NLP, and RL. Finally, because the agent architectures are largely unmodified, the work reads more as a dataset release than as a methodological advance.
We thank the reviewer for the review. Please find our response below:
C1: “Experimental section is thin; mid-sized models are missing”
1. Additional Models
The revised draft will report results for seven additional models: DeepSeek R1, Claude-3.7-Sonnet, Gemini-2.0-Flash, Gemini-2.5-Pro, Llama-4-Scout, and Llama-4-Maverick. Moreover, Llama-3.1-405B, DeepSeek R1, Llama-4-Scout, and Llama-4-Maverick represent the current frontier of available open-weight models.
| Model | Best AUP | Last AUP |
|---|---|---|
| GPT-4o | 1.288 | 1.317 |
| o3-mini | 1.214 | 1.236 |
| o1-preview | 1.423 | 1.444 |
| DeepSeek-R1 | 1.249 | 1.267 |
| Llama-3.1-405B | 1.330 | 1.353 |
| Llama4-Scout | 1.267 | 1.288 |
| Llama4-Maverick | 1.303 | 1.330 |
| Claude-3.5-Sonnet | 1.426 | 1.419 |
| Claude-3.7-Sonnet | 1.350 | 1.378 |
| Gemini-1.5-Pro | 1.420 | 1.405 |
| Gemini-2.0-flash | 1.374 | 1.385 |
| Gemini-2.5-Pro | 1.419 | 1.445 |
| baseline | 1.220 | 1.243 |
In the table above, new models are shown in italic, and the best scores are highlighted in bold. As we can see from the table above, the benchmark remains unsaturated, even with the release of new frontier models that push the boundaries of reasoning and agentic workflow.
2. Model choice reflects the frontier
Our primary goal is to release a benchmark and framework, not a full leaderboard. We therefore ran a small set of frontier models to demonstrate that MLGym-Bench remains challenging even at today’s performance ceiling. Each full run is resource-intensive, so we selected the strongest publicly available models at the time of the experiment.
C2: Focus on single-agent setting
We agree that multi-agent research is an exciting direction; however, it falls outside the scope of this paper. Our goal here is to introduce a framework and benchmark that make it easy for others to evaluate new AI-research agents on open-ended tasks. A single-agent setting is a necessary first step toward reliable, reproducible baselines.
C3: Evaluation
1. “Evaluation focuses on average-utility-per-task (AUP), which the authors did not introduce and which captures only overall task score.” We respectfully disagree with both of these claims. We have provided a clear description of the AUP metric and conducted extensive analyses, focusing on model performance for each task in the benchmark.
- Meaning of AUP: In Section 6, we define AUP as Area Under the Performance-profile curve, not “average-utility-per-task.” Equations and discussion are provided in that section, and Sections 6 and 7 both reference the definition explicitly.
- Raw task metrics: Tables 5 and 6 in Appendix D.1 list the native metric for each task.
If additional clarification is helpful, we welcome concrete suggestions from the reviewer.
C4: Only Performance Profiles and AUP, no rich metrics
We believe the reviewer may have missed some essential sections and figures in the paper, where we provide cost analyses, computational costs, token usage per model, agent action per task/model, and termination error analyses, which provide insight into failure modes. Please find the references to each of these experiments below.
1. Compute cost and sample efficiency
- Figure 3 plots API cost vs. AUP.
- Appendix D.1 reports GPU hours and wall-clock time per task.
- Appendix D.2 gives average token usage and API spend per model.
- Appendix D.3 provides agent action and termination error analyses.
We will add additional analyses of the average number of steps/actions for each task and model.
2. Qualitative novelty
We would like to remind the reviewer that this work introduces a new framework and benchmark. The work does not focus on agent design for increased performance. However, we will add full agent trajectories for every task in the updated manuscript to provide qualitative examples.
C5: “Reads more as a dataset release than a methodological advance.”
The paper’s chief contribution is indeed the modular simulation framework and benchmark; it is intended as infrastructure, not a new agent architecture. This aligns with the Call for Papers category “All about evaluation: benchmarks, simulation environments, scalable oversight, evaluation protocols and metrics.”. Based on our understanding of COLM’s policy, the fact that our paper focuses on a benchmark and framework rather than a new method should not be grounds for rejection.
If any wording suggested otherwise, we will clarify it in the revision.
Thank you for the detailed author response. Below is concise, point-by-point feedback to help strengthen the revision.
1. Model Coverage & Timely Update
• Benchmark risk: Even if the goal is not to release a full leaderboard, the absence of regularly updated mid-sized models (e.g., 7-34 B checkpoints) can mislead LLM developers about the true difficulty ceiling.
• Actionable suggestion: Add at least a representative slice of open-weight 7–34 B models, or publish a lightweight protocol (task subset / shorter episodes) so the community can contribute results without prohibitive cost.
2. Evaluation Clarity (Sec. 6 vs. Sec. 3.4)
I apologize for any misunderstanding: my concern is not the AUP formula itself but reader confusion caused by two different “Evaluation” discussions:
• Sec. 3.4 (task design) already spends several lines on evaluation scripts.
• Sec. 6 later re-introduces evaluation, somewhat duplicating the narrative.
• Minor suggestion: Merge or cross-reference these sections to avoid the impression that “evaluation = CSV prediction” (Kaggle style) when many tasks require script-based grading.
3. Granularity of Task Descriptions
Because you claim coverage of CV, NLP, RL, game theory, and tabular data, task-level evaluation details matter. Please consider:
• Include a concise per-task table in the main text summarizing input modality, artifact type, grading metric, and compute budget, clear pointers to full evaluation scripts in the repo.
• Alternatively, at minimum, provide qualitative case studies with full agent trajectories for each task in the updated manuscript to offer concrete qualitative examples.
4. Training Facilities & Extensibility
Current design showcases 13 tasks, but it is unclear whether the framework easily scales to dozens more. Concrete ways to strengthen this point:
• Provide a template generator (or wizard) that auto-creates config files, conda envs, and stub evaluation scripts.
• Include at least one “blank” scaffold example (no starter code) to illustrate curriculum flexibility.
5. Multi-Agent Road-Map (Optional but Valued)
While I accept that multi-agent is out of scope, a one-paragraph road map or tiny two-agent demo would reassure readers that the API will not need a fundamental redesign later.
Summary Recommendation
The authors have addressed many issues, but gaps remain in model spectrum coverage, evaluation narrative cohesion, and framework extensibility. Addressing the bullet points above would significantly improve clarity and strengthen community confidence in MLGym-Bench. However, I am reasonably willing to raise my score if the authors provide clear explanations to resolve the remaining confusion.
C1: Model Coverage and Timely Update
Benchmark risk: difficulty ceiling and representative models
We respectfully disagree with the claim that the current selection of models does not provide the true difficulty ceiling. Existing 7-34B models don't perform well on agentic tasks unless fine-tuned on the task. Our selection covers the last and current generations of open-weight frontier models. We cannot conduct new experiments during the discussion period due to budget and time constraints; however, we will include Qwen3-8B and QwQ models in our camera-ready submission.**
Lightweight protocol
Our task selection, computational resources, and API cost requirements are minimal to foster community benchmarking. All tasks in MLGym-Bench require 0-2 GPUs and a $4 limit on API cost per agent trajectory.
Furthermore, related works on Coding and ML agents typically select only 2-3 models for experimentation [1, 2, 3, 4]. Whereas, we go above and beyond the concurrent works by providing benchmark metrics, cost, and action analyses using 13 diverse models to give a fair view of the benchmark’s ceiling.
C2: Evaluation Clarity (Sec. 6 vs. Sec. 3.4)
The original review pointed out that we don’t define the AUP metric and mislabeled the definition of AUP. We want to emphasize that this is a moving target. However, we clarify the two different evaluation sections here.
Task Evaluation (Sec 3.4): Task evaluation refers to the evaluation function/script used for a single AI task in the benchmark.
Benchmark Evaluation (Sec 6): Section 6 describes the metric used for aggregation of scores on each task.
For further clarity, we will rename Section 6 to 'Benchmark Evaluation'.
Impression of Evaluation = CSV prediction
We respectfully disagree with this claim. One of the core contributions of this framework is the ability to evaluate flexible artifacts using script-based evaluation, as mentioned by the reviewer in their original review. This is also highlighted at multiple points in the paper: Abstract, Sections 2, 3.4, 3.5, 4, and Appendix C. Example text from Section 3.4:
To ensure extensibility to such open-ended tasks, the task definition is expected to provide an evaluation script and submission artifact instructions.
C3: Granularity of Task Descriptions
Coverage of tabular data
This is incorrect; we never claim coverage of tabular data.
Table for Task-Level Details
Please refer to C4.1 in the original response, where we specifically address the tables and figures in the submitted manuscript, which already provide the necessary details. We will open-source the codebase and add specific pointers to the evaluation script in the paper. Due to page limitations at the conference, we were unable to include this table in the main text; however, we have referenced it in the paper to draw attention to it.
Qualitative analysis
See C4.2 in the original response.
C4: Training Facilities & Extensibility
Scalability
We believe this criticism is too broad and would be happy to address any specific scalability issues that the reviewer can foresee.
Provide a template generator (or wizard) that auto-creates config files, conda environments, and stub evaluation scripts.
We do not claim that creating task config files can be automated. The automatic creation of task configuration files is not within the scope of this work and constitutes a separate research direction. We believe the release of the MLGym framework can foster research in this direction.
Include at least one “blank” scaffold example (no starter code) to illustrate curriculum flexibility.
We will include a blank scaffold example in the code upon release. However, it is unclear how we can consist of a single blank example without conducting a complete experimentation cycle for each model and task.
C5: Multi-Agent Road-Map
Due to time and budget constraints, we cannot include out-of-scope work in this paper. Multi-agent scenarios, while helpful, are not the central claim of this work.
[1] Yamada, Y., Lange, R. T., Lu, C., Hu, S., Lu, C., Foerster, J., Clune, J., & Ha, D. (2025). The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
[2] Wijk, H., Lin, T., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J., Dhyani, J., Ericheva, E., Garcia, K., Goodrich, B., Jurkovic, N., Kinniment, M., Lajko, A., Nix, S., Sato, L., Saunders, W., … Barnes, E. (n.d.). RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts.
[3] Chan, J. S., Chowdhury, N., Jaffe, O., Aung, J., Sherburn, D., Mays, E., Starace, G., Liu, K., Maksin, L., Patwardhan, T., Weng, L., & Mądry, A. (2024, October 9). MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
[4] Huang, Q., Vora, J., Liang, P., & Leskovec, J. (2024). MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation
Thanks for your detailed feedback to address my concern. I will raise my score.
We appreciate the time you took to review this submission and engage in a fruitful discussion with us. We are delighted that we were able to address your concerns and that you agreed to raise the score.
We noticed that while you mentioned you would raise your score following our discussion, the score in the system appears to remain unchanged from your original review. We want to remind you gently to update your score, as this will ensure that your revised score is accurately reflected in the review process.
Furthermore, we would like to clarify that we will also include a discussion on multi-agent systems in both the paper and the source code documentation. We believe adding an example of creating a multi-agent system is better suited in the code documentation, and the paper will discuss different avenues of research in multi-agent systems for open-ended AI research.
Thank you for your review and the detailed discussion.
This paper presents MLGym, a Gym-style framework for evaluating LLM agents on machine learning tasks. The system supports modular environments, flexible artifact evaluation, and integration with agentic workflows. The authors benchmark several frontier LLMs using adapted performance profile curves and analyze both performance and cost-effectiveness. The goal is to foster the development of autonomous AI research agents.
接收理由
-
Interesting direction toward supporting AI research via LLM agents.
-
The use of AUP scores based on performance profile curves is a thoughtful choice that could be inspiring.
拒绝理由
- Misalignment between task design and "AI research" framing
While the paper frames its tasks as requiring “real-world AI research skills,” most tasks provided resemble machine learning engineering problems. As a result, these LLM agents mostly improve performance through hyperparameter tuning rather than proposing novel hypotheses or algorithms as claimed in the paper. This overstates the benchmark's alignment with actual open-ended research.
- Overly broad and vague definition of AI Research Agents
The paper positions MLGym as a benchmark for “AI Research Agents,” but this term is loosely defined. Many established subfields (e.g., Neural Architecture Search) already focus on specific, automatable parts of AI research. In contrast, MLGym is mostly suited for a narrow slice of engineering-focused tasks, and it’s unclear how it would scale to real-world AI research challenges that demand large-scale experimentation and significant resource investments.
We thank the reviewer for the review. Please find our response below:
C1: Misalignment between task design and “AI Research” Framing
Thank you for the feedback. We may be misunderstanding the concern, so any clarification on what the reviewer considers “open-ended research” would help. Below, we restate our position in a more concise form.
1. Research via Benchmarks and Experiments
Because idea quality is subjective, ML research is typically validated through improved results on standard benchmarks. MLGym-Bench follows this tradition: agents receive a broad goal (“improve validation loss”, “reduce solve time”) without a fixed list of knobs, letting us evaluate idea generation, implementation, and experimentation in one loop. Related work, such as RE-Bench [1], adopts a similar approach.
2. Real-World Tasks, Not Engineering Recipes
All benchmark problems correspond to questions researchers have pursued for decades (e.g., lowering perplexity in language modelling), validated through performance on peer-reviewed and well-adopted benchmarks and datasets. Over-specifying each task would test only implementation skill; our open-ended framing probes whether an agent can move from a vague goal to a concrete, validated solution. Taking the example of language modelling, a lot of research on scaling laws and optimizers rely on comparing validation loss/perplexity with new algorithmic changes. The NanoGPT speedrunning benchmark [2] has resulted in great new optimizers like Muon [3], which are being used for large-scale LLM training as well [4]. A testbed like this enables AI agents to develop novel architectures, optimizers, and training setups, and downstream perplexity measures can validate these changes.
3. Where MLGym-Bench Sits in the Capability Hierarchy
Appendix A defines six levels of research proficiency. MLGym-Bench targets Level 1 – Baseline Improvement, a critical first step that already demands original ideas. We will surface this taxonomy in the main text for clarity.
4. Breadth of Domains = Breadth of Skills
The suite spans Game Theory, 3-SAT, CV, NLP, and RL. Here are some examples of how the default agent approaches certain problems without performing hyperparameter tuning.
- 3-SAT: No hyperparameters exist; an agent cuts solve-time by merging MOMS with a Look-Ahead heuristic.
- Image Classification: Starting from a two-layer CNN, the agent introduced rotation and crop augmentation, and replaced the network with a ResNet, thereby increasing accuracy within a 30-minute training time budget.
These trajectories show genuine algorithmic reasoning. Hyperparameter sweeps are primarily observed in challenging RL tasks, reflecting the current agent's limits rather than the benchmark design. We will add the qualitative examples and highlight these points in the revised manuscript.
5. MLGym Framework goes beyond Level 1 capabilities
While MLGym-Bench is designed to test LLM agents' capabilities on Level 1 tasks, the MLGym framework is intended to be modular and adaptable to the ever-changing needs of AI Research Agents' evaluation. Moreover, the tasks in the benchmark are defined by simple configs and scripts; the codebase and benchmark will be open-sourced and thus will evolve with community engagement.
C2 – “Overly broad and vague definition of AI Research Agents”
1. Definition and taxonomy
We use AI Research Agent to refer to any agent that cycles through idea generation, implementation, and experimentation within a sandboxed ML environment, with the goal of making progress on an AI research problem, such as language modeling. Recent works employ nearly identical language; e.g., “AI Scientist” [5] and “AI Co-Scientist” [6]. Appendix A gives a six-level capability taxonomy (from reproducing research papers to pursuing a long-term research agenda). We will surface this definition and table in the main text.
2. Difference from NAS / AutoML
NAS requires a handcrafted search space and is typically tailored to each domain, and solutions do generalize well in a cross-domain setting [7]; it does not address open-ended problem formulation. By contrast, AI Research Agent runs unchanged across 3-SAT, CV, NLP, RL, and game-theory tasks, operating in the same environment and highly expressive search space consisting of any idea that can be expressed in natural language, algorithm that can be implemented in a programming language, and any experiment that can be run via a bash command. An LLM agent is also much more autonomous, as it needs to validate for itself whether an idea or algorithm is better than the baseline (or the best so far). In contrast, NAS / AutoML typically provides an automatic way of evaluating methods and selecting the best one.
3. Scalability to “real-world” research
MLGym’s modules (Agent | Environment | Task | Dataset) are container-isolated; users can swap in larger clusters or longer budgets without touching core code. Thus, nothing prevents large-scale or resource-intensive experiments. Comparing this with existing ML Engineering/Research agent frameworks, we see AIDE [8] uses a search algorithm but only works on single files, in contrast to MLGym, where the agent abstraction is designed to work on large codebases.
If the reviewer has a concrete task that seems incompatible, we would welcome the specifics so we can address it directly.
[1] Wijk, H., Lin, T., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J., Dhyani, J., Ericheva, E., Garcia, K., Goodrich, B., Jurkovic, N., Kinniment, M., Lajko, A., Nix, S., Sato, L., Saunders, W., … Barnes, E. (n.d.). RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts.
[2] Jordan, K., Bernstein, J., Rappazzo, B., et al. (2024). modded-nanogpt: Speedrunning the NanoGPT baseline. https://github.com/KellerJordan/modded-nanogpt
[3] Jordan, K. (2024). Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/
[4] Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y., Qin, Y., Xu, W., Lu, E., Yan, J., Chen, Y., Zheng, H., Liu, Y., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y., Wang, J., … Yang, Z. (2025). Muon is Scalable for LLM Training (No. arXiv:2502.16982). arXiv. https://doi.org/10.48550/arXiv.2502.16982
[5] Yamada, Y., Lange, R. T., Lu, C., Hu, S., Lu, C., Foerster, J., Clune, J., & Ha, D. (2025). The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search (No. arXiv:2504.08066). arXiv. https://doi.org/10.48550/arXiv.2504.08066
[6] Gottweis, J., Weng, W.-H., Daryin, A., Tu, T., Palepu, A., Sirkovic, P., Myaskovsky, A., Weissenberger, F., Rong, K., Tanno, R., Saab, K., Popovici, D., Blum, J., Zhang, F., Chou, K., Hassidim, A., Gokturk, B., Vahdat, A., Kohli, P., … Natarajan, V. (n.d.). Towards an AI co-scientist.
[7] Tu, R., Roberts, N., Khodak, M., Shen, J., Sala, F., & Talwalkar, A. (2022). NAS-bench-360: Benchmarking neural architecture search on diverse tasks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in neural information processing systems (Vol. 35, pp. 12380–12394). Curran Associates, Inc.
[8] Jiang, Z., Schmidt, D., Srikanth, D., Xu, D., Kaplan, I., Jacenko, D., & Wu, Y. (2025). AIDE: AI-Driven Exploration in the Space of Code (No. arXiv:2502.13138). arXiv. https://doi.org/10.48550/arXiv.2502.13138
We appreciate the review and the time you took to consider this submission. We would like to gently remind you to acknowledge the rebuttal. If we have not addressed your concerns, we would be happy to address any questions you may have and are available to answer any queries.
Since the discussion period ends tomorrow, we would appreciate it if the reviewer could post any additional questions today or acknowledge the rebuttal by then. This will give us enough time to address any further queries you may have regarding the work.
Thank you
Thanks for the responses! These additions improve the paper’s clarity, and surfacing the capability taxonomy in the main text would help anchor expectations.
However, my core concerns remain. The tasks primarily reflect narrow engineering skills rather than open-ended research, and the framing around “AI Research Agents” still feels overstated. While the framework is promising, the benchmark does not convincingly support the broader claims made.
I’ve updated my score to reflect the improved clarity, but I’m still leaning towards rejection, as the paper’s framing remains misaligned with the actual scope and contributions.
Dear Reviewer 33qt,
Thank you again for reviewing our work and for acknowledging the interesting direction and the metrics used in this work. As the discussion period ends today, we would like to kindly remind you to acknowledge whether the questions raised in your original review have been addressed, particularly with the additional model results, examples, and clarifications regarding the task design and the definition of AI Research Agents. If so, we hope you can raise the rating for our submission. We are eager to address any other questions you may have.
Thank you
We thank all the reviewers for their thoughtful and detailed feedback. During the discussion period, we have added additional experiments and clarified the details around evaluation and benchmarking. Below, we summarize the key criticisms and our responses:
Expanded Model Coverage (Reviewers xDsa and 5pzF)
Criticism: Limited model coverage, missing recent frontier models, and mid-sized models.
Response: We have added results for seven additional frontier models: DeepSeek R1, Claude-3.7-Sonnet, O3-mini, Gemini-2.0-Flash, Gemini-2.5-Pro, Llama-4-Scout, and Llama-4-Maverick. The updated results show that the benchmark remains challenging even for the newest frontier models, with no single model achieving best performance across all tasks. We will include results for two mid-sized models, Qwen3-8B and QwQ, in the final manuscript.
Framework Positioning and Scope (Reviewer 33qt)
Criticism: Misalignment between task design and "AI research" framing; broad definition of AI Research Agents.
Response:
- Clear taxonomy: We explicitly define our focus on Level 1: Baseline Improvement tasks (from our 6-level hierarchy in Appendix A), which will be surfaced in the main text.
- Definition of AI Research Agents: We have added a concrete definition of AI Research Agents to refer to any agent that cycles through idea generation, implementation, and experimentation within a sandboxed ML environment, with the goal of making progress on an AI research problem, such as language modeling. We will add this clarification to the final manuscript.
- Concrete examples: We provide specific examples showing genuine algorithmic reasoning beyond hyperparameter tuning (e.g., 3-SAT task where agents implement MOMS with Look-Ahead heuristic, CV task where agents add data augmentation and architectural improvements).
Evaluation Concerns (Reviewer xDsa)
Criticism: Confusion about AUP metric definition and lack of detailed analyses.
Response: We have clarified several misconceptions:
- AUP Definition: AUP stands for Area Under Performance-profile curve, not "average-utility-per-task" as claimed. This is explicitly defined in Sections 6 and 7.
- Comprehensive analyses provided: Contrary to claims, we include extensive cost analysis (Figure 3), computational requirements (Appendix D.1), token usage (Appendix D.2), and agent action analyses (Appendix D.3).
- Raw metrics available: Tables 5 and 6 provide native task metrics for all models and tasks.
Framework Scalability (Reviewers 33qt, xDsa, 5pzF)
Criticism: Questions about framework scalability
Response: The modular design, with container isolation, allows for easy scaling to larger compute resources without requiring core code changes. Moreover, MLGym provides an abstraction for defining any AI research task and can thus be extended to support all research tasks without requiring code changes.
Multi-Agent Settings (Reviewers xDsa and 5pzF)
While multi-agent scenarios are valuable for future work, they fall outside the scope of this paper. The decoupled design makes future multi-agent extensions feasible. We will include a roadmap for potential future work in the final manuscript and provide an example of extending the system for multi-agent settings in the source code documentation.
Training Experiments (Reviewer 5pzF)
We deliberately limited the scope to the framework and benchmark introduction. Training LLM agents using this framework represents an important but separate research direction that warrants a dedicated study, and it would significantly expand the computational requirements and scope of the paper. The framework enables future research of this kind.
Comparison to NAS (Reviewer 33qt)
- Cross-domain generalization: NAS search spaces are carefully hand-crafted for each problem, whereas AI research agents operate unchanged across CV, NLP, RL, and game theory tasks.
- Open-ended search space: Agents can try any idea expressible in natural language, implementable in code, which broadens the scope, and can incorporate any open-ended idea.
Benchmark Saturation and Future-Proofing (Reviewer 5pzF)
Our comparative AUP metric has no fixed ceiling; it measures relative performance among evaluated models. Raw performance data shows no single model dominates all tasks, indicating substantial room for improvement.
Conclusion
We believe these clarifications and additions significantly strengthen the paper. The expanded model results demonstrate benchmark robustness, our clarifications address scope concerns, and correct factual misconceptions. MLGym provides essential infrastructure for the growing field of AI research agents, providing a foundation for future extensions. We will add all results and clarifications to the final manuscript. We appreciate the reviewers' constructive feedback and hope the revised manuscript addresses all concerns raised.
This paper presents MLGym, a framework and benchmark for evaluating LLM agents on 13 diverse machine learning tasks spanning computer vision, natural language processing, game theory, and reinforcement learning. The benchmark requires agents to engage in complete research cycles including hypothesis generation, implementation, experimentation, and iterative refinement to improve performance metrics.
Strength: The proposed benchmark is useful for fostering research and challenging LLM agents going beyond simple reasoning or code generation to tasks that require more deliberate and long-horizon planning. This benchmark fills an important niche in LLM agent evaluation and provides valuable infrastructure for the research community to build on top of it.
There are some concerns raised by reviewers most of them adequately addressed during rebuttal either by clarification or additional results and could be incorporated into the next version.
Areas for improvements:
-
Extension to multi-agent settings: as pointed out by reviewers, the framework's design naturally supports multi-agent research scenarios. While this falls outside the scope of the paper, the author promised to include a clear roadmap and potential future works on this direction
-
Reframing the “AI Research tasks” claim to be more reflective of Engineering focused and experimental design.
-
Provide a discussion on how “level 1 capabilities” can enable higher level research. And create standardized templates/guidelines that researchers can easily adapt to expand the benchmark.