5.5

/10

Rejected4 位审稿人

最低5最高6标准差0.5

3.8

置信度

正确性3.3

贡献度2.8

表达3.3

ICLR 2025

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

Patara Trirat,Sung Ju Hwang,Wonyong Jeong

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

This paper introduces AutoML-Agent, a novel multi-agent framework for building automated machine learning pipelines from natural language instructions, with new planning and verification strategies.

摘要

关键词

Automated Machine LearningMulti-Agent FrameworkLarge Language Models

评审与讨论

审稿意见

评分: 5置信度: 42024-10-21

This paper introduced the AutoML-Agent, a multi-LLM-agent framework that aims to automate the entire pipeline of automated machine learning. The framework employed Retrieval-Augmented Planning (RAP), which uses retrieved knowledge to devise multiple end-to-end plans, aiming to enhance exploration and search for more plans. The framework is claimed to be user-friendly to non-expert users, allowing them to build data-driven solutions solely through natural language instructions. The experiments seem to be extensive, investigating the effectiveness of AutoML-Agent on 7 downstream tasks and proving the superiority over existing models.

优点

[+] Integrate a lot of techniques from LLM-agent into this AutoML-Agent framework, including Retrieval-Augmented Planning, task decomposition, plan proposing, planning verification, etc.

[+] The experiments seem to be extensive and the results look good.

缺点

[-] The core contribution of this paper is the multi-agent framework with RAP for processing AutoML tasks in an end-to-end manner. While the ambition of AutoML is significant and the framework is claimed to be powerful, the benchmarks and tasks presented are very simple. I believe these tasks would not require such a complex framework and should be manageable for most ML practitioners.

[-] Additionally, I question the motivation and necessity behind devising this AutoML-agent framework. Do non-expert users really exist and need this level of assistance? Although the limitation section acknowledges the issues of hallucination and unreliability within LLMs, I want to emphasize that systems or frameworks with bugs can be prohibitive for non-expert users, making it extremely difficult for them to debug and resolve problems. This is not necessarily a flaw of this paper, but it seems premature to propose frameworks that may not be practical at this time.

问题

Why does RAP propose multiple plans for a task? Especially, this is not inference time compute (thoughts instead of execution) but indeed executes all the plans and compares the results in the last, which results in a significant amount of costs.
How do judge the difference between the LLM-agent framework for Auto-ML and the general LLM-agent? It seems to me that the general LLM-agent framework can cater to AutoML tasks since they are just a small subset of real complex tasks.

评论- Response to Reviewer N9Xd (3/3)

2024-11-21

Why does RAP propose multiple plans for a task? Especially, this is not inference time compute (thoughts instead of execution) but indeed executes all the plans and compares the results in the last, which results in a significant amount of costs.

R3. As clarified in Section 3.5 of the paper, the multiple-plan approach in RAP is designed not to impose significant computational overhead but to balance exploration with efficiency. We achieve this balance through the integration of Pseudo Data Analysis and Training-Free Model Search and Hyperparameter Optimization, both of which rely solely on LLM inference. This eliminates the need for executing all candidate plans fully during the exploration phase.

To further minimize runtime costs, only plans that pass our rigorous multi-stage verification process, as described in Section 3.6, are selected for actual runtime execution. This ensures that the computational resources are spent only on the most promising solutions. As shown in Table R3 (also Section 4.3), the total search time to derive a code of a deployable model is approximately 10 minutes, which constitutes a negligible fraction of the overall model training and optimization time. Thus, the proposed framework ensures a computationally efficient implementation while preserving the robustness and adaptability of the AutoML process.

Table R3. Time costs averaged across different tasks and datasets for a single run under the constraint-free and constraint-aware settings.

Cost	Prompt Parsing	Request Verification	Retrieval & Planning	Plan Execution	Execution Verification	Selection and Summarization	Code Generation	Total
Constraint-Free
Time (s)	10.78	1.91	187.71	136.34	1.04	17.88	182.60	538.25
Constraint-Aware
Time (s)	14.21	3.63	182.38	98.62	1.37	20.25	191.90	512.35

We hope this clarifies our approach and addresses the reviewer’s concerns regarding computational costs.

How do judge the difference between the LLM-agent framework for Auto-ML and the general LLM-agent? It seems to me that the general LLM-agent framework can cater to AutoML tasks since they are just a small subset of real complex tasks.

R4. We appreciate the reviewer’s observation regarding the applicability of general-purpose LLM-agent frameworks for handling AutoML tasks. While it is true that generic LLM-agents are versatile and can theoretically execute certain AutoML tasks, they lack critical domain-specific capabilities essential for achieving robust and reliable performance across the structured and interdependent processes of a full AutoML pipeline. Generic LLM frameworks are typically designed to support broad problem domains and rely heavily on user-defined instructions or augmentations, often leading to inefficiencies, suboptimal performance, and increased failure rates in complex tasks like end-to-end AutoML. In contrast, AutoML-Agent is purpose-built with explicit mechanisms, such as retrieval-augmented planning, specialized agents for sub-tasks, and multi-stage verification, all tailored to address the unique challenges of AutoML.

Furthermore, as highlighted in Table R4, even when a general-purpose LLM leverages the same backbone model, they consistently underperform compared to specialized methods like DS-Agent or AutoML-Agent in both success rates and downstream task metrics. AutoML-Agent not only outperforms these frameworks but also uniquely integrates the entire pipeline into a seamless, modular workflow, significantly reducing errors and enhancing usability for diverse downstream tasks. This distinction underscores the importance of designing specialized frameworks like AutoML-Agent, which prioritize task-specific optimizations over generality, ensuring both efficiency and reliability in automating full AutoML pipelines.

Table R4. Average performance scores (from Tables 5,6, and 7) between different LLM frameworks.

Method	Success Rate	Normalized Performance Score	Comprehensive Score
Single-Agent (GPT-4o)	0.268	0.229	0.249
DS-Agent (GPT-4o)	0.363	0.298	0.330
AutoML-Agent (GPT-4o)	0.936	0.807	0.871

评论- Thanks for the thorough response.

2024-11-21

I appreciate the thorough responses from the authors and I've read them carefully.

评论- Thank you for your engagement.

2024-11-22

Dear Reviewer N9Xd,

We sincerely appreciate your constructive feedback, which has helped us clarify and strengthen the key aspects of our work. We hope that the additional details we provided have addressed the concerns you raised. Please let us know if there are any points in our previous responses that remain unclear or if there are additional concerns that we have not fully addressed. We would be happy to provide further clarification.

Thank you again for taking the time to carefully read and consider our responses.

Best regards,

Submission 7839 Authors

评论- Any remaining concerns?

2024-11-25

Dear Reviewer N9Xd,

We would like to again express our sincere gratitude for your thoughtful feedback on our manuscript. We believe that we have carefully addressed all of your concerns and have made every effort to improve the quality of the work based on your valuable suggestions. If there are any remaining concerns that still need clarification, we would be more than happy to address them.

If there is no remaining concern, we kindly ask if you could consider updating your review to reflect the revisions we’ve made. Thank you once again for your time and consideration.

Warm regards, The Authors

评论- Follow-Up on Reviewer Feedback

2024-12-02

Dear Reviewer N9Xd,

Thank you for your thoughtful feedback and for taking the time to carefully read and engage with our responses. We greatly appreciate your acknowledgment of the thoroughness of our rebuttal.

As the discussion phase nears its conclusion (approximately one day remaining), we understand that you have chosen to maintain your current score. To ensure that we fully address any remaining concerns, could you kindly clarify if there are specific aspects of our rebuttal or the paper that you feel were not adequately resolved? If there are particular points that need further elaboration or clarification to merit a higher score, we would be more than willing to provide additional information or analysis before the rebuttal period ends.

We also hope to emphasize the positive aspects highlighted by the other reviewers, as well as the constructive dialogue we've had during this process. Your support could significantly enhance the chances of this work being shared with the broader community.

Thank you once again for your valuable feedback and for helping us improve our work.

Sincerely,

The Authors of Submission 7839

评论- Response to Reviewer N9Xd (2/3)

2024-11-21

Additionally, I question the motivation and necessity behind devising this AutoML-agent framework. Do non-expert users really exist and need this level of assistance? Although the limitation section acknowledges the issues of hallucination and unreliability within LLMs, I want to emphasize that systems or frameworks with bugs can be prohibitive for non-expert users, making it extremely difficult for them to debug and resolve problems. This is not necessarily a flaw of this paper, but it seems premature to propose frameworks that may not be practical at this time.

R2. We appreciate the reviewer’s thoughtful feedback and the opportunity to clarify the motivation and necessity of the AutoML-Agent framework, as well as its practicality for non-expert users.

Motivation and Necessity: In the era of generative AI, many companies are adopting AI technologies. However, these companies often lack AI experts, leading to software engineers—who are non-experts in AI—attempting to implement such solutions and facing significant challenges. Researchers from various domains outside of AI (e.g., economics, chemistry, healthcare) are increasingly seeking to apply AI models in their work but struggle due to their lack of specialized AI expertise. Besides, the existence of non-expert users in the ML landscape is well-documented in relevanted studies [a--d]. For example, an early AutoML survey paper [a] highlights the growing demand for ML tools among stakeholders across various domains. It emphasizes that AutoML tools aim to make ML accessible to non-experts, improve efficiency, and accelerate research, addressing the pressing need for user-friendly AI solutions. Similarly, as thoroughly illustrated in Section 3 of a recent study [c], the use of LLMs as interfaces for AutoML and as components of AutoML systems offers significant opportunities for non-ML experts seeking to apply off-the-shelf data-driven solutions to their problems.
Example Scenarios: To further illustrate the necessity and practicality of AutoML-Agent, consider an academic researcher aiming to evaluate several ML models for a novel dataset within a constrained timeline. Traditionally, this involves significant manual effort in setting up pipelines, selecting models, and debugging code—steps that can be both error-prone and time-consuming. Similarly, in industry settings, ML engineers often need to rapidly prototype models for specific business requirements, such as creating a lightweight and efficient spam detection system for mobile applications. AutoML-Agent enables such users to focus on high-level problem formulation by generating deployment-ready models that adhere to specified constraints, like latency or accuracy, directly from natural language task descriptions. This capability reduces setup time and errors, enabling users to focus on innovation rather than implementation logistics.
Practicality and Error Mitigation: While we acknowledge the inherent limitations of LLMs, such as potential hallucination, these challenges are mitigated in our framework through:
- Retrieval-Augmented Planning: Ensures robustness by grounding plans in external knowledge.
- Multi-Stage Verification: Detects and corrects errors at each stage, ensuring reliable outputs.
- Structured Prompt Parsing: Aligns the framework's capabilities with precise user requirements, reducing ambiguities.

We hope this clarification adequately addresses your concerns and demonstrates the value and timeliness of our work.

References

[a] Karmaker, Shubhra Kanti, et al. "Automl to date and beyond: Challenges and opportunities." ACM Computing Surveys (CSUR) 54.8 (2021): 1-36.
[b] Sun, Yuan, et al. "AutoML in The Wild: Obstacles, Workarounds, and Expectations." Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 2023.
[c] Tornede, Alexander, et al. "AutoML in the Age of Large Language Models: Current Challenges, Future Opportunities and Risks." TMLR. 2024.
[d] Davenport, T., and R. Bean. "Five key trends in AI and data science for 2024." MIT Sloan Management Review (2024).

评论- Response to Reviewer N9Xd (1/3)

2024-11-21

The core contribution of this paper is the multi-agent framework with RAP for processing AutoML tasks in an end-to-end manner. While the ambition of AutoML is significant and the framework is claimed to be powerful, the benchmarks and tasks presented are very simple. I believe these tasks would not require such a complex framework and should be manageable for most ML practitioners.

R1. We sincerely appreciate the reviewer’s acknowledgment of the ambition and core contributions of our work. We respectfully address the concerns regarding the simplicity of the benchmarks and tasks presented.

Benchmark Selection: The benchmarks were chosen to enable fair comparisons with existing approaches, as they have been adopted in recent studies [a, b]. While these tasks may appear straightforward to skilled ML practitioners, they pose significant challenges for automated frameworks, especially in achieving seamless integration of data handling, model selection, hyperparameter tuning, and deployment. Our approach aims to automate the entire AutoML pipeline end-to-end, significantly reducing the reliance on manual effort and extensive domain expertise. Notably, we demonstrated that a current state-of-the-art framework, DS-Agent [a], still struggles with even these benchmarks, underscoring the value of our contribution and inherent difficulties of full-pipeline code generation. That is, the simplicity of the benchmarks does not negate the complexity of automating the entire process; instead, these tasks serve to validate the framework's capacity for generalization across diverse data modalities and application domains.
Comparison with ML Practitioners: We would like to emphasize that AutoML-Agent significantly enhances usability for non-expert users, addressing a key gap in current AutoML tools. For users with limited expertise, the framework lowers the barrier to deploying robust models, which is crucial in democratizing access to AI technologies. For ML-expert users, our framework provides value by offering improvements in efficiency and performance, even exceeding state-of-the-art models built by the human experts, such as TabPFN [c] and iTransformer [d].

References

[a] Guo, Siyuan, et al. "DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning." ICML 2024.
[b] Liu, Siyi, Chen Gao, and Yong Li. "Large Language Model Agent for Hyper-Parameter Optimization." arXiv preprint arXiv:2402.01881 (2024).
[c] Hollmann, Noah, et al. "TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second." ICLR 2023.
[d] Liu, Yong, et al. "iTransformer: Inverted Transformers Are Effective for Time Series Forecasting." ICLR 2024.

审稿意见

评分: 6置信度: 42024-11-04

This paper introduces AutoML-Agent, an innovative multi-agent large language model (LLM) framework designed to achieve full automation in machine learning (AutoML) from data retrieval to model deployment. The framework receives task descriptions from users and coordinates multiple specialized LLM agents to execute tasks in parallel, thereby improving search efficiency. It also proposes a retrieval-enhanced planning strategy to enhance exploration and optimize planning. Furthermore, AutoML-Agent integrates structured prompt parsing and multi-stage validation to ensure the quality of the generated solutions and instructions.

优点

The AutoML-Agent framework proposed in this paper is an innovative multi-agent system that leverages the capabilities of large language models (LLMs) to achieve full automation in the workflow from data retrieval to model deployment.
The experimental results demonstrate the effectiveness of the proposed method.

缺点

The AutoML-Agent framework proposed in this paper relies on the collaborative work of multiple agents, which may lead to significant computational resource consumption. In environments with limited computing power, implementing this framework using open-source large language models (LLMs) might face challenges.

问题

The method proposed in this paper has only been experimentally validated on current SOTA models, Mixtral-8x7B and GPT-4o. Given that the performance improvements observed in downstream tasks may primarily be attributed to the inherent advantages of the large language models (LLMs) themselves rather than the proposed method, I am uncertain about the generalizability of the experimental conclusions to smaller-scale or less powerful models. To further demonstrate the robustness of the proposed method, it is suggested that the authors supplement the paper with experimental results on such models.

评论- Response to Reviewer Sn5B (1/2)

2024-11-21

The AutoML-Agent framework proposed in this paper relies on the collaborative work of multiple agents, which may lead to significant computational resource consumption. In environments with limited computing power, implementing this framework using open-source large language models (LLMs) might face challenges.

R1. We thank the reviewer for highlighting potential challenges in resource-constrained environments. Below, we address these concerns.

Resource Consumption: To reduce computation costs in multi-agent systems, it is possible to sequentially process tasks by loading only a single agent at a time or to reduce the number of multi-agents, thereby adjusting parallelism. The AutoML-Agent framework is designed with modularity and flexibility at its core, enabling adaptation to diverse resource settings. Specifically, while the framework supports parallel multi-agent execution to optimize task completion time, it also allows for sequential task execution by aligning the agent roles with the system's state. This configuration significantly reduces memory and processing demands, making the framework compatible with resource-constrained environments.
Open LLMs: The backbone model in our framework does not necessarily have to be an open-source LLM, which typically incurs significant infrastructure costs. Instead, commercial LLMs offering more cost-effective APIs can be used. As presented below, due to the relatively low cost of less than $1 (please see Section 4.3 or Table R1 below) for generating code of a deployable model using GPT-4o, we believe this low operational cost obviates the immediate need for open-source LLMs. Moreover, open-source LLMs typically demand fine-tuning or additional engineering to match GPT-4o's performance, further complicating their deployment in constrained environments. Thus, we believe that our choice balances efficiency, cost-effectiveness, and practicality for a wide range of users.

Table R1. Time and monetary costs averaged across different tasks and datasets for a single run under the constraint-free and constraint-aware settings.

Cost	Prompt Parsing	Request Verification	Retrieval & Planning	Plan Execution	Execution Verification	Selection and Summarization	Code Generation	Total
Constraint-Free
Money ($)	N/A	0.00	0.02	0.14	0.00	0.06	0.04	0.27
Time (s)	10.78	1.91	187.71	136.34	1.04	17.88	182.60	538.25
Constraint-Aware
Money ($)	N/A	0.00	0.00	0.11	0.00	0.15	0.06	0.32
Time (s)	14.21	3.63	182.38	98.62	1.37	20.25	191.90	512.35

评论- Response to Reviewer Sn5B (2/2)

2024-11-21

The method proposed in this paper has only been experimentally validated on current SOTA models, Mixtral-8x7B and GPT-4o. Given that the performance improvements observed in downstream tasks may primarily be attributed to the inherent advantages of the large language models (LLMs) themselves rather than the proposed method, I am uncertain about the generalizability of the experimental conclusions to smaller-scale or less powerful models. To further demonstrate the robustness of the proposed method, it is suggested that the authors supplement the paper with experimental results on such models.

R2. We thank the reviewer for their thoughtful comments on the generalizability of our proposed method. Below, we provide clarifications to address these concerns.

Attribution of Performance Gains: The observed improvements are primarily due to the methodological contributions of AutoML-Agent, as evidenced by comprehensive ablation studies (Section 4.4 and Table R2.1 below). Key components, such as plan decomposition and multi-stage verification, consistently enhance downstream task performance. These results emphasize that the gains are not solely due to the capabilities of LLMs.

Table R2.1. Downstream task performance (i.e., normalized performance score) from ablation study on different variations.

RAP	Plan Decomposition	Multi-Step Verification	Image Classification	Text Classification	Tabular Classification	Time-Series Forecasting	Node Classification	Average
✔			0.929	0.000	0.000	0.000	0.734	0.333
✔	✔		0.928	0.982	0.975	0.000	0.748	0.727
✔	✔	✔	0.936	0.971	1.000	0.991	0.812	0.942

Challenges with Smaller Models: While we acknowledge the importance of validating our framework on smaller-scale models, these models exhibit systemic limitations that hinder their utility for complex tasks requiring code execution and extensive planning, which we briefly discussed in Section 5 of the revised paper. During the early stage of development, we found that small language models, including LLaMA-2-7b, Qwen1.5-14B, gemma-7b-it, WizardLM-70B [a], and even code-specialized like CodeQwen1.5-7B failed to generate executable code for tasks requiring extensive planning or interdependent processes (e.g., full-pipeline skelenton script presented in Appendix A). Commonly, these smaller models exhibit similar issues, such as cutting the given codes, changing comments without code completion, partial completion, and returning code as it is given. As shown in Table R2.2 below, even vanilla GPT-3.5 and GPT-4o struggle with generating executable code for complex tasks, involving extensive planning or interdependent processes. Similar findings have been corroborated by prior studies [b, c], suggesting that these challenges are systemic to smaller language models and not unique to our framework.

Table R2.2. Average performance scores (from Tables 5,6, and 7) between different LLM frameworks.

Method	Success Rate	Normalized Performance Score	Comprehensive Score
Single-Agent (GPT-3.5)	0.048	0.098	0.073
Single-Agent (GPT-4o)	0.268	0.229	0.249
DS-Agent (GPT-4o)	0.363	0.298	0.330
AutoML-Agent (GPT-4o)	0.936	0.807	0.871

We hope this explanation clarifies the design, evaluation rationale, and constraints in testing with smaller models. We are committed to extending this work to address these challenges in future iterations. Thank you again for the opportunity to clarify these points.

References

[a] Xu, Can, et al. "WizardLM: Empowering large pre-trained language models to follow complex instructions." ICLR. 2024.
[b] Guo, Siyuan, et al. "DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning." ICML 2024.
[c] Hong, Sirui, et al. "Data interpreter: An llm agent for data science." arXiv preprint arXiv:2402.18679 (2024).

评论- Thank you for your response.

2024-11-25

Thank you to the authors for their detailed response, which has thoroughly addressed all my concerns and validated these answers with new experimental results. Based on this, I will update my rating for this paper.

评论- Thank you for your support!

2024-11-26

Dear Reviewer Sn5B,

Thank you for your positive feedback and for taking the time to review our response in detail. We are pleased to hear that our clarifications have effectively addressed all of your concerns. Should you have any remaining questions or feel that further clarification or additional details would enhance your understanding or confidence in our work, please do not hesitate to let us know. We are committed to ensuring the highest quality of our contribution.

Best regards,

Submission 7839 Authors

审稿意见

评分: 6置信度: 42024-11-04

This paper presents AutoML-Agent, a novel multi-agent framework designed for full-pipeline Automated Machine Learning (AutoML), addressing the limitations of existing AutoML systems that require technical expertise and are time-consuming. By leveraging large language models (LLMs) and facilitating collaboration between specialized agents, AutoML-Agent automates tasks from data retrieval to model deployment. The paper introduces a retrieval-augmented planning strategy to enhance exploration for optimal plans and decomposes each plan into sub-tasks handled by specialized agents. Extensive experiments demonstrate AutoML-Agent's higher success rate in automating the full AutoML process and yielding good performance across diverse domains, making it a significant contribution to the field.

优点

(1) The paper proposes a novel multi-agent framework tailored for full-pipeline AutoML, leveraging large language models (LLMs) in a task-agnostic manner from data retrieval to model deployment. This represents a groundbreaking attempt in the field of AutoML, offering a comprehensive solution to automate the entire AI development pipeline.

(2) By introducing retrieval-augmented planning with role-specific plan decomposition and prompt-based plan execution, the paper addresses the complexity of planning in full-pipeline AutoML. This approach enhances the flexibility and efficiency of the search process, enabling the framework to handle diverse and complex tasks more effectively.

(3) This paper integrates structure-based prompt parsing and multi-stage verification to ensure the quality of resulting solutions and instructions before actual code implementation. This approach improves the accuracy of the full-pipeline implementation, demonstrating the superiority of the proposed AutoML-Agent framework through extensive experiments across multiple application domains.

缺点

(1) While the paper presents an innovative framework, it would greatly benefit the research community if the associated code were openly available on platforms like Github.

(2) The paper could benefit from a more in-depth discussion of the framework's limitations and potential areas for improvement.

问题

None.

伦理问题详情

None.

评论- Thank you for your support!

2024-11-26

Dear Reviewer XBqa,

Thank you for taking the time to thoroughly review our responses and for acknowledging that we have addressed all of your concerns satisfactorily. We greatly appreciate your recognition of our efforts and your thoughtful review. If there are any additional points or areas that require further clarification or discussion to strengthen our paper, we would be happy to address them.

Sincerely,

Submission 7839 Authors

审稿意见

评分: 5置信度: 32024-11-09

This paper proposes an SOP for solving autoML tasks with multiple LLm agents. The user's prompt. i.e., the ML task is first passed to a prompt agent which re-organizes the instruction into ML tasks with specific information, e.g., the problem itself, the data, and model etc. Then the ML task is passed to a manager agent to generate a plan with RAG. The manager agent will generate multiple plans and each plan will be decomposed into executable sub-tasks that are assigned to the corresponding agent to perform. The manager will select the best and executable plan from the generated plan to generate code. The prompt parser agent is finetuned with open-source LLM and the rest of the agents are GPT-based models. The proposed method is evaluated on different ML tasks, such as image classification, NLP, and graph mining problems and compared with various baselines.

优点

The paper is well-organized and written.
The detailed prompt for each agent is given, which significantly improves the reproducibility of the work.
The performance of the proposed model is promising against various baselines.
I appreciate the resource cost is given.

缺点

The evaluation is only performed on synthetic datasets; no public benchmarks are tested. Public datasets like IMDB, Cora, and Citeseer are used for autoML tasks. The synthetic dataset I mentioned is the one for testing the performance of autoML algorithms. Since in 4.1, no evidence shows public benchmarks for evaluating autoML algorithms are given. Please specify any benchmarks if you do use some.

问题

I am unclear about how the manager agent verifies the plans if no plans are executed and tests with generated code. Can you explain a bit?
The generated plans seem homogeneous, similar, and not diverse enough, making it easy for all to fail if one fails (Appendix C2). Therefore, I am not sure about the effectiveness of the plan selection, given that the manager selects the plan based on the "simulated" result from the data and model agent.
Using LLM agent to solve autoML is not a new research problem. Specifically, what is the difference between the paper: https://arxiv.org/abs/2410.17238 Both papers use reflection, while the paper above uses a more sophisticated mechanism, e.g., MCTS, which yields a strong performance. Why don't you consider using MCTS instead? Does this work compatible with MCTS?
What is the reason for fine-tuning mixtral for prompt parser?

评论- Response to Reviewer fRNo (2/2)

2024-11-21

Using LLM agent to solve autoML is not a new research problem. Specifically, what is the difference between the paper: https://arxiv.org/abs/2410.17238 Both papers use reflection, while the paper above uses a more sophisticated mechanism, e.g., MCTS, which yields a strong performance. Why don't you consider using MCTS instead? Does this work compatible with MCTS?

R4. Thank you for referencing the MCTS-based work. We acknowledge the reviewer’s concerns and appreciate the opportunity to clarify our framework’s novelty. We also agree with the reviewer that using LLM agents to solve AutoML tasks is not a novel research problem, as we discussed in the Introduction and Related Work sections. However, existing methods focus primarily on either data or modeling aspects, whereas our framework is a full-pipeline AutoML platform, which aims to output deployable model APIs given task descriptions.

Regarding the SELA paper, we would like to make the following clarifications.

Novelty and Differences: We believe that our work complements rather than competes with MCTS-based methods. The novelty of our framework lies in RAP's integration with LLMs within an end-to-end AutoML context. RAP leverages LLMs’ inherent language understanding and API retrieval capabilities to adaptively define the search space without significant computational overhead, an area where MCTS-based approaches may face challenges. While SELA focuses on iterative pipeline optimization within a tree structure, AutoML-Agent prioritizes modularity, flexibility, and real-time retrieval capabilities for full-pipeline development. These differences highlight our framework’s contribution to advancing AutoML in distinct yet complementary directions.
Rationale for Not Using MCTS: The decision not to adopt MCTS is rooted in our framework’s design principles. In this work, we aim for full-pipeline automation and end-to-end model deployment, requiring flexibility across a wide range of tasks. Additionally, one of our key motivations was to reduce the computational overhead associated with the search process; employing MCTS would have increased search costs. Instead, we address the problem by facilitating communication between agents (relying solely on LLM inferences) and enhancing planning quality through retrieval-augmented knowledge.
Compatibility with MCTS: The structure of AutoML-Agent could theoretically integrate MCTS into the planning and execution stages. While theoretically feasible, integrating MCTS in AutoML-Agent would introduce computational overhead in the search process.

We would also like to point out that the paper you suggested became available approximately three weeks (on October 22) after the ICLR submission deadline (on October 2). Nevertheless, we have included a comparison with SELA in Table 1 of the revised version of the paper. Therefore, we believe that our framework's novelty and contributions remain valid as presented.

What is the reason for fine-tuning mixtral for prompt parser?

R5. The reason for fine-tuning is to ensure that the Prompt Agent generates a precise JSON format required for subsequent processes, which necessitates instruction tuning of an LLM. While the GPT-4 family is overly complex for this task, models like Llama-2, Mistral, and Mixtral perform effectively, with Mixtral yielding the best results. As discussed in Section 3.3, without instruction tuning, the LLM outputs JSON key-value pairs that, while semantically valid, are less predictable. The JSON schema details are available at /prompt_agent/WizardLAMP/template_schema.json. Fortunately, due to recent updates in OpenAI’s GPT models, we can now also use the proposed JSON schema with OpenAI APIs for more seamless integration.

评论- Response to Reviewer fRNo (1/2)

2024-11-21

We appreciate the reviewers' detailed feedback and are pleased to address your questions and comments as follows.

The evaluation is only performed on synthetic datasets; no public benchmarks are tested. Public datasets like IMDB, Cora, and Citeseer are used for autoML tasks. The synthetic dataset I mentioned is the one for testing the performance of autoML algorithms. Since in 4.1, no evidence shows public benchmarks for evaluating autoML algorithms are given. Please specify any benchmarks if you do use some.

R1. Thank you for this suggestion. However, this seems like a misunderstanding, since no synthetic datasets were used in our experiments. As presented in Table 2 and Section A.2, all datasets used for the experiments are public datasets collected from various sources. Additionally, Cora and Citeseer were also used in the experiments for node classification tasks.

I am unclear about how the manager agent verifies the plans if no plans are executed and tests with generated code. Can you explain a bit?

R2. Throughout the planning and search process, all agents interact solely through message passing to achieve their goals. The Agent Manager does not verify the plans, as it is responsible for creating them. Instead, it verifies the executed results provided by the Model Agent(s) against any constraints specified by the user. For each Model Agent to execute a given plan, it receives the plan, relevant instructions, knowledge from the Agent Manager, and anticipated data analysis results from the Data Agent. Based on this information, the Model Agent performs model search and hyperparameter optimization (HPO). The outcome is a textual summary predicting the top-k models with associated optimal hyperparameters and performance scores, which is executed only after verification.

The generated plans seem homogeneous, similar, and not diverse enough, making it easy for all to fail if one fails (Appendix C2). Therefore, I am not sure about the effectiveness of the plan selection, given that the manager selects the plan based on the "simulated" result from the data and model agent.

R3. As described above, the Agent Manager does not directly verify the plans but instead evaluates the executed results, which can vary even with similar plans depending on how the Data and Model agents perform on each plan. The diversity of the plans can be easily controlled by adjusting the model hyperparameter temperature. We have provided additional planning results in the anonymous repository at /example_plans. Furthermore, we propose a multi-stage verification process that accommodates cases where all plans fail by using recorded feedback through reflection. This component enhances the success rate of any revised plans.

评论- Gentle Reminder

2024-11-26

Dear Reviewer fRNo,

As the discussion period for ICLR ends on November 26th (AoE), we would greatly appreciate it if you could review the rebuttal we submitted, which we hope addresses all the concerns you raised, before the discussion period closes.

Thank you again for your time and valuable insights. Please don’t hesitate to let us know if any additional clarifications are needed.

Best regards,

Submission 7839 Authors

评论- Thanks for the response

2024-11-29

I sincerely thank the authors for the detailed responses, especially those clarifying my misunderstanding of the synthetic datasets. I also thank the authors for pointing out that the paper I quoted https://arxiv.org/abs/2410.17238 has just been released after ICLR submission and the experiments.

However, the attribution of not using MCTS to the computational overhead is a bit less convincing. I suggest quantitatively comparing the performance gain and computational cost trade-offs. So, I'll keep my scores.

评论- Additional Response to Reviewer fRNo

2024-12-01

Dear Reviewer fRNo,

Thank you for your thoughtful feedback and for acknowledging our clarifications in the initial rebuttal. We greatly appreciate your suggestions, particularly regarding the trade-off analysis between computational cost and performance gains. We have further elaborated on this point below.

Potential Performance Gains: Based on the reported results in SELA (Table 2), the performance improvement of MCTS over random search is less than 4%. Given that our tailored search design already achieves strong performance, we expect the incremental gains from MCTS to be even more marginal. Notably, on the same dataset (software-defects), our framework (Best 66.4) outperforms SELA (Best 53.3) by 24% without MCTS. Thus, we believe the potential gains from MCTS are unlikely to justify its computational expense.

Elaboration on Computational Complexity: Implementing MCTS, as in SELA, requires extensive runtime model training during the search phase, with computational costs scaling significantly with model complexity and the search tree depth.

Specifically, SELA achieves reasonable performance by using 10 rollouts, each involving full model training. This implies a total search cost of 10x the model training time.
In contrast, our approach focuses on training-free search, relying solely on lightweight LLM inference, with model training reserved only for the final verification of the most promising model.

To illustrate, if model training takes 1 hour, MCTS-based search would require at least 10 hours, compared to the 10-minute search time in our framework—a 60x higher search cost. Hence, the added computational burden of MCTS would undermine the practicality and scalability of our framework.

The cost scales with the complexity of the models, leading to search times that can span several hours or even days. SELA's focus is restricted to tabular classification and regression tasks, where models are comparatively simpler. In contrast, our AutoML-Agent is designed to support a broader range of tasks, including image classification, text classification, time-series analysis, and graph mining. These inherently more complex models, with longer training times, exacerbate the computational burden.

Our approach mitigates these challenges by proposing retrieval-augmented planning and prompt-based plan execution, enabling training-free methods during the search phase. This design ensures efficient search time (i.e., under 10 minutes), regardless of model complexity, with training times for the final model potentially spanning hours.

Alignment with Paper Contributions: We would also like to emphasize that integrating MCTS into our framework represents an orthogonal contribution rather than an enhancement of our existing contributions. As demonstrated by SELA (e.g., tree design, UCT modification, state-saving, etc.), integrating MCTS is non-trivial and would likely detract from the core novelty of our framework. SELA’s reliance on MCTS is feasible due to its focus on tabular data, whereas our framework addresses diverse tasks requiring a more efficient search methodology. Our work emphasizes a customized, efficient search strategy tailored to specific challenges in full-pipeline AutoML.

Incorporating a standard search technique like MCTS, while noteworthy, would shift the focus from our novel contributions (e.g., retrieval-augmented planning and prompt-based execution) and could dilute the originality of the work presented. Consequently, we regard MCTS as a potential avenue for future exploration rather than a critical addition to the current scope. We believe our approach strikes a balance between scalability, computational efficiency, and performance, aligning with the broader goals of AutoML research.

We hope these additional clarifications effectively address your concerns and illustrates the rationale behind our design choices. We believe these points highlight the validity and significance of our contributions, irrespective of the inclusion of MCTS. We respectfully request you to consider revisiting your evaluation in light of these clarifications.

Thank you again for your time and constructive feedback.

Best regards,

Submission 7839 Authors

评论- Additional Results for MCTS Trade-off Analysis

2024-12-04

Dear Reviewer fRNo,

We greatly appreciate your thoughtful feedback and the opportunity to address your concern about the computational overhead and performance trade-off of not using MCTS.

Quantitative Comparison: To substantiate our design choice, we conducted additional experiments comparing our AutoML-Agent's default setting (number of plans = 3) with SELA's MCTS-based setting (number of rollouts = 10). Due to time constraints, we selected two datasets from each downstream task: one representing the most difficult and the other the easiest dataset based on the performance scores reported in Table 6 of the SELA paper. This approach allows us to observe the highest and lowest potential gains.

Key Findings

Search Time Efficiency: Our AutoML-Agent demonstrates a significant reduction in computational cost, achieving ~8x faster search times compared to SELA's MCTS. This substantial efficiency gain directly supports our design choice and aligns with our method's design goal of prioritizing scalability and practical applicability.
Performance Trade-Off: Despite the reduced computational cost, our method achieves comparable or superior performance. The average normalized performance score for our approach (0.612) slightly exceeds that of SELA (0.599), with only minor performance deficits on a few datasets.

Table A. Comparison of search time and normalized performance score between MCTS-based SELA and our AutoML-Agent.

	Search Time (seconds) ↓		Normalized Performance Score ↑
Dataset	SELA (MCTS)	AutoML-Agent (Ours)	SELA (MCTS)	AutoML-Agent (Ours)
Binary Classification
smoker-status	2736.78	206.48	0.785	0.762
click-prediction-small	2227.85	256.24	0.238	0.352
Multi-Class Classification
mfeat-factors	1304.40	219.45	0.957	0.940
wine-quality-white	2372.77	206.41	0.650	0.652
Regression
colleges	674.29	232.03	0.876	0.878
house-prices	2906.95	378.85	0.090	0.090
Average	2037.18	249.91	0.599	0.612

Discussion of Trade-Off: As shown in Table A, while the MCTS-based SELA occasionally outperforms our AutoML-Agent on specific datasets (e.g., smoker-status and mfeat-factors), the performance gains are marginal and come at the expense of significantly higher computational overhead. The trade-off between computational efficiency and performance strongly supports our approach.

We believe these new experiments provide compelling evidence substantiating our design choice to forego MCTS due to its high computational cost relative to the minimal performance gains. We hope this addition fully addresses the reviewer's concerns regarding quantitative evidence and strengthens the justification for our method. In light of these results, we respectfully request you to consider revisiting your evaluation.

Thank you once again for your valuable time and insights.

Best regards,

Submission 7839 Authors

评论- Revision Note

2024-11-21

Based on the reviewers' feedback, we have highlighted the revised sections of the paper in brown. The updates include:

An edited comparison table (Table 1) to include the concurrent SELA paper. (Reviewer fRNo)
Added discussion on limitations and future work in Section 5. (Reviewer XBqa)

AC 元评审

2024-12-19

The paper "AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML" introduces a novel multi-agent framework, AutoML-Agent, designed to automate the entire machine learning pipeline using large language models (LLMs). The framework leverages natural language instructions to facilitate collaboration among specialized LLM agents, covering tasks from data retrieval to model deployment. Key innovations include a retrieval-augmented planning (RAP) strategy and a multi-stage verification process, which enhance the search for optimal plans and ensure the reliability of generated solutions. Extensive experiments across seven downstream tasks using fourteen datasets demonstrate the framework's effectiveness, achieving higher success rates and strong performance compared to existing methods.

Contributions The primary contributions of the paper are:

Full-Pipeline AutoML Framework: AutoML-Agent is the first multi-agent framework to automate the entire AutoML pipeline, from data retrieval to model deployment, using LLMs and natural language instructions.
Retrieval-Augmented Planning (RAP): A novel strategy that enhances exploration and optimizes plan generation by leveraging external knowledge, improving the efficiency and adaptability of the AutoML process.
Multi-Stage Verification: A robust verification process that ensures the reliability of generated solutions, addressing potential errors and improving the quality of the final model.

Weaknesses

Comparative Analysis and Novelty: Reviewer fRNo raised concerns about the novelty of the framework, particularly in comparison to a recently published paper using Monte Carlo Tree Search (MCTS). The reviewer suggested that MCTS could yield stronger performance and questioned the decision not to integrate it into AutoML-Agent.
Reproducibility and Code Availability: Reviewer XBqa emphasized the importance of open-source code availability for reproducibility and community engagement.
Generalizability to Smaller Models: Reviewer Sn5B questioned whether the framework's performance improvements were attributable to the inherent capabilities of state-of-the-art LLMs rather than the proposed method, suggesting additional experiments with smaller models.
Motivation and Practicality: Reviewer N9Xd questioned the motivation and necessity of AutoML-Agent for non-expert users, citing potential issues with system reliability and debugging.

审稿人讨论附加意见

Novelty and MCTS Comparison (fRNo):
- Concern: The reviewer compared AutoML-Agent to a post-submission MCTS-based paper, suggesting that MCTS could enhance performance and questioning the decision not to integrate it.
- Response: The authors clarified that the MCTS paper was published after the submission deadline and argued that their RAP strategy aligns better with full-pipeline AutoML goals. They provided additional experiments to quantify the trade-offs, demonstrating that AutoML-Agent achieves strong performance with significantly lower computational costs.
- Impact: The reviewer acknowledged the clarification but maintained their score, citing the lack of a quantitative comparison as a lingering concern. The authors' additional experiments addressed this but did not lead to a score adjustment.
Reproducibility and Limitations (XBqa):
- Concern: The reviewer emphasized the need for open-source code and a deeper discussion of limitations.
- Response: The authors provided an anonymous repository link and committed to public release upon acceptance. They also integrated discussions of limitations and future work into the main paper.
- Impact: The reviewer maintained their original score, indicating satisfaction with the responses.
Generalizability to Smaller Models (Sn5B):
- Concern: The reviewer questioned whether the performance gains were due to the capabilities of state-of-the-art LLMs rather than the proposed method.
- Response: The authors provided ablation studies and explained the challenges of using smaller models for complex AutoML tasks, attributing performance gains to methodological innovations.
- Impact: The reviewer updated their score positively, acknowledging the thorough responses and new experimental results.
Motivation and Practicality (N9Xd):
- Concern: The reviewer questioned the necessity of AutoML-Agent for non-expert users and raised concerns about reliability.
- Response: The authors argued for the framework's relevance in real-world scenarios and highlighted error mitigation strategies.
- Impact: The reviewer acknowledged the thorough responses but did not adjust their score, possibly due to lingering doubts about reliability for non-expert users.

最终决定Reject

2025-01-22

Reject