PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.5
置信度
创新性2.5
质量2.3
清晰度3.0
重要性2.3
NeurIPS 2025

MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

We introduce MM-Agent and MM-Bench to enable and evaluate LLM-powered end-to-end mathematical modeling, achieving superior performance on real-world problems and Finalist Award in MCM/ICM 2025.

摘要

Mathematical modeling is a cornerstone of scientific discovery and engineering practice, enabling the translation of real-world problems into formal systems across domains such as physics, biology, and economics. Unlike mathematical reasoning, which assumes a predefined formulation, modeling requires open-ended problem analysis, abstraction, and principled formalization. While Large Language Models (LLMs) have shown strong reasoning capabilities, they fall short in rigorous model construction, limiting their utility in real-world problem-solving. To this end, we formalize the task of LLM-powered real-world mathematical modeling, where agents must analyze problems, construct domain-appropriate formulations, and generate complete end-to-end solutions.We introduce MM-Bench, a curated benchmark of 111 problems from the Mathematical Contest in Modeling (MCM/ICM), spanning the years 2000 to 2025 and across ten diverse domains such as physics, biology, and economics. To tackle this task, we propose MM-Agent, an expert-inspired framework that decomposes mathematical modeling into four stages: open-ended problem analysis, structured model formulation, computational problem solving, and report generation.Experiments on MM-Bench show that MM-Agent significantly outperforms baseline agents, achieving an 11.88% improvement over human expert solutions while requiring only 15 minutes and \$0.88 per task using GPT-4o. Furthermore, under official MCM/ICM protocols, MM-Agent assisted two undergraduate teams in winning the Finalist Award (top 2.0% among 27,456 teams) in MCM/ICM 2025, demonstrating its practical effectiveness as a modeling copilot.
关键词
Mathematical Modeling AgentLLM AgentLLM

评审与讨论

审稿意见
4

This paper introduces MM-Agent, a framework leveraging LLMs to automate real-world mathematical modeling tasks. The authors formalize the task of LLM-powered mathematical modeling and propose MM-Bench, a benchmark comprising 111 diverse problems from the Mathematical Contest in Modeling (MCM/ICM) spanning 2000–2025. MM-Agent decomposes the modeling process into four stages: problem analysis, structured model formulation, computational problem solving, and report generation. It incorporates the Hierarchical Mathematical Modeling Library (HMML) to facilitate method retrieval and an actor-critic mechanism for iterative optimization. Experiments show that MM-Agent outperforms baseline agents, achieving an 11.88% improvement over human expert solutions while maintaining cost efficiency.

优缺点分析

Strengths

  • The authors submitted the code and a detailed appendix.
  • This work provides a system for solving problems and also designs a new benchmark.

Weaknesses

  • The paper claims that it "helped two undergraduate teams win the Finalist Award," but I didn't see more details in the main text or appendix.
  • Regarding the overall presentation of the paper, my impression after reading it is that there are a lot of design details, but the conveyed insights are quite limited. Personally, I don't particularly appreciate the writing style of this paper.
  • I have some points that I don't understand, as detailed in the following "Questions".

问题

  1. Observing Figure 9, there are few chartss included in the results generated by LLM. Does the author have any comments on this?
  2. In my personal opinion, due to the limited competition time, the length and presentation quality of modeling competition papers may significantly influence human evaluations. What are the advantages of content generated by LLMs or content assisted by LLMs in these two aspects?
  3. When describing the proposed method, I suggest that more comparisons and discussions be conducted with existing methods to clearly define the incremental contribution of this work. From the existing text, I do not clearly understand the relationship between this work and ResearchAgent or Agent Laboratory.
  4. How would the entries from modeling competitions that have won various awards be scored in the rating system of this paper? I want to know how the model performs without human intervention compared to human contestants.

局限性

It seems there is no dedicated paragraph discussing limitations.

格式问题

No

作者回复

Thanks for reviewer 3X2K and global response

We sincerely thank Reviewer 3X2K for your constructive feedback and thoughtful questions. We greatly appreciate your recognition that our work provides both a practical system (MM-Agent) and a new benchmark (MM-Bench) for advancing research in LLM-powered mathematical modeling. Your comments have helped us better articulate our motivation, design contributions, and empirical validation.

Please find our point-by-point responses below.

W1: "The paper claims that it "helped two undergraduate teams win the Finalist Award," but I didn't see more details in the main text or appendix."

Response: We thank the reviewer for pointing this out and apologize for the lack of details in the original submission. Due to privacy concerns and competition rules, we are unable to disclose identifying information about the two participating teams. We would like to kindly clarify that under official MCM/ICM guidelines, AI agents are not permitted to participate independently; however, human teams are allowed to use LLMs as assistants. In this context, our MM-Agent was deployed as a copilot, working collaboratively with two undergraduate teams during the competition.

To better understand how MM-Agent supported their efforts, we conducted follow-up interviews and questionnaires after the competition. Both finalist teams reported using MM-Agent for over 8 hours across the 3-day contest, particularly for tasks such as problem analysis, task clarification, modeling guidance, workflow planning, and data visualization. They explicitly acknowledged that MM-Agent played a substantial role in structuring their modeling process and accelerating exploratory analysis, ultimately contributing to their success.

W2: "Regarding the overall presentation of the paper, my impression after reading it is that there are a lot of design details, but the conveyed insights are quite limited. Personally, I don't particularly appreciate the writing style of this paper."

Response: We sincerely appreciate the reviewer’s comments on the writing and overall presentation. In the next revision, we will refine the writing style to improve readability and ensure that the core insights of our work are conveyed more clearly.

Q1: "Observing Figure 9, there are few chartss included in the results generated by LLM. Does the author have any comments on this?"

Response: We appreciate the reviewer’s observation regarding the use of charts in the generated outputs (e.g., Figure 9). As our MM-Agent is designed as an end-to-end autonomous agent, we observed that in problems not strongly data-driven, the agent tends to focus more on analytical reasoning, modeling abstraction, and textual explanation rather than generating visual outputs such as charts. This behavior aligns with the agent's prioritization of problem structuring and formal reasoning in the absence of explicit numerical datasets.

When MM-Agent is used in a human-in-the-loop copilot setting, users can explicitly prompt the agent to generate charts or visual summaries, which substantially improves the visual component of the solution. In fact, in the MCM/ICM deployment, participating teams reported actively leveraging MM-Agent for data visualization when appropriate. We consider enhancing chart-awareness and integrating adaptive visualization generation as a promising direction for future agent refinement.

Q2: "In my personal opinion, due to the limited competition time, the length and presentation quality of modeling competition papers may significantly influence human evaluations. What are the advantages of content generated by LLMs or content assisted by LLMs in these two aspects?"

Response: We agree with the reviewer that the length and presentation quality of modeling competition papers may influence human evaluations. However, in real-world competitions such as MCM/ICM, the maximum paper length is strictly limited to 25 pages, which helps mitigate the influence of paper length on evaluation outcomes.

Regarding presentation quality, LLM-generated content may indeed demonstrate advantages in aspects such as grammar and fluency. That said, we would like to clarify that, according to the official rules of these competitions, LLMs are not allowed to participate independently, and any use of AI tools must be explicitly disclosed. We only provide LLM-generated solutions as a reference for participating teams. These teams are required to adapt and disclose their use of LLMs as per the competition guidelines. Therefore, while LLMs may contribute to improved presentation quality, their actual impact remains limited under current competition regulations.

Q3: "When describing the proposed method, I suggest that more comparisons and discussions be conducted with existing methods to clearly define the incremental contribution of this work. From the existing text, I do not clearly understand the relationship between this work and ResearchAgent or Agent Laboratory."

Response: We thank the reviewer for the insightful comments. To the best of our knowledge, MM-Agent is the first agentic framework systematically applied to real-world mathematical modeling tasks, a domain requiring structured abstraction, formal method selection, and computational reasoning. We appreciate the opportunity to clarify its distinction from existing systems like ResearchAgent and Agent Laboratory. While ResearchAgent and Agent Laboratory primarily focus on research ideation or general scientific workflows, MM-Agent is specifically designed for end-to-end mathematical modeling and computational problem solving. It uniquely supports structured problem analysis, hierarchical modeling method selection via our HMML with a critic-actor mechanism. Importantly, MM-Agent introduces a formal modeling scheme selection mechanism that maps sub-problems to domain-specific mathematical methods (e.g., linear programming, quadratic programming), a unique contribution that is not addressed in the above works. These capabilities, particularly the formal modeling scheme selection and execution of structured optimization workflows, are not addressed in prior agent systems.

Q4: "How would the entries from modeling competitions that have won various awards be scored in the rating system of this paper? I want to know how the model performs without human intervention compared to human contestants."

Response: Thank you for the insightful question. We would like to kindly clarify that we included real-world award-winning solutions (Honorable Mention or above) from past MCM/ICM competitions as a baseline in our evaluation (see Human Team in Table 1, Lines 324–343). These entries serve as strong human references for assessing model performance. Results show that while human teams remain strong competitors, often outperforming most LLM-based agents, MM-Agent achieves near-human performance, and even surpasses human teams on certain metrics. This underscores both the inherent difficulty of the task and the effectiveness of MM-Agent in open-ended modeling.

L1 “It seems there is no dedicated paragraph discussing limitations.”

We thank the reviewer for the suggestion. The limitations of our work are discussed in the Broader Impacts section (see lines 979–1020). We will make this discussion more explicit in the next revision for clarity.

评论
  1. As commented by Reviewer badD, if there are more details about using MM-agent as assistant in this competition, this paper can be more solid.
  2. I have understood the contribution of this paper, however, I would like the author to briefly summarize the insights in the development process of MM agent.
评论

Thank you for your thoughtful comment and for recognizing the contribution of our work. We are happy to provide further clarification:

1. Clarification on the use of MM-Agent in the MCM/ICM competition

We appreciate the opportunity to further elaborate. Prior to the contest, we released an interactive modeling copilot system based on MM-Agent [1] and conducted pre-contest training for the participating teams. MM-Agent supports both full-pipeline use and modular interaction, covering four key stages: problem analysis, task decomposition, problem modeling, and problem solving.

  • Team 1 used the full MM-Agent pipeline, engaging with all four modules in their workflow.
  • Team 2 selectively used the first three modules and integrated outputs into their own implementation. Both teams confirmed that MM-Agent significantly improved early-stage modeling efficiency, clarified problem understanding, and supported structured reasoning, contributing to their finalist-level submissions.

To further illustrate MM-Agent’s role, we provide the following case study from Team 1:

Team 1 used DeepSeek-R1 as the base model within the MM-Agent framework and engaged with all four key modules:

  • Problem Analysis: They submitted domain-specific queries such as “How to process noise in gray images?” and “What non-destructive methods can be used to analyze material components?”, relying on AI to interpret and refine real-world problem statements.
  • Task Decomposition: The system provided structured breakdowns of solution strategies (e.g., spatial vs. frequency domain filters, spectroscopic vs. imaging-based methods), aligning with MM-Agent’s decomposition module.
  • Problem Modeling: MM-Agent recommended concrete modeling approaches such as Gaussian filtering, wavelet transforms, BM3D, FTIR, and XRF, enabling domain-appropriate model formulation.
  • Problem Solving: It also offered comparative analysis of different methods, including trade-offs and applicability, supporting decision-making during implementation.

These interactions demonstrate a substantial and practical application of MM-Agent’s core functionalities, either explicitly or via the general-purpose interface provided during training.

[1] Hugging Face: MathematicalModelingAgent

2. Key insights from the development of MM-Agent

Through the development and evaluation of MM-Agent, we gained several key insights:

  • (1) Evaluating open-ended mathematical modeling tasks is inherently challenging, as it requires expert-level domain knowledge and experience to judge the rigor, appropriateness, and completeness of the modeling process, far beyond typical correctness-based benchmarks.
  • (2) Simply applying foundation models (e.g., GPT-4o or DeepSeek-R1) without structured agent-level orchestration results in significantly weaker performance, particularly in modeling formulation and analysis stages. This underscores the necessity of modular, expert-inspired workflows.
  • (3) While MM-Agent shows promising results and outperforms strong baselines, there remains considerable room for improvement. Current LLM agents still struggle with modeling abstraction, highlighting important directions for future work.

We hope these clarifications help solidify the contributions of our work. Thank you again for your insightful feedback.

审稿意见
4

This paper presents MM-Agent, an expert-inspired framework designed to address real-world mathematical modeling problems by leveraging Large Language Models (LLMs). The authors formalize the task of LLM-powered mathematical modeling, which involves analyzing complex real-world problems, constructing domain-appropriate formulations, and generating end-to-end solutions. They introduce MM-Bench, a benchmark comprising 111 problems from the Mathematical Contest in Modeling (MCM/ICM) spanning 2000 to 2025, covering ten diverse domains such as physics, biology, and economics. The paper also proposes the Hierarchical Mathematical Modeling Library (HMML), a three-tiered knowledge hierarchy that encodes 98 high-level modeling schemas to support structured method selection and abstraction. MM-Agent decomposes the modeling process into four stages: problem analysis, model formulation, computational problem-solving, and report generation. Experimental results demonstrate that MM-Agent significantly outperforms baseline agents, achieving an 11.88% improvement over human expert solutions while requiring only 15 minutes and $0.88 per task using GPT-4o. Furthermore, MM-Agent assisted two undergraduate teams in winning the Finalist Award (top 2.0% among 27,456 teams) in the MCM/ICM 2025 competition.

优缺点分析

Strengths

  • The authors have curated a comprehensive benchmark of 111 real-world mathematical modeling problems from the MCM/ICM competition.
  • The experimental results demonstrate that MM-Agent significantly outperforms baseline agents and achieves competitive performance compared to human expert solutions.

Weaknesses

  • Yes, I agree the motivation is clear and the paper is well written, but my concern is - To handle real-world mathematical modeling problems, the pipeline of unstructured problem descriptions, structured mathematical models, solutions, and the generation of analytical reports is not novel. The novelty is limited, it's not the first work to address mathematical modeling in this domain.

问题

Only one question - The evaluation for agent results is not clear in lines 149-163, we cannot know the score of Analysis Evaluation, Modeling Rigorousness, Practicality and Scientificity, Result and Bias Analysis is reliable or not? For me, it's hard to judge that a higher score will mean a better result, It might just seem to have a good effect. For such problems, what seems good may be quite a distance from being truly correct.

局限性

The main concern is:

  • Novelty
  • The reliability of the result evaluation

最终评判理由

My concerns are mostly addressed.

格式问题

No

作者回复

Thanks for Reviewer badD and global response

We sincerely thank Reviewer badD for their thoughtful comments and constructive suggestions. We appreciate your recognition of the significance of our work, especially the comprehensive design of MM-Bench and the strong empirical results achieved by MM-Agent. Mathematical modeling, unlike traditional mathematical reasoning, requires open-ended abstraction, assumption design, and domain-specific formulation to translate complex real-world problems into formal systems, making it uniquely challenging and largely underexplored in LLM research. To address this, we introduce MM-Bench, the first benchmark built from 25 years of MCM/ICM problems, and MM-Agent, a purpose-built agentic framework tailored to modeling tasks. As acknowledged by Reviewers geD1 and 3X2K, our contributions lie in both novel methodology and strong empirical validation, including MM-Agent’s real-world success in MCM/ICM competitions.

We again thank the reviewer for their feedback. Please find our detailed point-by-point responses below.

W1: "To handle real-world mathematical modeling problems, the pipeline of unstructured problem descriptions, structured mathematical models, solutions, and the generation of analytical reports is not novel. The novelty is limited, it's not the first work to address mathematical modeling in this domain."

Response: We appreciate the reviewer’s observation and would like to clarify a key distinction. While prior work may be labeled under "mathematical modeling," they fundamentally focus on solving well-defined mathematical problems, such as optimization tasks or differential equations, where the problem formulation is already given. For example, Huang et al. [1] and Zhang and Luo [2] demonstrate the use of LLMs to solve ODEs or linear programs, assuming the mathematical structure is pre-specified.

In contrast, our work tackles open-ended and real-world mathematical modeling, where the central challenge lies not in solving a predefined problem, but in formulating the problem itself. This involves interpreting ambiguous natural language, abstracting key variables, and selecting appropriate mathematical representations, steps that precede any reasoning or solution process. As acknowledged by Reviewers geD1 and 3X2K, our contribution lies in both the carefully designed benchmarking with strong empirical performance (e.g., in MCM/ICM competitions), and the development of a novel LLM-based agent framework. MM-Agent bridges the gap between unstructured problem descriptions and formal mathematical models,an underexplored and fundamentally different challenge from conventional mathematical reasoning tasks.

[1] LLMs for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages

[2] Or-llm-agent: Automating modeling and solving of operations research optimization problem with reasoning large language model

Q1: "The evaluation for agent results is not clear in lines 149-163, we cannot know the score of Analysis Evaluation, Modeling Rigorousness, Practicality and Scientificity, Result and Bias Analysis is reliable or not? For me, it's hard to judge that a higher score will mean a better result, It might just seem to have a good effect. For such problems, what seems good may be quite a distance from being truly correct."

Response: We strongly agree with the reviewer that evaluating open-ended problems is inherently challenging, even for human experts, precisely because these tasks, unlike well-formulated mathematical problems, lack clear ground-truth answers or objective correctness criteria. Therefore, to address this issue, we follow a dual evaluation strategy which is widely used in AI for scientific discovery [3,4]: (1) LLM-as-judge evaluation using GPT-4o, and (2) Human expert evaluation grounded in domain knowledge. Under both evaluation settings, MM-Agent consistently outperformed all baselines, demonstrating its superior capability in handling open-ended modeling tasks. Furthermore, under official MCM/ICM evaluation protocols, MM-Agent assisted two undergraduate teams in winning the Finalist Award (top 2.0% out of 27,456 teams) in MCM/ICM 2025, providing compelling evidence of its practical effectiveness as a modeling copilot.

For the LLM-as-judge evaluation, we carefully followed the official ICM/MCM scoring rubrics and assessed agent responses across four key dimensions: Analysis Evaluation (AE), Modeling Rigorousness (MR), Practicality and Scientificity (PS), and Result and Bias Analysis (RBA). The specific evaluation prompts for each dimension are provided in Figures 7–10 of Appendix F, ensuring transparency and reproducibility.

For the human evaluation, we recruited three expert annotators with prior MCM/ICM awards or substantial mathematical modeling experience. To ensure reliability, we consulted with an official ICM/MCM judge to establish standardized scoring rubrics and conducted training. All model outputs were anonymized and evaluated independently, as illustrated in Figure 6 of Appendix E.3.

We further assess the consistency and reliability of our evaluation protocol, we computed agreement scores between human annotators as well as between human and model evaluations: Table: Agreement between Human and Model Evaluations

CategoryHuman vs. HumanModel vs. Human
AE (Answer Explanation)0.74750.5068
MR (Modeling Reasoning)0.48130.7130
PS (Problem Solving)0.78900.7860
RBA (Result-Based Analysis)0.76250.5692

As shown in Table 6 of Appendix E.3, we observe moderate to high agreement, especially in core modeling and problem-solving aspects. Categories such as AE and RBA exhibit slightly lower agreement due to their inherently subjective and explanation-based nature, as discussed in prior work [3].

We acknowledge that designing robust and interpretable evaluation frameworks remains an open challenge in AI-driven scientific discovery, particularly for complex modeling tasks. We consider this a crucial direction for future research and are actively exploring more rigorous, hybrid evaluation protocols to improve assessment fidelity.

[3] ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models

[4] Agent Laboratory: Using LLM Agents as Research Assistants

评论

Dear Reviewer badD,

Thank you for your valuable time and insightful comments. We believe our clarifications address your concerns, and we would greatly appreciate it if you could let us know whether any points remain unclear. Regarding the evaluation, we agree that assessing open-ended modeling tasks is inherently challenging and therefore adopted a dual protocol, GPT-4o “LLM-as-judge” scoring (per MCM/ICM rubrics) and blinded human expert review, both confirming MM-Agent’s superiority over baselines, further evidenced by its role in helping two undergraduate teams win the 2025 MCM/ICM Finalist Award (top 2.0% of 27,456 teams).

We look forward to your feedback.

Best regards,

Authors

评论

Thanks for the clarification! I think my concerns are mostly addressed, so I will raise my score to 4.

审稿意见
4

This paper investigates the capabilities of large language models (LLMs) in tackling real-world mathematical modeling problems. Unlike traditional mathematical reasoning, which assumes a well-defined problem formulation, mathematical modeling involves: (1) translating complex scenarios described in natural language into precise mathematical formulations, and (2) solving these formulations and producing detailed technical reports.

This paper introduces MM-Bench, a curated benchmark comprising 111 problems from the Mathematical Contest in Modeling (MCM/ICM) spanning diverse domains in science, engineering, and economics. Furthermore, the paper proposes MM-Agent, a multi-agent LLM framework inspired by expert workflows. Experimental results demonstrate that MM-Agent outperforms baseline agents.

优缺点分析

Strengths

  1. Relevance and scope: Mathematical modeling is essential in applied domains like biology, urban planning, and economics. The paper addresses a critical, underexplored challenge for LLMs: translating realistic, open-ended problems into formal models and executable solutions.

Weaknesses

  1. Benchmark formulation limits novelty. The inclusion of a method-retrieval step from HMML transforms what should be an open-ended modeling challenge into a more constrained retrieval problem. This design may overestimate the LLM’s genuine modeling skill and reduce the benchmark’s ability to assess true modeling creativity and abstraction.

  2. Limited architectural innovation. The core MM-Agent architecture largely follows the established LLM-agent architecture. While well-engineered, it lacks unique algorithmic or architectural features that push beyond current agent paradigms.

  3. CIaim. Although MM-Agent’s assistance in MCM/ICM success is impressive, the manuscript does not clearly explain how much of this outcome resulted from the agent's contribution versus human creativity. Clarifying the balance of influence would strengthen the claim.

问题

The authors are encouraged to address the limitations above.

局限性

N/A

最终评判理由

I will raise the score to 4 as my concerns are mostly addressed.

格式问题

N/A

作者回复

Thanks for reviewer 1dcb and global response

We sincerely thank Reviewer 1dcb for their thoughtful and constructive review. We appreciate your recognition of the importance and relevance of mathematical modeling as a real-world LLM challenge, as well as your acknowledgment of the scope and potential of our MM-Bench benchmark and MM-Agent framework.

Please find our detailed point-by-point responses below.

W1: "Benchmark formulation limits novelty. The inclusion of a method-retrieval step from HMML transforms what should be an open-ended modeling challenge into a more constrained retrieval problem. This design may overestimate the LLM’s genuine modeling skill and reduce the benchmark’s ability to assess true modeling creativity and abstraction."

Response: Thank you for the thoughtful question. We would like to kindly clarify a potential misunderstanding: MM-Bench and MM-Agent are two distinct contributions. MM-Bench poses open-ended mathematical modeling tasks that require full end-to-end solutions, not constrained retrieval problems, and the HMML module is part of MM-Agent, not the benchmark.

MM-Bench is constructed from 25 years of real-world MCM/ICM problems, which are widely recognized for evaluating open-ended mathematical modeling skills. Each task in MM-Bench requires solving an ill-posed, real-world problem through a complete end-to-end modeling pipeline, including problem analysis, model formulation, computational solving, and result interpretation. It does not reduce the task to a constrained retrieval problem, but instead preserves the full scope and complexity of open-ended modeling challenges.

To address these open-ended challenges, we propose MM-Agent, an expert-inspired framework that decomposes mathematical modeling into four stages: open-ended problem analysis, structured model formulation, computational problem solving, and report generation. The HMML module is a component of MM-Agent, It provides abstract modeling schemas to guide agent modeling. Within this pipeline, the modeling stage is especially challenging, requiring the abstraction of complex real-world problems into mathematically coherent formulations. To address this, MM-Agent integrates the Hierarchical Mathematical Modeling Library (HMML), a tri-level schema of 98 abstract modeling strategies, which supports problem- and solution-aware retrieval, principled method selection, and refinement via a critic-actor mechanism.

W2: "Limited architectural innovation. The core MM-Agent architecture largely follows the established LLM-agent architecture. While well-engineered, it lacks unique algorithmic or architectural features that push beyond current agent paradigms."

We would like to kindly clarify that MM-Agent is not derived from existing LLM-agent paradigms, but is instead a purpose-built framework specifically designed to tackle the unique challenges of real-world mathematical modeling. To the best of our knowledge, it is the first LLM-agent system systematically developed for this domain, which demands structured abstraction, rigorous formulation, and context-grounded reasoning.

MM-Agent adopts a four-stage expert-inspired pipeline, spanning problem analysis, model formulation, computational solving, and report generation. Central to this framework is the Hierarchical Mathematical Modeling Library (HMML), a tri-level knowledge hierarchy that enables problem-aware and solution-aware retrieval of abstract modeling schemas. This design supports principled abstraction, constraint reasoning, and methodological alignment, distinguishing MM-Agent from prior agent architectures.

As recognized by Reviewers geD1 and 3X2K, MM-Agent demonstrates both strong empirical performance in MCM/ICM competitions and architectural innovation, contributing a novel and practically useful agentic system for complex, open-ended modeling tasks.

W3: "CIaim. Although MM-Agent’s assistance in MCM/ICM success is impressive, the manuscript does not clearly explain how much of this outcome resulted from the agent's contribution versus human creativity. Clarifying the balance of influence would strengthen the claim."

Thank you for the thoughtful question. Due to privacy requests from the participating teams, we did not disclose detailed submission content in the paper. During the contest, MM-Agent was deployed strictly as a copilot, collaborating with human participants in compliance with MCM/ICM rules (which prohibit AI-only participation but allow LLM-assisted modeling).

Following the competition, we conducted interviews and surveys with the two finalist teams. Both reported using MM-Agent for over 8 hours during the 3-day contest, with particularly high engagement in problem analysis, task clarification, modeling guidance, workflow planning, and data visualization. They explicitly affirmed that MM-Agent played a substantial role in structuring their modeling pipeline and accelerating exploratory analysis, underscoring its meaningful contribution to their success.

评论

Dear Reviewer 1dcb,

Thank you for your constructive comments. We have responded to the key points you raised, including clarifying the distinction between MM-Bench and MM-Agent (W1), detailing the architectural novelty of MM-Agent (W2), and providing further evidence of its contribution to MCM/ICM successes (W3).

As the discussion deadline approaches, please feel free to let us know if you have any additional questions or concerns. Thank you again for your valuable feedback and time.

We look forward to your feedback.

Best regards,

Authors

评论

Thank you for the response! I have carefully read the response and the corresponding sections in the paper. My concerns about W1 and W2 are effectively addressed. But I am still concerned about W3. It is still unclear how MM-Agent assists the two finalist teams. The response to W3 is somehow vague. I prefer to keep my initial evaluation.

评论

Thank you for your follow-up comment. We apologize for the earlier vagueness and would like to provide a more concrete clarification regarding how MM-Agent assisted the two finalist teams in the MCM/ICM contest.

Prior to the contest, we developed and released an interactive modeling copilot system based on MM-Agent [1]. We conducted a pre-contest training session to help participating teams understand how to use the system effectively. Importantly, MM-Agent supports both full-pipeline usage and modular invocation, allowing users to either follow the end-to-end workflow or selectively apply specific components based on their needs. The four key modules are:

  • Problem Analysis: Identifying key variables, constraints, and goals from open-ended questions
  • Task Decomposition: Structuring the problem into logically coherent subcomponents
  • Problem Modeling: Formulating mathematical models guided by abstract schemas
  • Problem Solving: Proposing and comparing computational methods for implementation

During the contest, the two finalist teams used MM-Agent as a modeling copilot:

  • Team 1 followed the full MM-Agent pipeline, using all four modules throughout their workflow.
  • Team 2 selectively used the modules for problem analysis, decomposition, and modeling, integrating the outputs into their own implementation.

Both teams confirmed that MM-Agent helped them structure their approach, accelerate early-stage modeling, and clarify complex problem requirements, contributing meaningfully to their finalist-level performance.

To further illustrate MM-Agent’s role, we provide the following case study from Team 1:

Team 1 used DeepSeek-R1 as the base model within the MM-Agent framework and engaged with all four key modules:

  • Problem Analysis: They submitted domain-specific queries such as “How to process noise in gray images?” and “What non-destructive methods can be used to analyze material components?”, relying on the agent to interpret and refine real-world problem statements.
  • Task Decomposition: The system provided structured breakdowns of solution strategies (e.g., spatial vs. frequency domain filters, spectroscopic vs. imaging-based methods), aligning with MM-Agent’s decomposition module.
  • Problem Modeling: MM-Agent recommended concrete modeling approaches such as Gaussian filtering, wavelet transforms, BM3D, FTIR, and XRF, enabling domain-appropriate model formulation.
  • Problem Solving: It also offered comparative analysis of different methods, including trade-offs and applicability, supporting decision-making during implementation.

These interactions demonstrate a substantial and practical application of MM-Agent’s core functionalities, either explicitly or via the general-purpose interface provided during training.

We hope this clarification addresses your concern and clearly demonstrates how MM-Agent contributed to the finalist teams’ success.

[1] Hugging Face: MathematicalModelingAgent

评论

Thanks for the clarification! I think my concerns are mostly addressed, so I will raise my score to 4. One suggestion is that it would be good to do a controlled experiment for the user study. e.g. two teams use DeepSeek or ChatGPT v.s. two teams use the MM-agent powered by DeepSeek or ChatGPT. But I understand such a controlled experiment could be expensive, and this is not a reason for rejection.

评论

Dear Reviewer 1dcb,

Thank you very much for your thoughtful follow-up and for raising your score. We sincerely appreciate your careful reading and engagement throughout the review process.

We will carefully consider your valuable suggestion regarding a controlled user study in future work to further strengthen the empirical validation of MM-Agent.

Thank you again for your constructive feedback and support.

Best regards,

The Authors

审稿意见
5

In this paper, the authors introduce an agentic system for open-ended mathematical modeling. Rather than rely on zero-shot LLM capabilities, they break down modeling problems into problem framing, model formulation, solution development, and interpreting and preparing results. By using temporal splits, LLM-as-judge and human expert review, and experiments on optimization problems with ground truth answers, they show that their system outperforms other agentic systems for mathematical modeling.

优缺点分析

Strengths The approach and research objectives are clearly laid out. The token consumption results are an interesting evaluation, showing that MM-Agent is not simply consuming more tokens than other methods. The “real-world” applicability of MM-agent as used in the MCM/ICM competitions could be compelling evidence of practical utility.

Weaknesses A few points could benefit from increased clarity: the usefulness and rigor of the temporal split as an approximately “OOD” evaluation, the delta in performance between top-performing human experts and teams using MM-agent as a modeling copilot, and the exact use cases and utility of MM-agent to the finalist teams. HMML is included as a main contribution, but it seems like a simple tree structure representing hierarchies of modeling approaches. It’s unclear how useful or necessary this is to MM-Agent’s performance.

问题

The authors rely on the temporal split (2021-2024) vs (2025) as evidence that the underlying LLMs are performing “genuine modeling” rather than memorization. Are there any metrics of problem similarity between the splits from year to year that could be used to indicate how different the problem statements are? Can the authors expand on the results in Table 1 - which metrics are from automated scoring from GPT-4o vs human expert review? How many expert reviewers were consulted and how was expert review performed? MM-agent helped two teams achieve Finalist awards in competition - which specific subtasks, problems, or evaluations did the top performing teams outperform the MM-agent augmented teams in? Is there a clear path to highlight and address these limitations in future work? To what degree did the two finalist teams rely on MM-agent? Are there problem types and domains that are not currently considered in the MM-Bench evaluation that the framework could be extended to?

局限性

Yes

最终评判理由

The authors have clarified all questions and comments raised during the review, and I will increase my score accordingly.

格式问题

No

作者回复

Thanks for reviewer geD1 and global resposne

We sincerely thank Reviewer geD1 for their thoughtful and constructive feedback. We appreciate your recognition of our agentic framework, the token efficiency analysis, and the real-world applicability of MM-Agent in MCM/ICM competitions. Thank you also for raising insightful questions on evaluation clarity, temporal split design, HMML’s role, and MM-Agent’s impact in competition settings. These points have helped us further strengthen and clarify our work.

[W1 & Q1] W1:“ A few points could benefit from increased clarity: the usefulness and rigor of the temporal split as an approximately “OOD” evaluation, the delta in performance between top-performing human experts and teams using MM-agent as a modeling copilot, and the exact use cases and utility of MM-agent to the finalist team” and Q1: ”Similarity Across Temporal Splits”

Response: Thank you for the thoughtful question. We would like to kindly clarify that our benchmark problems are drawn from the real-world MCM/ICM competition, the world’s largest mathematical modeling contest. Its strict annual format ensures fresh and diverse topics each year, eliminating content repetition by design.

To further support this, we compute semantic similarity between each 2025 problem and those from 2010–2024 using the widely used mGTE embedding model [1]. As shown in the table below, most similarity scores fall between 0.5–0.6, confirming that the 2025 problems are substantively distinct from prior years. This supports our claim that the model’s success is not due to memorization, but rather genuine generalization to novel problem statements.

Table: Similarity between each problem from 2025 and problems from 2010 to 2024.

201020112012201320142015201620172018201920202021202220232024
2025A0.460.500.560.560.530.460.570.550.540.620.480.570.590.600.52
2025B0.510.520.530.620.510.550.580.500.610.610.550.540.600.620.59
2025C0.590.560.520.610.550.510.610.580.620.570.570.540.610.590.58
2025D-----0.610.630.630.630.620.590.550.660.640.63
2025E------0.580.580.520.650.600.690.590.560.54
2025F------0.660.580.690.550.610.680.610.570.62

[1] mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

Q2: “Expert Review and Evaluation Details

Response: Thank you for the insightful question. The metrics reported in Table 1 reflect automated evaluation using GPT-4o as an LLM-as-judge. To complement this, we conducted expert human evaluation on a representative subset of model outputs, with results presented in Appendix E.3 (Figure 6). Due to the high cost of human annotation, only a subset was reviewed, but the results demonstrate that MM-Agent consistently achieves the highest performance, particularly on the AE and RBA dimensions.

For the review process, we recruited three expert evaluators with prior MCM/ICM awards or extensive experience in mathematical modeling. To ensure evaluation quality, we consulted with an official MCM/ICM judge to provide training and establish scoring rubrics aligned with the contest’s official criteria. All outputs were anonymized to avoid bias, and each expert conducted the evaluations independently.

To assess reliability, we measured agreement scores between human annotators and between human and model evaluations (see Table below). The results indicate moderate to high consistency across most categories. While AE and RBA show slightly lower agreement, reflecting the inherent subjectivity in explanation-based evaluation as discussed in prior work [2], this does not undermine the observed superiority of MM-Agent.

Table: Agreement between Human and Model Evaluations

CategoryHuman vs. HumanModel vs. Human
AE (Answer Explanation)0.74750.5068
MR (Modeling Reasoning)0.48130.7130
PS (Problem Solving)0.78900.7860
RBA (Result-Based Analysis)0.76250.5692

In future work, we plan to develop more standardized and robust evaluation protocols to further improve the reliability of assessing open-ended modeling outputs.

[2]Researchagent: Iterative research idea generation over scientific literature with large language models

Q3: "MM-agent helped two teams achieve Finalist awards in competition - which specific subtasks, problems, or evaluations did the top performing teams outperform the MM-agent augmented teams in? Is there a clear path to highlight and address these limitations in future work?"

Thank you for the thoughtful question. We would like to kindly clarify that, due to MCM/ICM rules, AI agents are not allowed to participate independently, but human teams are permitted to use LLMs as assistance. Therefore, our MM-Agent was deployed as a copilot, collaborating with human participants during the competition. Given this setting, a direct comparison with Outstanding Award teams is not strictly objective, as final rankings also reflect subjective judging criteria such as creativity and presentation quality. We acknowledge a performance gap, particularly in modeling originality, explanation rigor, and task structuring. To address these limitations, we plan to enhance MM-Agent’s reasoning and abstraction capabilities (e.g., via reinforcement learning), and further improve its task decomposition strategies for complex, open-ended modeling.

Q4: “ Degree of MM-Agent Usage by Finalist Teams”

Response: To better understand how the two finalist teams leveraged MM-Agent, we conducted follow-up interviews and questionnaire surveys. Both teams reported using MM-Agent as a copilot for over 8 hours during the 3-day competition, with particularly high engagement in tasks such as problem analysis, task clarification, modeling guidance, workflow planning, and data visualization. They explicitly affirmed that MM-Agent played a substantial role in structuring their modeling process and accelerating exploratory analysis.

[W2 & Q5] W2:”HMML is included as a main contribution, but it seems like a simple tree structure representing hierarchies of modeling approaches. It’s unclear how useful or necessary this is to MM-Agent’s performance. And ”Q5: "HMML is included as a main contribution, but it seems like a simple tree structure representing hierarchies of modeling approaches. It’s unclear how useful or necessary this is to MM-Agent’s performance."

Response: We appreciate the reviewer’s question. We emphasize that HMML is a heterogeneous hierarchical design that fundamentally differs from flat modeling libraries. It supports both problem-aware and solution-aware retrieval of modeling strategies, enabling more accurate abstraction and principled method selection across diverse modeling scenarios. To rigorously evaluate its effectiveness, we conducted an ablation study by replacing HMML with a flat retrieval baseline lacking hierarchical structure (as detailed in Figure 4 and lines 350–373). Removing HMML led to consistent performance drops across all evaluation metrics, particularly in Analysis Evaluation (AE) and Modeling Rigorousness (MR), demonstrating that HMML plays a critical role in guiding structural reasoning and improving modeling quality.

Table: Ablation Study

MethodAEMRPSRBA
w/o HMML (GPT-4o)8.346.478.977.75
MM-Agent (GPT-4o)9.097.269.008.44
w/o HMML (R1)8.267.709.008.36
MM-Agent (R1)9.538.279.108.55

Q6: "Are there problem types and domains that are not currently considered in the MM-Bench evaluation that the framework could be extended to?"

Response: Thank you for the thoughtful question. We would like to kindly clarify that MM-Bench is entirely constructed from real-world MCM/ICM problems spanning the past 25 years, ensuring authentic and diverse coverage of open-ended modeling tasks across a wide range of domains. While we believe the current benchmark offers reasonably comprehensive representation, we acknowledge that certain domains, such as geometry, education, and energy, are currently underrepresented. We view these areas as valuable extensions and plan to incorporate them in future versions of MM-Bench to further enhance its domain coverage and generalizability.

评论

The authors have clarified all questions and comments raised during the review, and I will increase my score accordingly.

评论

Dear Reviewer geD1,

We sincerely appreciate your thoughtful and constructive review. Thank you for acknowledging the clarifications provided in our rebuttal and for your decision to increase the score. We will further improve clarity in the final version, especially on temporal splits, expert review, and MM-Agent’s real-world usage.

Best regards,

Authors

最终决定

This paper discusses the development of a system to perform mathematical modelling. The paper has garnered excitement and appreciation from the reviewers. In a long iterative back-and-forth, I believe the authors have responded to most of the comments and concerns the reviewers had. I looked at the paper to find better validation of the statement in the abstract: "Furthermore, under official MCM/ICM protocols, MM-Agent assisted two undergraduate teams in winning the Finalist Award (top 2.0% among 27,456 teams) in MCM/ICM 2025, demonstrating its practical effectiveness as a modeling copilot." and couldn't quite find this validation - to what extent and how the system helped the team. It is a strong claim, so if it is indeed the case, it would be good to base it.