REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving

审稿意见

评分: 5置信度: 42025-06-06

This paper presents the REASONING COMPILER, a novel compiler optimization framework that integrates large language model (LLM) reasoning with Monte Carlo Tree Search (MCTS) to improve the sample efficiency and performance of code optimization for model serving workloads. Recognizing that modern neural compilers face challenges due to the vast and interdependent space of transformation sequences, the authors cast compilation as a sequential decision-making process and use LLMs to propose context-aware transformations. These LLM-guided proposals are integrated into MCTS, which balances exploration and exploitation while leveraging a learned cost model to guide search without incurring real hardware evaluations.

The authors evaluate their method across four diverse neural network operators from LLaMA3, DeepSeek, and Stable Diffusion workloads. Results show that REASONING COMPILER significantly outperforms traditional black-box autotuners like Evolutionary Search, achieving up to 15–16× improvements in sample efficiency. Ablation studies further validate the contributions of the LLM proposal engine, the effect of model scale, and the utility of historical transformation context in prompts.

优缺点分析

Strengths

Demonstrates a novel combination of LLM reasoning with MCTS for compiler optimization (Section 3).
Clearly articulates the motivation for contextual, history-aware transformation proposals, with concrete examples of transformation dependencies (e.g., Section 3.1, Figure 1).
Empirical results show strong improvements in sample efficiency and performance across four diverse benchmarks, with up to 7–10× speedups using fewer than 40 samples (Section 4.2, Figure 3).
Ablation studies substantiate the importance of LLM quality, instruction tuning, and prompt history depth (Section 4.3, Figure 5).
Open-sourced implementation enables reproducibility and adoption by the broader community (Section 4.1).
Clarity in problem formalization, with a structured MDP and integration of LLMs into each MCTS stage (Section 2, Section 3.2).

Weaknesses

The paper does not perform end-to-end evaluations on full model inference pipelines—only individual operators are benchmarked (Section 4.1).
LLM proposal quality is only evaluated in terms of downstream performance; there is no analysis of proposal diversity, failure modes, or qualitative reasoning quality.
The cost of querying commercial LLM APIs (e.g., OpenAI) is not discussed, which could affect deployment feasibility (Section 4.1).
Claims around open-source LLMs being competitive could benefit from more head-to-head comparisons on the same operator and hardware budgets (Section 4.3.1).
The simulation step in MCTS uses a learned surrogate cost model, but the generalization accuracy and training procedure of this model are underdescribed (Section 3.2, Section 4.1).

问题

Can the authors report on the cost (e.g., time or monetary) of LLM queries per program? Would open-source models suffice in settings where commercial API use is impractical?
How well does the learned surrogate cost model generalize across unseen programs or hardware architectures? Is it retrained per workload?
Have the authors evaluated performance on full end-to-end inference pipelines, beyond isolated operators? If not, what are the key barriers?
Do LLMs sometimes propose transformations that are syntactically valid but semantically redundant or harmful? How often is fallback to random search triggered?

局限性

N/A

最终评判理由

I’m maintaining an accept recommendation because the paper cleanly integrates LLM-guided proposals with MCTS for compiler optimization, with a clear MDP setup and strong sample-efficiency gains over black-box autotuners. The rebuttal addressed my main concerns by adding cross-hardware results (five platforms) and end-to-end Llama-3-8B runs that show higher speedups with fewer trials than TVM. The authors clarified that the surrogate cost model is TVM’s standard XGBoost model used only for ranking candidates, which matches common autotuning practice. They also reported LLM usage costs with per-experiment tables and showed low fallback rates, which reduces risk from poor proposals. Some presentation issues remain (e.g., figure clarity) and the main paper still emphasizes kernel-level evaluations, but the new results and planned edits help. I encourage the authors to tighten figure captions and expand the end-to-end evaluations in the final version. On balance, the novelty, evidence, and practical relevance for model serving outweigh the remaining limitations, so I keep my positive rating.

格式问题

Figure legends could be slightly larger for better readability.
Minor spacing issues in math environments (e.g., Equation (1)) could be cleaned up.

作者回复

2025-07-31

We sincerely thank the reviewer for the thoughtful and detailed feedback. We are glad the novelty, empirical rigor, and potential impact of our work came through clearly. Below, we address the specific comments and questions.

Q1: Cost of LLMs

We appreciate this important practical consideration. From the OpenAI models, we only used the cheapest and not the most capable model at the time of submission (GPT-4o mini) for our main results. For open-source models, because of the time constraints during submission, we used the Hugging Face APIs through Nscale hyperscaler provider.

As the results in Section 4.3.1 show, the open-source models (especially Llama3.3-Instruct (70B) and DeepSeek-Distill (32B)) provide competitive sample efficiency and speedup gains in comparison with GPT-4o mini. Their cost can be minimized by local deployment.

And yes, the competitive results suggest that the open-source models can suffice when commercial API use is not practical.

Nonetheless, the following table reports the cost of every experiment for each benchmark. We were not particularly cost conscious about running the experiments and let the experiments run a high number of samples to understand the boundary of performance improvements and allow the algorithm to saturate.

Overall, we have spent about 350 USD with OpenAI models which includes all the experiments, development, and debugging runs for the submission and rebuttal. With Hugging Face, we have spent about 90 USD that also includes all the experiments, development, and debugging.

Cost per Entire Experiment	GPT-4o mini	OpenAI o1-mini	LLaMA3.3-Instruct (70B)	DeepSeek-Distill (32B)	LLaMA3.1-Instruct (8B)	DeepSeek-Distill (7B)
Llama-3-8B – Attention Layer	$0.89	$6.56	$2.07	$1.55	$0.31	$2.07
DeepSeek-R1 – MoE Layer	$0.90	$6.63	$2.09	$1.57	$0.31	$2.09
FLUX – Attention Layer	$0.88	$6.47	$2.03	$1.52	$0.30	$2.03
FLUX – Convolution	$1.12	$8.25	$2.67	$2.00	$0.40	$2.67
End-to-End Llama-3-8B	$1.59
Llama‑4‑Scout‑17B‑16E‑Instruct – MLP Layer	$0.90

Q2: Surrogate cost model generalization

We appreciate the reviewer’s attention to the cost model. The cost model generalizes to many platforms, but is not our work or contribution. It is a native part of TVM and one of its important contributions.

We use the default model provided by the TVM repository, which is based on XGBoost. We did not modify this model in our implementation. It has been widely adopted in both academic and commercial settings.

As the AutoTVM paper (Chen et al., NeurIPS'18) discusses, the use of XGBoost allows the cost model to learn efficiently from relatively few samples—making it a practical choice for rapid autotuning. Additionally, Sections 4 and 6.2 of their paper explicitly explores transfer learning, where knowledge from previously tuned workloads is reused to accelerate optimization on new workloads. While XGBoost itself does not support transfer learning natively, the AutoTVM framework applies transfer across similar tasks by initializing the model with prior data and retraining—demonstrating strong generalization in practice. The XGBoost model gets trained on the fly although it benefits from transfer learning across runs. The available documentation suggests that it is not necessary to train the model per workload.

While surrogate modeling is indeed a powerful and broadly useful tool in this domain, it is not a novel or central contribution of our work. We will clarify better and include more details.

Our use of their unmodified cost model is commensurate with the extensive literature on compiler optimizations and autotuning (Gibson and Cano. PACT’22, Ahn et al. DAC’22, Zhang et al. ICLR’21).

We provide new results for two benchmarks (Llama-3-8B Attention Layer and Deepseek-R1 MoE Layer) on five platforms from four different hardware vendors (Apple M2 Pro, Amazon Graviton2, AMD EPYC 7R13, Intel Core i9, and Intel Xeon E3) here and will include them in the paper with discussions. We have to acknowledge TVM, as it provides the hardware-agnostic implementation to build optimization innovations. We inherit the hardware agnosticism from its implementation, although we fundamentally change the optimization algorithm.
Moreover, we report end-to-end execution latency improvement for Llama3‑8B, covering all five platforms from four hardware vendors.
We also provide results of the MLP layer from Llama‑4‑Scout‑17B‑16E‑Instruct, showcasing performance improvements in representative large-model layers and support for layers beyond what was originally included in the paper on all these five different platforms.

Three tables for these new results are reported in response to “Reviewer secC.”

In the tables, we measure sample efficiency using $\frac{Speedup}{\text{No.~} of Samples}$ and report the relative improvements of our Reasoning Compiler compared to TVM. The reduction in the absolute number of samples is also reported.

As the results in all three tables show, the Reasoning Compiler consistently achieves higher speedups in substantially fewer samples across five different platforms, for all the three sets of experiments mentioned above.

For Llama-3-8B Attention Layer and Deepseek-R1 MoE Layer across the five platforms, our Reasoning Compiler achieves $6.1\times$ average speedup while TVM archives $3.3\times$ . Our compiler uses $6.5\times$ fewer samples and exhibits $11.9\times$ higher sample efficiency as the first table shows.

Q3: End-to-end evaluation and barriers

Our approach and TVM can be used for end-to-end optimizations but engineering challenges make experimentation limited. Converting PyTorch models to TVM IR is difficult because several of the operators are not supported. Other challenges include inter-operator dependencies that may affect scheduling, more complex memory reuse patterns and layout constraints, and increased runtime variability due to kernel fusion or asynchronous execution.

For end-to-end execution of Llama3-8B across the five platforms, the sample efficiency improvement compared with TVM ranges from $3.2\times$ on AMD EPYC to $11.8\times$ on Intel Core i9. The end-to-end speedups range from $2.2\times$ on AMD EPYC to $5.1\times$ on Amazon Graviton2. Reasoning Compiler consistently achieves significantly higher speedup ( $4.0\times$ geomean vs $2.8\times$ geomean) with an average of $3.9\times$ fewer samples.

Q4: Redundant transformations and fallback

Thank you for this insightful question. Yes, LLM proposals can occasionally be syntactically valid but semantically redundant or performance-regressive. During one expansion process, if all the transformations proposed by LLM are invalid, the process falls back to MCTS with a default (non-LLM) policy, effectively continuing the search. However, if only some of the transformations are invalid, those ones are skipped and the rest are still applied and no fall back is triggered. To prevent downstream harm from poor but valid transformations, the cost model evaluates all proposed transformations before they are added to the tree, so semantically poor proposals are naturally pruned due to a poor estimated value. Our transformation space is constrained to a known set of valid transformations, so correctness is primarily a matter of naming compliance and context usage, both of which the LLM handles well in most cases.

Model	Fallback Rate
GPT-4o mini	0%
OpenAI o1-mini	0%
LLaMA3.3-Instruct (70B)	0.08%
DeepSeek-Distill (32B)	0.17%
LLaMA3.1-Instruct (8B)	10.50%
DeepSeek-Distill (7B)	17.20%

The above table summarizes the average fallback rates for all the LLM models used for Reasoning Compiler. As the table shows, OpenAI models (GPT-4o mini and o1-mini) and open source models with a larger number of parameters (Llama3.3-Instruct (70B) and DeepSeek-Distill (32B)) fair similarly with 0% for the commercial models and 0.08% and 0.17% for the open-source models. On the other hand, smaller models have higher fallback rates with Llama3.1-Instruct (8B) at 10.50% and DeepSeek-Distill (7B) at 17.20%.

These results are consistent with our comparative study of different LLMs for Reasoning Compiler in Section 4.3.1 and Supplementary Material Section D.1, Table 2.

W4: Open-source models vs proprietary models

We thank the reviewer for pointing more head-to-head comparisons between open-source models and proprietary ones. We agree and to that end, besides the study in Section 4.3.1 and Figure 5 (a) of the submission, we had added Supplementary Material Section D.1, Table 2 that reports the gains on the rest of the benchmarks using all the open-source LLMs. These additional results reconfirm that larger open-source models are competitive with proprietary alternatives, further supporting the accessibility and reproducibility of our Reasoning Compiler.

Minor notes and formatting

We appreciate the feedback on figure legends and spacing. We will improve font size in Figure 1 and Figures 3–5, and adjust math formatting in Equation (1) accordingly.

Once again, we thank the reviewer for their encouraging review and constructive questions, which have helped us improve both the scope and clarity of our work.

2025-08-05

Reviewer jGo8, can you please check whether the author's rebuttal addresses your concerns?

2025-08-06

Thank you so much for the clarifications in your rebuttal! I'm glad the feedback was helpful.

This definitely clears up my remaining concerns. Please let me know if any of my comments were unclear. Looking forward to seeing the final version of the paper!

评论- Thank you

2025-08-06

We are grateful for your insightful comments and questions and we are glad the responses were satisfactory. All the comments were clear. Cheers!

审稿意见

评分: 4置信度: 22025-06-23

This paper proposes the reasoning compiler, a framework that optimizes compilation with LLM and Monte Carlo tree search. Using the reasoning compiler, the authors perform compiler optimization as a sequential search process with contextual awareness. Experimental results reveal that the proposed method achieves comparable or superior performance compared to previous evolutionary search method with significantly smaller sample sizes.

优缺点分析

Strengths:

Clarity The paper is well written and easy to follow. The method is described clearly and concisely, and results are clearly presented.
Significance The topic is important. It is important to optimize compiler performance for better downstream machine learning task performance.
Originality The proposed method is novel.
Quality The authors conduct carefully designed, comprehensive experiments to show that the proposed reasoning compiler works better than baseline methods.

Weaknesses:

Lack of varied hardware environments: The experiment is fixed to one hardware environment, namely "Intel Core i9 CPU using Apache TVM v0.20.0" (lines 245-246). Is the advantage of the proposed method agnostic to hardware environments?
Lack of discussion about downstream implications: Can the authors discuss in more detail the implications of this work to the downstream applications of machine learning model tuning or inference?

问题

Please see the weaknesses above. On top of the two weaknesses, can the authors also discuss about whether it is possible to use the generated optimization traces to further fine-tune models for better optimization performance?

局限性

yes

最终评判理由

My concerns are resolved.

格式问题

I have no formatting concerns.

作者回复

2025-07-31

We thank the reviewer for the positive and constructive comments, and we appreciate the recognition of our work’s clarity, originality, and significance. Below, we address the two primary comments and the additional question regarding fine-tuning.

W1: Hardware agnosticism

We appreciate your comment and we emphasize that our approach is hardware agnostic. We have to acknowledge TVM, as it provides the hardware-agnostic implementation to build optimization innovations. We inherit the hardware agnosticism from its implementation, although we fundamentally change the optimization algorithm.

We use the default surrogate cost model provided by the TVM repository, which is based on XGBoost. We did not modify this model in our implementation. It has been widely adopted in both academic and commercial settings.

As the AutoTVM paper (Chen et al., NeurIPS'18) discusses, the use of XGBoost allows the cost model to learn efficiently from relatively few samples—making it a practical choice for rapid autotuning. Additionally, Sections 4 and 6.2 of their paper explicitly explores transfer learning, where knowledge from previously tuned workloads is reused to accelerate optimization on new workloads. While XGBoost itself does not support transfer learning natively, the AutoTVM framework applies transfer across similar tasks by initializing the model with prior data and retraining—demonstrating strong generalization in practice.

While surrogate modeling is indeed a powerful and broadly useful tool in this domain, it is not a novel or central contribution of our work. We will clarify better and include more details.

Our use of their unmodified cost model is commensurate with the extensive literature on compiler optimizations and autotuning (Gibson and Cano. PACT’22, Ahn et al. DAC’22, Zhang et al. ICLR’21).

We provide new results for two benchmarks (Llama-3-8B Attention Layer and Deepseek-R1 MoE Layer) on five platforms from four different hardware vendors (Apple M2 Pro, Amazon Graviton2, AMD EPYC 7R13, Intel Core i9, and Intel Xeon E3) here and will include them in the paper with discussions.
Moreover, we report end-to-end execution latency improvement for Llama3‑8B, covering all five platforms from four hardware vendors.
We also provide results of the MLP layer from Llama‑4‑Scout‑17B‑16E‑Instruct, showcasing performance improvements in representative large-model layers and support for layers beyond what was originally included in the paper on all these five different platforms.

Hardware Platform	Benchmark	TVM # Samples	TVM Speedup	Ours # Samples	Ours Speedup	Reduction in # Samples	Improvement in Sample Efficiency
Apple M2 Pro	Llama-3-8B Attention Layer	1010	3.3	190	9.7	5.3×	15.6×
	Deepseek-R1 MoE Layer	1040	2.8	230	4.8	4.5×	7.8×
Amazon Graviton2	Llama-3-8B Attention Layer	510	3.9	60	5.1	8.5×	11.1×
	Deepseek-R1 MoE Layer	980	2.67	150	5.9	6.5×	14.4×
AMD EPYC 7R13	Llama-3-8B Attention Layer	1400	2.1	200	12.1	7.0×	40.3×
	Deepseek-R1 MoE Layer	2290	1.7	330	2.3	6.9×	9.4×
Intel Xeon E3	Llama-3-8B Attention Layer	2760	3.9	320	5.8	8.6×	12.8×
	Deepseek-R1 MoE Layer	1000	3.7	180	4.4	5.6×	6.6×
Intel Core i9	Llama-3-8B Attention Layer	920	10.5	130	11	7.1×	7.4×
	Deepseek-R1 MoE Layer	1632	9.1	192	9.1	8.5×	8.5×
Geomean			3.3×		6.1×	6.5×	11.9×

Hardware Platform	Benchmark	TVM # Samples	TVM Speedup	Ours # Samples	Ours Speedup	Reduction in # of Samples	Improvement in Sample Efficiency
Apple M2 Pro	End-to-End Llama-3-8B	4820	2.2×	1770	3.9×	2.7×	4.8×
Amazon Graviton2	End-to-End Llama-3-8B	4560	3.7×	1440	5.1×	3.2×	4.4×
AMD EPYC 7R13	End-to-End Llama-3-8B	410	2.0×	140	2.2×	2.9×	3.2×
Intel Xeon E3	End-to-End Llama-3-8B	4640	5.0×	670	5.0×	6.9×	6.9×
Intel Core i9	End-to-End Llama-3-8B	3800	2.2×	720	4.9×	5.3×	11.8×
Geomean		—	2.8×	—	4.0×	3.9×	5.6×

The table for the MLP layer from Llama‑4‑Scout‑17B‑16E‑Instruct is reported in response to “Reviewer secC.”

In the tables, we measure sample efficiency using $\frac{Speedup}{\text{No.~} of Samples}$ and report the relative improvements of our Reasoning Compiler compared to TVM. The reduction in the absolute number of samples is also reported.

As the results in all three tables show, the Reasoning Compiler consistently achieves higher speedups in substantially fewer samples across five different platforms, for all the three sets of experiments mentioned above.

For Llama-3-8B Attention Layer and Deepseek-R1 MoE Layer across the five platforms, our Reasoning Compiler achieves $6.1\times$ average speedup while TVM archives $3.3\times$ . Our compiler uses $6.5\times$ fewer samples and exhibits $11.9\times$ higher sample efficiency as the first table shows.

For end-to-end execution of Llama3-8B across the five platforms, the sample efficiency improvement compared with TVM ranges from $3.2\times$ on AMD EPYC to $11.8\times$ on Intel Core i9. The end-to-end speedups range from $2.2\times$ on AMD EPYC to $5.1\times$ on Amazon Graviton2. Reasoning Compiler consistently achieves significantly higher speedup ( $4.0\times$ geomean vs $2.8\times$ geomean) with an average of $3.9\times$ fewer samples.

For the large MLP layer from Llama‑4‑Scout‑17B‑16E‑Instruct, on Intel i9, Reasoning Compiler achieves a $12.7\times$ speedup in just 20 samples, while TVM reaches only $5.6\times$ after 230 samples. This corresponds to an $11.5\times$ reduction in the number of samples and a $26.1\times$ improvement in sample efficiency compared with TVM. Similar trends are observed across all the five platforms. For AMD EPYC, our compiler achieves $10.2\times$ speedup in 100 samples while TVM achieves $6.4\times$ in 510 samples.

W2: Downstream implications

Cost efficiency: These improvements directly translate into meaningful reductions in the operational cost of LLM inference—measured not only in dollars but also in energy consumption and environmental impact. Because our Reasoning Compiler consistently achieves $2$ – $5\times$ in required samples and end-to-end runtime across diverse hardware platforms, the compute workload—and consequently, the billing for cloud or on-premise usage—is reduced by a similar factor. This is highly impactful, especially as LLM inference costs now constitute a significant portion of infrastructure budgets for startups, enterprises, and research labs alike.

Energy and environmental impact: Inference runtime is directly tied to energy usage. Every millisecond reduction in execution time lowers the energy consumed per query, thereby contributing to a smaller carbon footprint. Given the growing scale and adoption of LLMs, this efficiency gain scales to material environmental benefits when deployed at production volume.

Responsiveness and usability: Our method also improves system responsiveness, which is critical for latency-sensitive applications like real-time chat, AI assistants, and interactive agents. Faster model execution enhances user experience and expands the viability of deploying large models in constrained environments such as edge devices or mobile apps.

Agile deployment and finetuning: By significantly reducing the number of samples required for optimization, our approach enables more agile model deployment, faster iteration cycles, and lower cost per tuning run in practical machine learning workflows.

Self-optimizing inference. One particularly exciting direction is the possibility of bootstrapping model inference itself. For instance, when the LLM guiding compilation is optimizing inference for its own architecture, a virtuous cycle is created: faster inference enables lower-latency querying of the model, which in turn accelerates the compilation process. This feedback loop opens up new possibilities for self-improving systems and continual performance refinement.

We thank the reviewer for the thoughtful prompt that opened the door to this discussion, and we will include it.

Q1: Possibility of using the generated optimization traces to fine-tune models for better optimization

This is a valuable question. While we did not fine-tune the LLM in our current work, we believe that fine-tuning using optimization traces is a promising direction. Our current system demonstrates strong performance without any fine-tuning, enabling plug-and-play integration into compiler infrastructures without requiring retraining or labeled data. This makes our method broadly applicable and easy to adopt across different hardware and workloads. That said, fine-tuning could likely yield even stronger optimization capabilities, particularly when targeting specific hardware or workload distributions. However, this introduces a tradeoff: fine-tuning on one hardware target may reduce portability and generalization to others — a limitation we deliberately avoided to preserve flexibility. Nonetheless, it is intuitive to fine-tune the model using optimization traces generated by MCTS. These traces can be filtered to produce high-quality training data, enabling a form of self-improvement where the model learns from its own refined experience.

Thank you again for your insightful questions and feedback. We will revise the paper to clarify these points, incorporate the additional cross-hardware results, and expand the discussion of downstream implications and future work.

2025-08-05

Thank the authors for the comment. I will keep my score.

评论- Thank you!

2025-08-06

Thank you very much for your valuable comments and questions, we appreciate your feedback and for letting us know. Cheers!

审稿意见

评分: 5置信度: 32025-07-02

The paper introduces reasoning compiler, a novel framework that integrates large language model (LLM) reasoning with Monte Carlo Tree Search (MCTS) to guide compiler optimization for accelerating model inference. Traditional neural compiler techniques struggle with the vast and interdependent transformation space and are often sample-inefficient.

To address the problem, Reasoning compiler treats optimization as a sequential, context-aware decision-making process. It uses LLMs to propose transformations based on program state and performance history, while MCTS enables structured exploration of optimization paths. Empirical results show that this hybrid approach achieves significant speedups (e.g., up to 7×–10×) over pre-optimized code using orders of magnitude fewer samples than black-box methods like Evolutionary Search, highlighting its superior sample efficiency and practical value for compiler tuning.

优缺点分析

Strengths:

Compiler optimization for neural network inference is of high practical importance
The proposed approach is technically sound and well-implemented. It integrates large language model (LLM) reasoning with Monte Carlo Tree Search (MCTS) for compiler optimization. The problem is clearly formalized as a Markov Decision Process, and the empirical evaluation is rigorous.
The core idea of casting compiler optimization as a context-aware reasoning task is interesting. It shows that the semantic understanding of LLMs can help navigating the non-linear optimization space

Weakness:

While the high-level flow is clearly described in the work, details on the interaction between LLM proposals and MCTS explorations are unclear to me. Figure 2 is not that helpful. In Figure 2(b), what are the difference between Prog. i+1, .., Temp i+1, 1, ... Prog Sim.? It seems that the work use both v_i and p_i to represent a program? What is and how to get a terminal program that can help estimate local cost without hardware execution? What is the cost model?
Although sample efficiency improvements are impressive, the final absolute performance gains relative to baseline methods (e.g., Evolutionary Search) are sometimes modest. The evaluation is limited to four kernels and does not explore how well the method generalizes to diverse hardware or larger compilation tasks.

问题

Clarification on the cost model: The framework heavily relies on a learned hardware-aware cost model to simulate downstream performance. However, the paper doesn't explain how this model is trained and validated across different hardware platforms. Can the authors elaborate on how the cost model is built, Whether it is specific to one hardware target or transferable across platforms, The accuracy of the surrogate model compared to actual runtime?
Can the authors discuss how well the approach generalize to the other layers of LLMs such as MLP layers and also report the end-to-end inference latency improvement?

The anonymized repository link doesn't seem to work for me.

局限性

yes

最终评判理由

The proposed approach is technically sound and well-implemented. It integrates large language model (LLM) reasoning with Monte Carlo Tree Search (MCTS) for compiler optimization. The problem is clearly formalized as a Markov Decision Process, and the empirical evaluation is rigorous. I remain positive of the work.

格式问题

no

作者回复

2025-07-31

We thank the reviewer for the thoughtful review and for recognizing the novelty, technical soundness, and practical significance of our approach. We address the comments in detail.

Q1: Clarification on the cost model

As correctly noted, the theoretical framework relies on a cost model. This has become standard in compiler autotuning.

We use the default model provided by the TVM repository, which is based on XGBoost. We did not modify this model in our implementation. It has been widely adopted in both academic and commercial settings.

As the AutoTVM paper (Chen et al., NeurIPS'18) discusses, the use of XGBoost allows the cost model to learn efficiently from relatively few samples—making it a practical choice for rapid autotuning. Additionally, Sections 4 and 6.2 of their paper explicitly explores transfer learning, where knowledge from previously tuned workloads is reused to accelerate optimization on new workloads. While XGBoost itself does not support transfer learning natively, the AutoTVM framework applies transfer across similar tasks by initializing the model with prior data and retraining—demonstrating strong generalization in practice.

While surrogate modeling is indeed a powerful and broadly useful tool in this domain, it is not a novel or central contribution of our work. We will clarify better and include more details.

Our use of their unmodified cost model is commensurate with the extensive literature on compiler optimizations and autotuning (Gibson and Cano. PACT’22, Ahn et al. DAC’22, Zhang et al. ICLR’21).

We evaluate all candidate schedules using this model, and only the final selected schedules are measured with actual hardware execution. While we do not claim new contributions in cost model design, we note that the model has been validated in prior work to provide reliable performance estimates for guiding the search.

While the original AutoTVM paper does not report explicit accuracy metrics (e.g., R^2 or MAE) for the cost model, this seems intentional. In autotuning, the goal of the cost model is not to predict absolute runtime precisely, but rather to rank program candidates effectively so that search can focus on promising regions of the space. Thus, the utility of the cost model is typically validated through its impact on search efficiency and final performance.

We provide new results for two benchmarks (Llama-3-8B Attention Layer and Deepseek-R1 MoE Layer) on five platforms from four different hardware vendors (Apple M2 Pro, Amazon Graviton2, AMD EPYC 7R13, Intel Core i9, and Intel Xeon E3) here and will include them in the paper with discussions. We have to acknowledge TVM, as it provides the hardware-agnostic implementation to build optimization innovations. We inherit the hardware agnosticism from its implementation, although we fundamentally change the optimization algorithm.
Moreover, we report end-to-end execution latency improvement for Llama3‑8B, covering all five platforms from four hardware vendors.
We also provide results of the MLP layer from Llama‑4‑Scout‑17B‑16E‑Instruct, showcasing performance improvements in representative large-model layers and support for layers beyond what was originally included in the paper on all these five different platforms.

Hardware Platform	Benchmark	TVM # Samples	TVM Speedup	Ours # Samples	Ours Speedup	Reduction in # Samples	Improvement in Sample Efficiency
Apple M2 Pro	Llama-3-8B Attention Layer	1010	3.3	190	9.7	5.3×	15.6×
	Deepseek-R1 MoE Layer	1040	2.8	230	4.8	4.5×	7.8×
Amazon Graviton2	Llama-3-8B Attention Layer	510	3.9	60	5.1	8.5×	11.1×
	Deepseek-R1 MoE Layer	980	2.67	150	5.9	6.5×	14.4×
AMD EPYC 7R13	Llama-3-8B Attention Layer	1400	2.1	200	12.1	7.0×	40.3×
	Deepseek-R1 MoE Layer	2290	1.7	330	2.3	6.9×	9.4×
Intel Xeon E3	Llama-3-8B Attention Layer	2760	3.9	320	5.8	8.6×	12.8×
	Deepseek-R1 MoE Layer	1000	3.7	180	4.4	5.6×	6.6×
Intel Core i9	Llama-3-8B Attention Layer	920	10.5	130	11	7.1×	7.4×
	Deepseek-R1 MoE Layer	1632	9.1	192	9.1	8.5×	8.5×
Geomean			3.3×		6.1×	6.5×	11.9×

Hardware Platform	Benchmark	TVM # Samples	TVM Speedup	Ours # Samples	Ours Speedup	Reduction in # of Samples	Improvement in Sample Efficiency
Apple M2 Pro	End-to-End Llama-3-8B	4820	2.2×	1770	3.9×	2.7×	4.8×
Amazon Graviton2	End-to-End Llama-3-8B	4560	3.7×	1440	5.1×	3.2×	4.4×
AMD EPYC 7R13	End-to-End Llama-3-8B	410	2.0×	140	2.2×	2.9×	3.2×
Intel Xeon E3	End-to-End Llama-3-8B	4640	5.0×	670	5.0×	6.9×	6.9×
Intel Core i9	End-to-End Llama-3-8B	3800	2.2×	720	4.9×	5.3×	11.8×
Geomean		—	2.8×	—	4.0×	3.9×	5.6×

Hardware Platform	Benchmark	TVM # Samples	TVM Speedup	Ours # Samples	Ours Speedup	Reduction in # of Samples	Improvement in Sample Efficiency
Apple M2 Pro	MLP Layer Llama‑4‑Scout‑17B‑16E‑Instruct	2460	2.2×	440	3.4×	5.6×	8.6×
Amazon Graviton2	MLP Layer Llama‑4‑Scout‑17B‑16E‑Instruct	1630	1.7×	500	4.0×	3.3×	7.7×
AMD EPYC 7R13	MLP Layer Llama‑4‑Scout‑17B‑16E‑Instruct	510	6.4×	100	10.2×	5.1×	8.1×
Intel Xeon E3	MLP Layer Llama‑4‑Scout‑17B‑16E‑Instruct	1200	2.0×	300	6.1×	4.0×	12.2×
Intel Core i9	MLP Layer Llama‑4‑Scout‑17B‑16E‑Instruct	230	5.6×	20	12.7×	11.5×	26.1×
Geomean		—	3.1×	—	6.4×	5.3×	11.1×

Q2: Generalization to other layers and end-to-end results

We appreciate the reviewer’s insightful question. Our technique is not limited to any specific layer type. As long as a layer is supported by TVM, our method can be applied without modification. This flexibility stems from our implementation being built directly on top of TVM's abstraction layers, enabling us to target a wide variety of architectures and operators.

In the tables, we measure sample efficiency using $\frac{Speedup}{\text{No.~} of Samples}$ and report the relative improvements of our Reasoning Compiler compared to TVM. The reduction in the absolute number of samples is also reported.

As the results in all three tables show, the Reasoning Compiler consistently achieves higher speedups in substantially fewer samples across five different platforms, for all the three sets of experiments mentioned above.

For Llama-3-8B Attention Layer and Deepseek-R1 MoE Layer across the five platforms, our Reasoning Compiler achieves $6.1\times$ average speedup while TVM archives $3.3\times$ . Our compiler uses $6.5\times$ fewer samples and exhibits $11.9\times$ higher sample efficiency as the first table shows.

For end-to-end execution of Llama3-8B across the five platforms, the sample efficiency improvement compared with TVM ranges from $3.2\times$ on AMD EPYC to $11.8\times$ on Intel Core i9. The end-to-end speedups range from $2.2\times$ on AMD EPYC to $5.1\times$ on Amazon Graviton2. Reasoning Compiler consistently achieves significantly higher speedup ( $4.0\times$ geomean vs $2.8\times$ geomean) with an average of $3.9\times$ fewer samples.

For the large MLP layer from Llama‑4‑Scout‑17B‑16E‑Instruct, on Intel i9, Reasoning Compiler achieves a $12.7\times$ speedup in just 20 samples, while TVM reaches only $5.6\times$ after 230 samples. This corresponds to an $11.5\times$ reduction in the number of samples and a $26.1\times$ improvement in sample efficiency compared with TVM. Similar trends are observed across all the five platforms. For AMD EPYC, our compiler achieves $10.2\times$ speedup in 100 samples while TVM achieves $6.4\times$ in 510 samples.

These results support the broad applicability and practical benefits of our method across layers, models, and hardware.

W1: Clarification on the interaction between LLM and MCTS (Figure 2):

Thank you for pointing this out. We agree with your comments and will revise Figure 2.

We use $v_i$ to denote a node in the tree, and $p_i$ to denote the program stored in node $v_i$ . In Figure 2, we used notations such as $\textsf{Prog. i}$ instead of $p_i$ for better visuals, which seems to have caused confusion instead. We apologize.

In Figure 2(a), the compiler applies a sequence of LLM suggested transformations to $\textsf{Prog. i}$ , which is the plan resulting from LLM reasoning to transform $\textsf{Prog. i}$ to $\textsf{Prog. i+1}$ . The compiler applies transformations in this sequence one-by-one, generating temporary intermediate programs denoted by $\textsf{Temp i+1,1}$ , $\textsf{Temp i+2,2}$ , etc. After all of these temporary states, the compiler reaches $\textsf{Prog. i+1}$ . This is the MCTS Expansion stage.

The next stage (Figure 2(b)) is the MCTS simulation stage, where the algorithm tries to evaluate the value of $\textsf{Prog. i+1}$ while considering its potential future transformations. Therefore, the MCTS algorithm generates a sequence of random transformations to simulate and assess how valuable $\textsf{Prog. i+1}$ is. The terminal program ( $\textsf{Prog. Sim.}$ ) is attained after the compiler applies a sequence of random transformations one-by-one to $\textsf{Prog. i+1}$ until no more valid transformations could be applied. It was an oversight to denote the temporary programs as $\textsf{Temp i+1,1}$ , same as Figure 2(a)--they should have been $\textsf{Dump i+1,1}$ .

Repository

Hovering your mouse over the word "anonymized" shows the link in the bottom corner of the browser.

Thank you for your insightful inquiries to our work.

2025-08-05

Reviewer secC, can you please check whether the author's rebuttal addresses your concerns?

2025-08-06

Thanks for the response to my questions. I remain positive of the work.

评论- Thank you

2025-08-07

We appreciate your thoughtful comments and feedback. Cheers!

审稿意见

评分: 4置信度: 42025-07-03

This paper introduces Reasoning Compiler, which focuses on neural optimization, and does so via monte carlo tree search (MCTS) and reasoning. The paper shows how fewer samples can be used to guide the optimization direction, obtaining results that are better than a an unoptimized baseline compiled via evolutionary search.

优缺点分析

Important domain, runtime performance driven
Strong evaluation and results
Good explanation of methodology and approach.

Related work is not entirely well explained. Is reasoning necessary?
Ablation study is more a discussion section, than an ablation on where the improvements arise.

Overall, I like this work, and appreciate the idea of using reasoning guided MCTS for compiler performance enhancement. The authors do a good job at explaining their approach in detail, how they run their experiments, and motivating the importance of their solution.

My main weaknesses from the work have to do with related work, and a subnote about the surrogate model for prediction. Regarding related work: the authors focused extensively on ML-guided approaches, but it was not clear that an ML-guided approach is truly needed (or at least, it was fully explained to this reader). The use of TVM and discussion of challenging search space exploration makes it more interesting as to whether reasoning really is required for MCTS to succeed. The authors themselves highlight that an LLM is not the centerpiece; this leads me to wonder if a rule-based MCTS guided approach would have done just as well without the large overheads involved with an LLM?

Another related work has to do with superoptimization as a field in general. Can the authors discuss how their approach compares/contrasts to compiler superoptimization? Is Reasoning Compiler just a unique subset, or does it actually have more to offer?

Finally, this reviewer was very intrigued about the surrogate model used in the end of section 3, but it did not get enough real-estate in the paper! Such a model seems super useful in general, and could potentially be it's own contribution. Do the authors think so, or is this a simple model, and hence not a big component?

问题

Where is reasoning really needed? Can a rule-based MCTS with a big of engineering effort replace the reasoning portion of the tool?
Comparison with super-optimization.
Any additional details of the surrogate model?

局限性

Yes.

最终评判理由

Rebuttal was decent, and more results are added/promised for final paper.

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for the thoughtful and constructive feedback. We’re glad you found our methodology and experimental results compelling, and we appreciate the opportunity to clarify several points.

Q1 Where is reasoning really needed

Thank you for this insightful question. To clarify, our statement that “the LLM is not the centerpiece of our contribution, but a necessary enabler of effective search” was meant to emphasize that the key novelty lies in combining structured search (via MCTS) with contextual, history-aware reasoning—not in developing a new language model for compilation or optimization. However, we do view the reasoning component as essential and a main enabler.

Your question rightly asks whether a rule-based MCTS could be sufficient. Nonetheless, contextual reasoning is a must for compiler optimization due to the deep interdependence of transformations. Rule-based systems—even if embedded within MCTS—historically struggle in this space, particularly because they lack the flexibility to generalize across workloads and transformation paths.

This challenge is evident in the superoptimization literature, where early approaches (e.g., Massalin’s Superoptimizer, ASPLOS 1987) focused on exhaustive enumeration of transformation sequences. Even these exhaustive enumerations were found to be limited in scalability as the “Stochastic Superoptimization” paper (Schkufza et al., ASPLOS 2013) states and therefore shifts towards randomized stochastic search (MCMC).

Related works such as Stochastic Superoptimization and TVM are a testament to the ineffectiveness of rule-based optimizations as the ASPLOS 2013 paper is the work that makes the leap to use random search for compiler optimization. TVM follows suit and uses evolutionary genetic search and simulated annealing algorithms.

For example, the caption of Figure 4 in the “Stochastic Superoptimization” (ASPLOS 2013) paper states that “O0 and O3 optimized codes occupy a densely connected part of the space which is easily traversed. Expert code occupies an entirely different region of the space which is reachable only by way of an extremely low probability path.”

This resonates with our experience in compiler optimization: expert-designed heuristics are brittle, and the performance-optimal regions are often isolated and hard to discover without global context.

Although these works overcome the limitation of rule-based approaches, they suffer from sample inefficiency that limits their exploration in practice and results in sub-optimal solutions. As such we go even further beyond stochastic search and contribute LLM-guided semantic reasoning to propose structured, context-sensitive search of program transformations that incorporates

transformation history,
hierarchical code structure,
hardware cost feedback.

This enables the compiler to reason about interactions that rule-based or myopic policies would miss. Could this be replicated by a handcrafted rule-based MCTS? Possibly, but it would require significant engineering effort, domain-specific tuning, and still lack the adaptability and generality of a learned reasoning model.

In short, the LLM and reasoning are not merely a convenience: it provides flexible, context-aware decision-making that is critical to achieving high sample efficiency and generalization, particularly in settings where manual rules fall short—as the stochastic superoptimization literature has already demonstrated.

Q2: Comparison to superoptimization

We appreciate the reviewer’s suggestion to situate our work more explicitly relative to superoptimization.

While our high-level goal of discovering highly efficient program variants shares motivation with superoptimization (e.g., Massalin ASPLOS 1987, Bansal and Aiken, ASPLOS 2006, Schkufza et al., ASPLOS 2013), the formulation and tractability of our problem differ substantially.

Superoptimization typically aims to find the globally optimal instruction-level program, often via exhaustive or stochastic search over low-level assembly variants. In contrast, our system operates over a constrained space of high-level, legality-preserving transformations (e.g., tiling, fusion, unrolling). That is, we do not attempt to synthesize arbitrary instruction sequences or perform unconstrained equivalence-preserving rewrites.

Instead, our problem is best understood as sequencing legal high-level transformations over structured intermediate representations—closer in nature to a planning problem amenable to contextual reasoning, whereas superoptimization is closer to low-level program synthesis.

Our approach is framed as a sequential decision-making process over a defined space of high-level transformations (e.g., tiling, fusion, vectorization), applied to structured intermediate representations such as TVM’s IRModule. This formulation—explicitly casted as a Markov Decision Process (MDP) in Section 2—lends itself naturally to structured search methods like MCTS and supports contextual reasoning over transformation sequences. In contrast, superoptimization aims to synthesize semantically equivalent low-level (often loop-free) assembly programs, typically using program synthesis techniques such as enumeration, symbolic reasoning, or stochastic search (as seen in Massalin ASPLOS 1987, Bansal and Aiken, ASPLOS 2006, Schkufza et al., ASPLOS 2013). These fundamental differences in abstraction level and search space shape the respective trade-offs between generality, scalability, and reasoning complexity.

While our formulation sacrifices the generality of instruction-level search and optimization across all possible domains and programs, it significantly improves scalability and practical applicability to emerging neural workloads.

Due to these foundational differences, we did not explicitly cite superoptimization literature in the original draft. We agree that a discussion contrasting objectives and search constraints would improve contextualization, and we will include it.

Q3: Surrogate model details

Thank you for highlighting interest in the surrogate model. In our system, the surrogate (or cost) model is used to estimate hardware performance without requiring each candidate program to be compiled and executed—a well-known bottleneck in practical compiler optimization.

We use the default model provided by the TVM repository, which is based on XGBoost. We did not modify this model in our implementation. It has been widely adopted in both academic and commercial settings.

As the AutoTVM paper (Chen et al., NeurIPS'18) discusses, the use of XGBoost allows the cost model to learn efficiently from relatively few samples—making it a practical choice for rapid autotuning. Additionally, Sections 4 and 6.2 of their paper explicitly explores transfer learning, where knowledge from previously tuned workloads is reused to accelerate optimization on new workloads. While XGBoost itself does not support transfer learning natively, the AutoTVM framework applies transfer across similar tasks by initializing the model with prior data and retraining—demonstrating strong generalization in practice.

While surrogate modeling is indeed a powerful and broadly useful tool in this domain, it is not a novel or central contribution of our work. We will clarify better and include more details .

Our use of their unmodified cost model is commensurate with the extensive literature on compiler optimizations and autotuning (Gibson and Cano. PACT’22, Ahn et al. DAC’22, Zhang et al. ICLR’21).

We provide new results for two benchmarks (Llama-3-8B Attention Layer and Deepseek-R1 MoE Layer) on five platforms from four different hardware vendors (Apple M2 Pro, Amazon Graviton2, AMD EPYC 7R13, Intel Core i9, and Intel Xeon E3) here and will include them in the paper with discussions. We have to acknowledge TVM, as it provides the hardware-agnostic implementation to build optimization innovations. We inherit the hardware agnosticism from its implementation, although we fundamentally change the optimization algorithm.
Moreover, we report end-to-end execution latency improvement for Llama3‑8B, covering all five platforms from four hardware vendors.
We also provide results of the MLP layer from Llama‑4‑Scout‑17B‑16E‑Instruct, showcasing performance improvements in representative large-model layers and support for layers beyond what was originally included in the paper on all these five different platforms.

Three tables for these new results are reported in response to “Reviewer secC.”

In the tables, we measure sample efficiency using $\frac{Speedup}{\text{No.~} of Samples}$ and report the relative improvements of our Reasoning Compiler compared to TVM. The reduction in the absolute number of samples is also reported.

As the results in all three tables show, the Reasoning Compiler consistently achieves higher speedups in substantially fewer samples across five different platforms, for all the three sets of experiments mentioned above.

For Llama-3-8B Attention Layer and Deepseek-R1 MoE Layer across the five platforms, our Reasoning Compiler achieves $6.1\times$ average speedup while TVM archives $3.3\times$ . Our compiler uses $6.5\times$ fewer samples and exhibits $11.9\times$ higher sample efficiency as the first table shows.

For end-to-end execution of Llama3-8B across the five platforms, the sample efficiency improvement compared with TVM ranges from $3.2\times$ on AMD EPYC to $11.8\times$ on Intel Core i9. The end-to-end speedups range from $2.2\times$ on AMD EPYC to $5.1\times$ on Amazon Graviton2. Reasoning Compiler consistently achieves significantly higher speedup ( $4.0\times$ geomean vs $2.8\times$ geomean) with an average of $3.9\times$ fewer samples.

Once again, thank you for your detailed and encouraging feedback.

2025-08-05

Reviewer 4gxN, can you please check whether the author's rebuttal addresses your concerns?

2025-08-06

Thank you for the thorough rebuttal. I maintain my positive score for the paper.

评论- Thank you

2025-08-07

We thank you for your deep and stimulating comments. Cheers!

最终决定Accept (poster)

2025-09-17

This paper introduces a novel framework for compiler optimization that combines LLMs with MCTS. Reviewers were consistently positive about the technical soundness and sample efficiency. They raised valid concerns about the limited evaluation on a single hardware platform and the lack of end-to-end benchmarks. The authors provided a very thorough rebuttal, presenting significant new results across multiple hardware platforms and for end-to-end inference. This rebuttal effectively resolved the reviewers' primary concerns and solidified the paper's contributions. The resulting work presents a significant and practical advance in compiler optimization. The paper should be accepted.

REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving

摘要

评审与讨论

优缺点分析

问题

局限性

最终评判理由

格式问题

Q1: Cost of LLMs

Q2: Surrogate cost model generalization

Q3: End-to-end evaluation and barriers

Q4: Redundant transformations and fallback

W4: Open-source models vs proprietary models

Minor notes and formatting

优缺点分析

问题

局限性

最终评判理由

格式问题

W1: Hardware agnosticism

W2: Downstream implications

Q1: Possibility of using the generated optimization traces to fine-tune models for better optimization

优缺点分析

问题

局限性

最终评判理由

格式问题

Q1: Clarification on the cost model

Q2: Generalization to other layers and end-to-end results

W1: Clarification on the interaction between LLM and MCTS (Figure 2):

Repository

优缺点分析

问题

局限性

最终评判理由

格式问题

Q1 Where is reasoning really needed

Q2: Comparison to superoptimization

Q3: Surrogate model details