We sincerely thank the reviewer for their insightful and constructive feedback. We are particularly encouraged that the reviewer recognized and appreciated the clarity of our writing, the comprehensiveness of our experiments, and the core finding that high-quality data is more impactful than sheer quantity. These acknowledgments affirm the central contributions of our work.

We are grateful for the opportunity to address the remaining concerns and further clarify our contributions:

W1: Limited Scale of Hint-Engineering Data, Scalability, and Necessity of Small Data

We would like to clarify that achieving significant improvements with a small, manually curated dataset is not a limitation, but rather a key strength and central finding of our work, demonstrating the high sample efficiency of our CoRT framework.

A High-Quality Seed for a Scalable Pipeline: The 30-sample dataset serves as a strategic "seed" to efficiently elicit the base model's latent capability to interact effectively with a Code Interpreter. The manual effort is minimal (<2 hours), far more efficient than the extensive annotation required in works like InstructGPT. Our approach then scales through automated data synthesis via Rejection Fine-Tuning (RFT), where the model generates large volumes of candidate solutions which are then filtered for quality, creating a virtuous self-improvement cycle.
Necessity of Seed Data: To demonstrate that this small-scale data is essential, we conducted zero-shot experiments on the DeepSeek-R1-32B. The results below show that prompting alone cannot achieve the same fundamental shift in reasoning behavior that our method enables.

Model	Dataset	Avg. Accuracy	Avg. Tokens	Avg. Code Blocks
TIR Prompt-hint on DeepSeek-R1-32B	AIME24	67.9%	9,382	0.89
	AIME25	50.4%	11,902	0.68
Hint-Engineering-RFT-32B	AIME24	76.3%	7,260 `(-22.6%)`	2.10
	AIME25	65.2%	8,532 `(-28.3%)`	2.64

Our Hint-Engineering model achieves significantly higher accuracy while using 23-28% fewer tokens. Critically, the "Avg. Code Blocks" metric shows a dramatic behavioral shift: our model uses 2-4 times more code blocks, indicating a learned, efficient strategy of proactive computation. This aligns with recent findings [1] on the power of small, high-quality data.

[1] "The Challenge of Teaching Reasoning to LLMs Without RL or Distillation."

W1: Generalization to Broader Settings

To directly address transferability, we conducted a comprehensive out-of-distribution (OOD) evaluation on Chemistry problems from GPQA. This stringent test assesses if the model can generalize its reasoning and tool-use skills to a completely new domain, including discovering and using a domain-specific library (RDKit) that was never seen during training.

Model	Accuracy	Avg. Tokens	RDKit Usage Rate
DeepSeek-R1-32B	40.6%	5,947	0%
Hint-Engineering RFT-32B	47.5%	4,220	81.3%
Improvement	+6.9 pts	-29.0%	+81.3 pts

The results show robust cross-domain transfer: our model achieves a 6.9-point accuracy gain, uses 29% fewer tokens, and spontaneously discovers and utilizes the novel RDKit tool in 81.3% of cases. This provides strong evidence that our framework teaches generalizable reasoning capabilities, not domain-specific memorization.

W2: Lack of Statistical Significance Testing

We thank the reviewer for this constructive suggestion.

We must clarify that our results are statistically meaningful as each reported accuracy is a stable estimate averaged over multiple stochastic samples (4 or 16 per problem), not a single-shot result, thus significantly reducing the impact of randomness as described in caption of table 1. Then we have now conducted pairwise Wilcoxon signed-rank tests by pooling results across all problems from the benchmarks to assess statistical significance.

Model Scale	Model A	Model B	Acc. Improv.	p-value	Token Reduction	p-value
32B	Hint-Eng-RFT	R1-distill-32B	+3.80%	0.055	36.3%	< 0.001
1.5B	Hint-Eng-RL	R1-distill-1.5B	+7.41%	0.013	59.1%	< 0.001

The analysis confirms that our efficiency gains are highly significant (p < 0.001). Accuracy improvements are also statistically significant for the 1.5B model (p=0.013) and approach significance for the 32B model (p=0.055).

W3: Inference Time Analysis Missing

We thank the reviewer for this critical question. To confirm that token efficiency translates to real-world speed, we conducted a comprehensive latency analysis on AIME24, decomposing wall-clock time into Generation Time and Execution Time.

Model / Method	Avg. Wall-Clock Time (s)	Avg. Generation Time (s)	Avg. Execution Time (s)	Generation Time as % of Total
Prompt-Hint-SFT-32B	187.35 ± 107.74	182.39	4.96	97.3%
Hint-Engineering-RFT-32B	101.17 ± 98.03	97.84	3.32	96.7%
Improvement	46.0% Faster	46.3% Faster	33.1% Faster	-

The results are definitive:

Generation time is the bottleneck, accounting for over 96% of total latency.
Our method achieves a 46.0% reduction in total wall-clock time, a substantial real-world speedup. This confirms that token efficiency is the primary driver of performance gains in this setting.

W4 & Q1: Alternative Baselines

Length-based Penalty: We agree this is an intuitive baseline. However, a simple length penalty is a "black box" method that cannot distinguish between efficient conciseness and logical incompleteness. Our experiments show that simply finetuning on the shortest responses (prompt-hint-Shortest) leads to worse performance than random sampling, while our Hint-engineering approach achieves the highest performance with the shortest generation length, demonstrating it learns true efficiency.

Model Size	Method	Performance on AIME24 & 25	Avg. Generation Length
1.5B	prompt-hint-Shortest	25.6	18290
	Hint-engineering	28.8	16263
7B	prompt-hint-Shortest	40.6	14206
	Hint-engineering	42.2	11043
14B	prompt-hint-Shortest	63.2	11770
	Hint-engineering	65.4	8702

ToolLLM with Function Calling: We agree that function-calling is effective for simple tasks, but argue that for complex mathematical reasoning, code generation offers indispensable advantages in flexibility, expressiveness, and emergent problem-solving.
Flexibility & Composability: Pre-defined APIs are rigid. They cannot handle novel combinations of functions or tasks requiring intermediate logic (e.g., loops, conditionals). In contrast, code generation empowers the model to dynamically orchestrate tools using loops, variables, and conditional logic, acting as a powerful "computational glue" for complex, multi-step reasoning. This is essential for problems like iterative approximation that have no single, pre-defined tool.
Expressiveness & Generalization: An API toolkit is inevitably incomplete. It cannot cover the "long tail" of thousands of functions in libraries like SymPy. Code generation provides universal access to the entire library ecosystem. Our Out-of-Distribution (OOD) experiment provides the strongest evidence: the model spontaneously discovered and used the unseen RDKit library to solve chemistry problems, a feat impossible for a system with a fixed API set.

Q2: Impact of R1-style Base Model

Our goal was to augment a SOTA LRM, leveraging its sophisticated reasoning patterns (e.g., self-reflection, hypothesis testing) and combining them with computational precision. This hybrid approach, also seen in proprietary models like OpenAI's o3, is a key research gap in the open community. Our work aims to fill this gap. To validate this design choice, we compared our LRM-based models with a strong code-centric baseline.

Model (1.5B scale)	Base Model Style	Avg. Accuracy
Qwen-2.5-Math-1.5B-RL	Short-CoT, Code-Centric	47.3%
Prompt-Hint-1.5B-RL	LRM (R1-style), Hybrid	58.3%
Hint-Engineering-1.5B-RL	LRM (R1-style), Hybrid	56.4%

Our LRM-based models substantially outperform the code-centric baseline by ~10 absolute percentage points. This suggests that the "heavy text reasoning" from the LRM base provides strategic advantages like problem structuring, strategic tool deployment, and self-verification, which are crucial for top-tier performance on complex tasks.

Q3: Prompting to Skip Verification

This is an astute suggestion. We experimented on D_{Hint-engineering-SFT} and found that explicitly prompting the model to skip verification does reduce token usage, but our Hint-Engineering method is far more effective.

Method	Average Response Length
Prompt-Hint	7867
Prompt-Hint (w/ skip verification prompt)	6978 (`-11.3%`)
Hint-Engineering	4620 (`-41%`)

Q4: Overhead of Code Execution for Small LLMs

We clarify that our work focuses on the fundamental challenge of integrating systematic reasoning with natural language reasoning, a paradigm applicable across model scales. We note that commercial products like ChatGPT successfully integrate code execution without prohibitive overhead, and we believe deployment challenges are addressable through engineering advances.

Q5: Formatting in Line 119

'''python is a special token that starts with the code block, not a typo, we will explain it further.