PaperHub
3.7
/10
Rejected3 位审稿人
最低3最高5标准差0.9
3
5
3
4.3
置信度
正确性2.7
贡献度2.7
表达3.0
ICLR 2025

HessianGrad: Optimizing AI Systems with Hessian-Aware Textual Gradients

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
LLMPrompt OptimizationGradient Descent

评审与讨论

审稿意见
3

This paper presents a new LLM-based textual optimization method that takes the evolution of LLM systems responses across iterations into account. Improvements are achieved on multiple textual optimization tasks.

优点

  1. The paper writing is clear and easy to understand.
  2. The proposed optimization method achieves considerable improvements on a variety of tasks.

缺点

  1. Although the authors have written about the difference between momentum-based methods and HessianGrad, the novelty of transferring the focus from feedback similarity to response similarity is somewhat weak. The authors should include more convincing ablation experiments to verify that tracking the dynamics of responses is more effective.

  2. The second-order Hessian formulation can not provide sufficient theoretical support for the optimization framework. The relationship between tracking feedback and tracking responses is not comparable to first-order and second-order optimization.

问题

Please respond to the concerns in the "Weaknesses" part.

Q1. The motivation (line 223~226) of introducing the similarity function does not match well with the second-order optimization theory. How can the similarity of the responses provide second-order information?

Q2. The concrete definition of the similarity function on line 244 is meant to connect with the formulation of second-order optimization. However, this definition is contrary to the motivation in line 226 ("more gradual and thoughtful evolution of the response over multiple iterations"), since larger similarity means larger fluctuation between successive steps, according to this definition. Moreover, this definition is actually focusing on changes in feedback (L(r(pt))L(r(p_t))) instead of response (r(pt))r(p_t))), and this is a point that contradicts the motivation of this paper. Please clarify how the definition aligns with the stated motivation.

评论

3. The motivation (line 223~226) of introducing the similarity function does not match well with the second-order optimization theory. How can the similarity of the responses provide second-order information?

We appreciate the opportunity to clarify the role of the similarity function. The function is not intended to mathematically replicate second-order derivatives but to capture trends in response evolution across iterations. These trends, we believe, serve a similar purpose to second-order information by reflecting the optimization trajectory’s progression, conceptually akin to how second-order derivatives capture curvature and directionality in numerical optimization.

By tracking response similarity, the function identifies stagnation (low similarity changes) or instability (large fluctuations), enabling HessianGrad to make thoughtful adjustments. We think this richer signal helps prevent premature convergence or erratic updates, stabilizing the optimization process. Empirical results demonstrate smoother trajectories, the ability to escape local optima, and improved task performance, validating the practical utility of this approach even if its connection to second-order theory is conceptual rather than strict.

4. The concrete definition of the similarity function on line 244 is meant to connect with the formulation of second-order optimization. However, this definition is contrary to the motivation in line 226 ("more gradual and thoughtful evolution of the response over multiple iterations"), since larger similarity means larger fluctuation between successive steps, according to this definition.

The similarity function measures the degree of change between successive responses, where larger values indicate greater fluctuations. We believe this design does not contradict the motivation of promoting gradual and thoughtful evolution. Instead, it enables adaptive updates: larger similarity values signal instability, prompting stabilization, while smaller values indicate stagnation, encouraging further refinement.

In our view, this approach aligns conceptually with second-order principles by capturing cumulative trends in response evolution, similar to how second-order derivatives evaluate curvature and directional changes. This allows for controlled, iterative refinement of responses throughout the optimization process.

5. Moreover, this definition is actually focusing on changes in feedback (L(r(pt))L(r(p_{t}))) instead of response (r(pt)r(p_t)), and this is a point that contradicts the motivation of this paper. Please clarify how the definition aligns with the stated motivation.

The similarity function evaluates the textual loss L(r(pt))L(r(p_t)), which measures the quality of the response r(pt) in aligning with task objectives. Since L(r(pt))L(r(pt)) is directly dependent on r(pt)r(p_t), changes in L(r(pt))L(r(p_t)) reflect changes in the response. We believe this ensures the similarity function inherently tracks response evolution over iterations, supporting thoughtful and meaningful adjustments in line with the paper’s motivation.

In contrast, feedback refers to the gradient or adjustment direction (L(r(pt))pt\frac{\partial L(r(p_t))}{\partial p_t}) provided by the evaluator LLM. Feedback describes how prompts should change but does not directly measure how responses evolve. For instance, feedback may remain constant even if responses improve, or responses could stagnate while feedback fluctuates due to minor prompt adjustments. By focusing on response evolution rather than feedback, the similarity function aligns with the goal of gradual and thoughtful optimization.

In our view, by focusing on response evolution rather than feedback, the similarity function aligns more closely with the goal of gradual and thoughtful optimization, which we consider central to our framework.

评论

Thank you for your careful reading and thoughtful reviews. Let us address your comments below.

1. Although the authors have written about the difference between momentum-based methods and HessianGrad, the novelty of transferring the focus from feedback similarity to response similarity is somewhat weak. The authors should include more convincing ablation experiments to verify that tracking the dynamics of responses is more effective.

Thank you for your thoughtful comments. Below, we clarify how HessianGrad differs from momentum-based methods like M-TextGrad and why tracking response dynamics is more effective.

HessianGrad fundamentally focuses on response dynamics rather than feedback similarity. Momentum-based methods adjust step sizes based on feedback similarity, which can lead to instability in complex response landscapes. In contrast, HessianGrad tracks response evolution over iterations, enabling stable and controlled updates without relying on abrupt step-size adjustments.

Tracking response dynamics provides a richer signal by capturing subtle changes in responses, even when feedback appears stagnant. This approach allows HessianGrad to achieve smoother optimization trajectories, escape local optima, and perform robustly in challenging tasks.

Moreover, we have added an empirical analysis of loss curves on solution optimization tasks (Google-proof QA, MMLU-Machine Learning, and MMLU-College Physics) to further validate this approach. Please refer to the blue part in Section 4.2 for details (the text shown in blue color in the revised manuscript is our revision). With six iterations, the textual loss curves (using test accuracy as a proxy) reveal:

  • Escaping Local Optima: As shown in Figure 2 (b), HessianGrad surpasses performance plateaus through cumulative response dynamics.
  • Stabilizing Updates: Unlike baselines that exhibit oscillations and instability (e.g., M-TextGrad in Figure 2 (c)), HessianGrad achieves smoother optimization trajectories.
  • Improved Performance in Challenging Scenarios: Proxy loss curves in all figures highlight HessianGrad’s ability to navigate meaningful adjustments, achieving higher test accuracy over iterative refinements.

These results demonstrate that tracking response dynamics offers significant advantages over feedback-driven methods, addressing complex optimization challenges effectively.

2. The second-order Hessian formulation can not provide sufficient theoretical support for the optimization framework. The relationship between tracking feedback and tracking responses is not comparable to first-order and second-order optimization.

We appreciate the reviewer’s comments and would like to clarify our perspective. HessianGrad does not aim to replicate numerical second-order derivatives. Instead, we believe it is inspired by second-order optimization principles and operationalizes their practical effects—such as escaping local optima and stabilizing updates—within the context of textual optimization.

We think the relationship between feedback and response tracking in HessianGrad is conceptual rather than strict. Feedback tracking reacts to immediate gradients, which is similar to first-order methods, while response tracking captures broader trends in optimization, akin to the cumulative information associated with second-order derivatives. This richer signal, in our view, allows HessianGrad to guide updates that are both precise and stable.

We believe the empirical results support this approach. Across tasks, HessianGrad achieves smoother optimization trajectories and outperforms momentum-based methods, demonstrating that tracking response dynamics is practically effective and aligned with second-order-inspired principles, even if not strictly theoretical.

审稿意见
5

This paper proposes a new method for optimizing system prompts through gradient descent. The authors outline the limitations of the classical method "TextGrad", and present solutions to mitigate these issues. The second-order derivative (i.e., HessianGrad) is introduced into "TextGrad", thereby reducing the likelihood of the system prompt getting trapped in local minima. The authors conduct empirical experiments on three tasks: prompt optimization, solution optimization, and code optimization. The conclusion is that the proposed new method outperforms the naive TextGrad method and the Momentum-Enhanced TextGrad method across four mainstream models.

优点

This paper is well written with clear presentation of the new algorithm, and the experiments are rigorous with repeatability.

缺点

Overall, this paper may have an algorithmic contribution, but supplemental details are required on both theoretical and experimental aspects.

  1. Theoretical aspect:
  • When defining the similarity S\mathcal{S} of the prompt before and after each iteration, you do not multiply it by any hyperparameters and directly add it to the original TextGrad to obtain the HessianGrad. Consider adding two hyperparameters, β1\beta_1 and β2\beta_2, to the two terms of HessianGrad, similar to Adamw. Could you conduct an ablation study on the effect of adding these hyperparameters, or provide justification for why they were not included in the current formulation?
  • You use the L1L_1 norm to define S\mathcal{S}, however, when measuring semantic similarity, cosine similarity is more commonly used as it signifies that the loss L\mathcal{L} before and after iteration is closer in direction, while L1L_1 primarily signifies that L\mathcal{L} is closer in numerical value. It is recommended to provide a rationale for choosing the L1L_1 norm as a similarity metric. (Simple reasons are acceptable, such as "L1L_1 norm is easier to compute"). Could you compare the performance of your method using L1 norm versus cosine similarity, or provide empirical justification for why L1 norm was chosen over other common similarity metrics?
  1. Experimental aspect:
  • Please provide the loss curves for HessianGrad, M-TextGrad, TextGrad, and CoT of their iterative processes for a representative example from each of the three tasks (prompt optimization, solution optimization, and code optimization) to demonstrate the effect of "HessianGrad Escaping Local Optima" as shown in Figure 1.
  • Calculating HessianGrad typically requires more computational resources. Could you provide a detailed comparison of computational resources (GPU memory, runtime) for HessianGrad versus the baseline methods across all three tasks.

问题

  1. In general computational frameworks, HessianGrad is computed by adding a small quantity in the iteration direction and recalculating the gradient once, followed by taking the finite difference of the gradients. From a practical standpoint, can the direct finite difference version of similarity S(r(pt),r(pt1))=L(r(pt))L(r(pt1))ptpt1\mathcal{S}(r(p_t), r(p_{t-1})) = \frac{|| \mathcal{L}(r(p_t)) - \mathcal{L}(r(p_{t-1}))||}{||p_t - p_{t-1}||} save computational costs (since L(r(pt1))\mathcal{L}(r(p_{t-1})) and pt1p_{t-1} are both values computed from the previous iteration) and achieve similar effects?
  2. There is a typo in Equation (3) in Section 3. The second-order partial derivative should be denoted as 2Lpt2\frac{\overset{\sim}{\partial}^2\mathcal{L}}{\overset{\sim}{\partial}p_t^2} rather than 2L2pt\frac{\overset{\sim}{\partial}^2\mathcal{L}}{\overset{\sim}{\partial}^2p_t}.
评论

4. Calculating HessianGrad typically requires more computational resources. Could you provide a detailed comparison of computational resources (GPU memory, runtime) for HessianGrad versus the baseline methods across all three tasks.

We appreciate the reviewer’s suggestion. We have added the detailed comparison of computational resources for HessianGrad versus baseline methods across all three tasks. Please refer to Section 4.5 for the full result table. (The text shown in blue color in the revised manuscript highlights our revisions.)

We observe that while HessianGrad involves slightly higher per-iteration runtime due to its second-order optimization-inspired design, it converges in fewer iterations, resulting in significant overall savings. The detailed findings are as follows:

  • On Object Counting dataset, HessianGrad reduces total runtime by 50% compared to TextGrad by converging in fewer iterations despite slightly higher per-iteration costs.
  • For solution optimization task, HessianGrad achieves 26.14% lower total runtime than TextGrad, while M-TextGrad incurs 77.65% higher runtime due to instability.
  • For code optimization task, HessianGrad reduces total runtime by 16.67% compared to baselines.
  • For GPU memory usage, HessianGrad demonstrates similar requirements to baseline methods, indicating no significant increase in computational resources.

These findings demonstrate that, despite higher per-iteration runtime, HessianGrad’s faster convergence ensures practical efficiency and stability, making it computationally advantageous across tasks.

5. In general computational frameworks, HessianGrad is computed by adding a small quantity in the iteration direction and recalculating the gradient once, followed by taking the finite difference of the gradients. From a practical standpoint, can the direct finite difference version of similarity SS save computational costs (since L(r(pt1))L(r(p_{t-1})) and pt1p_{t-1} are both values computed from the previous iteration) and achieve similar effects?

We thank the reviewer for this insightful suggestion. We would like to clarify that recalculating gradients using finite differences is not applicable within HessianGrad’s framework. HessianGrad operates without explicit numerical gradients or directional updates, relying instead on the LLM’s implicit processing capabilities to handle updates.

This design choice ensures computational efficiency by avoiding additional recalculations while still achieving second-order optimization-inspired effects. Empirical results validate the effectiveness of this approach, demonstrating that HessianGrad delivers robust optimization outcomes without requiring finite difference methods.

6. There is a typo in Equation (3) in Section 3.

We thank the reviewer for the careful reading. We have updated it in the revision.

评论

Thank you for your careful reading and thoughtful reviews. Let us address your comments below.

1. When defining the similarity function of the prompt before and after each iteration, you do not multiply it by any hyperparameters and directly add it to the original TextGrad to obtain the HessianGrad. Consider adding two hyperparameters, β1\beta_1 and β2\beta_2, to the two terms of HessianGrad, similar to Adamw. Could you conduct an ablation study on the effect of adding these hyperparameters, or provide justification for why they were not included in the current formulation?

We thank the reviewer for the insightful suggestion regarding the inclusion of hyperparameters β1\beta_{1} and β2\beta_{2} ​ in HessianGrad. We would like to clarify that HessianGrad operates entirely within the textual optimization domain, where updates are implicitly handled by the LLM. Unlike numerical optimizers such as AdamW, HessianGrad does not involve explicit numerical calculations or parameterized updates. Introducing hyperparameters like β1\beta_{1} and β2\beta_{2} would require determining their magnitudes and fine-tuning them, which could introduce subjectivity and complexity to the framework.

Our approach avoids explicit hyperparameters because HessianGrad leverages the LLM’s contextual understanding to implicitly balance feedback (TextGrad) with response trajectory tracking. Empirical results demonstrate that HessianGrad achieves effective optimization in its current formulation, validating the framework’s robustness without requiring additional hyperparameters.

2. You use the L1L_1 norm to define SS, however, when measuring semantic similarity, cosine similarity is more commonly used as it signifies that the loss L before and after iteration is closer in direction, while L1L_{1} primarily signifies that L is closer in numerical value. It is recommended to provide a rationale for choosing the L1L_{1} norm as a similarity metric. (Simple reasons are acceptable, such as "L1L_{1} norm is easier to compute"). Could you compare the performance of your method using L1L_1 norm versus cosine similarity, or provide empirical justification for why L1L_1 norm was chosen over other common similarity metrics?

We appreciate the reviewer’s question regarding the use of L1L_1​ norm versus cosine similarity. HessianGrad does not explicitly compute numerical similarity metrics, such as L1L_1​ or cosine similarity, as all comparisons and updates are implicitly handled within the LLM’s textual framework. This design eliminates the need for explicit numerical calculations, and as a result, a direct performance comparison between L1L_1​ and cosine similarity is not applicable to our implementation.

In the theoretical framework, we chose L1L_1​ norm to describe SS for its conceptual clarity and simplicity. The L1L_1​ norm intuitively represents the magnitude of change in textual loss across iterations, which aligns with the goal of tracking response evolution. On the other hand, cosine similarity focuses on directional alignment, which is less relevant in the context of textual optimization tasks, where maintaining semantic and contextual coherence is a higher priority.

We hope this explanation clarifies our rationale and are happy to provide further details if needed.

3. Please provide the loss curves for HessianGrad, M-TextGrad, TextGrad, and CoT of their iterative processes for a representative example from each of the three tasks (prompt optimization, solution optimization, and code optimization) to demonstrate the effect of "HessianGrad Escaping Local Optima" as shown in Figure 1.

We appreciate the reviewer's suggestion to provide the loss curves to demonstrate the simulation of second-order effects in HessianGrad. We have added an empirical analysis of loss curves on solution optimization tasks across three datasets—Google-proof QA, MMLU-Machine Learning, and MMLU-College Physics. Please refer to the blue part in Section 4.2 for details (the text shown in blue color in the revised manuscript is our revision).

We observed the following key effects:

  • Escaping Local Optima: As shown in Figure 2 (b), HessianGrad surpasses performance plateaus through cumulative response dynamics.
  • Stabilizing Updates: Unlike baselines that exhibit oscillations and instability (e.g., M-TextGrad in Figure 2 (c)), HessianGrad achieves smoother optimization trajectories.
  • Improved Performance in Challenging Scenarios: Proxy loss curves in all figures highlight HessianGrad’s ability to navigate meaningful adjustments, achieving higher test accuracy over iterative refinements.

We believe these results validate that the second-order effects are simulated and operationalized by the framework’s design, demonstrating the effect of "HessianGrad Escaping Local Optima" as shown in Figure 1.

审稿意见
3

Existing automatic optimization methods only focus on immediate feedback, which can be easily trapped by the local optima. HessianGrad is analogous to second-order derivative methods by taking into account the feedback over multiple iterations. Experimental results show consistent gains of HessianGrad in prompt optimization, solution refinement, and code optimization.

优点

  1. The core idea addresses a critical limitation of iterative/reflective methods that only focus on immediate feedback.

  2. This work covers a wide array of recent literature, and the presentation is clear.

缺点

  1. Besides the analogy provided in Eq. 2 and Eq. 3, the core implementation of HessianGrad is to include a meta-prompt that encourages LLM to reflect over multiple turns as shown in Pg. 12. However, this is arguably analogous to the real Hessian matrix that the work is trying to deliver. Moreover, whether LLM is able to capture the second-order phenomenon is also questionable and lacks justification in this work.

  2. The actual technical contribution of this work is to provide a more refined meta-prompt to the original TextGrad’s meta-prompt. First, the contribution of the refined meta-prompt appears to be limited. Second, the OPRO work [1] has also included similar meta-prompts to reflect over multiple turns by feeding the iterative optimization trajectory into the context of LLM. Therefore, the novelty of this work is limited and appears more as an incremental improvement on TextGrad.

  3. The selected baselines in main experiments are also questionable. Most baselines are TextGrad and M-TextGrad. However, for each task, competitive baselines, e.g. ProTeGi [2] in prompt optimization, are not compared.

问题

  1. Could you provide justifications for the difference between your optimizer prompt and OPRO’s meta-prompt [1] for prompt optimization?

[1] Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., & Chen, X. (2023). Large Language Models as Optimizers (No. arXiv:2309.03409). arXiv. http://arxiv.org/abs/2309.03409

[2] Pryzant, R., Iter, D., Li, J., Lee, Y. T., Zhu, C., & Zeng, M. (2023). Automatic Prompt Optimization with “Gradient Descent” and Beam Search (No. arXiv:2305.03495). arXiv. http://arxiv.org/abs/2305.03495

评论

4. Second, the OPRO work [1] has also included similar meta-prompts to reflect over multiple turns by feeding the iterative optimization trajectory into the context of LLM. Could you provide justifications for the difference between your optimizer prompt and OPRO’s meta-prompt [1] for prompt optimization?

We appreciate the reviewer’s request for clarification. Below, we detail the differences between HessianGrad’s optimizer prompt and OPRO’s meta-prompt:

  1. Task Specificity

OPRO’s meta-prompt is task-specific, designed exclusively for prompt optimization. For instance, it instructs the model to generate instructions with higher scores by feeding solution-score pairs into the LLM:

Generate an instruction that is different from all the instructions <INS> above and has a higher score than all the instructions <INS> above.

In contrast, HessianGrad’s optimizer prompt is task-agnostic and explicitly designed to iteratively refine any variable (not limited to prompt optimization) by modeling response evolution. For example:

Focus on adjusting the variable in a way that each step introduces thoughtful, measured changes based on past iterations, rather than drastic shifts.

This task-agnostic design enables HessianGrad to handle a broader range of textual optimization tasks, such as solution optimization and code generation, beyond just prompt optimization.

  1. Explicit Modeling of Iterative Dynamics

While OPRO’s meta-prompt incorporates solution-score pairs into the context, reflecting an iterative process, it does not explicitly analyze response evolution dynamics across iterations. Instead, OPRO relies on mechanisms like temperature tuning to balance exploration and refinement, which introduces stochasticity and potential instability in updates.

HessianGrad, on the other hand, explicitly tracks how responses evolve across iterations using a <PAST_ITERATIONS> section:

Reflect on how the responses to this variable have evolved across iterations: <PAST_ITERATIONS>{past_values}</PAST_ITERATIONS>.

This explicit modeling allows HessianGrad to refine updates based on cumulative trends, stabilizing optimization or escaping stagnation as needed. By integrating response trajectory analysis and textual gradients, HessianGrad achieves a deterministic balance between refinement and exploration, ensuring stability and coherence.

  1. Fine-Grained Output Examples

In prompt optimization tasks, OPRO generates concise instructions such as:

Let’s solve this problem step-by-step.

While effective for basic tasks, these outputs lack the detailed, structured reasoning necessary for more complex scenarios. In contrast, HessianGrad generates fine-grained, structured prompts tailored to the task’s complexity. For example, in BBH_Object Counting, HessianGrad produces:

You will answer a reasoning question. If the task is complex or unclear, begin by restating it to confirm understanding. Explicitly list each item and its quantity in a structured format like .... Ensure clarity by stating the number of each type of object before summing them up. Perform calculations in ..., explaining the addition process ..., Verify your results by ..., and consider using ... for confirmation. Provide the final answer in .... Avoid unnecessary details and ensure ... relevant to the question.

This level of detail showcases HessianGrad’s ability to refine prompts iteratively, ensuring higher-quality outputs that go beyond the capabilities of OPRO’s meta-prompt.

5. The selected baselines in main experiments are also questionable. Most baselines are TextGrad and M-TextGrad. However, for each task, competitive baselines, e.g. ProTeGi [2] in prompt optimization, are not compared.

We appreciate the reviewer’s feedback regarding the inclusion of additional competitive baselines. ProTeGi primarily focuses on prompt optimization and does not explicitly address optimizing prompts for smaller models using feedback from larger models. To adapt ProTeGi for comparison, we followed this approach:

  • Optimization Engine: We used GPT-4o as the optimization engine to perform inference on the training set.
  • Few-Shot Example Selection: High-quality inference examples from GPT-4o were selected as few-shot examples for the smaller model, Llama-3.1-8B.

We incorporated ProTeGi into prompt optimization task. The results are shown in the following table:

DatasetAccuracy % (Improv. over TextGrad)
CoTProTeGiTextGradM-TextGradHessianGrad
Object Counting65.068.077.080.083.0
GSM8k84.684.684.684.684.6

We can observe that On GSM8k with Llama-3.1, both ProTeGi and HessianGrad show limited performance gains, likely due to model saturation. However, on the Object Counting dataset, HessianGrad outperforms ProTeGi by effectively leveraging feedback from stronger models like GPT-4o, demonstrating superior optimization for weaker, cost-effective models such as GPT-3.5-turbo-0125.

评论

Thank you for your careful reading and thoughtful reviews. Let us address your comments below.

1. Besides the analogy provided in Eq. 2 and Eq. 3, the core implementation of HessianGrad is to include a meta-prompt that encourages LLM to reflect over multiple turns as shown in Pg. 12. However, this is arguably analogous to the real Hessian matrix that the work is trying to deliver.

Thanks for noticing this. First of all, it is important to clarify that the HessianGrad framework introduced in our work is not a numerical computation of the Hessian matrix, as seen in traditional optimization. Instead, it represents a textual analogy within the framework of textual optimization. Our goal is not to compute second-order derivatives in the mathematical sense but to simulate second-order effects in iterative response optimization, such as capturing cumulative response dynamics and improving stability.

Moreover, we believe the characterization of HessianGrad as merely a meta-prompt for reflection oversimplifies its core implementation. HessianGrad explicitly integrates response trajectory tracking via a similarity function, combining the structured first-order feedback (TextGrad) with trajectory insights. This enables it to identify stagnation or instability and refine responses effectively, distinguishing it from basic prompting strategies.

2. Moreover, whether LLM is able to capture the second-order phenomenon is also questionable and lacks justification in this work.

We appreciate the opportunity to clarify the role of HessianGrad in achieving second-order effects. HessianGrad does not rely on the intrinsic ability of LLMs to numerically compute second-order derivatives. Instead, we believe these effects are achieved through the framework’s design, which integrates structured feedback and trajectory tracking to simulate second-order-inspired behavior in textual optimization. This design enables HessianGrad to capture cumulative response dynamics and guide stable, thoughtful updates.

To address the concern about justification, we think empirical evidence supports the practical realization of second-order-inspired behavior. We have added an empirical analysis of loss curves on solution optimization tasks across three datasets—Google-proof QA, MMLU-Machine Learning, and MMLU-College Physics. Please refer to the blue part in Section 4.2 for details (the text shown in blue color in the revised manuscript is our revision).

We observed the following key effects:

  • Escaping Local Optima: As shown in Figure 2 (b), HessianGrad surpasses performance plateaus through cumulative response dynamics.
  • Stabilizing Updates: Unlike baselines that exhibit oscillations and instability (e.g., M-TextGrad in Figure 2 (c)), HessianGrad achieves smoother optimization trajectories.
  • Improved Performance in Challenging Scenarios: Proxy loss curves in all figures highlight HessianGrad’s ability to navigate meaningful adjustments, achieving higher test accuracy over iterative refinements.

We believe these results validate that the second-order effects are simulated and operationalized by the framework’s design, rather than being dependent on the LLM’s intrinsic numerical second-order capabilities.

We welcome further discussion and feedback to improve this understanding.

3. The actual technical contribution of this work is to provide a more refined meta-prompt to the original TextGrad’s meta-prompt. First, the contribution of the refined meta-prompt appears to be limited. the novelty of this work is limited and appears more as an incremental improvement on TextGrad.

We appreciate the reviewer’s perspective and would like to clarify that HessianGrad goes beyond being a refined meta-prompt for TextGrad. In our view, the primary technical contribution of HessianGrad lies in its explicit focus on tracking response dynamics across iterations. While TextGrad relies primarily on immediate feedback to guide updates, HessianGrad models response evolution trends over time, enabling thoughtful adjustments that are essential for escaping stagnation and ensuring stability during optimization.

Although the framework is inspired by second-order principles, we believe the novelty of HessianGrad lies in leveraging response trajectory analysis as a practical and scalable alternative to conventional second-order techniques. This approach is specifically tailored for textual optimization tasks and addresses challenges that TextGrad alone may not fully resolve.

AC 元评审

The paper introduces HessianGrad, a new optimization method inspired by second-order derivatives, to improve textual optimization in LLMs. It focuses on leveraging response trajectory tracking to refine outputs iteratively. While the idea is promising and shows empirical improvements over baselines, the reviewers raised significant concerns. These include limited novelty, as the method appears to refine existing meta-prompt strategies without substantial innovation. The theoretical justification for using response similarity as a proxy for second-order optimization is weak and lacks rigor. Additionally, the experimental evaluation is narrow, with insufficient benchmarks and limited exploration of computational trade-offs, such as runtime and scalability. I recommend rejection.

审稿人讨论附加意见

During the rebuttal, the authors clarified that HessianGrad is not a numerical second-order derivative but a response tracking framework, addressing concerns about novelty and theoretical rigor. They provided additional experiments, including comparisons with ProTeGi, and demonstrated computational efficiency through faster convergence. These responses partially address the reviewers' concerns.

最终决定

Reject