6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.3

置信度

创新性2.5

质量2.8

清晰度3.0

重要性2.8

NeurIPS 2025

Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models

Donghoon Ahn,Jiwon Kang,Sanghyun Lee,Minjae Kim,Wooseok Jang,Jaewon Min,Sangwu Lee,Sayak Paul,Seungryong Kim

OpenReview PDF

提交: 2025-05-07更新: 2025-10-29

TL;DR

We analyze head-level attention perturbation guidance to improving image quality and style in diffusion and flow model, additionally introducing a generalized attention perturbation framework.

摘要

关键词

diffusion modelsflow matching modelsdiffusion guidancegenerative models

评审与讨论

审稿意见

评分: 5置信度: 42025-06-29

This work aims to enhance the performance of image diffusion models in terms of overall fidelity, diversity, prompt alignment, and user-specified criteria through attention perturbation guidance. It builds on two key observations: (1) how to perturb—there is a trade-off in selecting the optimal perturbation strength; and (2) where to perturb—even within the same attention layer, different attention heads vary in their effectiveness for improving specific image generation criteria. To address these challenges, the authors propose SoftPAG and HeadHunter, two methods that improve image generation performance across various metrics by scaling inference-time computation to search for optimal perturbation scales and attention heads. The proposed methods demonstrate effectiveness when applied to SD3 and FLUX models.

优缺点分析

Strengths:

The observation that different MM-DiT attention heads correspond to distinct geometric or appearance-related attributes is both novel and insightful. This could potentially benefit downstream tasks such as zero-shot style mixing and transfer.
The paper conducts a thorough analysis of the effects of varying perturbation strengths and attention heads, evaluated across a wide range of metrics, offering a comprehensive understanding of the method’s impact.
The proposed method demonstrates strong generalizability across different MM-DiT-based models, showcasing its robustness and adaptability.

Weaknesses:

Although searching over attention heads and the interpolation parameter improves performance compared to attention-layer-level search, it introduces significant computational overhead, particularly given the scope of current SOTA image generators (e.g., FLUX1-dev contains 1,368 attention heads). The authors are encouraged to quantify the number of function evaluations (NFEs) and compare them with those required by PAG and SEG, or alternatively, benchmark performance under equal inference-time budgets.
In Figure 1&2, a larger $u$ appears to correspond to weaker attention weight perturbations, which is inconsistent with the equations provided in the paper, i.e., $f_I(\mathbf{A}, u)=u \mathbf{A}+(1-u) \mathbf{I}$ and $f_U(\mathbf{A}, u)=u \mathbf{A}+(1-u) \mathbf{I}$ . The authors should clarify this apparent discrepancy.
As the core contribution centers on improving image generation quality through increased inference-time computation, it is important to more clearly articulate how this approach relates to, and differs from, existing inference-time scaling methods, particularly the maintream method that involve search over the initial noise space [1].

[1] Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps.

问题

It is impressive that the FID scores improve monotonically with increased inference-time computation during attention head search, as shown in Figure 7(b). However, a potential concern is that the specific verifier used for guiding the search may be biased toward optimizing a single criterion, potentially at the expense of other important dimensions. To address this, it would be helpful to examine whether the observed performance gains in FID are also reflected in other metrics, such as Inception Score (IS), textual alignment or aesthetic score.

局限性

Yes.

最终评判理由

Most of my concerns are addressed. I agree that the generalization ability to new prompts differs the proposed method to previous test-time alignment method. Thus, I raise my score to accept.

格式问题

I do not notice any major formatting issue.

作者回复

2025-07-31

We sincerely appreciate your time and effort.

[W1] The authors are encouraged to quantify the number of function evaluations (NFEs) and compare them with those required by PAG and SEG, or alternatively, benchmark performance under equal inference-time budgets.

[A1] Thank you for raising this important point. We agree that reporting the number of function evaluations (NFEs) is crucial for understanding the efficiency of the proposed method. However, we would like to clarify our interpretation of the question. It is unclear whether the reviewer refers to the NFEs required to generate a single sample, or to the NFEs involved in searching for heads under a specific objective. We interpret it as the latter and respond accordingly. Please kindly correct us if we misunderstood.

The NFEs for sampling a single image are identical to those of existing layer-level methods employed in PAG and SEG. The difference lies in the search phase, where we identify which heads to perturb for a given objective.

As noted in Appendix D.3.2, we consider following parameters to compute NFE :

$T$ : Total search iteration
$M$ : Number of prompt-seed pairs
$N$ : Number of total attention heads in model
$S$ : Number of denoising steps

In total, our framework requires $T \cdot M \cdot N \cdot (2S+1)$ , considering two forward pass for guidance and one NFE for verifier. In our general quality improvement setting, we use $M = 20$ , $T = 1$ , and in the style-oriented setting, $M = 5$ , $T = 5$ . Both use $N = 24 \times 24$ heads and $S = 20$ steps. On an 8×H100 GPU node, each iteration takes roughly 36 minutes. We emphasize that this one-time cost is amortized due to the strong generalizability of the selected heads across new prompts. Please refer to response [A2] to reviewer X5tq for further discussion on generalization to unseen prompts.

Regarding comparison with prior methods such as PAG and SEG, we note that these methods do not involve any search procedure. They apply perturbation to pre-selected layers, typically chosen heuristically (e.g., the bottleneck layer of the U-Net). Thus, there is no defined NFEs cost for perturbation target selection in those methods, making a direct comparison with HeadHunter infeasible.

In contrast, one of our core contributions is to introduce a principled search framework for selecting perturbation targets based on arbitrary objectives. While we could implement a version of our framework that searches over layers instead of heads, this would still be a comparison between our own baselines, not against the original PAG/SEG methods.

We note that searching over layers would be approximately $H$ times faster per iteration than searching over heads per iteration, where $H$ is the total number of attention heads. However, because layers are a much coarser perturbation unit, the resulting quality is typically suboptimal. We present results comparing layer-level and head-level search below to illustrate this trade-off. For a fair comparison, we set $u = 0.125$ to match the relative influence of the added perturbation unit, as each head-level iteration adds $k = 3$ heads (approximately 1/8 of the heads in a single layer). We also set $T$ to the maximum possible value, equal to the total number of layers.

Style-Oriented Quality Improvement across Different Perturbation Targets:

Search level / metrics	HPSv2	PickScore	ImageReward	AES	CLIPScore
Layer-level search	0.2787	21.59	1.10	5.93	25.93
Head-level search	0.3540	24.55	1.74	6.65	32.27

The results show that head-level perturbation guidance outperforms our baseline (PAG with principled search), suggesting that individual heads serve as distinct and fine-grained units contributing to specific visual attributes. We will include these results and the discussion on NFEs in the revised manuscript.

[W2] discrepancy between equations and figures.

[A2] We’re sorry for the confusion. We will revise the equations to $f_I(\mathbf{A},u) = (1-u) \mathbf{A} + u \mathbf{I}$ and $f_U(\mathbf{A},u) = (1-u) \mathbf{A} + u \mathbf{U}$ to make it consistent with the figures.

[W3] it is important to more clearly articulate how this approach relates to, and differs from, existing inference-time scaling methods, particularly the mainstream method that involve search over the initial noise space [1].

[A3] Thank you for raising this important point. While both our method and noise-space inference-time scaling share the goal of improving image quality without retraining, there is a key distinction in generalization.

Noise search aims to find an optimal initial noise that yields the best result for a specific prompt under a given objective (e.g., verifier score). However, this search must be repeated from scratch for every new prompt. In contrast, our method identifies a set of perturbation heads that maximize a given objective across a small set of prompts, and crucially, this configuration generalizes well to new prompts. As shown in Appendix D.3.2 (Tab. 2 and Fig. 26), the retrieved head sets not only improve general image quality for unseen prompts (for general improvement) or for unseen content prompts (for style-oriented improvement).

This generalization makes our method more practical. For example, model providers or users can share head sets optimized for general quality, preference tuning or specific styles (e.g., "anime" or "line art"), similar to how LoRA weights are shared in the community. Users can then directly apply these shared configurations without any additional optimization or cost.

Moreover, our method is orthogonal to noise search and can be combined with it. Users may first apply HeadHunter to retrieve meaningful heads and then perform inference-time noise search for further improvements.

Although our method is optimized using the average objective across prompts (to enable generalization), and noise search is optimized per prompt, we present a comparison below for completeness. For a fairer comparison, one could also optimize our method on a single prompt, similar to [1], to obtain prompt-specific optimal performance. However, this would deviate from our goal of learning generalizable guidance configurations, so we do not adopt that setting. This comparison uses the same number of NFEs for both methods, evaluated under the style-oriented quality improvement setting on styles like "sunlit" and "spring". PickScore is used as the main objective for both.

HeadHunter:

Round	HPSv2	PickScore	ImageReward	AES	CLIPScore
0	0.2868	22.51	1.51	5.42	25.29
1	0.3309	23.59	1.83	5.98	30.38
2	0.3380	24.44	1.98	6.48	30.72
3	0.3404	24.24	1.86	6.74	32.23
4	0.3587	24.70	1.99	6.61	31.43
5	0.3540	24.55	1.74	6.65	32.27

Initial noise search (random search algorithm):

Round (same NFEs)	HPSv2	PickScore	ImageReward	AES	CLIPScore
0	-	-	-	-	-
1	0.3444	25.56	1.98	6.38	33.32
2	0.3181	25.51	1.94	6.25	30.34
3	0.3494	25.61	2.15	6.09	31.77
4	0.3541	25.41	1.84	6.25	31.59
5	0.3466	25.62	2.29	6.31	31.45

While noise search achieves higher PickScore on a per-prompt basis, our method achieves comparable or better performance in other metrics (e.g., HPSv2, ImageReward, AES, CLIP Score), and crucially, generalizes to new prompts. This demonstrates the practicality and robustness of our approach. We will include this comparison and discussion in the revised manuscript.

[Q1] A potential concern is that the specific verifier used for guiding the search may be biased toward optimizing a single criterion, potentially at the expense of other important dimensions. To address this, it would be helpful to examine whether the observed performance gains in FID are also reflected in other metrics, such as Inception Score (IS), textual alignment or aesthetic score.

[A4] Thank you for raising this important point. Verifier hacking, or overfitting to a single objective at the expense of other desirable qualities, is a known challenge when optimizing with a single verifier. Previous work on inference-time scaling in noise space has addressed this by using an ensemble of verifiers. Our framework is fully compatible with such ensembles and can easily incorporate multiple objectives if needed.

Unlike gradient-based optimization methods that are often more susceptible to reward hacking, our method does not rely on gradients. Instead, it performs a combinatorial search over the model’s internal structure, which helps it remain more robust against bias in the objective.

Importantly, although we optimize for a single objective such as PickScore, we observe consistent improvements in other metrics, including Inception Score (IS), CLIPScore, and ImageReward. This suggests that our method leads to general image quality improvement rather than overfitting to a specific metric.

General quality improvement (Fig. 7):

metrics / $k$	3	6	9	12	15	18	21	24
FID	58.79	55.02	53.04	53.08	52.52	51.34	51.90	52.52
IS	19.48	20.47	22.14	21.62	22.69	23.40	23.93	24.09
PickScore	20.91	21.21	21.45	21.55	21.67	21.94	22.00	22.07
CLIPScore	24.40	24.78	25.03	25.08	25.15	25.29	25.26	25.32
ImageReward	-0.09	0.11	0.26	0.30	0.33	0.51	0.54	0.59

Style-Oriented Quality Improvement (Sec. 4.2.2): Fig. 9 demonstrates that optimizing only PickScore still leads to better performance across other metrics, showing our method’s robustness.

评论- Official Comment by Authors

2025-08-06

Thank you very much for taking the time to provide a thorough and thoughtful review. Your feedback helps us a lot in improving our work!

We have tried our best to address your concerns in our response, but please don’t hesitate to let us know if you have any further questions. We’d be happy to discuss them during the discussion period. We look forward to hearing from you :)

评论- Concerns addressed

2025-08-06

I appreciate the authors' efforts in the rebuttal. Most of my concerns have been addressed, and I am raising my rating to accept.

评论- Thank you for your time, effort, and kind words

2025-08-06

We're glad to hear that your concerns have been addressed. Your comments and feedback have been very helpful in improving our paper, especially in clarifying HeadHunter's practical usability given its computational cost, and in articulating its connection to inference-time scaling along the noise axis.

If you have any further questions, please feel free to let us know. Thank you!

审稿意见

评分: 4置信度: 32025-06-30

This paper propose a method for implementing finer-grained perturbations on modern large-scale text-to-image diffusion and flow models. Existing methods typically perturb at the layer level, this paper targets the perturbation precisely to individual attention heads (HeadHunter). The experiments show that precise control of specific visual attributes (e.g., concept, style, image quality) in the generated image can be achieved by perturbing specific attention heads. This approach overcomes the side effects such as over-smoothing or over-saturation that may result from existing perturbations, and to provide a plug-and-play solution that enhances existing inference pipelines without requiring retraining.

优缺点分析

Strength:

Experimental designs and analysis are good.
The method can enhances existing inference pipelines without requiring retraining, it will be highly practical and scalable, and can be applied directly to existing pre-trained large diffusion models (e.g., Stable Diffusion 3 and FLUX).
Head-level guidance would allow the user to perform highly precise and targeted manipulation of specific visual attributes of the resulting image.

Weaknesses: The claim made by the work provides a simple approach to finer-graining target style or concept without training and is, in general, innovative. There are sufficient visual results to support the claim. However, the explanations of why some strategies were chosen are relatively weak, with unclear parameter selection. For example, the interpolated parameter u provides intuitive control, but its theoretical optimality is not sufficiently analyzed. A “sweet spot” region is found experimentally, but this is more of an empirical discovery than based on theoretical derivation.

(More specified in the “Question”).

问题

This paper proposes to view attention weights as probability distributions and to perform interpolation, which indeed offers a more unified and intuitive perspective for understanding PAG and SEG. However, this "unification" primarily refers to linear interpolation. The paper mentions the existence of more complex paths between probability distributions (e.g., Fisher-Rao or optimal transport geodesics) , whether these more complex geometric structures could lead to superior perturbation strategies? Or why simple linear interpolation proves to be highly effective in mathematical principles?
The basis for the selection of the parameters “number of heads selected per round n” and “maximum number of iterations T” is not specified. How do these parameters affect search efficiency and final results, and are there smarter search strategies (e.g., based on gradients or interpretability scores) that can replace or complement this trial-and-error generation-evaluation cycle?
This paper notes that “adding more perturbed heads results in over perturbation”, which limits the number of heads that can be combined. HeadHunter, while finding good combinations, does not explicitly give a strategy for how to automatically determine the optimal number of combinations (i.e., when to stop adding heads), and still relies on observing performance curves. Does this rely too much on manual evaluation and weaken its generalization?
In the appendix, each concept is associated with different heads of different layers (e.g. Dark and White), and in practice, when users want to optimize certain styles, they need to train to find the corresponding layers and heads first. However, users applications that require rapid iteration, this strategy of iteratively generating and evaluating each individual head can become computationally prohibitive. How about the computational efficiency and scalability?

局限性

Yes

最终评判理由

The proposed method is innovative, and the authors' rebuttal successfully justifies their parameter choices and the methodology's rationale. However, as noted previously, the paper's theoretical underpinnings remain weak. Additionally, it is debatable whether the slight performance improvement in certain scenarios justifies the increased complexity from the large number of heads. For these reasons, I will maintain my original score as a borderline accept.

格式问题

N/A

作者回复

2025-07-31

We sincerely appreciate your careful evaluation of our manuscript, which helps us a lot to improve our work. We address your questions as follows.

[W1] The interpolated parameter u provides intuitive control, but its theoretical optimality is not sufficiently analyzed. A “sweet spot” region is found experimentally, but this is more of an empirical discovery than based on theoretical derivation.

[A1] We’d like to offer a theoretical explanation for why a “sweet spot” emerges, drawing on the framework introduced in autoguidance [1]. The autoguidance paper argues that weakening the model results in a smoother sample distribution that includes outlier or erroneous regions. By guiding away from the weaker model, Autoguidance avoids these undesirable regions and helps correct the model’s errors.

Perturbation guidance operates in a similar manner. By perturbing the original model’s attention distribution, we create a degraded variant that captures the erroneous regions. The difference between the original and perturbed outputs provides a meaningful guiding signal that can steer the sampling process away from those regions.

In this framework:

The perturbation method defines how the sample distribution is perturbed (i.e., which regions of the distribution are smoothed) and thus determines the direction of the guidance signal.
The perturbation strength defines how much the distribution is perturbed (i.e., the degree of difference between the original and perturbed scores) and also affects the direction.
The guidance scale $w$ , in contrast, controls how far the sample moves in that direction and affects only the magnitude of the signal, not its direction.

Therefore, the perturbation method/strength plays a crucial role in shifting the sampling distribution to cover outliers/erroneous samples while avoiding overly-smoothed distribution, which would lead to ineffective guidance. This intuition highlights the importance of selecting an appropriate perturbation strength, as also observed in autoguidance, where slightly degraded models (e.g., S vs. XS) yield the best performance.

In our setting, the selection of attention heads defines the perturbation method (which visual attributes to perturb), and the interpolation coefficient $u$ controls the strength of the perturbation. Together, they determine the final guidance signal. This mechanism is fundamentally different from adjusting the guidance scale $w$ , which only affects the signal's magnitude. This insight highlights the importance of controlling $u$ . As shown below, tuning $u$ consistently yields better results:

PickScore comparison

$w$ / $u$	0.25	0.5	0.75	1.0
2.0	22.07	22.28	22.19	21.92
4.0	22.31	22.29	22.08	21.66
6.0	22.29	22.12	21.78	21.27

While we provide theoretical reasoning for the existence of such a sweet spot, we acknowledge that the precise optimal point is determined empirically. This is consistent with other methods: for example, AutoGuidance does not analytically derive an optimal weakening strength (e.g., via model size, EMA length, or training time), and the original CFG paper [2] offers a theoretical motivation for combining conditional and unconditional distributions, but identifies the optimal scale $w$ through empirical tuning.

We thank the reviewer for this important point and will include this discussion in the revised manuscript.

[Q1] The paper mentions the existence of more complex paths between probability distributions (e.g., Fisher-Rao or optimal transport geodesics) , whether these more complex geometric structures could lead to superior perturbation strategies? Or why simple linear interpolation proves to be highly effective in mathematical principles?

[A2] Thank you for the question. The table below reports PickScore results on 100 prompts from the COCO validation set, comparing linear and Fisher-Rao interpolation paths. (In addition, we initially attempted to implement optimal transport (OT) geodesic interpolation, but found it difficult to implement efficiently)

Interpolation Method / $u$	0	0.2	0.4	0.6	0.8	1
Linear	20.65	22.15	22.32	22.32	22.21	21.90
Fisher-Rao	20.65	21.80	22.34	22.30	22.03	21.90

We observe that Fisher-Rao interpolation provides slightly higher scores at certain $u$ values. However, they also complicate the implementation and may introduce numerical instability. Given its simplicity, stability, and competitive performance, we adopt linear interpolation in our main experiments. Nonetheless, users may still consider using more complex paths if slightly better quality is desired. We will include these results and discussion in the revised manuscript.

[Q2W3] About the parameter choice of HeadHunter

[A3] Thank you for raising this important point. For the number of heads selected per round $k$ (we use $k$ consistently), smaller values allow more accurate assessment of individual additive contributions. We adopt a greedy strategy that selects the head that improves the objective most when added. While $k=1$ provides the most precise search, it increases the number of rounds and total cost. In practice, we found that $k=3$ and $T=5$ offer a good trade-off between efficiency and quality.

The total number of selected heads is controlled by $k \times T$ . As shown in Fig. 9, the objective typically saturates within a few rounds, and adding too many heads can degrade quality. We empirically observed that selecting around 15–24 heads (approximately the number in a single layer) gives good results.

To avoid relying on manual tuning of $T$ , we also implement an automatic stopping criterion based on the improvement of the objective between rounds. This is a common strategy in deep learning and worked robustly across prompts and models (SD3, FLUX), as shown in Section 4.2 and Appendix D.3.1, supporting its generalizability.

[Q2-2] Are there smarter search strategies (e.g., based on gradients or interpretability scores) that can replace or complement this trial-and-error generation-evaluation cycle?

[A4] Thank you for suggesting potential improvements to search efficiency. It is worth noting that exhaustively searching all possible head combinations has exponential complexity $\mathcal{O}(2^H)$ , where $H$ is the total number of attention heads. In contrast, our method already reduces this to $\mathcal{O}(H)$ by leveraging the compositionality of heads, which represents a significant improvement in efficiency.

To further improve efficiency, we propose a practical strategy based on our observation of head selection patterns across rounds (Fig. 22, Appendix D.3.2). We found that in early rounds, heads related to structure and general image quality are selected, while in later rounds, heads contributing to style-specific effects are retrieved. This likely occurs because style heads alone do not produce high-quality outputs and thus rely on general-quality heads to be effective.

Based on this insight, once the general-quality head set is obtained (e.g., from a one-time search), we can reuse it as a starting point for style-specific searches. This allows us to skip early rounds and save computational resources up to 40% in our experiments (2 out of 5 rounds). Below, we provide a comparison showing this strategy for the "sunlit" and "spring" styles:

Full HeadHunter Search (“sunlit” and “spring” styles):

Round	HPSv2	PickScore	ImageReward	AES	CLIPScore
0	0.2166	19.96	0.00	5.27	24.22
1	0.3094	21.66	1.48	5.77	26.35
2	0.3173	22.13	1.74	5.88	27.63
3	0.3115	22.19	1.72	5.90	28.62
4	0.3133	22.09	1.78	5.92	27.95
5	0.3144	22.17	1.63	5.96	28.57

HeadHunter Search Initialized with General-Quality Heads (“sunlit” and “spring” styles):

Round	HPSv2	PickScore	ImageReward	AES	CLIPScore
2	0.3040	21.77	1.42	5.71	25.89
3	0.3102	22.01	1.67	5.98	27.06
4	0.3228	22.30	1.65	5.77	27.14
5	0.3172	22.17	1.74	5.92	27.46

More advanced head search methods, such as gradient-based or RL approaches, are promising and will be explored in future work. However, our main contribution lies in showing that attention heads exhibit interpretable, disentangled effects under perturbation, and in proposing a simple yet effective baseline to leverage this. We leave efficiency improvements to future work and thank the reviewer for the valuable suggestion.

[Q4] Users applications that require rapid iteration, this strategy of iteratively generating and evaluating each individual head can become computationally prohibitive. How about the computational efficiency and scalability?

[A5] Thank you for the question. Due to space limitations, we kindly refer the reviewer to responses [A1] and [A2] provided to reviewer X5tq, which address this point in detail.

[1] Guiding a diffusion model with a bad version of itself, NeurIPS 2024

评论- Official Comment by Authors

2025-08-06

Thank you very much for taking the time to provide a thorough and thoughtful review. Your feedback helps us a lot in improving our work!

2025-08-06

2025-08-07

Thank you for your time and effort during the review and discussion period. Your comments and feedback have greatly improved our work.

We would like to respectfully emphasize that a key contribution of our work lies in identifying distinct behaviors of individual heads in perturbation guidance, and in proposing a strong baseline for selecting heads, which can inspire future research on head-aware approaches such as head-aware PEFT, unlearning, and so on.

While the head search algorithm introduces some complexity, we would like to highlight that, once the search is completed, the resulting configuration can be shared with the community and reused by others without any additional search cost. We believe this significantly improves the practicality and accessibility of our method. We acknowledge, however, that this perspective may not be universally shared.

Once again, we sincerely appreciate your thoughtful feedback. Please feel free to reach out if you have any further questions :)

审稿意见

评分: 4置信度: 32025-07-02

Based on current guidance methods for diffusion models, the authors propose a unified attention perturbation guidance model, which interpolates between the original attention map and a perturbed attention map. Specifically, the authors interpret the attention maps as a set of probability distributions, where the uniform distribution represents the scenario with the highest entropy, and the identity matrix can be interpreted as the lowest entropy case. Interpolating between the extremes is investigated experimentally in the first half of the paper, and it is concluded that there exists a certain trade-off. In the second half of the paper, the authors investigate the impact of different layers and attention heads on the guidance methods and propose an iterative framework, called Headhunter, for identifying attention heads that are interpretable and composable.

优缺点分析

Strengths

Interesting results and thorough investigation The paper goes into great detail investigating the effects of attention-based perturbation guidance on a layer and attention-head level. For the generalized framework of attention perturbation guidance, the authors carefully explore the variety of hyperparameters such as the interpolation coefficient, the guidance strength or the target layers or attention heads. It is interesting to see that selectively guiding certain attention heads impacts specific properties and can be used for controllability purposes.
Automated framework appears promising The proposed HeadHunter framework allows for a principled and automated approach to find relevant attention heads that improve the generation quality of the image under a predetermined objective function. By composing multiple of these heads, the overall performance can be improved over the unconditional case.

Weaknesses

Probability interpretation is mathematically questionable/not well-motivated The interpretation of the attention maps as a set of probability distributions is not well defined in the paper and might be confusing for some readers, as it might imply some stochastic modelling of the attention values. Interpreting the values of the attention map as a set of probability distributions appears to only matter for the argument of minimal/maximal entropy scenarios, which itself does not provide much additional insight into the method. For the minimal entropy scenario, there exists more than just the identity matrix option, which is referred to as 'max guidance' in the Supplementary Material. Technically, the interpolation could be performed on the entire probability simplex, which offers a large number of options.
Unclear Figures and parameter settings Throughout the paper, it remains unclear what value is used for the interpolation parameter u or the guidance scale w in the figures for a lot of scenarios. This makes it hard to understand the exact effects of each parameter and the generalizability of the method.
Perturbation mechanism remains unclear, mostly empirical observations While the paper provides a wide variety of different experimental results, it lacks an answer to the deeper mechanisms of the perturbation mechanism. For classifier-free guidance, the intuition behind weighing the conditional signal more strongly stems from sharpening the estimated conditional distribution, but in the case of perturbation guidance, no intuitive interpretation is given.

问题

How does perturbation guidance perform compared to regular classifier-free guidance or auto-guidance?
How should perturbation guidance be interpreted? How does it affect the estimated probability density path?
Are the effects observed for each attention head only valid for attention perturbation, or could it also be used independently? If so, how does it perform compared to the proposed method?

局限性

Yes (in the Supplementary Material)

最终评判理由

After the discussion with the authors, most of my questions have been answered, and concerns were addressed. While the findings presented in the paper are interesting and noteworthy, the unclear description and settings of the method hinder a deeper understanding to the point where reproducibility becomes a concern. As a result, I decided to keep my borderline accecptance rating.

格式问题

No formatting concerns.

作者回复

2025-07-31

We sincerely appreciate your careful evaluation of our manuscript, which helps us a lot to improve our work. We address your questions as follows.

[W1] Mathematical soundness/motivation for interpreting attention maps as a set of probability distributions

[A1] Thank you for the insightful comments. The original attention mechanism paper [1] interprets attention weights as probabilities of alignment between target and source tokens, and subsequent works have adopted a similar probabilistic interpretation. We extend this view to attention maps to unify existing guidance methods and to enable a natural formulation of interpolation between distributions.

Although interpolating two distributions is intuitive and natural action in the probability distribution space, this idea has been largely overlooked in prior work, and the introduction of the probability-space perspective allowed us to introduce interpolation between the original and perturbed attention maps.

While the probabilistic interpretation supports a unified view of existing guidance methods and motivates the introduction of interpolation, we acknowledge that entropy-based explanations may not provide substantial additional insight and could cause confusion. Therefore, we will minimize the discussion of probabilistic, especially entropy-based views in the revised manuscript, and instead place greater emphasis on interpolation, which plays a key role in controlling perturbation strength.

[W1] Interpolation on other paths.

[A2] As the reviewer pointed out, alternative interpolation paths in the probability space, such as Fisher-Rao interpolation, can lead to slightly better image quality. We present the results in the table below:

Interpolation Method / $u$	0	0.2	0.4	0.6	0.8	1
Linear	20.65	22.15	22.32	22.32	22.21	21.90
Fisher-Rao	20.65	21.80	22.34	22.30	22.03	21.90

We can see that Fisher-Rao interpolation provides slightly higher scores at certain $u$ values than linear interpolation. However, they also complicate the implementation and may introduce numerical instability. In addition, we initially attempted to implement optimal transport (OT) geodesic interpolation, but found it difficult to implement efficiently. Given its simplicity, stability, and competitive performance, we adopt linear interpolation in our main experiments. Nonetheless, users may still consider using more complex paths if slightly better quality is desired. We will include these results and the discussion in the revised manuscript.

[Q1] How does perturbation guidance perform compared to regular classifier-free guidance or auto-guidance?

[A3] We kindly refer the reviewer to [A4] for a conceptual comparison between perturbation guidance, CFG, and Autoguidance. In this response, we interpret the question as a request for a performance comparison.

Our goal is to improve attention perturbation guidance through fine-grained control and principled objective-driven head selection, rather than to outperform CFG or autoguidance. Nonetheless, for reference, we conducted experiments on 1K prompts from the MS-COCO validation set and report the results below. For perturbation guidance, we use the top-k heads as described in Sec. 4.2.1.

	FID	IS	Precision	Recall	Density	Coverage	PickScore	ImageReward
Baseline	64.08	18.61	0.44	0.56	0.53	0.04	20.61	-0.31
CFG 4.0	54.84	25.45	0.71	0.48	1.23	0.08	22.38	0.91
Autoguidance ( $w$ =4.0, $D_1$ : dropout with p=0.001)	68.49	18.94	0.43	0.40	0.42	0.03	20.86	-0.09
Ours (top-k=21, $w$ =4.0, $u$ =0.5)	56.85	24.18	0.65	0.44	1.02	0.06	22.32	0.73
Layer 9 ( $w$ =4.0, $u$ =0.8)	59.23	24.32	0.63	0.43	0.97	0.06	22.28	0.71

We note that prior perturbation guidance methods (PAG and SEG) were also not designed to outperform CFG. In general, perturbation guidance offers unique advantages, including compatibility with unconditional generation and better preservation of sample diversity. Moreover, it can be combined with CFG to leverage the strengths of both approaches. We demonstrate this in Tab. 3 and Fig. 27 in the Appendix.

In this work, we focused on improving perturbation guidance by introducing head-level control. As shown in the last two rows of the table, our best-performing head-level configuration outperforms the best-performing layer-level setting (which corresponds to PAG with principled layer search) across nearly all metrics. This suggests that head-level perturbation offers more effective and fine-grained control compared to layer-level approaches. The same trend holds for style-oriented quality improvement, as shown in the table provided in response [A1] to reviewer xdij.

[W3] Deeper mechanisms of the perturbation mechanism. Intuitive interpretation like CFG. [Q2] How should perturbation guidance be interpreted?

[A4] Thank you for raising this important question. We adopt the interpretation proposed in autoguidance [2] to help explain how perturbation guidance works. [2] argues that weakening the model results in a smoother sample distribution that includes outlier or erroneous regions. By guiding away from the weaker model, Autoguidance avoids these undesirable regions and helps correct the model’s errors.

In this framework:

The perturbation method defines how the sample distribution is perturbed (i.e., which regions of the distribution are smoothed) and thus determines the direction of the guidance signal.
The perturbation strength defines how much the distribution is perturbed (i.e., the degree of difference between the original and perturbed scores) and also affects the direction.
The guidance scale $w$ , in contrast, controls how far the sample moves in that direction and affects only the magnitude of the signal, not its direction.

Therefore, the perturbation method/strength plays a crucial role in shifting the sampling distribution to cover outliers/erroneous samples while avoiding overly-smoothed distribution, which would lead to ineffective guidance. This intuition highlights the importance of selecting an appropriate perturbation strength, as also observed in Autoguidance, where slightly degraded models (e.g., S vs. XS) yield the best performance.

Similarly, in our setting, the perturbed attention heads determine the shape of the smoothed distribution by selecting which visual attributes to perturb, the direction of perturbation, and the interpolation parameter $u$ controls its strength. Together, they determine the direction of the guidance signal. This is fundamentally different from tuning the guidance scale $w$ , which only adjusts the magnitude. This insight highlights the importance of controlling $u$ .

Indeed, we conducted experiments where we reduced the guidance scale. The best results were consistently achieved by tuning $u$ , as shown in the table below:

PickScore comparison

$w$ / $u$	0.25	0.5	0.75	1.0
2.0	22.07	22.28	22.19	21.92
4.0	22.31	22.29	22.08	21.66
6.0	22.29	22.12	21.78	21.27

[Q3] Are the effects observed for each attention head used independently? If so, how does it perform compared to the proposed method?

[A5] Thank you for the thoughtful question. We refer the reviewer to response [A3] to Reviewer X5tq for our preliminary exploration of parameter-efficient fine-tuning (PEFT) applied to specific heads retrieved by HeadHunter. We observed that fine-tuning individual heads can lead to different visual outcomes, suggesting that attention heads may encode distinct visual semantics.

However, we did not observe a clear correlation between the heads selected by HeadHunter and improvements in PEFT efficiency or style fidelity. Our hypothesis is that heads already specialized for certain styles may not benefit from further training, whereas other heads that are less aligned with the target style may still have capacity to learn, resulting in greater improvement when trained.

Additionally, in this response, we explore an alternative approach where we directly scale the attention maps of selected heads using a fixed multiplier, inspired by Prompt-to-Prompt [3]. We apply this to interpretable heads we found (e.g., “sunlit,” “shadow,” “blue”). Interestingly, this method does enhance certain visual concepts consistent with those observed under perturbation guidance, for instance, increasing shadow intensity or emphasizing specific colors.

However, we also found that this approach frequently introduces noticeable artifacts and tends to degrade overall image quality, making it less practical in real applications.

In general, it is considerably easier to degrade model performance than to improve it. This aligns with findings from AutoGuidance [2] and supports our claim that HeadHunter’s negative guidance strategy provides a more stable and effective direction for controllable generation.

[W2] Clarification on parameter choice.

[A6] We apologize for the confusion. In Sec. 3 and Sec. 4, we used $w = 5.0$ and $u = 1.0$ . For HeadHunter, we used $w = 3.0$ and $u = 1.0$ , except in Fig. 7(b), where $u = 0.25$ . This follows standard practice in diffusion literature [4–7], which uses different settings for quantitative and qualitative comparisons. We will clearly state all parameter values in the manuscript.

评论- References

2025-08-01

References:

[1] Neural machine translation by jointly learning to align and translate, ICLR 2025

[2] Guiding a diffusion model with a bad version of itself, NeurIPS 2024

[3] Prompt-to-Prompt Image Editing with Cross Attention, arxiv 2022

[4] Scalable Diffusion Models with Transformers, ICCV 2023

[5] SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers, ECCV 2024

[6] High-Resolution Image Synthesis with Latent Diffusion Models, CVPR 2022

[7] Classifier-free Diffusion Guidance, arXiv 2022

Thank you so much for helping us improve the paper. We appreciate your open and thoughtful discussions! The discussion on intuitive interpretation of our method, the independent use of attention head effect, and clarification of details help us improve the paper. Please let us know if you have any further questions.

2025-08-04

Thank you for your detailed response. I have a question regarding using $u=1$ . Following the definition of $u$ between L. 138 and L. 139 (it would be helpful to add numbers to the equations), choosing $u=1$ would just give us the regular attention map, and there would be no perturbation, which seems odd. Is my understanding correct? I also realized that $u$ is defined differently in Fig. 10 of the appendix. In that case, $u=1$ would be equal to perturbing with the identity matrix, which is equal to PAG.

评论- Response to Reviewer ZmWS

2025-08-04

We apologize for the confusion. Your understanding is correct. There was a typo in the equations shown in L.138 and L.153. As you mentioned, the version in Fig. 10 in the appendix reflects the correct formulation: $f_I(\mathbf{A},u) = (1-u) \mathbf{A} + u \mathbf{I} \quad \text{and} \quad f_U(\mathbf{A},u) = (1-u) \mathbf{A} + u \mathbf{U},$ where $u=0$ corresponds to the regular attention map and $u=1$ corresponds to the fully perturbed attention map. We will revise the equations in L.138 and L.153 to be consistent with this definition. The notation used in the figures of the current manuscript and our rebuttal follows this version.

Also, as you suggested, we will add equation numbers to improve clarity. Thank you again for your helpful feedback and suggestions to improve our paper.

审稿意见

评分: 4置信度: 32025-07-04

This paper introduced a unified framework for self-attention score guidance. Basically ablating certain attention head (interpolating between original and identity or flat map) and making them negative control, one can do classifier free guidance with itself and achieve better quality or style. The authors systematically studied the effect of perturbing different attention heads and combination of them. Finally they developed a pipeline to systematically search for heads combinations to optimize certain objective like style.

优缺点分析

Strength

The method is simple to implement and quite principled. The introduction of this method as a conceptual unification of previous guiding method (PAG, SEG) is nice.
The systematic study and description of perturbation effect of different head and combination of them is quite interesting (Fig.4,5). It’s thought-provoking to the field for interpretability study in diffusion model.
Localizing perturbation to head is interesting and illuminating, from Fig6 seems like it’s a strong and targeted effect.

Weakness

Lack of theoretical framework and further understanding of the observed effect. For example, authors could discuss
- it’s a bit unclear if the effect of perturbing a head is equivalent to perturb a layer with less strength $w$ . (see questions)
- Or perturbing / ablating the attention map in the negative run is equivalent to enhance the map in the positive run.
The proposed HeadHunter iterative method is a great proof of concept but it seems a bit ad hoc, and expensive… not sure if it’s actually practical. (see questions)
Minor: The main text may benefit from a slightly more detailed description of the HeadHunter method, currently all the relevant details are buried in appendix D.

问题

What is the complexity of the head hunter method? It seems need to run through all the layer x head and all the prompt once per iteration, that is hundreds of heads and runs. So in a sense it’s very expensive to run this head hunter once. Then the question becomes how useful and generalizable it is? Can the authors show that they can run it once for some prompts and then generalize to many more new prompts / context?
As follow up, If so, isn’t the better thing to do is to finetune the model or perform some PEFT specific to those heads? How does the HeadHunter compare to just finetuning?
The paper introduced the additional $u$ parameter, but could it be equivalent to tune the cfg guidance scale parameter $w$ for previous methods? This is esp. concerning for the Figure 6 head vs layer comparison, it seems likely that the layer result looks bad because the guidance is too strong….and it’s intuitive that if all attn head in a layer was perturbed, then the diff between two runs will be greater, so the guidance strength should be tune down for it… The authors also noted themselves, at L216, the more head is perturbed, the more likely it is to oversaturate. Is there an experiment showing tuning the $u$ parameter is more than tuning guidance scale?
In section 3, you introduced an additional interpolation parameter $u$ , on the bright side, we have another knob to tune; on the dark side, we have yet another hyperparameter to tune. Is there any optimal choice of this value that could be transferred to new experiments? or does it have high variability from head to head and need to tune separately? Do you need to tune u for each head in HeadHunter?
(Minor) L131 Why 2D Gaussian blur cannot be applied to modern MMDiT architecture?

局限性

Yes in appendix H.

最终评判理由

The authors have addressed all our concerns raised in our review! We decided to raise the score to 4. Main reason for increasing the score is the explanation and new experiments for 1) method efficiency, 2) the generalizability of the method and 3) comparing it with PEFT makes the method more impactful! These evidence convince me that the proposed method has some potential for practical use.

The new results of tuning the cfg scale for layer vs single head is very interesting. I feel the inital concern (equivalence between disrupting one head and disrupting a layer with smaller cfg) is sometimes valid, but definitely not for every layer. The authors are very transparent about it and we appreciate their honesty

格式问题

Authors seem to forget to remove the guidelines in the checklist.

作者回复

2025-07-31

We are grateful to the reviewer for their time and insightful feedback, which helps us a lot to improve our work. We address the questions:

[Q1] What is the complexity of the head hunter method?

[A1] Thank you for the question. As noted in Appendix D.3.2, we consider following parameters to compute NFE :

$T$ : Total search iteration
$M$ : Number of prompt-seed pairs
$N$ : Number of total attention heads in model
$S$ : Number of denoising steps

In total, our framework requires $T \cdot M \cdot N \cdot (2S+1)$ , considering two forward pass for guidance and one NFE for verifier.

In our general quality improvement setting, we use $M = 20$ , $T = 1$ , and in the style-oriented setting, $M = 5$ , $T = 5$ . Both use $N = 24 \times 24$ heads and $S = 20$ steps. On an 8×H100 GPU node, each iteration takes roughly 36 minutes.

Importantly, this is a one-time cost, which can be amortized over many prompts thanks to the strong generalizability of the selected heads. We elaborate on this in [A2].

[W2 Q1] How useful and generalizable it is? Can the authors show that they can run it once for some prompts and then generalize to many more new prompts / context?

[A2] In Fig. 7(b), we show that heads selected for improving general image quality using only 20 prompts also enhance the quality of samples from unseen prompts in the MS COCO validation set.

For style-oriented improvement, we demonstrate in Appendix D.3.2 (“Generalization across content prompts”) that head sets selected from just 5 content prompts can improve style fidelity for 50 novel prompts generated by GPT-4o. Fig. 26 and Tab. 2 report both qualitative and quantitative results, confirming HeadHunter's strong generalization to new prompts.

[Q2] Isn’t the better thing to do is to finetune the model or perform some PEFT specific to those heads? How does the HeadHunter compare to just finetuning?

[A3] Thank you for the insightful question. To investigate this, we ran preliminary experiments fine-tuning LoRA modules on the interpretable heads identified in Fig. 5 and Fig. 22 (e.g., “darkness,” “shadow,” “blue,” “sunlit”). For each head, we generated samples using perturbation guidance with related prompts and then trained LoRA on the query/key/value projections of that head both with and without out projection. We compared the results with those from LoRA fine-tuning and full attention layers.

Interestingly, we found that fine-tuning on the retrieved heads does not consistently lead to improved style alignment. In fact, training on HeadHunter-retrieved heads resulted in weaker alignment to the intended style than training on full layers (see Fig. 8). We hypothesize that this is because the retrieved heads already encode the desired style, so further tuning provides little benefit. In contrast, unrelated heads have more room to adapt, leading to stronger optimization.

[W2-2] Perturbing / ablating the attention map in the negative run is equivalent to enhance the map in the positive run.

[A4] Thank you for the comment. As noted in autoguidance [1], "it is easy to get a worse version of the model, but much harder to make it better." and principled way to enhance the map in positive run does not exist as far as we know.

However, we tried to explore an alternative perspective to enhance the effects of attention perturbation on positive run. One possible way could be the method of Prompt-to-Prompt [2], where the attention maps are amplified by multiplying them with a scalar greater than one. We conducted such experiments using interpretable heads identified in Fig. 5 and Fig. 22 (e.g., the “darkness”, “shadow”, “blue”, and “sunlit” heads) both with and without prompts containing the relevant concepts.

Our results showed that boosting the corresponding heads can emphasize their associated concepts to some extent. For example, enhancing the “darkness” head produced thicker and more pronounced shadows; the “sunlit” head added yellow highlights near human contours and the “blue” head introduced blue elements in headphones or backgrounds. However, we also observed that this often came at the cost of greatly reduced image quality with notable artifacts.

These findings suggest that while concept-specific heads can indeed be identified and amplified, directly enhancing attention maps can be hard to achieve and unstable. In this context, using negative guidance offers a more practical alternative for emphasizing desired styles or attributes.

[W1] it’s a bit unclear if the effect of perturbing a head is equivalent to perturb a layer with less strength $w$ . (see questions) [Q3] Could it be equivalent to tune the cfg guidance scale parameter $w$ for previous methods? … Is there an experiment showing tuning the parameter is more than tuning guidance scale?

[A5] Thank you for raising this important point. We would like to clarify that perturbing a single head is not equivalent to perturbing an entire layer with a lower guidance scale $w$ . As a layer is equivalent to the multiple heads (24 heads in SD3), let's think about the single-head vs two-heads case. In Fig. 5, we showed that applying head-level guidance for each head results in a distinct visual attribute, and the case of two heads results in the composition of the concept implied by each head. If the reviewer’s assumption is true (layer perturbation with a lower guidance scale is equivalent to fewer head perturbations), the effect of reducing the strength across multiple heads should be the same as one of the concepts in each head. To check this, we conducted several experiments using the attention heads in Fig. 5(b). Applying head-level guidance with “blue” and “dark” heads with lower guidance scale did not correspond to either the “blue” head perturbation alone or “dark” head perturbation alone. But it exhibited the diluted “dark blue” concept and the results were similar for the case of other head combinations.

Also due to the difference of head number, we agree that the guidance strength of Fig. 6 (b) may have been too strong, leading to unfair comparison. As suggested, we re-evaluated with lower guidance scales and reported the quantitative comparison using PickScore below. (Due to the rebuttal constraint, we only provide quantitative results, will include sample images in the revised manuscript.)

For L3 and L8, your comment was correct. We confirmed that reducing guidance at the layer level yielded performance comparable to head-level guidance. As these are not suitable examples, we will revise the manuscript accordingly, and we are grateful for your valuable feedback in improving the paper.

However, we would like to respectfully draw the reviewer’s attention to L12 and L13. Despite using a very small guidance scale, L12 still fails to surpass the performance of head-level guidance. L13 even consistently underperformed compared to the baseline (unguided) across all guidance scales. These results still make our analysis valid that even for underperforming layers, selecting the right heads can yield strong results.

PickScore comparison

Layer / $w$	Baseline (Unguided)	1	2	3	4	5	Ours (Head-level Guidance, w=5.0)
3	21.08	22.07	21.87	21.43	17.67	18.36	21.37
8	21.08	22.31	22.43	20.94	20.36	21.41	22.23
12	21.08	21.77	21.51	20.7	20.44	17.81	22.52
13	21.08	19.62	19.12	17.39	16.55	16.75	22.47

[Q4] Choice of hyperparameter $u$

[A6] Thank you for the great question. We found that using a fixed guidance scale of $w=3.0$ and interpolation parameter $u=1.0$ works well across different style prompts in the style-oriented quality improvement setting. This fixed configuration was used throughout our main experiments (Fig. 8 and Fig. 9) across 23 style prompts, and also in the generalization test (Fig. 26, Tab. 2), demonstrating strong qualitative and quantitative performance.

As the reviewer suggested, it is indeed possible to tune $u$ individually for each head to potentially improve quality further. However, we found that applying the same $u$ to all selected heads yields sufficiently strong results in practice. In our HeadHunter search phase, we set $u=1.0$ to maximize the influence of each head in expressing a specific style. For inference-time usage, users may optionally reduce $u$ if the perturbation appears too strong.

Importantly, our method is not restricted to a specific value or search strategy for $u$ . The choice of $u=1.0$ was based on empirical effectiveness and simplicity, but HeadHunter remains compatible with alternative tuning schemes if desired.

[Q5] (Minor) L131 Why 2D Gaussian blur cannot be applied to modern MMDiT architecture?

[A7] Thank you for the opportunity to clarify. In MMDiT architectures, attention maps span both 2D image tokens and 1D text tokens. Therefore, the original paper’s 2D Gaussian blur to the full attention map is not feasible. While we experimented with applying the blur only to the image attention submatrix, it led to degraded performance and unintended artifacts. Moreover, such partial perturbations do not generalize well to multi-modal models (e.g., video), where attention spans additional dimensions like time, and would require heuristic designs that we believe are not scalable.

[W3] Minor: The main text may benefit from a slightly more detailed description of the HeadHunter method, currently all the relevant details are buried in appendix D.

[A8] Thank you for suggesting this. We will add a more detailed description and the algorithm of the HeadHunter method so that readers can grasp the idea more clearly in our manuscript.

[1] Guiding a diffusion model with a bad version of itself, NeurIPS 2024

[2] Prompt-to-Prompt Image Editing with Cross Attention, arxiv 2022

评论- Official Comment by Authors

2025-08-06

Thank you very much for taking the time to provide a thorough and thoughtful review. Your feedback helps us a lot in improving our work!

2025-08-07

Thank you for your thorough revision and for addressing all of our concerns. The new experiments on generalizability and the comparison with PEFT significantly strengthen the impact of your work (otherwise, there might not be a very strong reason to use the method), and I recommend highlighting these findings in the final version.

I also found the results on tuning CFG scale—comparing weaker layer-level intervention versus stronger single‐head interventions—particularly insightful. While the equivalence between disrupting one head and an entire layer holds in some cases (when heads are homogeneous), it clearly breaks down in others (where heterogeneity makes additional disruptions detrimental). A deeper investigation into each layer’s functional role would be a compelling direction for future work. Your transparent discussion of this nuance will be a valuable addition to the manuscript.

Finally, your explanation of the MMDiT attention map and the asymmetry between disrupting negative runs versus enhancing positive runs clarified that point beautifully. We are pleased with these extensive revisions and will raise our score; we lean toward accepting the paper. Thank you for your outstanding work!

2025-08-07

Thank you very much for your kind words and for suggesting things further to improve our work!

最终决定Accept (poster)

2025-09-17

Summarization.

This paper introduces a generalized framework for attention perturbation guidance in text-to-image diffusion and flow models. The authors make two primary claims: 1) Their framework, which introduces an interpolation parameter between the original and perturbed attention distributions, unifies and improves upon prior methods like PAG and SEG. 2) Specific attention heads correspond to distinct and interpretable visual attributes (e.g., structure, lighting, texture). Building on this, they propose "HeadHunter," a systematic search method to identify and combine the most effective heads to achieve user-defined objectives, such as enhancing aesthetic quality or reinforcing a specific style, without needing to retrain the model. The findings are validated on large-scale models like Stable Diffusion 3 and FLUX.1, demonstrating improved image quality and style control.

Strength

Novel interpretability and controllability: The key insight that individual attention heads correspond to disentangled visual attributes is a significant finding. This enables a new level of fine-grained control over the generation process (acknowledged by all reviewers).
Thorough experimental analysis: The paper is supported by a comprehensive set of experiments that carefully investigate the effects of different perturbation strategies, layers, and individual heads across a wide range of metrics.(acknowledged by ZmWS, xdij).
Practicality and scalability: The proposed method is a plug-and-play solution that can be applied to large, pre-trained models without any retraining. The authors demonstrate that the discovered head-sets are generalizable to unseen prompts, making the one-time search cost practical as these configurations can be shared and reused (acknowledged by ZVaq, xdij).
Principled and automated head selection: The HeadHunter framework provides a systematic, objective-driven approach to designing perturbation strategies, moving beyond the heuristic-based layer selection of previous methods (acknowledged by ZmWS).

Weakness

Lack of theoretical depth: Several reviewers (X5tq, ZmWS, ZVaq) pointed out that the paper lacks a deep theoretical motivation for some of its observations. For instance, the reason a "sweet spot" for the interpolation parameter exists and the precise mechanism behind perturbation guidance are explained through intuition and analogy to prior work rather than a formal framework.
Clarity on hyperparameters and reproducibility: Initially, reviewers (ZmWS, xdij) found it difficult to track the specific hyperparameter values (e.g., guidance scale w and interpolation u) used in different experiments, which could hinder reproducibility. The authors clarified this during the rebuttal.
Computational cost of headHunter: A primary concern was the computational expense of the HeadHunter search process. Reviewers (X5tq, xdij, ZVaq) questioned its practicality, especially for models with a very large number of heads.

Reasons for the decision.

I am recommending this paper for acceptance, which is also a consensus among reviewers. The core contribution, demonstrating that attention heads can be manipulated as interpretable, fine-grained units of control, is a novel and valuable insight for the field of generative models. This finding has the potential to inspire future work in model interpretability, editing, and control. The authors effectively mitigated the weakness during the rebuttal. Two reviewers (X5tq, xdij) raised their scores. There is no negative rating after the rebuttal.

Summarize the discussion.

The rebuttal period was highly effective and strong. The following points were discussed.

Computational Cost and Practicality: Reviewers X5tq, xdij, and ZVaq raised major concerns about the practicality of the HeadHunter search due to its high computational cost. The authors clarified the exact number of function evaluations (NFEs), framed it as a one-time cost, and provided new experiments showing that the discovered head-sets generalize effectively to unseen prompts.
Comparison to other inference-time methods: Reviewer xdij asked for a clearer distinction from inference-time methods that search over the initial noise space. The authors explained that noise search is per-prompt, whereas their method learns a generalizable configuration. They provided a new experiment with an equal NFE budget, showing their method achieved comparable or better performance on several metrics while having the key advantage of generalization. This successfully addressed the reviewer's concern.
Theoretical Intuition and Hyperparameter Choices: Reviewers X5tq, ZmWS, and ZVaq requested a better explanation for the method's mechanisms and hyperparameter choices. The authors provided a more detailed intuition by linking perturbation strength and guidance scale to the direction and magnitude of guidance, drawing an analogy to the Autoguidance framework. They also clarified all hyperparameter settings used in the paper.
Clarity and Notational Errors: Reviewers xdij and ZmWS noted inconsistencies in the equations for the interpolation parameter.The authors acknowledged the typos and corrected the equations to be consistent with the figures and their implementation.