6.4

/10

Poster3 位审稿人

最低3最高5标准差0.8

3.3

置信度

创新性2.7

质量3.0

清晰度2.7

重要性2.7

NeurIPS 2025

Localizing Knowledge in Diffusion Transformers

Arman Zarei,Samyadeep Basu,Keivan Rezaei,Zihao Lin,Sayan Nag,Soheil Feizi

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

摘要

Understanding how knowledge is distributed across the layers of generative models is crucial for improving interpretability, controllability, and adaptation. While prior work has explored knowledge localization in UNet-based architectures, Diffusion Transformer (DiT)-based models remain underexplored in this context. In this paper, we propose a model- and knowledge-agnostic method to localize where specific types of knowledge are encoded within the DiT blocks. We evaluate our method on state-of-the-art DiT-based models, including PixArt-$\alpha$, FLUX, and SANA, across six diverse knowledge categories. We show that the identified blocks are both interpretable and causally linked to the expression of knowledge in generated outputs. Building on these insights, we apply our localization framework to two key applications: *model personalization* and *knowledge unlearning*. In both settings, our localized fine-tuning approach enables efficient and targeted updates, reducing computational cost, improving task-specific performance, and better preserving general model behavior with minimal interference to unrelated or surrounding content. Overall, our findings offer new insights into the internal structure of DiTs and introduce a practical pathway for more interpretable, efficient, and controllable model editing.

关键词

LocalizationInterpretabilityDiffusion ModelsDiffusion Transformers

评审与讨论

审稿意见

评分: 4置信度: 32025-07-01

The paper proposes a model-agnostic and knowledge-agnostic method to localize where specific types of knowledge are encoded within Diffusion Transformer blocks, enabling efficient and targeted model editing.

优缺点分析

Strengths:

The proposed method in this paper enables targeted fine-tuning that is both faster and more memory-efficient, while maintaining high generation quality and consistency.
The authors conduct comprehensive experiments to validate the effectiveness of the proposed method.

Weaknesses:

Figures 1 and 4 demonstrate the differences in the knowledge distribution patterns among various models. What causes these differences, and are they due to variations in model architectures?
While the paper adopts K=9 for model personalization and K=5 for concept unlearning, providing more details on the selection process would enhance the understanding of the method.
Why does an increase in K lead to a decrease in the CLIP score, and what is the reason behind this phenomenon?

问题

Does the final performance in the experiment using the llava model get affected by the preferences of that model?
The paper refers to knowledge-agnostic prompts in Section 3.1. Could you clarify their specific formulation and provide the template used in implementation?
Figure 4 shows that FLUX experiences the steepest performance decline across multiple metrics as K increases. What specific characteristics of FLUX might explain its heightened sensitivity to larger K values?

局限性

Yes

最终评判理由

The authors addressed most of the concerns.

格式问题

There are no paper formatting concerns.

作者回复

2025-07-29

We sincerely thank the reviewer for their constructive feedback. We are pleased that the reviewer found our experimental results comprehensive. Below, we address the reviewer’s concerns in detail:

Figures 1 and 4 demonstrate the differences in the knowledge distribution patterns among various models. What causes these differences, and are they due to variations in model architectures?

The observed differences in knowledge distribution patterns across models can be attributed to several factors, including variations in model architecture, differences in training procedures, the diversity and composition of training datasets, and the inherent complexity of the high-dimensional optimization landscape. These elements jointly influence how and where knowledge is encoded within each model, leading to the distinct patterns observed. Prior work [1] also highlights this phenomenon, showing that knowledge is distributed differently across architectures

While the paper adopts K=9 for model personalization and K=5 for concept unlearning, providing more details on the selection process would enhance the understanding of the method.

We conducted a small-scale hyperparameter sweep over values of $K \\in \\{3, 5, 7, 9\\}$ and empirically found that $K = 5$ yields the best results for concept unlearning, while $K=9$ performs best for model personalization. These values provided a good balance between edit effectiveness and minimal side effects. We will clarify this selection process in the final version of the paper.

Why does an increase in K lead to a decrease in the CLIP score, and what is the reason behind this phenomenon?

The CLIP score used in our localization setup measures how strongly the generated images reflect the presence of the target knowledge. As $K$ increases, a greater number of blocks identified as encoding this knowledge are intervened. Specifically, instead of providing these blocks with a knowledge-specific prompt (one containing the target knowledge), we supply a knowledge-agnostic prompt, so that the target information is no longer fetched by these blocks during generation. As a result, the model becomes increasingly unable to express the target knowledge in its outputs, leading to a corresponding drop in the CLIP score.

Does the final performance in the experiment using the llava model get affected by the preferences of that model?

We understand the reviewer's concern about potential biases in using LLaVA as an evaluator. While LLaVA is a commonly used VLM in the community for such evaluations [2, 3], we complemented it with additional evaluation metrics, such as CLIP and CSD score, to improve the robustness of our results. To further address the reviewer's concern directly, we extended our evaluation by incorporating two additional VLMs—Qwen and Instruct-BLIP—to reduce model-specific biases. Below, we include a partial view of these results for the copyright, celebrity, and place categories using the FLUX model across different values of $K$ . The complete results will be included in the final version of the paper.

Category	$K$	CLIP	LLaVA	Qwen	BLIP
$\mathcal{Copyright}$	$\mathtt{0\\%}$	$0.2947$	$0.9530$	$0.7105$	$0.9097$
	$\mathtt{20\\%}$	$0.2240$	$0.4703$	$0.1598$	$0.3068$
	$\mathtt{40\\%}$	$0.1934$	$0.1827$	$0.0009$	$0.0490$
$\mathcal{Celebrity}$	$\mathtt{0\\%}$	$0.2778$	$0.5225$	$0.3435$	$0.7240$
	$\mathtt{20\\%}$	$0.2223$	$0.1635$	$0.0575$	$0.1894$
	$\mathtt{40\\%}$	$0.1956$	$0.0464$	$0.0017$	$0.0180$
$\mathcal{Place}$	$\mathtt{0\\%}$	$0.3046$	$0.9850$	$0.8778$	$0.9742$
	$\mathtt{20\\%}$	$0.2341$	$0.4506$	$0.2833$	$0.4225$
	$\mathtt{40\\%}$	$0.1776$	$0.0322$	$0.0100$	$0.0172$

The paper refers to knowledge-agnostic prompts in Section 3.1. Could you clarify their specific formulation and provide the template used in implementation?

As discussed in Section 3.1, knowledge-agnostic prompts are constructed by taking a knowledge-specific prompt and replacing the target knowledge with a semantically related but generic placeholder. For example, for the knowledge “the Batman”, the specific prompt “the Batman walking through a desert” becomes “a character walking through a desert” in the agnostic version. Table 2 in the appendix includes several examples of knowledge-specific prompts along with their anchors. The knowledge-agnostic prompts are created by replacing the knowledge with the corresponding anchor in those examples.

Figure 4 shows that FLUX experiences the steepest performance decline across multiple metrics as K increases. What specific characteristics of FLUX might explain its heightened sensitivity to larger K values?

As noted earlier, differences in model architecture, training data, and optimization dynamics lead to varying localization patterns across models. In the case of FLUX, our intuition for its steeper drop in localization metrics as $K$ increases lies in its architecture. Specifically, FLUX uses the MMDiT architecture, which concatenates text and image tokens and processes them jointly. Rather than injecting textual information only via cross‑attention (as seen in models like PixArt), FLUX’s design allows bidirectional, layer‑wise interaction between text and visual streams.This architectural design means that intervening on some blocks can change the flow of information more broadly, potentially affecting downstream layers and leading to a sharper drop as more blocks are intervened.

We hope these additional clarifications address the reviewer’s concerns, and we thank them again for their valuable insights.

[1] Basu, Samyadeep, et al. "On mechanistic knowledge localization in text-to-image generative models." ICML, 2024.

[2] Han, Evans Xu, et al. "Progressive compositionality in text-to-image generative models." arXiv preprint arXiv:2410.16719 (2024).

[3] Sun, Kaiyue, et al. "T2v-compbench: A comprehensive benchmark for compositional text-to-video generation." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.

2025-08-04

We thank you again for your valuable feedback and comments which have helped to strengthen our paper. As the discussion period is ending soon, we would really appreciate if you could let us know if our responses have addressed your concerns. We will be happy to answer any further questions and address any remaining concerns to the best of our abilities in the remaining time!

审稿意见

评分: 3置信度: 42025-07-03

Prior research has studied the knowledge localization problem in UNet-based diffusion models; however, that understanding might not be transferable to recent transformer-based models. This paper aims to diminish that gap by proposing a method to locate where knowledge is encoded in transformer blocks. They create a probing dataset, LOCK, covering a wide range of concepts and knowledge. Their localization method tries to find the attention contribution of concept text tokens in the prompt to image tokens, and identify layers that yield the highest contribution. They validate the role of important layers by performing causal tracing with the knowledge-agnostic prompts. This paper also proposes to utilize important layers to edit knowledge more effectively. In particular, when performing personalization or unlearning, we can fine-tune those layers only. Experimental results show that this approach improves the performance of knowledge editing methods, likely due to the knowledge isolation in the model.

优缺点分析

Strengths

This paper explores the knowledge localization problem in transformer-based models, which hasn't been studied before.
The causal tracing experiments in Sec. 3 validate the role of important layers.
The experiments show that finetuning important layers could be more effective than full finetuning, while requiring less resource.

Weaknesses

The term "block" in the paper is not clearly defined. It could be layers, modules, or attention heads, while this paper only studies layers. From the formulation, it seems we can also find important attention heads, which could help localize knowledge better since they are subcomponents of layers. Furthermore, this approach identifies important layers from attention contribution, while a transformer layer contains other modules as well, such as MLP and normalization layers. However, this approach does not take into account the effect of these modules.
The motivation for a separate work for transformer-based models is not clear. Previous works [1,2] have already localized important layers in diffusion models; this paper lacks a discussion on the challenge of those methods on transformer-based models. It'd be more convincing if you could provide the difference between the important layers identified by this framework and other methods.
This work fine-tunes only important layers and observes an improvement over full finetuning. However, it's not conclusive enough to connect the role of knowledge localization to finetuning methods. The experiments should also include other settings where we fine-tune random layers or the least important layers and show that the performance in those cases are worse than full-finetuning.

[1] Basu, Samyadeep, et al. "Localizing and editing knowledge in text-to-image generative models." The Twelfth International Conference on Learning Representations. 2023.

[2] Basu, Samyadeep, et al. "On mechanistic knowledge localization in text-to-image generative models." Forty-first International Conference on Machine Learning. 2024.

问题

Can we apply this framework to identify important attention heads? What if we intervene on those heads only, for example, causal tracing or finetuning?
What is the performance when we fine-tune random or unimportant layers?

局限性

yes

最终评判理由

The paper proposes a new approach to localizing knowledge by using attention contribution; however, it could be improved by explaining why this method shows a stronger signal than causal intervention in prior works. The practicality of this paper is also questionable, it's unclear that the proposed method indeed offers more efficiency than standard finetuning methods. Also the experiments could be strengthened by comparing with more recent and efficient methods.

格式问题

N/A

作者回复

2025-07-29

We genuinely appreciate the reviewer’s insightful and helpful feedback. Below, we provide detailed responses to each of the concerns raised.

The term "block" in the paper is not clearly defined. It could be layers, modules, or attention heads, while this paper only studies layers. From the formulation, it seems we can also find important attention heads, which could help localize knowledge better since they are subcomponents of layers…

Can we apply this framework to identify important attention heads? What if we intervene on those heads only, for example, causal tracing or finetuning?

We appreciate the reviewer’s insightful comments regarding the potential benefits of analyzing attention heads individually. To address this concern and further strengthen our work, we extended our experiments to perform localization at the attention head level within each attention block. Specifically, we applied our localization framework to identify the top $K'$ important heads (out of 16) within the top $K=6$ attention blocks of the PixArt model. We used attention contribution to rank heads and evaluated the effectiveness of these heads using prompt intervention on the head level.

$K'$	CLIP Score $(\downarrow)$	LLaVA $(\downarrow)$
$\mathcal{0}$	$0.2855$	$0.9478$
$\mathcal{4}$	$0.2579$	$0.8063$
$\mathcal{8}$	$0.2506$	$0.7174$
$\mathcal{12}$	$0.2487$	$0.6797$
$\mathcal{16}$	$0.2435$	$0.6378$

where $K'=0$ represents the baseline (no intervention) and $K'=16$ represents the block-level localization and intervention. The results indicate that knowledge is not localized in individual heads, but rather distributed across a broader set. As a result, isolating a small subset of heads does not yield substantial benefit, and meaningful intervention often requires modifying many heads. This supports our focus on block-level localization, which provides a more practical and effective abstraction for identifying and intervening on knowledge in diffusion transformers.

… Furthermore, this approach identifies important layers from attention contribution, while a transformer layer contains other modules as well, such as MLP and normalization layers. However, this approach does not take into account the effect of these modules.

We also appreciate the reviewer’s suggestion to investigate the role of other modules within the transformer block. We fully understand this concern and agree that it is important to evaluate the contribution of components beyond cross-attention. We would first like to emphasize that in models like PixArt, the cross-attention module is the primary and the only mechanism through which image tokens receive information from the text tokens. This makes it a natural and impactful target for intervention and localization in these models. Nonetheless, to more thoroughly address the reviewer’s point, we conducted an ablation study examining the individual roles of other modules within the transformer block. Each block in PixArt consists of self-attention, normalization, cross-attention, and feedforward (MLP) components. For this analysis, we first localized the top $K$ most important blocks. Then, for each module within these blocks, We performed isolated interventions by running two forward passes at each denoising step—one with a knowledge-specific prompt (containing the target information) and one with a knowledge-agnostic prompt. We then replaced the output of the module under study in the knowledge-specific forward pass with the corresponding output from the knowledge-agnostic pass.

Intervention Block Type	CLIP Score $(\downarrow)$	LLaVA $(\downarrow)$
Baseline (No Intervention)	$0.2854$	$0.9588$
Normalization	$0.2830$	$0.9537$
FeedForward	$0.2802$	$0.9437$
Self-Attention	$0.2770$	$0.9301$
Cross-Attention	$0.2628$	$0.8636$

As shown in the results, interventions on the cross-attention module yield the strongest localization signal, reinforcing our original design choice. We thank the reviewer for this valuable suggestion, which has helped us strengthen the empirical foundation of our work. We will include the full results and experimental details in the final version of the paper.

The motivation for a separate work for transformer-based models is not clear. Previous works [1,2] have already localized important layers in diffusion models; this paper lacks a discussion on the challenge of those methods on transformer-based models. It'd be more convincing if you could provide the difference between the important layers identified by this framework and other methods.

The reviewer raises a valid and important point. Both [1] and [2] employ “causal tracing” to identify localized layers useful for downstream tasks such as unlearning. While such a technique can theoretically be applied to DiTs - [2] explicitly notes in Section (M) of the Appendix the challenges of applying these methods to diffusion models with a T5-based text encoder. The authors were unable to modify/edit layers amenable to such analysis in T5-based models which can be useful for downstream applications for unlearning. Since the modern DiT architecture predominantly relies on T5, we chose to explore alternative approaches as in our paper that scale better to this setting. In fact, our initial attempts to apply causal tracing to DiT did not yield promising results and we weren’t able to identify localized layers corresponding to different visual concepts. This observation further motivated our proposed method, which—unlike previous approaches that rely on computationally expensive brute-force searches for localization—leverages efficient attention-based attribution signals. Our method scales effectively to large architectures and avoids the need for exhaustive layer-wise searches. We will add this note in the final camera-ready version of the paper to provide a more comprehensive picture.

This work fine-tunes only important layers and observes an improvement over full finetuning. However, it's not conclusive enough to connect the role of knowledge localization to finetuning methods. The experiments should also include other settings where we fine-tune random layers or the least important layers and show that the performance in those cases are worse than full-finetuning.

What is the performance when we fine-tune random or unimportant layers?

We greatly appreciate the reviewer’s suggestion to incorporate additional baselines such as random block selection and selection of the least important blocks. This helps evaluate both the specificity of our localization method and its impact on downstream applications like unlearning.

To begin, we conducted experiments to assess the localization performance of these alternative block selection strategies. While our main method selects the top- $K$ most important blocks based on our localization metric, we compared it against two baselines: (1) randomly selected $K$ blocks, and (2) the $K$ least important blocks (referred to as Bottom-K in the tables below). The table below reports prompt intervention results for each of these selection strategies.

Category	Block Selection Policy	CLIP $(\downarrow)$	LLaVA $(\downarrow)$
$\mathcal{Copyright}$	Baseline (No Intervention)	$0.2854$	$0.9588$
	Bottom-K	$0.2845$	$0.9585$
	Random	$0.2756$	$0.9117$
	Top-K (Ours)	$\mathbf{0.2337}$	$\mathbf{0.5463}$
$\mathcal{Celebrity}$	Baseline (No Intervention)	$0.2808$	$0.7607$
	Bottom-K	$0.2801$	$0.7602$
	Random	$0.2753$	$0.6857$
	Top-K (Ours)	$\mathbf{0.2420}$	$\mathbf{0.2743}$

As shown, the Top- $K$ selection consistently outperforms both the random and bottom- $K$ baselines by a large margin, demonstrating the effectiveness of our localization method in identifying knowledge-bearing blocks.

Further, to evaluate the effect of these selection strategies on finetuning-based unlearning, we applied localized unlearning using the same three block selection methods. Specifically, we fine-tuned only the selected blocks using each of the three selection strategies. The results are shown below.

Block Selection Policy	CLIP Score $(\downarrow)$	CLIP Acc. $(\downarrow)$
Baseline (No Finetuning)	$0.2735$	$0.91$
Bottom-K	$0.2636$	$0.82$
Random	$0.2526$	$0.73$
Top-K (Ours)	$\mathbf{0.2293}$	$\mathbf{0.55}$

These results show that finetuning the Top- $K$ most important blocks leads to significantly better unlearning performance compared to random or bottom- $K$ selections, reinforcing the relevance of our localization strategy. We thank the reviewer for this valuable suggestion, which has helped us strengthen the empirical rigor of our work. We will include the full set of results and implementation details in the final camera-ready version of the paper.

We hope these clarifications address the reviewer’s concerns, and we thank them again for their valuable insights.

2025-08-04

2025-08-05

Thank you for your response.

I believe that Table 1 should be interpreted as "not all attention heads contribute equally to knowledge". Specifically, 8 over 16 heads account for $\frac{0.2855 - 0.2506}{0.2855-0.2435}= 83.09\%$ of the change in CLIP Score and $\frac{94.78-71.74}{94.78-63.78}=74.32\%$ of the change in LLaVa score. Furthermore, the effect of localized modules on finetuning is also questioned in prior work [1].
Regarding the motivation, if I understand correctly, Section M in [2] argues that their method does not work well due to the T5 text encoder, not the transformer architecture as studied in this paper. It's still unclear how the attention signal in your approach helps tackle that problem. Additionally, they claim that their method fails to edit knowledge, instead of the efficiency problem as mentioned in the comment.
It'd be helpful if you could compare with the performance when finetuning the whole model. The main purpose of this approach, as mentioned in the paper, is to reduce the computational cost of full finetuning while improving the task performance. I am not sure if localized finetuning actually helps or if finetuning an arbitrary subset of modules could do the same. Furthermore, I assume that the results in the last table are similar to the setting in Figure 9? However, it seems the base model achieves CLIP score around 0.22, not 0.2735. Could you explain that discrepancy? Finally, it'd be better if you could provide the computational cost of full finetuning and localized finetuning to support your claim.

[1] Hase, Peter, et al. "Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models." NeurIPS 2023

[2] Basu, Samyadeep, et al. "On mechanistic knowledge localization in text-to-image generative models." Forty-first International Conference on Machine Learning. 2024.

2025-08-06

We greatly appreciate the reviewer’s insightful comments and detailed response, which have guided us in enhancing our work. Below, we provide detailed responses to each of the concerns raised.

Thank you for the valuable insight that "not all attention heads contribute equally to knowledge". We will incorporate this perspective into the final version of our paper. We would like to emphasize two key points: (1) our method is generalizable to subcomponents such as individual attention heads, allowing for head-level localization; and (2) effective interventions should be performed either at the layer level or across a sufficiently large set of attention heads. As shown in our experiments, targeting only a small subset (e.g., $K=4$ ) of attention heads does not capture the full extent of the relevant knowledge.

In prior work, [1] has shown that for language models, localization does not lead to effective model editing, as certain layers produce similar editing efficacy to the localized ones. However, in contrast, we find that this observation does not hold for diffusion models. Specifically, our findings indicate that fine-tuning localized layers for unlearning results in better performance compared to layers that are not localized (or random layers). We will ensure to refer to [1] in the final version of the paper and highlight the discrepancy between our findings in text-to-image models and those observed in LLMs..

Additionally, [2] reports that editing non-causal layers in diffusion models within the CLIP transformer architecture does not lead to any significant improvements in editing efficacy. Our observation aligns with this in some ways —specifically, we observe that fine-tuning localized layers leads to significant improvements in editing efficacy over fine-tuning non-localized layers.
The primary motivation of our paper is to localize knowledge in Diffusion Transformers, enabling targeted interventions that can improve applications like unlearning and personalization. By focusing interventions on specific areas, we can often achieve better task performance while preserving other capabilities of the model (also shown in our paper).

In [2], the authors apply the causal tracing framework to the UNet and text-encoder (both CLIP and T5). Specifically, for modern diffusion models like DeepFloyd, which employ a T5 text-encoder, they were unable to identify any localization signals in either the UNet or the T5 encoder (see Fig. 2 and Fig. 3).

In our early experiments with Pixart-alpha, we applied causal tracing to localize knowledge but obtained results similar to DeepFloyd, as shown in Fig. 2 of [2]. We were unable to detect strong localization signals in either the noise transformer or the T5 encoder. We plan to include similar plots to Fig. 2 from [2] in our final camera-ready version and highlight the main-takeaway from using causal tracing to the DiT stack.

Given the pessimistic conclusions regarding the adaptation of causal tracing to modern diffusion model stacks (as noted in Sec.M of [2] regarding the negative results for unlearning applications), along with our own early results for Pixart-alpha, we decided not to pursue causal tracing further. Instead, we focused on developing newer methods for localizing knowledge in recent diffusion models.

We also note that the localization method proposed in [2] is essentially a brute-force approach, which is computationally expensive. For instance, even if the target knowledge resides in just one block, a model like PixArt with 28 blocks would require evaluating all 28 blocks individually, resulting in $28\times$ the work. In contrast, our method only requires $1\times$ the effort. As the number of relevant blocks increases from 1 to $k$ , the computational cost becomes even more significant. Specifically, brute-force localization requires evaluating all $\binom{N}{k}$ combinations for a model with $N$ blocks, whereas our approach remains constant at $1\times$ regardless of $k$ .

Because of these two reasons, we design newer localization methods for DiTs — which leads to improved and more efficient localization results and also leads to improved downstream tasks as shown in our paper.

评论- Official Comment by Authors (Cont.)

2025-08-06

The performance of full finetuning in the above experiments yields a CLIP score of $0.2278$ and a CLIP accuracy of $0.52$ . As discussed in Section 4.3 of our paper, while localized finetuning leads to improved task performance in personalization settings, in the case of model unlearning, it achieves performance comparable to full finetuning—with the added benefit of improved training efficiency, better preservation of surrounding concepts and knowledge (as shown in by a higher CLIP score for surrounding knowledge in localized finetuning: 26.15 vs. 24.31 for full finetuning in the experiments above), and preservation of model’s prior performance (reflected in a lower FID score for localized finetuning, as shown in Figure 9).

Furthermore, in comparison to arbitrary block selection (as shown in the row labeled as random in the table), our localized finetuning approach performs significantly better (~2x drop in the CLIP score and accuracy), demonstrating that our method is both targeted and effective.

Regarding the apparent discrepancy in CLIP scores: Figure 9 presents results for the style category, whereas the experiments referenced above are conducted on the copyright category. We will ensure that these distinctions are clearly reported in the final camera-ready version.

In terms of training efficiency, under the same setup using an RTX6000 GPU, full finetuning takes approximately 877 seconds, while localized finetuning completes in about 707 seconds—yielding a ~19% speedup. Regarding memory usage, full finetuning consumes 41.4 GB of GPU memory compared to 30.2 GB for localized finetuning, resulting in a ~27% reduction. It's also important to highlight that these results are obtained using an optimized training setup that includes techniques such as gradient checkpointing. Without such optimizations, the memory savings from localized finetuning become even more pronounced—reaching up to ~70%.

2025-08-07

Thank you for your detailed comment. I believe that the analysis of attention heads would complement the paper.

Regarding the motivation, it's still unclear to me how the attention contribution could localize knowledge in diffusion models that use T5 encoder while [2] couldn't. Theoretically, [2] performs intervention on the model and therefore should pinpoint the causal effect and locate important layers. From the current state, it seems that the paper only empirically observes that attention contribution is a stronger signal; a deeper explanation in Section 3.2 would strengthen the paper. However, I agree with the author that the brute-force approach in [2] could be a challenge, and the proposed method helps mitigate that.

Regarding the experimental results of random-, full-, and top-k finetuning, it'd be better if the author could use a consistent setup on diverse knowledge, i.e., different animals, styles, and objects.

I am also concerned about the practicality of this work. It seems that full-finetuning is only slower by less than 3 minutes, which is not too significant. Also, we need to take into account the computation to localize knowledge and the hyperparameter search process to find $K$ . The results in the paper also give minimal practical insights; I am not sure which points in Figure 7 and 9 I should take when personalizing or unlearning diffusion. Additionally, the experiments only use one personalization method (DreamBooth) and unlearning method (Concept Ablating), which could be outdated compared to more recent and efficient methods such as InstantBooth or RECE.

For those reasons, I decided to keep the original rating.

Shi, Jing, et al. "Instantbooth: Personalized text-to-image generation without test-time finetuning." CVPR 2024

Gong, Chao, et al. "Reliable and efficient concept erasure of text-to-image diffusion models." ECCV 2024

2025-08-08

We thank the reviewer for their time and constructive feedback. We respond to each of your concerns below:

Regarding the motivation, it's still unclear to me how the attention contribution could localize knowledge in diffusion models that use T5 encoder while [2] couldn't. Theoretically, [2] performs intervention on the model and therefore should pinpoint the causal effect and locate important layers.

We appreciate the opportunity to clarify this. There seems to be a misunderstanding here: the authors of [R1] initially attempted to apply causal tracing from [R2] to newer architectures at the time and reported that the method failed to yield any meaningful signal, which aligns with our own early observations with causal tracing. To address this limitation, they later introduced an intervention-based brute-force approach, which is indeed applicable to diffusion models like ours. However, the key drawback is its high computational cost: for a model with $N$ layers, the method requires evaluating all $\binom{n}{k}$ combinations for each value of $k$ , making it impractical for real-world use. In contrast, our method offers a principled and scalable alternative by leveraging attention contribution signals. As shown in Appendix B.2, our approach achieves comparable localization performance to a brute-force-like method while being significantly more efficient. We believe this efficiency is a key contribution of our work.

Regarding the experimental results of random-, full-, and top-k finetuning, it'd be better if the author could use a consistent setup on diverse knowledge, i.e., different animals, styles, and objects.

We agree that evaluating on a diverse set of concepts is important. During the rebuttal, we included an additional case to demonstrate generalization beyond the original “style” example. We plan to extend this analysis to other categories in the final version. Based on our findings so far, we expect the results to remain consistent across these diverse concept types.

It seems that full-finetuning is only slower by less than 3 minutes, which is not too significant.

We note that the ~3-minute speedup refers to a single concept. In practice, users often unlearn hundreds or even thousands of concepts. In such settings, our method can save days of computation and significantly reduce GPU memory usage. Moreover, these savings are achieved without compromising the effectiveness of the downstream task while also better preserving the surrounding knowledge.

Additionally, the experiments only use one personalization method (DreamBooth) and unlearning method (Concept Ablating), which could be outdated compared to more recent and efficient methods such as InstantBooth or RECE.

This is a valuable suggestion. We wish this concern had been raised earlier during the rebuttal period, as we would have been happy to run these additional experiments. We plan to include comparisons to newer methods in the final version of the paper. That said, we expect our method to complement recent approaches in the same way it does with DreamBooth and Concept Ablating.

We are grateful for your feedback, which helped us expand our analysis and improve the paper. During the rebuttal, we addressed the key points you raised initially, including:

Extending our localization method to attention heads
Analyzing various submodules within DiT blocks, and finding that cross-attention shows the strongest localization signal
Demonstrating significant improvements in computational efficiency compared to brute-force localization approaches
Showing that our top-k localization consistently outperforms both random-k and least-k alternatives in finetuning

We hope these additional results and clarifications convey the practical value and conceptual contribution of our method more clearly. We truly appreciate your time and constructive review, and we hope you’ll consider reflecting this in your final score.

[R1] Basu, Samyadeep, et al. "On mechanistic knowledge localization in text-to-image generative models." ICML, 2024.

[R2] Basu, Samyadeep, et al. "Localizing and editing knowledge in text-to-image generative models." (2024).

2025-08-09

Thank you for your comment.

I acknowledge that efficiency is the main advantage of our framework compared to [R1]. That said, as mentioned earlier, I find that the current version lacks an in-depth discussion and justification of why this approach offers more accurate localization. This is a bit counterintuitive to me, since causal intervention should work theoretically. I believe that this discussion would bring beneficial insights to the mechanistic interpretability community.
If I understand correctly, the localization will be different for different knowledge concepts. Therefore, if we need to unlearn thousands of concepts, we also need to run the localization method thousands of times.
I do not expect the author to provide additional results due to the limited discussion time. However, I am concerned about the practicality of localized finetuning, since the efficiency advantage seems trivial, and it would be even more insignificant on recent advanced efficient methods.

审稿意见

评分: 5置信度: 32025-07-03

This paper tackles the problem of locating transformer blocks that are responsible for / critical for generating certain knowledge — certain object or certain style. They analyzed cross attention strength at each layer and use prompt ablation for given layer to evaluate their importance for the goal. Then they applied this knowledge to efficient finetuning — DreamBooth personalization, and Unlearning. — where they focused on tuning the most critical blocks this leads to more efficient fine tuning and smaller off target effect.

优缺点分析

Strength

They tackle an open question of interpretability in diffusion models esp. transformers, and they showed the generality of their method in a wide range of recent open source models.
The method was used to locate the most relevant blocks, and the knowledge was used to finetuning locally, which is effective, efficient and easy to use.
The method is simple to use and principled.

Weakness

Even the paper is talking about knowledge, what does authors mean by knowledge is not totally clear. Maybe a definition could be added somewhere. Do you mean knowledge by word to visual content mapping? Or knowledge just mean a certain word / meaning?
Maybe a deeper understanding of where knowledge is inside each block was not explored (c.f. to localizing knowledge in language models). So from mechanistic interpretability perspective it’s relatively weak.

问题

Conceptually, how do the authors think about how the “knowledge” locates inside each block?
Maybe it’s out of scope, but as authors discussed in limitation, have you observed any structural / intrinsic property of attention that correlate with the importance of the block to the final perturbation outcome?
Even though fewer blocks were finetuned, the speed up is not that dramatic, is it because it still needs the full backward pass to compute gradient to each target block?
Seems the attention contribution method is not limited to DIT models no? UNet based stable diffusion 1 also used cross attention to fetch textural information to guide generation. So the method should be applicable there too?

局限性

Authors clearly discussed limitation in supplementary.

最终评判理由

The new experiments address my concern about the depth of the paper. With the clearer definition of “knowledge,” the within block fine-grained localization experiments, and the strong correlation between output norm and block importance, the paper’s findings are both deeper and more impactful.

格式问题

No.

作者回复

2025-07-29

We sincerely thank the reviewer for their thoughtful and constructive feedback. Below, we address the reviewer’s concerns in detail:

Even the paper is talking about knowledge, what does authors mean by knowledge is not totally clear. Maybe a definition could be added somewhere. Do you mean knowledge by word to visual content mapping? Or knowledge just mean a certain word / meaning?

In our paper, we use the term knowledge to refer to various forms of visual concepts, such as stylistic elements, safety considerations, and popular locations. In particular, these are represented as textual tokens that guide the generation of specific visual content—for example, a particular set of tokens may correspond to the generation of Van Gogh’s artistic style, as noted by the reviewer. We provide detailed descriptions of these categories in Appendix Sec. B.1. Notably, such definitions of visual concepts have been explored in prior work on diffusion model interpretability [1, 2, 3]. In the final version of the paper, we will include a formal definition of knowledge and explicitly relate it to these prior works, as highlighted in this rebuttal.

Maybe a deeper understanding of where knowledge is inside each block was not explored (c.f. to localizing knowledge in language models). So from mechanistic interpretability perspective it’s relatively weak.

Conceptually, how do the authors think about how the “knowledge” locates inside each block?

The reviewer raises an insightful question regarding the localization of knowledge within individual blocks. In response, we extended our framework to explore whether knowledge can be further localized to finer-grained components, such as individual attention heads. Specifically, we applied our localization method to identify the top $K'$ most influential heads (out of 16) within the top $K=6$ attention blocks of the PixArt model. Heads were ranked based on their attention contribution scores, and their impact was assessed through intervening on attention heads.

$K'$	CLIP Score $(\downarrow)$	LLaVA $(\downarrow)$
$\mathcal{0}$	$0.2855$	$0.9478$
$\mathcal{4}$	$0.2579$	$0.8063$
$\mathcal{8}$	$0.2506$	$0.7174$
$\mathcal{12}$	$0.2487$	$0.6797$
$\mathcal{16}$	$0.2435$	$0.6378$

Furthermore, we conducted an ablation study examining the individual roles of other modules within the transformer block. Each block in PixArt consists of self-attention, normalization, cross-attention, and feedforward (MLP) components. For this analysis, we first localized the top $K$ most important blocks. Then, for each module within these blocks, We performed isolated interventions by running two forward passes at each denoising step—one with a knowledge-specific prompt (containing the target information) and one with a knowledge-agnostic prompt. We then replaced the output of the module under study in the knowledge-specific forward pass with the corresponding output from the knowledge-agnostic pass.

Intervention Block Type	CLIP Score $(\downarrow)$	LLaVA $(\downarrow)$
Baseline (No Intervention)	$0.2854$	$0.9588$
Normalization	$0.2830$	$0.9537$
FeedForward	$0.2802$	$0.9437$
Self-Attention	$0.2770$	$0.9301$
Cross-Attention	$0.2628$	$0.8636$

Maybe it’s out of scope, but as authors discussed in limitation, have you observed any structural / intrinsic property of attention that correlate with the importance of the block to the final perturbation outcome?

We thank the reviewer for the insightful question. While our main focus was not on characterizing the intrinsic properties of attention that correlate with block importance, we agree this is a valuable direction. To further investigate and more directly answer the reviewer’s concern, we conducted a series of additional ablation experiments during the rebuttal phase to explore this connection more systematically.

We analyzed several structural properties of attention blocks and measured their correlation with the attention contribution metric (used in our paper as a proxy for block importance). Specifically, we considered the following metrics:

Attention Map Aggregated Norm: For each block, we extracted the attention maps corresponding to the text tokens of interest attending to image tokens, across all heads, and computed the average $\ell_2$ norm of these attention weights.
Attention Entropy: Using the attention scores (after softmax), we calculated the entropy of each head’s distribution via $-\sum_j M_{i,j} \log M_{i,j}$ (computed for each image token $i$ ), then aggregated the mean and variance of these entropies across heads and image tokens.
Output Norm: We computed the $\ell_2$ norm of the attention block’s output for image tokens and aggregated the result across tokens.
Output/Input Norm Ratio: To assess the extent of transformation in each block, we measured the ratio between the output norm and the input norm for image tokens, giving an estimate of how much the representation is altered.
Attention Contribution (As discussed in the main paper)

We performed these analyses over $M$ generations using diverse knowledge-specific prompts, and for a model with $N$ blocks (below results are for the PixArt model with 28 blocks), we obtained $M \times N$ values per metric. We then computed Pearson and Spearman correlations between each metric and the attention contribution scores across all generations.

Property of Attention	Pearson Correlation	Spearman Correlation
Attention Map Agg. Norm	$0.1477$	$0.3444$
Mean Entropy	$-0.1690$	$-0.3489$
Variance Entropy	$0.0162$	$0.2237$
Output Norm	$0.9722$	$0.9245$
Output/Input Norm Ratio	$0.8365$	$0.8256$

As shown in the table, some metrics such as the output norm exhibit meaningful correlation with block importance. These findings provide preliminary evidence that other certain internal signals in the attention mechanism do reflect the role of the block in knowledge localization.

We are grateful to the reviewer for this suggestion, which helped us further deepen our analysis. We will include the results and detailed methodology of these experiments in the final camera-ready version of the paper.

Even though fewer blocks were finetuned, the speed up is not that dramatic, is it because it still needs the full backward pass to compute gradient to each target block?

Yes, that’s exactly right. To update the earlier, localized blocks, gradients must still be computed through the later, unlocalized blocks so that the backward signal can propagate all the way back. As a result, while we do observe a noticeable speed-up by restricting updates to fewer blocks, the improvement is not dramatic due to the full backward pass still being required.

Seems the attention contribution method is not limited to DIT models no? UNet based stable diffusion 1 also used cross attention to fetch textural information to guide generation. So the method should be applicable there too?

While in principle the attention contribution method could be applied to UNet-based architectures, in practice it's more challenging due to the architectural differences. In UNet, the spatial resolution and feature dimensions change across layers (e.g., in the encoder, features become deeper and the spatial size shrinks). This variability affects the structure and semantics of the attention, making the contribution signals less interpretable and harder to align across blocks. As a result, we don’t obtain a straightforward or consistent signal as we do in DiT-style models.

We hope these clarifications address the reviewer’s concerns, and we thank them again for their valuable insights.

[1] Gandikota, Rohit, et al. "Erasing concepts from diffusion models." Proceedings of the IEEE/CVF international conference on computer vision. 2023.

[2] Basu, Samyadeep, et al. "Localizing and editing knowledge in text-to-image generative models." (2024).

[3] Basu, Samyadeep, et al. "On mechanistic knowledge localization in text-to-image generative models." ICML, 2024.

2025-08-04

2025-08-06

We applaud the authors’ extensive new experiments and their thorough response to our concerns. With the clearer definition of “knowledge,” the within block fine-grained localization experiments, and the strong correlation between output norm and block importance, the paper’s findings are both deeper and more impactful. Accordingly, we will raise our evaluation to 5.

To confirm our understanding: knowledge appears to reside within the cross-attention blocks as a whole, rather than in any individual heads, and the output norm of each cross-attention block reliably predicts its true contribution. Is that correct?

2025-08-08

We sincerely thank the reviewers for their thoughtful and constructive feedback, which has helped us strengthen our work. Regarding your question: yes, that interpretation is correct. While our method can indeed be extended to localize individual attention heads, we found that effective interventions are best performed either at the layer level or across a sufficiently large set of heads. As a result, we have primarily focused on layer-level interventions in our experiments. Furthermore, we confirm that the output norm of each cross-attention block reliably predicts its contribution. We will include all corresponding ablations in the final version of the paper.

最终决定Accept (poster)

2025-09-17

This paper introduces a method to localize knowledge within the blocks of Diffusion Transformer (DiT) architectures and applies it to tasks such as personalization and unlearning. The reviewers generally agree on the technical soundness and relevance of the proposed framework. Reviewer vyfy supports acceptance, citing the generality and utility of the method, and notes that concerns about clarity and depth were addressed through additional experiments and discussion. Reviewer gcMX is moderately supportive (borderline accept), appreciating the thorough experiments but requesting better justification for some empirical choices and clearer explanation for architectural differences across models. Reviewer AmLP, however, remains more skeptical (borderline reject), acknowledging that many initial concerns were addressed, but still questioning the theoretical grounding of attention-based localization versus causal tracing, the practicality of the proposed efficiency improvements, and the limited use of only older baseline methods. Although AmLP raises late-stage concerns regarding baseline choice, the authors correctly note these were not raised during the initial review and should not weigh heavily at this stage. Given the novelty of studying knowledge localization specifically in DiT architectures, the empirical effectiveness of the proposed approach, and the extensive rebuttal efforts that clarified methodology and addressed concerns, the AC recommends acceptance and wishes the author can adopt the suggestions from reviewers in the final version.