CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up
We propose a convolution-like linearization strategy that accelerates pre-trained diffusion transformers for ultra-resolution image generation.
摘要
评审与讨论
This paper proposes a convolution-like local attention mechanism called CLEAR, designed to linearize existing DiT models. By restricting each query token to interact only with tokens within its local neighborhood during attention computation, CLEAR significantly accelerates high-resolution image generation.
优缺点分析
Strengths
-
The paper thoroughly analyzes various efficient attention mechanisms and conducts comprehensive evaluations and validations on image generation performance at a resolution of 1024×1024.
-
While maintaining generation quality, CLEAR achieves a 99.5% reduction in attention computation and a 6.3× speed-up, demonstrating particularly significant advantages in 8K image generation tasks.
Weaknesses
-
In Table 2, are the different efficient attention methods also evaluated based on distilled models from FLUX-1.dev with their attention layers replaced accordingly? Additionally, since both Swin attention and CLEAR adopt local attention, differing mainly in window shape, is it theoretically possible to adjust Swin’s window size or shape to approximate CLEAR’s performance?
-
Is the model’s ability to generate 8K-resolution images primarily distilled from FLUX-1.dev, or is it achieved through additional data-specific fine-tuning? Moreover, how does the method address the common issue of repeated content or artifacts in high-resolution image generation?
-
The ablation study appears insufficient, as the individual contributions of the two additional loss terms introduced (prediction loss and attention loss) are not separately verified. Also, does each choice of attention radius 𝑟 require training a separate model, or can one model generalize across different domain sizes?
问题
My major concerns are the weakness. I would like the authors to directly address these concerns in their response. My final rating is subject to change based on the quality and clarity of the authors’ feedback.
局限性
The authors are encouraged to answer the questions and address the weakness above.
格式问题
No format issues.
We deeply thank Reviewer BEVg for the valuable comments and are glad that the reviewer finds our analysis thorough, evaluations comprehensive, and methods effective. We would like to address the concerns as below.
- (W1A: Whether all the methods are evaluated on distilled models from FLUX) In Table 2, are the different efficient attention methods also evaluated based on distilled models from FLUX-1.dev with their attention layers replaced accordingly?
Thanks for the good question. Yes. All the attention methods are implemented and compared on FLUX-1.dev. We will clarify this in the revision.
- (W1B: Approximation based on Swin Transformer) Additionally, since both Swin attention and CLEAR adopt local attention, differing mainly in window shape, is it theoretically possible to adjust Swin’s window size or shape to approximate CLEAR’s performance?
Thanks for the insightful question. In our initial exploration, we have indeed tried to adjust the window size of Swin Transformer. Unfortunately, we find that it still results in the "grid" artifact, similar to Fig. 3. We speculate that it is because of the non-overlapped window partition for each attention layer.
Nevertheless, we indeed find an approximated form between CLEAR and Swin Transformer. The key point here is to apply overlapped window partition instead. Specifically, CLEAR can be viewed as a form of local window attention with stride=1. We show in the following study that stride=2 achieves comparable performance.
| Aesthetic | Prompt Alignment | Overall | Win Rate vs Other | GenEval | |
|---|---|---|---|---|---|
| Stride=2 | 89.41 | 91.19 | 88.37 | 0.54 | 0.665 |
| Stride=1 | 89.62 | 92.13 | 88.52 | - | 0.674 |
This property could be useful for hardware optimization. Thanks again to the reviewer for introducing a promising direction.
- (W2A: The source of the high-resolution capacity) Is the model’s ability to generate 8K-resolution images primarily distilled from FLUX-1.dev, or is it achieved through additional data-specific fine-tuning?
Thanks for the valuable question. The model’s ability to generate 8K-resolution images comes from the ability of training-free resolution extrapolation of FLUX. In fact, it requires two steps to achieve efficient high-resolution generation. The first one is to adapt the original DiT to make it effective at higher scales. The second is to make it more efficient. CLEAR mainly addresses the second step.
- (W2B: The common issue of repeated content or artifacts) Moreover, how does the method address the common issue of repeated content or artifacts in high-resolution image generation?
Thanks for the insightful question. In fact, as shown in Line 311, we apply the SDEdit algorithm for high-resolution generation. Specifically, we first generate an image at the native resolution scale of the diffusion model. Then, we resize it to a larger size and add a certain noise to it. A noise scale of 0.7 is adopted empirically in our experiments. Starting from this point, the model conducts the remaining denoising steps. In this way, the original image structures are preserved and low-level details are refined. In other words, the issue of repeated content is suppressed by skipping the initial denoising steps corresponding to the overall image layouts. We include the following study to verify its effectiveness.
| Aesthetic (2K*2K) | Prompt Alignment (2K*2K) | Overall (2K*2K) | Win Rate vs Other (2K*2K) | Aesthetic (4K*4K) | Prompt Alignment (4K*4K) | Overall (4K*4K) | Win Rate vs Other (4K*4K) | |
|---|---|---|---|---|---|---|---|---|
| w/o SDEdit | 86.32 | 89.08 | 85.37 | 0.87 | 84.15 | 87.22 | 82.36 | 0.92 |
| Ours | 90.22 | 91.94 | 88.71 | - | 90.09 | 92.29 | 88.81 | - |
Moreover, as shown in Appendix E and Fig. 13, it is convenient to build CLEAR upon more advanced methods of training-free high-resolution generation, since our method is orthogonal to them. We will enhance the clarity of this part in the revision.
- (W3A: Ablation studies on the individual contributions of the two additional loss terms) The ablation study appears insufficient, as the individual contributions of the two additional loss terms introduced (prediction loss and attention loss) are not separately verified.
Thanks for pointing this out. In fact, as shown in Lines 254~255, the two loss terms and the corresponding loss weights adopt the default setups in [30] and [37], which also focus on architectural distillation of diffusion models. Nevertheless, we are willing to add the corresponding studies below.
| FID (Against Original) / CLIP-T | α=0 | α=0.05 | α=0.5 | α=5 |
|---|---|---|---|---|
| β=0 | 14.27 / 30.90 | 13.98 / 30.77 | 13.78 / 30.66 | 13.70 / 30.56 |
| β=0.05 | 13.90 / 30.81 | 13.88 / 30.81 | 13.81 / 30.72 | 13.68 / 30.64 |
| β=0.5 | 13.83 / 30.68 | 13.82 / 30.68 | 13.72 / 30.65 | 13.47 / 30.59 |
| β=5 | 13.77 / 30.65 | 13.69 / 30.66 | 13.45 / 30.62 | 13.44 / 30.58 |
We find that the performance is overall insensitive to the hyper parameters. Too small values may result in insufficient effects of these loss terms, while too large values would not benefit performance, either.
- (W3B: Generalizability across different 𝑟) Also, does each choice of attention radius 𝑟 require training a separate model, or can one model generalize across different domain sizes?
Thanks for pointing out an interesting experiment. Qualitatively, we find that the model trained with a smaller 𝑟 can be used for larger 𝑟. However, the opposite case does not work. We speculate that it is because the information provided for the model is complete for inference with additional clues when 𝑟 is larger, while the information would be insufficient if 𝑟 becomes smaller.
We supplement the quantitative results of GPT evaluation below.
| Aesthetic | Prompt Alignment | Overall | |
|---|---|---|---|
| Train 𝑟=8, Eval 𝑟=8 | 89.19 | 88.38 | 87.96 |
| Train 𝑟=8, Eval 𝑟=16 | 89.49 | 88.83 | 88.31 |
| Train 𝑟=16, Eval 𝑟=16 | 89.62 | 92.13 | 88.52 |
| Train 𝑟=16, Eval 𝑟=8 | 86.88 | 79.08 | 82.95 |
Thank you very much for the authors’ detailed response.
W3A. While the authors provided a grid search over the loss hyperparameters, the individual contributions of each loss term remain unclear. Specifically, it would be helpful to see a more detailed analysis regarding which loss is more critical and whether the two losses are complementary or potentially redundant. In addition, the authors adopt the default hyperparameters from references [30] and [37], but it is not discussed whether these values are well-suited to the current task.
W3B. It also remains unclear whether, in practical deployment, the model needs to be retrained for each target resolution, or if a single trained model can generalize effectively across different resolutions. Further clarification on this point would strengthen the applicability claims of the proposed method.
Dear Reviewer BEVg,
We would like to genuinely thank the reviewer for the time and attention in reviewing and acknowledging our rebuttal.
We sincerely hope to know if there are any further questions or points requiring clarification, and we would be more than happy to provide any additional information or engage in further discussion. The time and effort the reviewer has dedicated to our submission are greatly appreciated.
Best regards, Authors of Submission 3487
We would like to thank Reviewer BEVg sincerely for the active engagement in the author-reviewer discussion and the further questions. We have the following opinions regarding them:
-
W3A. While the authors provided a grid search over the loss hyperparameters, the individual contributions of each loss term remain unclear. Specifically, it would be helpful to see a more detailed analysis regarding which loss is more critical and whether the two losses are complementary or potentially redundant. In addition, the authors adopt the default hyperparameters from references [30] and [37], but it is not discussed whether these values are well-suited to the current task.
Thanks for the valuable follow-up questions. In general, both loss terms contribute to the consistency between the distilled and original models. Nevertheless, we indeed find they are complementary:
-
controls the weight of , regulating the consistency between the final output results, which, according to our experiments shown in the grid search, appears to be relatively more effective than . For example, the cases of generally yield better FiD scores than those of .
-
controls the weight of , regulating the consistency between the intermediate attention results, which, according to our experiments, serves as an auxiliary measure to facilitate the training convergence. To verify this, we present the values of the flow matching loss after moving average at various steps in the training progress:
Training Steps 10 100 1000 2000 10000 w/o 0.512 0.494 0.487 0.438 0.358 w 0.396 0.371 0.362 0.351 0.347 Results indicate that can facilitate training by offering informative intermediate supervision signals, beyond the relatively late supervision provided by at the final outputs. We will include an illustrative presentation in the revision.
Overall, applying both terms achieves the best performance, and the performance is insensitive to the specific values of the hyperparameters according to the grid search. There can be some subtle trade-off between the FiD and CLIP-T metrics, where the default values achieve a good balance.
-
-
W3B. It also remains unclear whether, in practical deployment, the model needs to be retrained for each target resolution, or if a single trained model can generalize effectively across different resolutions. Further clarification on this point would strengthen the applicability claims of the proposed method.
Thanks for the good question! The model does not need to be retrained for each resolution. We only conduct training on the native resolution of FLUX, i.e., 1024x1024. As mentioned in our response to W2A, the cross-resolution generalization of FLUX is achieved by existing techniques of training-free resolution extrapolation. Our models, as the corresponding distilled models, inherit this capacity to be deployed for various resolutions.
We will definitely enhance the clarity regarding these questions in the revision. Sincere hope is that our response would alleviate the reviewer's concerns. Look forward to any further feedback.
Thank you for the authors’ response, which has addressed my initial concerns. However, I remain cautious about the claimed novelty and practical effectiveness of the CLEAR local attention mechanism.
As such, I will retain my current rating. Good luck.
We would like to thank Reviewer BEVg for the follow-up comments. We are glad to know that our previous response has fully addressed the initial concerns.
Here, we would like to respectfully reiterate the novelty and practical effectiveness of our work:
- First to linearize the complexity of pre-trained DiT. To the best of our knowledge, we are the first to explicitly focus on and tackle the problem of linearizing the complexity of already pre-trained Diffusion Transformers (DiTs).
- Systematic study of efficient attention. We conduct a systematic survey of various efficient attention mechanisms, and—through extensive experimentation—identify four necessary conditions that a feasible linearization method should satisfy.
- Non-trivial properties of linearized DiT. We further explore practically useful benefits of the linearized DiT, including:
- acceleration in high-resolution image generation,
- cross-model and cross-component generalization, and
- enhanced parallelization efficiency.
While our final design shares certain similarities with Neighborhood Attention, it emerges from a long process of trial-and-error exploration over numerous efficient attention variants. This search process leads us to distill useful guidelines that can be informative for future works. We sincerely hope this aspect could be taken into account. Moreover, as detailed in both the main paper and rebuttal, we have compared our method with Neighborhood Attention, showing that ours preserves performance while delivering higher efficiency and the additional benefit of rotation robustness.
Regarding practical effectiveness, we would like to clarify that the showcases in the paper are not cherry-picked. Instead, the test cases are randomly generated by GPT. Our method is indeed effective in practice and delivers substantial acceleration in high-resolution scenarios—this is quantitatively demonstrated across all experimental tables in the paper. If there are any specific concerns about practical effectiveness, we would be more than happy to address them during the remaining author–reviewer discussion phase, as we remain fully committed to resolving any such issues.
We truly appreciate the time and consideration of Reviewer BEVg, and we look forward to the feedback.
This paper aims at a linear attention mechanism that reduces the complexity of pre-trained DiTs to linear. This paper discusses 4 specific features that are essential for successfully linearizing pre-trained DiTs: 1) Locality, 2) Formulation Consistency, 3) High-Rank Attention Maps, and 4) Feature Integrity. Based on the 4 feature, this paper proposes CLEAR, which based on MM-DiT architectures: for each text query, it still gathers features from all text and image key-value tokens; while for each image query, it interacts with all text tokens and local key-value tokens fallen in a local neighborhood around it. Unlike Neighborhood Attention and standard 2D convolution, which use a square sliding window, CLEAR employs circular windows.
The student model is trained by fine-tuning the attention layer on 10K self-generated samples for 10K iterations. Experiments show that the distilled local attention layers are also compatible with different variants of the teacher model, e.g., FLUX.1-dev and FLUX.1-schnell, and various pre-trained plugins like ControlNet without requiring any adaptation. Another side benefits of locality is multi-GPU patch-wise parallel inference.
优缺点分析
Strengths:
- The proposed method significantly reduces the required generation computation and time for large images.
- The analysis of the 4 features are nice, except for locality.
Weakness:
- The analysis of the importance of locality is questionable. Perturbing an image of "A small blue plane sitting on top of a field" may not get the conclusion that local features are more important, because this specific prompt and image itself does not have long dependency. An good example to demonstrate locality should be the one in Figure 4. This image shows strong long dependency. If locality is also more important here, we can thus conclude. In particular, in Figure 4, local attention seems not to make any better sense than the global attention.
- The improvement compared with Neighborhood Attention seems questionable. Although the GFLOPS decrease by 12.5%, according to Figure 2, there is only a very limited improvement in real wall-time speed.
- The proposed method seems underperform Swin Transformer given the same GFLOPS.
问题
- Compared with Neighborhood Attention, what is the wall-time speed improvement? 2., Except for the GFLOPS, is there any other advantages over Neighborhood Attention? Can Neighborhood Attention also be used for DiT Plugins, and Multi-GPU Parallel Inference?
- In Table 2, why increasing radius significantly improves the FiD against original FLUX-1.dev, while has no almost no impact against real images? What are the vital changes between them?
局限性
yes
最终评判理由
The rebuttal has resolved most of my concerns. I have increased my rating accordingly.
格式问题
n/a
We appreciate Reviewer a9dc's thoughtful comments and are glad that the efficiency of our work for large images and the analysis of the 4 features, except for locality, are recognized. We would like to address the concerns as below.
- (W1: Local attention may not be effective in dealing with long dependencies in images) The analysis of the importance of locality is questionable. Perturbing an image of "A small blue plane sitting on top of a field" may not get the conclusion that local features are more important, because this specific prompt and image itself does not have long dependency. An good example to demonstrate locality should be the one in Figure 4. This image shows strong long dependency. If locality is also more important here, we can thus conclude. In particular, in Figure 4, local attention seems not to make any better sense than the global attention.
Thanks for the insightful review. We would like to rephrase the reviewer's concern as "local attention may not be effective in dealing with long dependencies in images". Regarding this, we have the following opinions:
-
No matter whether there is long dependency or not, we find that locality is a necessary condition. For the case of no long dependency, we provide examples in Figs. 3 and 5. For the case of long dependency, we supplement the GPT and GenEval scores below, which focus on evaluating the capability of handling the holistic layouts in generated images.
Aesthetic Prompt Alignment Overall Win Rate vs Other GenEval Strided Attn. 67.20 78.85 68.81 0.98 0.433 CLEAR Attn. 89.62 92.13 88.52 - 0.674
The results suggest the importance of locality in the case of long dependency.
-
Since images are locally continuous by nature, local attention is crucial to preserve this property. For long dependency, the overall receptive field of the network can be enlarged with a multi-layer structure of the DiT, a feature similar to CNN.
-
We fully agree with the reviewer that using an image with a long dependency is helpful for the conclusion. We will use another sample to reflect this in the revision.
- (W2: Circular window vs square window) The improvement compared with Neighborhood Attention seems questionable. Although the GFLOPS decrease by 12.5%, according to Figure 2, there is only a very limited improvement in real wall-time speed.
Thanks for the good point. As mentioned in Lines 321~325, the standard Neighborhood Attention and the proposed method with circular windows yield comparable performance. Given that the FLOPS can be reduced largely, we apply the latter form. Furthermore, as shown in the following study, it is indeed more efficient in terms of wall-clock time, especially at higher resolutions.
| Time (s) / 20 Steps | 1K*1K | 2K*2K | 4K*4K | 8K*8K |
|---|---|---|---|---|
| Square (𝑟=8) | 4.63 | 16.40 | 72.70 | 310.50 |
| Circular (𝑟=8) | 4.38 | 15.67 | 69.41 | 293.50 |
| Square (𝑟=16) | 4.84 | 18.77 | 87.09 | 382.07 |
| Circular (𝑟=16) | 4.56 | 17.19 | 83.13 | 360.83 |
- (W3: Improvement over Swin Transformer) The proposed method seems underperform Swin Transformer given the same GFLOPS.
Thanks for the comment. In fact, as mentioned in Fig. 3, the Swin Transformer's strategy, which lacks high-rank attention maps, results in "grid" artifacts and fails to generate visually coherent results. Also, as shown in Tab. 2, our method outperforms Swin Transformer for most metrics with less GFLOPS. We further validate the superiority of the proposed solution over Swin Transformer via the following study on the GPT scores and GenEval benchmark:
| Aesthetic | Prompt Alignment | Overall | Win Rate vs Other | GenEval | |
|---|---|---|---|---|---|
| Swin Attn. | 51.33 | 68.77 | 53.26 | 0.95 | 0.483 |
| CLEAR Attn. | 89.62 | 92.13 | 88.52 | - | 0.674 |
- (Q1: Advantage to neighborhood attention) Compared with Neighborhood Attention, what is the wall-time speed improvement? 2., Except for the GFLOPS, is there any other advantages over Neighborhood Attention? Can Neighborhood Attention also be used for DiT Plugins, and Multi-GPU Parallel Inference?
Thanks for the good questions. We kindly refer the reviewer to our response to W1 for the wall-time speed. Indeed, Neighborhood Attention in principle equips with the features explored in the manuscripts. We would like to emphasize that the main contribution of the manuscript is to identify the four key factors for successful linearization of pre-trained DiTs, listed in Tab. 1, instead of introducing a brand-new attention strategy. The initial motivation of circular windows is that they can reduce the computational overhead without hurting the performance, which is a free-launch benefit.
Moreover, we indeed find that circular windows can offer an additional advantage: they are potentially more robust to image rotation, given that circles are rational invariance, which is not the case for square windows. To validate this property, we send original images to the network and get the outputs. Then, we rotate the original images by 60 degrees, send them to the network, and rotate the outputs back. The average SSIM between the two outputs are shown below.
| Square | Circular | |
|---|---|---|
| SSIM | 0.70 | 0.75 |
- (Q2: FID against real images in Table 2) In Table 2, why increasing radius significantly improves the FiD against original FLUX-1.dev, while has no almost no impact against real images? What are the vital changes between them?
Thanks for the insightful question. The reason behind this is that the model is distilled from the original FLUX. Increasing the radius can result in a smaller architectural gap with the original model, which is reflected in the FiD scores against original FLUX-1.dev. For the scores against real images, as shown in Tab. 2, the FLUX-1.dev itself has a relatively large FiD, indicating that images generated by the teacher model do not strictly follow the distribution of COCO images, which also affects the distilled models.
We would like to thank Reviewer a9dc again for the in-depth reviews. We would definitely love to further interact with the reviewer if there are any further questions.
Dear Reviewer a9dc,
We sincerely appreciate the time and effort the reviewer has dedicated to reviewing our work.
If there’s anything else we can clarify—regarding either the manuscript or our responses—we would be more than happy to help.
Your feedback is truly important to us, and we are eager to ensure that all of the concerns have been thoroughly addressed.
Thanks again for the time and consideration.
Best regards, Authors of Submission 3487
Dear Reviewer a9dc,
Thanks for taking the time to provide such thoughtful and insightful questions. We greatly appreciate the engagement with our work.
If possible, we sincerely hope that the reviewer could help us by specifying which of the original concerns remain unresolved after our responses. This would be very helpful for us to address any remaining issues.
Thanks again for the valuable time and feedback.
Best regards,
Authors of Submission 3487
This paper investigates efficient attention mechanisms for Diffusion Transformers (DiTs), with a focus on adapting them for high-resolution image generation. The authors begin with a systematic analysis of existing efficient attention variants, identifying locality as the most promising design principle for this context. Building on this insight, they propose a circular local attention mechanism that uses a fixed-radius mask to approximate convolution-like inductive bias while retaining transformer flexibility. The method enables a linear-time attention variant and is further enhanced via knowledge distillation, achieving strong efficiency while minimizing performance degradation compared to the teacher model.
优缺点分析
Strengths:
-
The paper provides a detailed discussion of existing efficient attention mechanisms and analyzes their suitability in diffusion models. This comparative analysis offers valuable insight into the design space of efficient attention for image generation and strengthens the motivation for the proposed method.
-
The experimental section is comprehensive and well-structured, with comparisons across multiple baselines and resolutions. The reported results are convincing and support the core claims.
-
The paper includes thorough implementation details, making the proposed method highly reproducible and potentially practical for real-world deployment.
Weaknesses:
-
The proposed circular local attention mask offers limited novelty compared to existing square-shaped attention masks. Empirically, the observed improvements are marginal, which raises questions about the necessity of the circular design.
-
While the paper discusses extending the method to ultra-high-resolution image generation (e.g., 2K and 4K) using SDEdit, NTK rotary embeddings, and disabling resolution-aware dynamic shifting, the technical details of this pipeline are insufficiently explained. Moreover, the paper lacks ablation studies isolating the impact of these components. Given that CLEAR's claimed speedup at ultra-high resolutions is one of its major selling points, the underexplored treatment of this aspect undermines the strength of the argument.
问题
-
Could the authors further elaborate on the advantages of the proposed circular attention mask? While the design introduces a novel geometric locality, its empirical benefits over conventional square masks (e.g., those used in Neighborhood Attention) appear marginal based on current results. A clearer justification—whether theoretical (e.g., isotropy, boundary continuity) or task-specific—would help clarify the necessity and value of this design.
-
Can the authors provide more implementation details and design considerations regarding the application of their method to ultra-high-resolution generation? The paper mentions several techniques (e.g., SDEdit, NTK rotary embedding, disabling resolution-aware dynamic shifting) used to enable high-resolution generation, but these are not described in sufficient depth. A more detailed technical discussion would improve the clarity and reproducibility of the proposed pipeline.
-
Could the authors include ablation studies specifically focused on ultra-high-resolution settings? Since CLEAR’s improved efficiency at 2K–4K resolutions is one of its key claims, it would be valuable to see component-wise evaluations (e.g., NTK rotary embedding, disabling resolution-aware dynamic shifting) at these resolutions to understand which factors contribute most to quality and speed in this context.
局限性
Yes
最终评判理由
The authors have provided a clear and convincing explanation of the experimental methodology, which addressed my earlier concerns. The additional results and clarifications strengthened the credibility of the work. I also appreciate the authors' perspective in analyzing the problem and the insights presented in the paper. Based on these considerations, I have decided to raise my score to an accept.
格式问题
None
We sincerely thank Reviewer JWRk for the valuable feedback on the manuscript and are very excited that the reviewer finds the comparative analysis valuable, the experiments comprehensive and convincing, and the method highly reproducible and potentially practical. The questions are addressed below.
- (W1: Circular window vs square window) The proposed circular local attention mask offers limited novelty compared to existing square-shaped attention masks. Empirically, the observed improvements are marginal, which raises questions about the necessity of the circular design.
- (Q1: Circular window vs square window) Could the authors further elaborate on the advantages of the proposed circular attention mask? While the design introduces a novel geometric locality, its empirical benefits over conventional square masks (e.g., those used in Neighborhood Attention) appear marginal based on current results. A clearer justification—whether theoretical (e.g., isotropy, boundary continuity) or task-specific—would help clarify the necessity and value of this design.
Thanks for the good points. We would like to clarify that the main contribution of the manuscript is to identify the four key factors for successful linearization of pre-trained DiTs, listed in Tab. 1, instead of introducing a brand-new attention strategy. Local attention based on circular windows is one feasible strategy following the four factors, and it reduces the computational overhead of square windows without hurting the performance, which is like a free lunch. That is the reason we adopt this form.
Moreover, we indeed find that circular windows can offer an additional advantage: they are potentially more robust to image rotation, given that circles are rational invariance, which is not the case for square windows. To validate this property, we send original images to the network and get the outputs. Then, we rotate the original images by 60 degrees, send them to the network, and rotate the outputs back. The average SSIM between the two outputs are shown below.
| Square | Circular | |
|---|---|---|
| SSIM | 0.70 | 0.75 |
(W2: Technical details of high-resolution generation) While the paper discusses extending the method to ultra-high-resolution image generation (e.g., 2K and 4K) using SDEdit, NTK rotary embeddings, and disabling resolution-aware dynamic shifting, the technical details of this pipeline are insufficiently explained. Moreover, the paper lacks ablation studies isolating the impact of these components. Given that CLEAR's claimed speedup at ultra-high resolutions is one of its major selling points, the underexplored treatment of this aspect undermines the strength of the argument.
(Q2: Technical details of high-resolution generation) Can the authors provide more implementation details and design considerations regarding the application of their method to ultra-high-resolution generation? The paper mentions several techniques (e.g., SDEdit, NTK rotary embedding, disabling resolution-aware dynamic shifting) used to enable high-resolution generation, but these are not described in sufficient depth. A more detailed technical discussion would improve the clarity and reproducibility of the proposed pipeline.
Thanks for the insightful comments. In fact, it requires two steps to achieve efficient high-resolution generation. The first one is to adapt the original DiT to make it effective at higher scales. The second is to make it more efficient. CLEAR mainly addresses the second step. For the first step, the techniques are mainly adapted from previous works, e.g., SDEdit and NTK rotary embedding. That is why we do not include so many details about them.
Nevertheless, we fully agree with the reviewer that it would be helpful to supplement the related details. We include some key information here and will further elaborate on it in the revision.
- SDEdit is a simple yet effective training-free method for image editing based solely on a pre-trained text-to-image diffusion model, which can be used for high-resolution generation in a coarse-to-fine manner. Specifically, we first generate an image at the native resolution scale of the diffusion model. Then, we resize it to a larger size and add a certain noise to it. A noise scale of 0.7 is adopted empirically in our experiments. Starting from this point, the model conducts the remaining denoising steps. In this way, the original image structures are preserved and low-level details are refined.
- NTK rotary embedding applies a scaling factor to the rotary base used for rotary positional embedding, i.e., , where , and is the current sequence length while is the native length seen during the training time.
- Resolution-aware dynamic shifting applies a factor to the projection function of the current denoising time step for the flow matching scheduler: , where is positively related to the image resolution. When this is enabled, at a high resolution, the projection function would appear too "skew", i.e., the number of denoising steps allocated to the stage of high noise level is insufficient.
- (Q3: Ablation studies on high-resolution generation settings) Could the authors include ablation studies specifically focused on ultra-high-resolution settings? Since CLEAR’s improved efficiency at 2K–4K resolutions is one of its key claims, it would be valuable to see component-wise evaluations (e.g., NTK rotary embedding, disabling resolution-aware dynamic shifting) at these resolutions to understand which factors contribute most to quality and speed in this context.
Thanks for the good questions. Regarding the three factors demonstrated in our response to the previous question, we conduct the following GPT evaluation.
| Aesthetic (2K*2K) | Prompt Alignment (2K*2K) | Overall (2K*2K) | Win Rate vs Other (2K*2K) | Aesthetic (4K*4K) | Prompt Alignment (4K*4K) | Overall (4K*4K) | Win Rate vs Other (4K*4K) | |
|---|---|---|---|---|---|---|---|---|
| w/o SDEdit | 86.32 | 89.08 | 85.37 | 0.87 | 84.15 | 87.22 | 82.36 | 0.92 |
| w/o NTK | 87.73 | 91.32 | 87.10 | 0.75 | 86.54 | 91.35 | 85.93 | 0.78 |
| w Dynamic Shifting | 88.44 | 91.68 | 86.59 | 0.70 | 84.02 | 91.40 | 84.29 | 0.73 |
| Ours | 90.22 | 91.94 | 88.71 | - | 90.09 | 92.29 | 88.81 | - |
Through the results, we can observe that SDEdit mainly helps the robustness of the overall contents, and NTK rotary embedding disabling dynamic shifting mainly benefits details and aesthetics, especially for 4K images. We will include qualitative comparisons in the revision.
I appreciate the authors’ detailed response, which has clarified my concerns. Accordingly, I have decided to increase my score.
Thanks for the update — we feel encouraged to hear that the concerns have been clarified. All the clarifications and experiments discussed in the rebuttal will be carefully incorporated into the revised version. We sincerely appreciate the thoughtful feedback from Reviewer JWRk.
In the paper, the authors identify four key factors for linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. And they introduce a convolution-like local attention strategy, which limits feature interactions to a local window around each query token, and achieves linear complexity. In experiments, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images.
优缺点分析
Strengths:
- The authors identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity.
- They introduce a convolution-like local attention strategy, which limits feature interactions to a local window around each query token.
Weaknesses:
- The convolution-like local attention strategy employs circular windows instead of a square that seems not indispensable. Although the FLOPS of a circular neighborhood is ∼ π/4× that of a square neighborhood, but more data operations are required, and they achieve comparable performance.
- Among the four key elements, the high-rank attention maps and feature integrity seem not notable in the convolution-like local attention strategy.
- The experiment on relation of window size and performance is not presented.
问题
- The meaning of 'convolution-like' is not clear. What parts are convolution-like?
- What size of the window is optimal? Isn't a larger window size always better?
局限性
yes
最终评判理由
I appreciate the authors’ detailed response, which has clarified part of my concerns. However, I remain some cautious about the novelty and effectiveness. I have decided to increase my score to Borderline accept.
格式问题
no
We would like to express our sincere gratitude to Reviewer SrwR for the constructive comments and are happy that the reviewer finds the four key factors for successful linearization of pre-trained DiTs and the introduced convolution-like local attention strategy effective. We would like to address the concerns and questions reflected in the review below.
- (W1: Circular window vs square window) The convolution-like local attention strategy employs circular windows instead of a square that seems not indispensable. Although the FLOPS of a circular neighborhood is ∼ π/4× that of a square neighborhood, but more data operations are required, and they achieve comparable performance.
Thanks for the valuable point. Indeed, circular or square windows do not fundamentally affect the performance and the linearization properties shown in the manuscript. The main conclusion of the manuscript is on the four key factors for successful linearization of pre-trained DiTs, listed in Tab. 1. The standard neighborhood attention, which follows these four principles, is definitely effective, and so is the introduced method based on circular windows. Indeed, they yield comparable performance as mentioned in Lines 321~325. Given that the FLOPS can be reduced largely while preserving the performance, which is a free-lunch benefit, we apply the latter form. As shown in the following study, it is indeed more efficient in terms of wall-clock time, especially at higher resolutions.
| Time (s) / 20 Steps | 1K*1K | 2K*2K | 4K*4K | 8K*8K |
|---|---|---|---|---|
| Square (𝑟=8) | 4.63 | 16.40 | 72.70 | 310.50 |
| Circular (𝑟=8) | 4.38 | 15.67 | 69.41 | 293.50 |
| Square (𝑟=16) | 4.84 | 18.77 | 87.09 | 382.07 |
| Circular (𝑟=16) | 4.56 | 17.19 | 83.13 | 360.83 |
Meanwhile, we find that circular windows can potentially be more robust to image rotation, given that circles are rational invariance, which is not the case for square windows. To validate this property, we send original images to the network and get the outputs. Then, we rotate the original images by 60 degrees, send them to the network, and rotate the outputs back. The average SSIM between the two outputs are shown below.
| Square | Circular | |
|---|---|---|
| SSIM | 0.70 | 0.75 |
- (W2: The high-rank attention maps and feature integrity) Among the four key elements, the high-rank attention maps and feature integrity seem not notable in the convolution-like local attention strategy.
Thanks for the comment. In fact, the proposed convolution-like local attention strategy are adhere to these two principles:
- On the one hand, each query token has a different set of key and value tokens, which makes the overall attention masks exhibit a high-rank pattern.
- On the other hand, the proposed method does not compress or recompute key and value features in any forms, which ensures their integrity.
These two factors contribute significantly to the image quality, as shown in Fig. 3:
- The Swin Transformer's strategy, which lacks high-rank attention maps, results in "grid" artifacts.
- If we zoom the images, we can find that the PixArt-Sigma's strategy, which lacks feature integrity, yields distorted local details.
As a result, they fail to produce high-quality images. We further introduce the following GPT and GenEval scores to verify the conclusions:
| Aesthetic | Prompt Alignment | Overall | Win Rate vs Other | GenEval | |
|---|---|---|---|---|---|
| Swin Attn. | 51.33 | 68.77 | 53.26 | 0.95 | 0.483 |
| PixArt-Sigma Attn. | 70.28 | 79.20 | 71.32 | 0.97 | 0.428 |
| CLEAR Attn. | 89.62 | 92.13 | 88.52 | - | 0.674 |
- (W3: Performance vs window size) The experiment on relation of window size and performance is not presented.
Thanks for the valuable point, indicating an important study. The related experiments can be found in Tab. 2, Tab. 3, and Fig. 8, with various . We will further highlight them in the revision.
- (Q1: Definition of "convolution-like") The meaning of 'convolution-like' is not clear. What parts are convolution-like?
Thanks for the good question. For each query token, only several local key and value tokens are taken for attention interaction, which is similar to the behavior of a convolution with a fixed-size kernel. We provide an illustration of this paradigm in Fig. 6.
- (Q2: Optimal window size) What size of the window is optimal? Isn't a larger window size always better?
Thanks for the good questions. The reviewer's insight that a larger window size yields better performance is correct. However, it requires more computational overhead, as mentioned in Fig. 2 and Tab. 2. We empirically find that yields the best trade-off between performance and efficiency and thus adopt this setup by default. We will enhance the clarity of this part in the revision.
We would like to thank Reviewer SrwR again for the valuable feedback. Hope our responses alleviate the reviewer's concerns, and we are happy to answer additional questions if there are.
In the high-rank attention maps, how to compute the rank?
We appreciate Reviewer SrwR for the good question regarding "high-rank attention maps".
For sparse attention with the attention mask , it refers to the rank of this matrix , as discussed in Appendix A (Swin Transformer).
- For SWIN Transformer, as mentioned in Eq. 16 of the appendix, all tokens within the same window share the same set of key and value tokens. This results in many duplicate rows in , significantly reducing its rank.
- By contrast, as shown in Eq. 17, each query token in CLEAR and neighborhood attention has a distinct key and value token set, and each row in is linearly independent of each other.
We hope our response clarifies things better. Thanks again for the engagement in the discussion. We remain fully committed to answering any further questions throughout the author-reviewer discussion phase.
Dear Reviewer SrwR
We would be grateful for any indication of whether our responses have satisfactorily addressed the concerns or if further clarification is needed.
We remain committed to improving our manuscript based on your expertise.
Thank you for the time and consideration.
Best regards,
Authors of Submission 3487
Dear Reviewer SrwR,
We would like to thank the reviewer again for the thoughtful review of our submission.
Please feel free to let us know if there are any remaining questions or concerns about the manuscript or our rebuttal. We would be glad to provide any further clarification if helpful.
Best regards, Authors of Submission 3487
The paper proposes CLEAR, a convolution-like local attention for linearizing pre-trained DiTs, enabling efficient high-resolution image generation with 99.5% reduced attention computations and 6.3x speedup at 8K. All reviewers (SrwR, JWRk, BEVg, a9dc) reached consensus on acceptance post-rebuttal, with numerous addressed concerns on locality analysis, high-resolution details, loss contribution and robustness, comparison to neighborhood attention etc. Overall, this work confirms attention transfer also works well across architectures for generation tasks, and adds value to the community. Please incorporate promised revisions in the camera-ready version.