5.5

/10

Poster4 位审稿人

最低5最高7标准差0.9

3.8

置信度

正确性2.8

贡献度2.5

表达3.0

NeurIPS 2024

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization

Qihao Liu,Zhanpeng Zeng,Ju He,Qihang Yu,Xiaohui Shen,Liang-Chieh Chen

OpenReview PDF

提交: 2024-05-11更新: 2024-11-06

摘要

This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization. Diffusion models have gained prominence for their effectiveness in high-fidelity image generation. While conventional approaches rely on convolutional U-Net architectures, recent Transformer-based designs have demonstrated superior performance and scalability. However, Transformer architectures, which tokenize input data (via "patchification"), face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations concerning token length. While larger patch sizes enable attention computation efficiency, they struggle to capture fine-grained visual details, leading to image distortions. To address this challenge, we propose augmenting the **Di**ffusion model with the **M**ulti-**R**esolution network (DiMR), a framework that refines features across multiple resolutions, progressively enhancing detail from low to high resolution. Additionally, we introduce Time-Dependent Layer Normalization (TD-LN), a parameter-efficient approach that incorporates time-dependent parameters into layer normalization to inject time information and achieve superior performance. Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, where DiMR-XL variants surpass previous diffusion models, achieving FID scores of 1.70 on ImageNet $256 \times 256$ and 2.89 on ImageNet $512 \times 512$. Our best variant, DiMR-G, further establishes a state-of-the-art 1.63 FID on ImageNet $256 \times 256$.

关键词

Image Generation

评审与讨论

审稿意见

评分: 7置信度: 52024-06-28

The paper proposes a multi-resolution network for diffusion model which emphasizes the learning features across resolutions. The method further introduces a time-dependent layer norm to boost the model performance with fewer parameters compared to AdaLN in DiT. This proposed network demonstrates a SoTA performance on ImageNet at 1.70 and 2.89 on ImageNet 256 & 512, respectively.

优点

The paper addresses image distortion caused by varying patch sizes by integrating a multi-resolution structure into the network. The choice of patch sizes involves a trade-off between computational complexity and model performance: larger patch sizes reduce computational complexity but degrade performance.
The method finds a balance between compute complexity and performance by introducing a multi-branch architecture and multi-resolution loss to decompose the learning process from low to high resolution. The idea is a substantial contribution. Through quantitative and qualitative results, the method can alleviate distortion of image generation and achieve SoTA FID scores.
The time-dependent layer norm is a simplified form of AdaLN with less intensive parameters by removing the cumbersome MLP layer and rearranging class embedding and time embedding for parameter-efficient time conditioning.

缺点

The discussion between the method and cascaded methods like "Cascaded diffusion models" and "Matryoshka diffusion models" should be included to highlight the advantages of the method.
In table 1, it needs to be updated with some latest SoTa methods like MDTv2 with the best FID of 1.58 and PaGoDA with the best FID of 1.56 on ImageNet 256x256. Not to compete with them, but to get an overview picture of the latest methods.
The proposed multi-scale diffusion loss is also introduced in SimpleDiffusion. So, the authors need to mention the difference in the paper.
Equation 2: have the authors used concat instead of adding the upsampling features directly to the larger-resolution features of the next branch?
The current method only uses the output features of the previous branch and injects them into the start of the next branch. However, this design lacks interconnections between blocks from the low-resolution and high-resolution branches. Ideally, there should be skip-connections across branches. What is the motivation behind this? If authors have not tested this, it is encouraged to do so.
In table 2, what is the reason why using AdaLN-Zero with Multi-branch (row 3) causes a bad result?
The root problem of distortion boils down to the use of patch size which is the main motivation of the work. However, in the design of multiscale network, the authors also patchify the input image to the corresponding resolution for each branch. I wonder why the author do not use different-resolution noisy images as the input to each branch like Matryoshka Diffusion Models. I think this is more effective to completely remove distortion problem. As shown in Figure 5, the method still exhibits a certain level of distortion rates.
Sampling speed: it is valid to include a comparison of sampling time with baselines.
Figure 3: What is the red line meaning?

Misc: L37, L130: the figure 7 of DiT paper is incorrectly linked.

Ref:

Gao, Shanghua, et al. "MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer." arXiv preprint arXiv:2303.14389 (2023).
Kim, Dongjun, et al. "PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher." arXiv preprint arXiv:2405.14822 (2024).
Hoogeboom, Emiel, Jonathan Heek, and Tim Salimans. "simple diffusion: End-to-end diffusion for high resolution images." International Conference on Machine Learning. PMLR, 2023.

问题

N/A

局限性

As pointed out in Weaknesses, the method still has a certain distortion rate which is due to the use of strided convolution (a form of patchification) at the start of each branch. I think the author should include this in the revised manuscript.

作者回复

2024-08-07

We thank the reviewer for the constructive comments, and we carefully address the concerns below.

W1: The discussion between the method and cascaded methods like "Cascaded diffusion models" and "Matryoshka diffusion models" should be included to highlight the advantages of the method.

Thank you. We will add the following paragraph to the related work section.

Cascaded diffusion models. To generate images with high resolution while solving the large computational complexity, it is commonly to use Cascaded diffusion models [a,b,c], where a first diffusion model is used to generate images at lower resolution, and then one or multiple diffusion models are used to gradually generate the super-resolution versions of the initial generation. Recently, Gu et al [d] propose using one diffusion model to generate images at different resolutions at the same time via a Nested U-Net and a progressive training recipe. However, the key insight of their multi-resolution design is that high-resolution images can be generated with smaller and affordable models if the low-resolution generation is used as conditioning. In contrast, we observe that different architectures (e.g., transformer and convolution layers) have different behaviors (such as performance, speeds, etc) at different resolutions. Therefore, we propose a multi-resolution network design in the feature space, and use different architectures to handle features at different resolutions. This enables us to utilize the advantages of different architectures within a single diffusion model, and achieve the best performance while maintaining model efficiency.

[a] Ho, Jonathan, et al. "Cascaded diffusion models for high fidelity image generation."
[b] Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents."
[c] Saharia, Chitwan, et al. "Photorealistic text-to-image diffusion models with deep language understanding."
[d] Gu, Jiatao, et al. "Matryoshka diffusion models."

W2: In table 1, update with some latest SoTa methods.

Thanks for the suggestion. We will add them to the full table. However, it is noteworthy that these methods propose new strategies to train diffusion models, such as using mask latent modeling schemes or progressive training, which are orthogonal to the main focus of this work (a multi-branch network design). We believe that these strategies can also be used to train our model and achieve better performance. Additionally, PaGoDA was on arXiv (May 23, 2024) after our submission to NeurIPS.

W3: The proposed multi-scale diffusion loss is also introduced in SimpleDiffusion. So, the authors need to mention the difference in the paper.

Thanks for the suggestion. We already cited SimpleDiffusion in L230, and will discuss more in the revision. In short, the multi-scale training loss in SimpleDiffusion is used to balance the loss on different levels of details when training diffusion models for high resolutions. The multi-scale training loss used in our DiMR is to supervise different branches at different resolutions. Another crucial difference is the underlying denoising backbone, where SimpleDiffusion uses U-Net, and DiMR uses a multi-branch network.

W4: Have the authors used concat instead of adding the upsampling features directly to the larger-resolution features of the next branch?

We actually considered both strategies at the early stage of this project. We found that concatenation and upsampling make no difference experimentally. They achieve very similar performances as shown in Table J.

Table J: Comparison between upsampling and concatenation on ImageNet-256

	Epoch	#Params.	Gflops	FID-50K w/o CFG	FID-50K w CFG
DiMR-XL/2R w/ upsampling	400	505M	160	4.87	1.77
DiMR-XL/2R w/ concatenation	400	506M	161	5.01	2.06

W5: Ideally, there should be skip-connections across branches. What is the motivation behind this? If authors have not tested this, it is encouraged to do so.

Thanks for the interesting suggestion. We have experimented with skip-connections, as detailed in Table K. Specifically, we connect the input of each layer in the latter branch with the output of the corresponding layer in the former branch using a skip-connection block. This block consists of two layer normalization layers with a pixel shuffle upsampling layer in between. Our results indicate that skip-connections have a marginal, or slightly negative, influence on our DiMR model.

Table K: Ablation on skip connections

	skip-connections	Epoch	#Params.	Gflops	FID-50K w/o CFG	FID-50K w CFG
DiMR-XL/2R	NO	400	505M	160	4.87	1.77
DiMR-XL/2R	YES	400	542M	165	4.45	1.96

W6: In table 2, what is the reason why using AdaLN-Zero with Multi-branch (row 3) causes a bad result?

Our observation is that AdaLN-Zero may not perform well on convolution layers, probably because MLP considers data in 1D, which may not handle the 2D structure of the convolution layers very well. Also, our multi-branch design (w/ transformer and convolution layers) may require a more careful tuning of AdaLN-Zero, which was originally designed for DiT (single-branch and pure transformer layers).

Additionally, some other papers (e.g., see Sec. 3.1 and Fig. 2b in U-ViT paper [e]) also find that AdaLN does not perform well.

[e] Bao, Fan, et al. "All are Worth Words: A ViT Backbone for Diffusion Models."

We address Weakness 7-10 and Limitation 1 in "Part2 of our rebuttal to Reviewer Wwps"

评论- Part2 of our rebuttal to Reviewer Wwps: addressing Weakness 7-10 and Limitation 1

2024-08-07

W7: I wonder why the author do not use different-resolution noisy images as the input to each branch like Matryoshka Diffusion Models. I think this is more effective to completely remove distortion problems.

We thank the reviewer for the suggestion. In DiMR, we have one specific branch that uses the original resolution without patchification, which is designed to capture the finer visual details. Using different-resolution noisy images is a very good idea, but we think that resizing the input images to a smaller resolution may also cause the loss of details. Additionally, the FID score of 6.62 reported by Matryoshka Diffusion Models (vs. DiMR 1.70) may also suggest that this would only provide marginal improvements. Nevertheless, we agree that designing a method without any sort of patchification would further reduce the distortion rate. We leave it for future work.

W8: Sampling speed: it is valid to include a comparison of sampling time with baselines.

Thanks for the suggestion! We provide the comparison of sampling time in the following table. We test both methods with batch size 1, and compare the sampling time to generate one image on an A100. We can see that our model also has a very high sampling speed.

Table L: Comparison of sampling speed. We report Sampling Speed (second/sample) here.

	ImageNet-256	ImageNet-512
DiT-XL	3.3	6.3
DiMR-XL (Ours)	2.4	2.6

W9: Figure 3: What is the red line meaning?

Thanks for pointing it out. It means 0.1 threshold. We will make it clear in the revision.

W10: L37, L130: the figure 7 of DiT paper is incorrectly linked.

Thank you. We actually did intend to refer to the Figure 7 of DiT paper, which qualitatively shows that decreasing patch size reduces distortion. We note that they are not hyper-linked.

L1: The method still has a certain distortion rate which is due to the use of strided convolution (a form of patchification) at the start of each branch. I think the author should include this in the revised manuscript.

Thanks for the suggestion. We will include this in the limitation section. It is noteworthy that completely solving distortion in image/video generation is an extremely challenging problem. In this work, we have already significantly reduced the distortion rate compared to U-ViT and DiT, which is quite a step towards the goal.

评论- Response

2024-08-08

I appreciate the authors' effort in addressing all of my concerns and I am happy with these answers. I will raise my final score and vote for acceptance, as the contributions are significant for advancing the growth of diffusion models.

评论- Thank You for Your Review and Support

2024-08-10

Thank you very much for your insightful feedback and for considering our responses. We're glad we could address all your concerns satisfactorily. Your support in recommending our paper for acceptance is greatly appreciated. If you have any further questions, please feel free to let us know.

审稿意见

评分: 5置信度: 42024-07-03

This paper proposes to replace the original pull transformer blocks in DiT with multi-resolution network. Detailly, transformer block is employed for low resolution and Conv blocks are used for the remaining higher resolution. An additional time-conditioning block is designed for Conv blocks. The effectiveness of this method is validated with numercial experiments.

优点

This paper proposes a multi-resolution network, consisting of both transformer and conv blocks, to boost the performamce of image generation. Such design is rational. To fit the conv block, this paper also introduce an additional time conditioning block.
This method achieves better performance than DiT.

缺点

It is unclear what the performance gain comes from. intuitively, the multi-resolution design could already brings better performance. But this paper proposes the conbination of transformer and conv. I wonder if the design of transformer in low resolution really necessary.
The scalibility of this network may be limited due to the introduction of conv blocks. For example, for network parameters larger than 1B.
This paper mainly claims to solve the large computional complexityinduced by the long token length. But in the real long token length case (image 512x512), the perfomance gains of this method is marginal compared to DiT. This raises the concern of the real performance of this method on large image case, eg. >512x512.

问题

It is suggested to identify the real core contribution for performance gain of this paper. The ablation study of replacing the transformer block with conv block is suggested.
The effectiveness of this method on larger image size and networks are concerened. The authors are suggested to give furhter discussions on these points if they wish a full surpass of the original DiT.

局限性

The authors mask well discussions on the limitaion of this paper.

作者回复

2024-08-07

We thank the reviewer for the constructive comments, and we carefully address the concerns below.

W1: It is unclear what the performance gain comes from. Is the design of transformer in low resolution really necessary?

We thank the reviewer for the suggestion. Below, our experiments in Table G demonstrate that using transformer blocks at the lowest resolution is critical for achieving good performance while maintaining efficiency. Replacing the transformer blocks in the lowest-resolution results in a DiMR variant that has multi-resolution design with pure convolution blocks. However, it performs worse than combing transformer blocks with convolution blocks at different resolutions.

Table G: Ablation of different architectures on ImageNet-256.

	1st branch (lowest resolution	2nd branch	Epoch	FID-50K w/o CFG	FID-50K w CFG
DiMR-XL/2R (pure conv)	ConvNeXt	ConvNeXt	400	5.75	2.09
DiMR-XL/2R (transformer + conv)	Transformer	ConvNeXt	400	4.87	1.77

W2: The scalibility of this network may be limited due to the introduction of conv blocks. For example, for network parameters larger than 1B.

We thank the reviewer for the suggestion. To verify it, we trained a 1B variant of DiMR (i.e., DiMR-H/3R) on ImageNet-512 during the rebuttal period. As shown in Table H below, performance significantly improves when increasing the model size from 505M to 1.02B, demonstrating the scalability of DiMR. We note that DiMR-H/3R has been trained with only 300 epochs due to limited GPUs and the short rebuttal period. However, we still observe improvement as training progresses. We emphasize that this experiment was conducted only once due to time and GPU constraints. Therefore, better performance may be achieved with a more careful network design and optimized training hyperparameters.

Table H: Scaling up DiMR on ImageNet-512

	Epoch	#Params.	Gflops	FID-50K w/o CFG	FID-50K w CFG
DiMR-XL/3R	400	525M	206.1	8.56	3.23
DiMR-H/3R	300	1.03B	399.6	7.74	2.86

W3: In the real long token length case (image 512x512), the performance gains of this method are marginal compared to DiT, eg. >512x512.

We thank the reviewer for the comment. To clarify, on ImageNet-512, we adopt a three-branch design, DiMR-XL/3R, where the transformer branch has a patch size of 4. This design strikes a better balance between accuracy and speed, following U-ViT-H/4. However, the design may lead to inferior performance to a model using a patch size of 2, like DiT-XL/2. As a result, we think U-ViT-H/4 is a more proper baseline for DiMR-XL/3R (both use the same patch size for transformer branch), as the patch size significantly affects generation accuracy. It is also evidenced by the Table 4 of DiT paper, where DiT models with a patch size of 4 are 1.47 to 2.21 times worse than the same models with a patch size of 2 (e.g., DiT-XL/4 43.01 vs. DiT-XL/2 19.47 FID-50k w/o CFG). Therefore, DiT-XL/2 needs to handle heavy computational burdens to achieve good results. Specifically, on ImageNet-512, DiT-XL/2 is 2.55 times slower than our DiMR-XL/3R per forward pass (525 Gflops vs. 206 Gflops under similar model parameters). Nevertheless, DiMR-XL/3R still outperforms the best and heaviest DiT-XL/2 by 0.15 FID, and surpasses the proper baseline U-ViT-H/4 by 1.16 FID.

In addition, our model can also be improved by reducing the patch size and increasing the computational complexity. During the rebuttal period, we also tried 512 generation with 2 branches (which equal to a patch size of 2). We report the FID-10K score during training in Table I. We can observe that we can further improve performance, but it also increases computational complexity.

Table I: ImageNet-512 generation with 2 branches (DiMR-XL/2R) will further improve performance

	#Params.	Gflops	FID-10K w/o CFG at 80 epochs	FID-10K w/o CFG at 160 epochs	FID-10K w/o CFG at 240 epochs
DiMR-XL/3R	525M	206	18.74	14.84	13.90
DiMR-XL/2R	515M	619	17.56	13.68	12.95

Q1: Identify the real core contribution for performance gain of this paper. The ablation study of replacing the transformer block with conv block is suggested.

Please refer to W1 for the detailed discussion.

Q2: The effectiveness of this method on larger image size and networks are concerned.

Please refer to W3 for the detailed discussion.

2024-08-12

Thanks for the response. I decide to raise the score to 5.

评论- Thank You for Your Review and Support

2024-08-13

Thank you once more for your valuable suggestions and for considering our responses! If you need any further information or clarification, please feel free to contact us!

审稿意见

评分: 5置信度: 32024-07-12

This paper introduces a diffusion model named DiMR, which incorporates a Multi-Resolution Network and Time-Dependent Layer Normalization to enhance image generation quality. Traditional diffusion models, often limited by a trade-off between computational efficiency and visual fidelity, struggle with image distortion due to the coarse resolution of input data processing in Transformer-based designs. DiMR addresses this by refining image details progressively across multiple resolutions, reducing distortions significantly. The new light-weight Time-Dependent Layer Normalization technique introduced in the model embeds time-based parameters into the normalization process efficiently. Demonstrated through benchmarks on ImageNet 256x256 and ImageNet 512x512, DiMR outperforms competing models.

优点

The paper is well-written and easy to follow, featuring clear diagrams and detailed captions that enhance understanding.
The introduction of Time-Dependent Layer Normalization (TD-LN) is particularly interesting. The use of PCA analysis to justify this approach provides a strong motivation and highlights its innovativeness.
The experimental section of the paper effectively demonstrates the effectiveness of the proposed method.

缺点

The model has a higher FLOPs count compared to a DiT of similar parameter size, raising concerns about slower training speeds.
It is unclear whether this cascaded approach can support multi-resolution training.

问题

This work appears to be a cascade diffusion model based on DiT. How does it perform compared to traditional UNet-based cascade diffusion models?
What are the training and convergence speeds of this model, particularly in comparison to the baselines?
The proposed method adopted the class token similar to that used in U-ViT for injecting class conditions. Is it possible to replace this with an approach similar to Time-Dependent Layer Normalization (TD-LN)?

局限性

The authors have adequately addressed the limitations and potential negative societal impact of their work.

作者回复

2024-08-07

We thank the reviewer for the constructive comments, and we carefully address the concerns below.

W1: The model has a higher FLOPs count compared to a DiT of similar parameter size, raising concerns about slower training speeds.

Thanks for the comment. To address it, we provide a training speed analysis in Table D. We note that on ImageNet-256, when comparing with DiT-XL/2, our DiMR-XL/2R only increases the computation by 41 Gflops, which is relatively negligible. Additionally, we also experiment with a much faster DiMR variant (i.e., DiMR-XL/3R) on ImageNet-256. As shown in the following table, DiMR-XL/3R, even with much smaller Gflops, still surpasses DiT-XL/2 by a large margin.

Table D: Comparison of training speed on ImageNet-256. For training speed, we test all models with batch size of 256 on 8 A100s.

	#Params.	Gflops	FID-50K w/o CFG	FID-50K w CFG	Total training time
DiT-XL/2	675M	119	9.62	2.27	24.8 days
Large-DiT-3B	3B	928	-	2.10	-
DiMR-XL/2R (ours)	505M	160	4.87	1.77	11.8 days
DiMR-XL/3R (ours)	502M	48	5.58	1.98	10.0 days

Furthermore, on ImageNet-512, our DiMR-XL/3R is more compute-efficient than DiT-XL/2. Specifically, our DiMR-XL/3R has 206 Gflops while DiT-XL/2 has 525 Gflops, indicating that we are more than 2x faster at each forward pass, making ours much more training efficient for high-resolution image generations.

W2: Can this cascaded approach support multi-resolution training?

We thank the reviewer for the interesting idea. Our DiMR can actually perform multi-resolution training and enable multi-resolution generation with a single model. Due to the limited time of the rebuttal period, we only tried two different resolutions: 256x256 and 512x512 multi-resolution training, and we report the results in Table E. We observe that the same DiMR-XL/2R model can generate both 256x256 and 512x512 images while achieving good FID scores. For reference, we also reported the best DiT model at each resolution. Our model is comparable (even superior) to the best resolution-specific DiT models.

Table E: Multi-resolution generation. For DiMR-XL/2R, we use the same model to generate 50K images for each resolution and compute the FID-50K score. For DiT-XL, we report their best model trained on each specific resolution.

	#Params.	Gflops	FID-50K on 256x256	FID-50K on 512x512
DiMR-XL/2R （multi-resolution generation）	505M	160	1.79	3.18
DiT-XL/2 （single 256-resolution generation）	675M	119	2.27	x
DiT-XL/2（single 512-resolution generation）	675M	525	x	3.04

Q1: How does DiMR perform compared to traditional UNet-based cascade diffusion models?

As discussed in the paper, DiMR adopts a framework of feature cascade, instead of image cascade (like traditional UNet-based cascade diffusion models). Specifically, the image cascade approaches (e.g., Cascaded Diffusion Models) first generate low-resolution images by a base diffusion model, and then improve the details in the subsequent super-resolution diffusion models. By contrast, DiMR generates feature maps progressively from low-resolution to high-resolution, all within a single model.

Below, Table F provides the results compared to the cascade diffusion models, i.e., Cascaded Diffusion Model (CDM) and Matryoshka Diffusion Model (MDM), a special type of cascaded diffusion model where multi-resolution images are generated at the same time via a Nested U-Net. Note that they only report results on ImageNet-256 generations. As shown in the table, our model surpasses them by a large margin.

Table F: Comparison with cascade diffusion models on ImageNet-256.

	FID-50K without CFG	FID-50K with CFG
CDM	-	4.88
MDM	8.92	6.62
DiMR-XL/2R (ours)	4.50	1.70

Q2: What are the training and convergence speeds of this model, particularly in comparison to the baselines?

Please refer to W1 for the detailed discussion.

Q3: The proposed method adopted the class token similar to that used in U-ViT for injecting class conditions. Is it possible to replace this with an approach similar to Time-Dependent Layer Normalization (TD-LN)?

We thank the reviewer for the interesting question. We experimented with this idea, but it did not work. In the formulation of TD-LN (Equ (5) in the paper), the time token is considered as a scalar, instead of an embedding vector as in adaLN (Equ (3) in the paper). The diffusion (or denoising) process is a monotonical function of time, allowing us to easily encode time as a scalar. On the other hand, the class token contains more complex information, and therefore simply encoding it as a scalar in our Equ (5) does not work. In the end, we resort to U-ViT's simple yet effective strategy by feeding the class tokens into the transformer branch. A more careful exploration (e.g., adding a lightweight MLP for the class tokens) may make it work, and we leave it for future work.

审稿意见

评分: 5置信度: 32024-07-13

This paper works on efficient diffusion model backbone architecture by using transformer and ConvNeXt architecture respectively on small and large resolutions of the same inputs, to leverage the strength of both architectures, and alleviate the distortion problem.

优点

The writing is easy to read and follow.
The idea is simple and effective.
They conduct rich experiments to support the idea.

缺点

This work utilizes two standard architectures, incorporating only minor design elements such as the TD-LN. It would be more advantageous to propose a new, general, and elegant architecture that effectively combines the strengths of the attention layer and the convolution layer.

问题

Are there any ablation studies on the number of different resolutions used? For instance, what happens if we use one small resolution with DiT and one normal resolution with a larger conv net?

局限性

See the weakness section.

作者回复

2024-08-07

We thank the reviewer for the constructive comments, and we carefully address the concerns below.

W1: This work utilizes two standard architectures; more advantageous to propose a new, general, and elegant architecture that combines both.

We thank the reviewer for the suggestion, which we fully agree with. Additionally, we believe that our proposed DiMR is one step towards the goal, by effectively combining the strengths of the attention and convolution layers. Specifically,

DiMR is clean and elegant: As shown in Fig. 2 of the paper, DiMR only uses the standard multi-head attention layers, depthwise convolutions, and MLP layers, along with the proposed TD-LN layers. No other complex designs or operations are involved. As a result, DiMR showcases a straightforward and effective network design.
DiMR is general: Since only standard operations are employed in DiMR, users can design their own DiMR variant by simply changing the types, numbers, or orders of layers. Additionally, as demonstrated in Table A, DiMR also supports the variant of pure convolutions (i.e., the lowest-resolution branch also uses ConvNeXt blocks).

Table A: DiMR supports arbitrary combination of different architectures at different branches. Experiments on ImageNet-256. We explain the performance gap between 2R and 3R in Q1 (Table B).

	1st branch (lowest resolution)	2nd branch	3rd branch	FID50K w/o CFG	FID50K w/ CFG
DiMR-XL/2R (pure conv)	ConvNeXt	ConvNeXt	-	5.75	2.09
DiMR-XL/2R	Transformer	ConvNeXt	-	4.87	1.77
DiMR-XL/3R	Transformer	ConvNeXt	ConvNeXt	5.58	1.98

DiMR is novel and effective: To the best of our knowledge, DiMR is the first work that has successfully combined transformer and convolution architectures into a single multi-resolution diffusion model. DiMR demonstrates the effectiveness of combining both architectures, resulting in state-of-the-art performance without any bells and whistles (even surpassing Large-DiT 3B parameters by a large margin: 1.70 vs. 2.10 on ImageNet-256 benchmark, while DiMR only uses 505M parameters).
TD-LN is crucial, not minor: As demonstrated by the systematic analysis in the paper, TD-LN is a parameter-efficient approach that effectively injects time information into the model. We emphasize that other reviewers appraise the proposed TD-LN: Reviewer MPTv acknowledges that TD-LN is particularly interesting, well-motivated, and innovative, while Reviewer Wwps also thinks that TD-LN is more efficient and effective than AdaLN-zero, due to the removal of the cumbersome MLP layer.

Q1: Are there any ablation studies on the number of different resolutions used?

We thank the reviewer for the suggestion. It is noteworthy that using different number of resolutions affects the trade-off between generation performance and model speed (both inference and training time). As a result, we use two resolutions (denoted as 2R in the paper) and three resolutions (i.e., 3R) for ImageNet-256 and ImageNet-512 benchmarks, respectively.

Below, we present a careful ablation study on the number of resolutions. Note that for ImageNet-512 generations, we could not complete the full experiment within the short rebuttal period. Therefore, we followed U-ViT and reported the FID10K at every 100K iterations (i.e., 80 training epochs). Our findings indicate that one transformer branch combined with one convolution branch (i.e., 2R) usually achieves the best results. However, the Gflops also increases (since the transformer branch is operated on a higher resolution than the 3R counterpart). Conversely, employing more blocks at smaller resolutions (i.e., 3R) reduces the computational burden but slightly degrades the performance. We will include these experiments in the revised version.

Table B: Ablation of different numbers of resolutions on ImageNet-256

	Epoch	#Params.	Gflops	FID-50K w/o CFG	FID-50K w CFG
DiMR-XL/2R	400	505M	160	4.87	1.77
DiMR-XL/3R	400	502M	48	5.58	1.98

Table C: Ablation of different numbers of resolutions on ImageNet-512

	#Params.	Gflops	FID-10K w/o CFG at 80 epochs	FID-10K w/o CFG at 160 epochs	FID-10K w/o CFG at 240 epochs
DiMR-XL/2R	515M	619	17.56	13.68	12.95
DiMR-XL/3R	525M	206	18.74	14.84	13.90

作者回复

2024-08-07

Dear reviewers and ACs,

We thank all reviewers for their valuable comments and feedback, mentioning that our method is "simple and effective" (Reviewer 1gRs and MPTv), which "alleviates distortion of image generation and achieves SoTA FID scores" (Reviewer Wwps). Additionally, we are glad that they find the proposed TD-LN is "particularly interesting with strong motivation and innovativeness" (Reviewer MPTv) and "simple with less intensive parameters" (Reviewer Wwps).

We address all the comments and questions from the reviewers in the rebuttals below, and provide detailed explanations and experimental results.

For quick reference, we list new experiments provided in this rebuttal below:

Table A shows that DiMR supports arbitrary combination of different architectures at different branches.
Table B/C ablate different numbers of resolutions on ImageNet-256/512, respectively.
Table D demonstrates the advantage in training speed compared to the baselines.
Table E shows that DiMR is capable of multi-resolution generation and is comparable to (or even superior to) the best resolution-specific DiT models.
Table F compares DiMR with cascade diffusion models.
Table G ablates different architectures (pure convolution vs. transformer + convolution)
Table H shows the results of the 1B variant of DiMR on ImageNet-512, demonstrating the scalability of DiMR.
Table I shows that increasing Gflops of DiMR will further improve performance on ImageNet-512.
Table J compares the difference between upsampling and concatanation.
Table K ablates the skip connections.
Table L demonstrates the advantage in sampling speed compared to the baselines.

We thank the reviewers for the constructive comments, and we will include all the new results and discussions in the final version.

Best,

Authors

最终决定Accept (poster)

2024-09-25

The paper presents a multiresolution architecture that combines Transformer-based layers on low-resolution branches and ConvNeXt-based layers on high-resolution branches. The paper also makes an interesting observation on adaLN layers in diffusion models and proposes a simple parameter-efficient time-dependent adaLN. Overall the reviewers have found the paper well-written and acknowledged that the exhaustive experimental studies in the paper provide compelling evidence on the efficacy of the proposed architecture. The rebuttal has addressed most remaining concerns well, and thus we are happy to recommend acceptance at NeurIPS.

We strongly recommend incorporating reviewers' suggestions and the new experimental results in the final camera-ready version (or its appendices). Please also update your Table 1 with results from MDTv2 and DiffiT.