AccCtr: Accelerating Training-Free Control For Text-to-Image Diffusion Models
摘要
评审与讨论
This paper aims at training-free conditional generation tasks for the pre-trained diffusion models. Specifically, this paper first introduces alternative maximization to investigate why training-free conditional generation always needs more sampling steps. Then, this paper argued that the reason is that the feature extractor cannot offer precise direction. To solve this question, this paper further proposes how to fine-tune the feature extractor, which could both reduce the sampling steps and enhance the generation quality.
优点
-
Introducing the alternative maximization to explore the reason for the slow generation of training-free conditional generation is interesting. Training-free conditional generation is similar to gradient optimization; thus, it is reasonable to introduce the maximization view.
-
The experimental results show that AccCtr achieves the SOTA results based on specific tasks.
缺点
- The writing for this paper should be further improved. 1) Redundant and confused math equations. For example, the main content in Sec 3.1 is the background of the DDPM, which will never be used again in the latter storyline. Meanwhile, should be in 139 lines. This also leads to difficulty in understanding Algorithm 1. For example, the time travel step is shown in 9-13 lines in Algorithm 1. We assume . We can find that the equation in 9 lines is wrong since the noise scheduler parameter will become , which mismatches the in the outer loop. In the end, what is mean in Fig. 2? 2) There is an isolated storyline. There seems to be no connection between Sec. 4 and Sec. 5. 3) Typos. For example, in Table 1, the best CLIP metric should be ControlNet instead of the bold part.
At least, the authors provide a brief explanation of how the DDPM background relates to their method, clarify the notation used in Algorithm 1, and explain the meaning of in the figure caption.
- Part of the contribution of this paper seems limited and unreasonable. There are two contributions to this paper. 1) Introducing alternative maximization. 2) Proposing a fine-tuned strategy for the feature extractor based on the two loss functions.
Focusing on the contribution 2, one of the loss is from MPGD [1] directly. Concretely, MPGD raised that we could map to low-dimension latent space to help the training-free conditional generation. Therefore, MPGD proposed to train an auxiliary autoencoder. Thus, is the same as the MPGD by replacing the autoencoder with a feature extractor, which weakens the contribution 2. Meanwhile, the other loss is unreasonable. If it is correct, to optimize , we have to compute a Hessian matrix when calculating since we have to first calculate the partial for and then calculate the partial for . In this condition, the computation cost is too high, which makes it unreasonable.
At least, the authors should clarify how AccCtr differs from or improves upon MPGD, particularly regarding .
- The experimental results are not enough. 1) As illustrated above, fine-tuning strategy is similar to training an autoencoder in MPGD. The MPGD-z (MPGD with autoencoder) should be considered as the baseline. 2) Lacking the ablation study for . is very important for the training-free conditional generation since it directly decides the generation quality. Meanwhile, in this paper, is also a parameter in . Thus, we have to discuss its influence on the quality of generation.
[1] Manifold Preserving Guided Diffusion. Yutong He and Naoki Murata and Chieh-Hsin Lai and Yuhta Takida and Toshimitsu Uesaka and Dongjun Kim and Wei-Hsiang Liao and Yuki Mitsufuji and J Zico Kolter and Ruslan Salakhutdinov and Stefano Ermon. ICLR 2024.
问题
-
What is the connection between Fig. 1 and the text-to-image-based tasks? Unlike tasks in Fig. 1, text-to-image-based training-free conditional tasks face additional challenges from the prompt. That is, we have to deal with the guidance from the text and the guidance from the additional conditions such as depth maps. Therefore, tasks in Fig. 1 lose guidance from the text, which leads to the explanation of the slow generation in Sec. 5 being inconvincible. The author should clarify this question carefully.
-
Does AccCtr work on the style-transfer task? The motivation for this question is that previous works, such as MPGD and FreeDom, could work for the style-transfer task, where we could give a style image as the condition to guide the stable diffusion to generate images that both obey the style and the prompt we provide.
-
Is there any way to decrease the fine-tuning cost? Currently, the fine-tuning cost reaches 60 hours, and the proposed method cannot drop the feature encoders different from the MPGD since MPGD works well, too, after dropping the autoencoder. In this condition, the computation cost is unacceptable for the training-free method.
To sum up, all my concerns are listed in Weaknesses and Questions. The advantage of this paper is that it introduces alternative maximization. However, this paper should be improved further. In this condition, we rate it as "reject." If the author could clarify these concerns, i am wiling to increase my rate.
To accelerate the sampling speed in text-to-image controllable generation, this paper retrains the condition extraction network to refine the loss's guidance, denoted as the AccCtr framework.This is because the analysis reveals that manifold deviation is a key factor contributing to slow sampling, necessitating more iterations to match the target conditions and the data manifold. In practice, the resulting AccCtr can be seamlessly implemented into current training-free conditional diffusion models with negligible sampling overhead. Importantly, experiments enhanced by the optimized condition extraction network demonstrate both effectiveness and high efficiency.
优点
-
Slow sampling speed in diffusion models is a well-known issue that has attracted considerable attention from researchers. Consequently, this paper investigates an intriguing problem and presents meaningful analyses.
-
The analysis of manifold deviation is reasonable and serves as a primary motivation for future work, as a .
-
The proposed alternative maximization is theoretical guaranteed, which provide some inspirations for computer vision task.
-
The presentation is clear, the figures are visually appealing, and the writing is well done.
缺点
-
The connection between the alternative maximization and condition extraction network optimization seems like no strong connection.
-
In my humble opinion, the performance improvements shown in Table 2 are marginal. However, since using an additional network is a common way to enhance generation, it may not be worthwhile to incur extra training costs for only a slight improvement.
-
It seems like make a trade-off between the image quality and controllable force,as the FID value can not compare with other baselines.
-
There is a lack of ablations on the selection of the unconditional diffusion count and the conditional correction count .
-
It is suggested that the format of the references be made uniform, as there are discrepancies between the introduction and other sections.
问题
-
Could you provide more experiments in the ablation study with quantitative evaluations, rather than relying solely on visual analyses?
-
Why not include the sampling overhead in Table 2 for a more comprehensive comparison?
-
How about the detailed model architecture of the condition extraction network?
-
In Table 1, why not include some metrics for evaluation when comparing sampling time? This would provide a more reasonable analysis.
If all of my concerns are addressed, I will consider improving the score.
The paper proposes an alternative maximization problem for conditional diffusion models. The key idea seems to be to run an unconditional diffusion chain first, and then, follow up with a conditional correction with a retrained condition extraction network. The paper claims to have improved both the sample quality and sampling time w.r.t. prior art. The approach validated using multiple input modalities (e.g., depth, canny edges, or segmentation masks) and show promising results when compared to prior models.
优点
The topic is of interest for the ICLR community.
The figures are easy to read and understandable.
Alg. 1 adds to method understanding.
缺点
The writing of the paper is not easy to follow, it requires multiple readings of the abstract and intro to partially grasp the paper content. The presentation of the methodology section could be modified and simplified. Results section is missing some details, e.g. discussion of Table 2 is very short.
The validation is limited to only one generative model; quantitative results are limited. Adding more generative models (including flow matching ones) as well as adding more benchmark comparisons would make the paper stronger.
Given that the paper claims improvements in model sapling efficiency it would be nice to discuss connections to efficient sampling literature, e.g. https://arxiv.org/abs/2211.13449 and https://arxiv.org/abs/2310.19075
Given the limited validation and lack of positioning to methods that sample efficiently from the diffusion models, the impact of the introduced solution is unclear.
问题
Please strengthen the paper presentation by improving positioning to related work.
Adding more generative models and quantitative results would strengthen the paper.
In Table 2 the reported differences across methods are not too high. Would it be possible to add confidence intervals to the numbers?
The use of \cite and \citep in the intro is a bit confusing, e.g., in second paragraph all the citations should be with \citep.
AccCtr trains condition extraction network on latent space of LDM such that the gradient can guide posterior means toward clean image that is paired with given condition. By applying trained condition extraction network during the diffusion sampling process, AccCtr achieves conditioned sampling.
优点
- Training condition extracting network with the proposed objective function provides promising results, without repeating multiple updates on posterior mean .
缺点
-
First of all, the proposed method actually "trains" the conditional network from scratch using paired set of image and generated condition (equation 14), while the title claims that this is method for "training-free" conditioned generation. This could raise confusion for readers so mush be fixed.
-
Significant error in the theorem: in equation (12), the paper claims that . (For simplicity, and ). Definitely, the equality does not hold for general function . Even though one assumes that is a linear function, is not a constant , but should contain . In other words, the conditional score function is not equal to . Thus, the claim (in lines 193-199) that states "equation (10)-(12) alternately maximize the two objective and " is invalid and following analysis in section 4-5 is broken.
-
In section 3.1, authors claim that is the projection of on the "image" manifold . However, is posterior mean which is conditioned on the time (i.e. noise level). It would be true that as goes to 0, manifold of become closer to , but not for large . This could be easily shown by visualizing during diffusion sampling.
-
Regardless of the theorem, the relevant methods that use optimization on Tweedie estimation for conditioned sampling already has been explored in [1,2], but there is no discussion or comparisons with them. The difference of the proposed method is just optimizing updates on denoised estimation multiple times.
-
Missing comparison with training-free methods [3, 4] and adapter-tuning method [5,6].
-
Manuscript should be proofread.
- The notion denotes pixel level diffusion sample in section 1-4 and even for algorithm 1, but it represents latent code in lines 347-350. This would make readers confusing.
- In equation (9), should be , should be as equation (11).
- In section 3.2, is more proper for the "conditional" score that , not just for the definition, but also for the literature [7, 8, 9]. Still, equalities such as hold.
-
The style of manuscript does not follow ICLR2025.sty.
Reference
[1] Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems (ICLR'24)
[2] DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation (ECCV'24)
[3] FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition (CVPR'24)
[4] MasaCtrl: Tuning-free Mutual Self-Attention Control for Consistent Image Synthesis and Editing (ICCV'23)
[5] ControlNet : Adding Conditional Control to Text-to-Image Diffusion Models (ICCV'23)
[6] Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model (Arxiv'24)
[7] Score-Based Generative Modeling through Stochastic Differential Equations (ICLR'21)
[8] Diffusion Models Beat GANs on Image Synthesis (NeurIPS'21)
[9] High-Resolution Image Synthesis with Latent Diffusion Models (CVPR'22) - section 3.3
问题
- In section 5, the submission claims that the gradient may not provide accurate estimates for "large steps". Could authors clarify the meaning of "large steps"? Does it mean large step size for gradient descent?
- What diffusion time is used for the Figure 2?
- In section 5.2, what is the meaning of "retrain"? It would be great if authors clarify whether it is fine-tuning or training from scratch. More specifically, is the condition extraction network trained by two-stage approach with and ? If not, where the authors get the pre-trained condition extraction network (line 334) that extracts condition from latent code of LDM (line 352)?
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.