3.7

/10

withdrawn3 位审稿人

最低3最高5标准差0.9

3.7

置信度

正确性2.0

贡献度2.0

表达1.7

ICLR 2025

APCtrl: Adding Conditional Control to Diffusion Models by Alternative Projection

Longquan Dai,He Wang,Jinhui Tang

OpenReview PDF

提交: 2024-09-27更新: 2024-11-14

摘要

Enhancing the versatility of pretrained diffusion models through advanced conditioning techniques is crucial for improving their applicability. We present APCtrl, a novel conditional image generation approach that formulates the latent $ \dmrv{z}_\dms{t} $ at timestep $ t $ as the projection $ \dmrv{z}_\dms{t} = \text{Proj}_{\bmfrakD_\dms{t}} (\dmrv{z}_{ \dms{t} + \dms{1} }) $ onto the denosing set $ \bmfrakD_\dms{t} $. For conditional control, APCtrl integrates the condition set $ \bmfrakC_\dms{t} $, defined by a latent control network $\bmcalA_{\dmv{theta}}(\cdot, \cdot)$. Our method simplifies conditional sampling to recursive projections $ \dmrv{z}_\dms{t} = \text{Proj}_{\bmfrakI_\dms{t}} \circ \text{Proj}_{\bmfrakD_\dms{t}} (\dmrv{z}_{ \dms{t} + \dms{1} }) $, where each projection step integrates both the diffusion and condition priors. By employing Alternative Projection, our approach offers several key advantages: 1. Multi-Condition Generation: easily expandable with additional conditional sets; 2. Model and Sampling Agnosticism: works with any model or sampling method; 3. Unified Control Loss: simplifies the management of diverse control applications; 4. Efficiency: delivers comparable control with reduced training and sampling times. Extensive experiments demonstrate the superior performance of our method.

关键词

Diffusion ModelsCondition DiffusionAlternative ProjectionControl-on-TrainingControl-on-Sampling

评审与讨论

审稿意见

评分: 5置信度: 32024-10-29

In this paper, the authors introduce a method to include user control in diffusion models using control-on-sampling techniques. They present a framework that reformulates the denoising diffusion step as a projection step. By introducing another operator which projects the current noisy sample onto a set which matches some condition with respect to a reference image, they can then apply the alternative projection paradigm in order to sample with the given reference image condition. They perform extensive experiments and comparison on text-to-image experiments with a backbone given by Stable Diffusion v1.5.

优点

I appreciate the extensive experimental set up. The authors have performed comparisons with a large number of control methods for diffusion models with a large number of possible conditionings (depth, canny, HED, M-LSD, segmentation…). These comparisons are both quantitative and qualitative. They tried a variety of samplers and diffusion backbones. They also perform efficiency comparisons for the training of the network in APCtrl.
The paper is nicely structured and easy to follow. I have some concerns regarding the mathematical rigor of certain statements (see the section on Weaknesses below) but overall I am satisfied with the quality of the writing.

缺点

My main concern is the limited novelty of the approach when it comes to the alternate projection method. While a limited novelty is not a problem per se, as the experiments are strong I think that extremely relevant works have been omitted in the related work and could also be included in the experimental comparison. For instance, there is a line of work in [1,2,3] see [4] for a review of the work which also works under constraints. Albeit the goal of these methods is different from refined user control as investigated in the current paper, the goal of sampling from a posterior distribution is more general. I think revisiting the algorithm and the contributions of the authors in the light of these papers is necessary.
The contributions are oversold. The paper asserts that the method is a Control-on-Sampling approach, see the caption of Figure 5 for instance “our method is a Control-on-Sampling approach”. I would like to challenge this idea since the current method requires training a neural network due to the latent conditioning. This is in stark contrast with other Control-on-Sampling methods such as FreeDom, DSG or UniGuid (see Table 3). Second, the paper claims that among Control-on-Sampling methods their approach is the only one that is “Controlled by Multi-Condition”. However in the Universal Guidance paper, it is stated that “We propose an algorithm that enables universal guidance for diffusion models. Our proposed sampler evaluates the guidance models only on denoised images, rather than noisy latent states. By doing so, we close the domain gap that has plagued standard guidance methods. This strategy provides the end-user with the flexibility to work with a wide range of guidance modalities and even multiple modalities simultaneously. The underlying diffusion model remains fixed and no finetuning of any kind is necessary.” Can the authors elaborate on this? The corresponding column in Table 1 does not seem like a fair assessment of competing contribution. Finally, I would also argue that Universal Guidance [5] is also a method that satisfies “Sampling Agnosticism”. Indeed, the authors only propose an update of the $\varepsilon$ prediction (see Equation 4 in [5]). Since $x_0$ -prediction and all other predictions are related to the $\varepsilon$ -prediction with a one-to-one mapping the proposed method is also sampling agnostic.
I find it quite suspicious that a large number N of iterations hurts the generation quality as depicted in Figure 8. If anything, following the explanations of the authors, the image should continually be refined. Of course there should be a cost of using large values of $N$ but I am having trouble understanding why the quality is also decreasing. It would be good (and I think necessary) to explain this phenomenon.

[1] Chung et al. – Diffusion Posterior Sampling for General Noisy inverse problems

[2] Chung et al. – Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems

[3] Chung et al. – Improving Diffusion Models for Inverse Problems using Manifold Constraints

[4] Peng et al. – Improving Diffusion Models for Inverse Problems Using Optimal Posterior Covariance

[5] Bansal et al. – Universal Guidance for Diffusion Models

问题

In the abstract the projecting operator on line 17 is not defined. It is hard to understand what is meant here for the update of $z_t$ .
Calling the denoising step a denoising projection is a stretch. I think this might be misleading as it is not performing a projection in the mathematical sense. If there is truly a mathematical statement behind this assertion I would like to see a proof and a proper statement.
In Section 3.2, the title is “From latent control to latent control”. I think it should be “From pixel control to latent control”.
No limitation is stated in the paper.

审稿意见

评分: 3置信度: 42024-11-01

This paper integrates concepts derived from alternative projection techniques and conditional control with the aim of enhancing diffusion models. It provides practical insights regarding recursive projection for controllable generation, although the theoretical/mathematical insights are somewhat limited. Moreover, it showcases examples that effectively demonstrate the efficiency of condition handling.

优点

The topic of enhancing the versatility of pretrained diffusion models is of great interest, and this work has exerted certain efforts in this regard.

缺点

It seems to me that the novelty and significance of this work are insufficient for ICLR. This is especially the case considering that Latent Control is a method that has been introduced previously. Moreover, the training approach described entails retraining the latent network whenever a new model or dataset is employed.
In many cases, the quantitative results presented in Table 2 fail to demonstrate the advantages of the proposed APCtrl approach in comparison with competing methods like ControlNet and ControlNet++.
In Section 2, I was confused by Figure 1. The direction of the arrows doesn't match the logic of the algorithm. A more detailed explanation is required to clarify its connection with Algorithm 1. Alternatively, a more intuitive illustration would be beneficial.
There are several instances where definitions appear somewhat unconventional. I would recommend revisiting these sections to ensure clarity and adherence to standard notation:

The definition of $\mathfrak{J}_t := \mathfrak{D}_t \cap \mathfrak{C}_t$ appears rather strange. The four operations defined on Page 16 for $\mathfrak{J}_t$ consist of two noise-adding and two denoising steps. However, it seems that these operations do not actually bring about a denoising effect.
While the definition of $\text{Proj}_{\mathfrak{C}_t} (\mathbf{z}_{t+1})$ in Eq. (7) is comprehensible, the variable within the parentheses does not show up in the expression itself but only in the initialization. This makes the definition rather unconventional.
In Section 3, there seems to exist a conflict within the definitions. Specifically, the "Up Projection" $\mathrm{Proj}_{\mathfrak{D}_{t+1}}(\mathbf{z}_t)$ in Eq. (8)

and the denoising projection $\mathrm{Proj}_{\mathfrak{D}_{t}}(\mathbf{z}_{t+1})$ in Eq. (6) are in conflict,

especially when considering $\mathrm{Proj}_{\mathfrak{D}_{t+1}}(\mathbf{z}_{t+2})$ .

According to the FORMATTING INSTRUCTIONS FOR ICLR2025 CONFERENCE SUBMISSIONS: "The text must be confined within a rectangle 5.5 inches (33 picas) wide and 9~inches (54 picas) long. The left margin is 1.5 inch (9 picas). Use 10 point type with a vertical spacing of 11 points. Times New Roman is the preferred typeface throughout", and "Do not change any aspects of the formatting parameters in the style files. In particular, do not modify the width or length of the rectangle the text should fit into, and do not change font sizes (except perhaps in the References section)". However, this submission seems to employ some strange typeface (and the mathematical notations are rather unusual), which makes it somewhat uncomfortable for me to read. Moreover, the texts within the tables appear to be in smaller font sizes. Therefore, I am uncertain as to whether this submission should be desk-rejected or not.
Some minor problems:

There seems to be a typo in the title of Section 3.2, “From Latent Control to Latent Control”, as this part discusses the differences between Pixel Control and Latent Control.
I found that the layout of the figures and tables in the paper causes inconvenience for the reader. The issues include the following: Table 2 is presented in Section 4.2, yet it actually pertains to Section 4.4. Figure 4 is shown in Section 4.2, but it belongs to Section 4.3. Additionally, other elements such as Table 3 and Figure 3 also contribute to this inconvenience.

问题

N/A

审稿意见

评分: 3置信度: 42024-11-04

The paper presents APCtrl, a method for conditional diffusion generation. Leveraging the idea of alternating projection, the authors propose to project the latent variable $\mathbf{z}_t$ of the latent diffusion model at each timestep onto a denoising set and a condition set, aiming to locate a point within the intersection of two sets and simplify the conditional sampling process to recursive projections.

优点

The authors perform extensive experiments, presenting results across multiple scenarios. They also demonstrate the efficiency of their method, particularly in comparison with control-on-training methods which require refinement of the latent space and incur retraining costs for the diffusion models.

缺点

The method’s soundness is questionable.
- The main formula in Eq. 9 appears arbitrarily constructed. While classical alternating projections [1] update iteratively between two convex sets, Eqn. 9 employs four composite functions $(\mathcal{D}_t, \mathcal{D}_{t+1}, \mathcal{C}_t, \mathcal{D}_{t+1})$ (i.e., projection onto \mathcal{D}_{t+1} is applied twice), adding two extra projections onto the denoising sets after alternating between the condition set and the denoising sets. The authors do not give a clear rationale or justification for this modification while adding potentially unnecessary complexity to the calculations.
- The results are concerning. Despite offering various visual and quantitative comparisons, the empirical improvements appear limited. In Table 1, the primary metric (FID score) is worse than that of comparison methods, suggesting that the modified latent states $\mathbf{z}_t$ may not adequately lie within the ideal intersection, thus reducing generation quality. Additionally, the visualizations in Figure 8 indicate that further iterations of the method may produce artificial-looking images compared to more natural scenes.
Clarity: There are notable issues with notation consistency and clarity throughout the paper.
- Lines 196-197: the vector $\mathbf{x}$ is never introduced in the paper.
- Line 203: The denoiser $\mathcal{Z}$ appears unexplained and is absent from the preceding formula.
- Line 217: $\mathbf{z}_{0|t}$ is not defined and it’s hard to understand why the approximation holds especially when $\mathbf{z}_t$ indicates a noisy state.
- Algorithms: (i) At step 4, defining $\mathcal{J}_t$ is unnecessary as the method does not explicitly calculate the intersection set. (ii) In the highlighted blue block, there is a loop over $n$ while $n$ is neither defined in the draft nor included in the algorithm’s input. (iii) It’s unclear why $\mathcal{C}_t$ serves as an algorithm input, as it appears to be the target solved iteratively in Eqn. 5.
- In Section 3.2, the authors review a few concepts and methods to compute the latent control. However, they fail to cite reference papers and do not give sufficient background information. For example, in line 213, they mention an encoder $\mathcal{E}$ for processing the control image, however, it is not clear what’s the role of this encoder and if they need it to be pre-trained.

[1] Relaxed alternating methods, Cegielski and Suchocka, 2008 SIAM

问题

How is the inverse problem in Eqn7 being addressed? Assuming you are leveraging the gradient information, which algorithm are you using?
Regarding the experiments on compatibility with various diffusion backbones (lines 409-411), could you clarify the statement “APCtrl supplies Condition 1, complemented by Condition 2”? How are these conditions integrated into the sampling algorithm?
Have the authors compared the performance of unplugging versus plugging in the control step? The authors should consider reporting the performance of their backbone as a baseline, especially when the modification to the latent states might harm the generation process.

伦理问题详情

n/a

撤稿通知

2024-11-14

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.