PaperHub
6.6
/10
Poster5 位审稿人
最低5最高8标准差1.2
6
8
8
5
6
3.4
置信度
正确性3.2
贡献度2.8
表达2.8
ICLR 2025

A Simple Approach to Unifying Diffusion-based Conditional Generation

OpenReviewPDF
提交: 2024-09-26更新: 2025-04-27
TL;DR

We propose a simple architecture and training scheme that produces a single diffusion model that can be used for flexible conditional generation, estimation, and joint diffusion.

摘要

关键词
image generationcontrollabilityestimation

评审与讨论

审稿意见
6

This paper allows to generate diverse output modalities (image, depth maps, edge maps, poses) from diverse input modalities using a single generative model. The proposed UniCon model processes image and condition inputs “concurrently in two parallel branches, with injected joint cross-attention modules where features from two branches attend to each other.” LoRA weights are used to adapt the model from the pre-trained stable diffusion model. Experiments show that the approach performs on par with single generative approaches (controlNet, readout guidance, etc) in terms of FID, and depth, pose or edge estimation metrics, and outperform jointnet, a related multi-task generative approach in terms of results and speed.

优点

  • The paper is well written and illustrated.
  • The approach is well motivated, unifying abilities of multiple generative models into a single one.
  • The approach is relatively less computationally intensive than many other diffusion based approaches.
  • The results are convincing.

缺点

  • No open sourcing of the code / models is mentioned.
  • There is no human study comparing the effective quality of the approach compared to the others. There are nice qualitative examples in Fig 3, but a human study would be more convincing to demonstrate these examples are representative.
  • One of the most concrete applications of generative image models would be serving downstream tasks, such as classification or depth estimation as demonstrated in this work. However, the estimation performance is far from SOTA performance.
  • There is a confusion in the architecture of Unets, at least a sentence to rephrase line 242: The Unet [2015] [...] does not contain transformer blocks as it is written. Unets employed in diffusion models yes, but the sentence with this reference is confusing for now.

问题

1/ Do you plan to release the source code to ease reproducibility of the approach?

2/ Could you do a human study?

3/ In figure 3, third column, the fact that both the loose control and Unicon approaches lead o a snowman on the beach would mean their latent space is really similar as both approaches build on Stable diffusion weights. Also the controlNet result (first column) is also very similar to the Unicon one. It makes question the generalization ability of the approach.

4/ Why is there no quantitative comparison with the Loose control results?

5/ How is the weight w_r set up in (7)

评论

We greatly appreciate your thoughtful review and your recognition of our work's strengths. In the following, we address some concerns and provide additional clarifications to enhance our contributions.

“No open sourcing of the code/models is mentioned.”

Question 1: ”Do you plan to release the source code to ease reproducibility of the approach?”

We will open source all code (training and inference) and weights of models used in the paper upon acceptance.

“There is no human study comparing the effective quality of the approach compared to the others. There are nice qualitative examples in Fig 3, but a human study would be more convincing to demonstrate these examples are representative.”

Question 2: “Could you do a human study?”

Thanks for suggesting a human study. We conduct a user study on 4 models:

MethodDepthEdgePoseMethodID
Readout-Guidance0.030.030.22
ControlNet0.490.430.27IP-Adapter-Face0.47
UniCon (Ours)0.480.540.51UniCon (Ours)0.53

Our methods demonstrate comparable results to specialized conditional generation methods. We do not include Appearance model because a suitable baseline for comparison is currently unavailable.

“One of the most concrete applications of generative image models would be serving downstream tasks, such as classification or depth estimation as demonstrated in this work. However, the estimation performance is far from SOTA performance.”

We provide a qualitative comparison of our method against Marigold (SOTA diffusion-based depth estimation method) in Fig. 11. Though not achieving SOTA on depth estimation, our method generally performs better than Marigold on diverse test cases, especially for images with ambiguous structure (Fig. 11 Line 2,5) or non-realistic style (Fig. 11 Line 1,7,8).

Meanwhile, existing benchmarks often lack scene diversity, as noted in Depth Anything V2. Therefore, we further evaluate UniCon-Depth on the relative depth estimation benchmark DA-2K, introduced in Depth Anything V2, which comprises diverse high-resolution test images.

MethodMarigoldGeowizardDepthFMDepth Anything V1ZoeDepthUniCon (Ours)Depth Anything V2 (ViT-L)
Accuracy (%)86.888.185.888.589.190.597.1

On DA-2K, Our UniCon-Depth model is superior to other models except for Depth Anything V2 (which serves as our annotator), demonstrating that our model better generalizes to diverse scenes.

There is a confusion in the architecture of Unets, at least a sentence to rephrase line 242...

Thank you for pointing this out. We understand the confusion caused by the imprecise phrasing. In the revised version, we will rephrase the sentence to differentiate between the original U-Net architecture and the U-Nets commonly adapted in diffusion models.

In figure 3, third column, the fact that both the loose control and Unicon approaches lead o a snowman on the beach would mean their latent space is really similar as both approaches build on Stable diffusion weights. Also the controlNet result (first column) is also very similar to the Unicon one. It makes question the generalization ability of the approach.

We agree that UniCon, LooseControl, and ControlNet exhibit similarities in latent space, as all are built upon the same base model, Stable Diffusion. In Fig. 3, the resemblance in the ControlNet result is primarily due to the strict conditioning on depth and the highly specific text prompt (“golden retriever in a field of dandelions”), which significantly constrains image appearance. We provide more qualititive comparison results in Fig. 12 to demonstrate the generalization ability of our approach.

Why is there no quantitative comparison with the Loose control results?

It is because of the lack of established metrics for evaluating loose control performance.

  • Since there is no ground truth (GT) image, it is not possible to compute metrics like FID
  • Condition fidelity cannot be reliably measured against the boxified depth condition used in LooseControl.

Moreover, the LooseControl paper itself does not present quantitative results, which suggests the inherent difficulty in defining standardized metrics for this approach.

How is the weight w_r set up in (7)

For guidance, we adopt an optimizer (e.g. AdamW) to compute the gradient and determine the weighting factor w_r in Eq. (7) instead of manually setting a fixed weight factor, as suggested by Readout-Guidance. We explain the details in Appendix (Line 754).

评论

Thank you for your comments on our paper. We have submitted a detailed response and a revised paper on OpenReview. With the discussion phase ending on November 26, we would like to confirm whether we have effectively addressed all concerns, particularly whether the additional qualitative results (Fig. 12, Fig. 11) and human study provide a more comprehensive understanding of UniCon’s performance compared to other specialized methods.

We also hope you might consider raising the scores based on these updates and the ongoing discussion.

Thank you

评论

I would like to thank the authors who addressed most of my concerns (reproducibility with code open sourcing, diversity of results shown with additional figure, performance with human study). I'm still in doubt regarding the applicative interest of this work regarding potential downstream tasks, but that's an issue I have with most image generation papers.

审稿意见
8

The paper introduces UniCon, an efficient approach to unify various conditional generation adaptation paradigms by modeling a single joint distribution across multiple modalities. This is done by having two U-Net branches that interact via a joint cross-attention. By also having different noise levels per branch, as previously done in Diffusion Forcing, both joint generation and conditional generation in both directions are possible during inference. To improve parameter efficiency, non-RGB branches are initialized to the RGB branch's weights and adapted using LoRAs instead of training them from scratch.

优点

The proposed method enables more flexible conditional generation than many previous methods (Cond -> RGB, RGB -> Cond, RGB & Cond joint generation) by posing conditional generation as a special case of a more general joint generation setup and combining it with different timesteps per sample as pioneered by Diffusion Forcing. Compared to the approaches prevalent in conditional diffusion adaptation methods, this enables more flexible generation scenarios. The quantitative and qualitative evaluations show competitive performance of the proposed method compared to more specialized approaches. The paper is easily understandable and presents the method clearly.

缺点

The method, as presented, is limited to only accepting conditioning in the form of 2D images/feature maps. As it has now become common to support those combined with feature vector conditioning, as done in, e.g., Uni-ControlNet [Zhao et al., NeurIPS 2023], Composer [Huang et al., ICML 2023], or CTRLorALTer [Stracke et al., ECCV 2024].

The paper would benefit from additional qualitative examples in the appendix, as is commonplace in similar works.

问题

Regarding the weaknesses, I'd recommend the authors incorporate more related work into their quantitative comparisons and related work. Especially Uni-ControlNet seems important to compare against, as it is very closely related (except for the joint generation part). For additional qualitative comparisons, I recommend following the example set by IP-Adapter, which provides a great overview of the qualitative performance of a wide range of methods.

The ablation study only focuses on RGB generation in a single setting. Have you also ablated effects in other sampling settings?

评论

Thank you for your detailed review and for recognizing the flexibility and generality of our proposed method in conditional generation. Below, we address some concerns and provide further clarifications to strengthen our contributions.

The method, as presented, is limited to only accepting conditioning in the form of 2D images/feature maps. As it has now become common to support those combined with feature vector conditioning, as done in, e.g., Uni-ControlNet [Zhao et al., NeurIPS 2023], Composer [Huang et al., ICML 2023], or CTRLorALTer [Stracke et al., ECCV 2024].

Thanks for pointing out these related works. To clarify, there is a fundamental conceptual difference between UniCon and the mentioned works (e.g., Uni-ControlNet). Uni-ControlNet focuses on unifying multiple control signals within a single model to achieve controllable generation. In contrast, UniCon aims to unify multiple generation tasks within a single model, for a specific image condition.

Specifically, the mentioned works primarily address unifying various conditioning signals for controllable image generation. However, they cannot handle estimation, joint generation, or rough conditioning. Our method, on the other hand, learns a joint image distribution, emphasizing symmetry between the image and its conditioning. While it does not support conditions represented as feature vectors, it facilitates a broader range of generative behaviors beyond controllable generation.

The paper would benefit from additional qualitative examples in the appendix, as is commonplace in similar works. Regarding the weaknesses, I'd recommend the authors incorporate more related work into their quantitative comparisons and related work. Especially Uni-ControlNet seems important to compare against, as it is very closely related (except for the joint generation part). For additional qualitative comparisons, I recommend following the example set by IP-Adapter, which provides a great overview of the qualitative performance of a wide range of methods.

Thanks for suggesting more quantitive results. In Fig. 12, we provide a quantitive comparison of UniCon against ControlNet, Uni-ControlNet on conditional image generation, following IP-Adapter example.

We also provide a qualitative comparison of our method against Marigold on depth estimation in Fig. 11.

Question: “The ablation study only focuses on RGB generation in a single setting. Have you also ablated effects in other sampling settings.”

Currently, we have only tested the ablation models on depth-to-image generation. Evaluating the models on additional sampling settings, such as depth estimation, would certainly enhance the reliability and comprehensiveness of the ablation study. We plan to include results on depth estimation once sufficient computational resources become available to test all ablated models.

评论

Thank you for the response.

Regarding your first response, I first want to apologize for the seemingly cut-off comment in the initial review. I don't know how that happened. Given your response, at least the point I was trying to make still came across. For reference, the point I was trying to make was that (jointly) conditioning on non-image-like conditions has become an important aspect that is not covered by the presented work. This still holds for the updated work. However, I agree with the authors that implementing what we would typically implement as a conditional distribution as a joint distribution instead is an orthogonal task.

Regarding the second, I appreciate the authors providing additional qualitative results. Regarding the depth examples, while they do demonstrate the impressive strengths of the presented method in out-of-distribution situations that the authors also mentioned in other comments, I would appreciate the authors also including some more normal photo examples without extreme reflections or depth of field in a potential camera-ready version to make the comparison more representative of many real-world use cases. I also share xQEt's concerns about this performance potentially being more of a consequence of teacher and dataset choice than the underlying approach being superior for depth estimation, which is reflected in the model quantitatively underperforming on datasets typically considered in-distribution while overperforming on samples typically considered out-of-distribution.

Regarding the third, I would appreciate it if the authors added that.

While I still think that the paper would benefit greatly from further improvements, the author's response to my main point of criticism and the additional clarifications and results provided in response to other reviews, I will consider raising my score to reflect the improvements on the paper.

评论

Thank you for your comments on our paper. We have submitted a detailed response and a revised paper on OpenReview. With the discussion phase ending on November 26, we would like to confirm whether we have effectively addressed all concerns, particularly whether the additional qualitative comparison on conditional generation (Fig. 12) and depth estimation (Fig. 11) provides a good overview of the performance of UniCon against other methods. We also hope you might consider raising the scores based on these updates and the ongoing discussion.

Thank you

评论

Thanks for your reply!

This figure shows depth estimation samples on normal photos. We will include both normal photos and out-of-distribution samples in our final version. We are working on training an UniCon-Depth model using ZoeDepth as the annotator to address the shared concern of you and xQEt and will report the result when it is done.

We include the depth estimation results in ablation study. The accuracy on DA-2K dataset is reported:

MethodFID-6KAbsRel(%)Accuracy (DA-2K)
UniCon-Depth13.219.2690.5
- Encoder13.6410.1690.4
+ Data (200K)13.108.6691.0

The depth estimation performance is generally consistent with the conditional generation performance. Ablations that removes depth loss (annotated as -Depth loss) do not support depth estimation, thus not included.

Thanks again for helping us improve our work!

审稿意见
8

This paper proposes a framework for unifying different conditional generation tasks involving image-condition correlation in diffusion models. The proposed method learns the join probability distribution of the images and conditions during training and used for conditional generation during inference. The method mainly consist of parallel denoising paths for each condition, interacting with the image denoising path via cross-attention blocks. To maintain paramer efficiency, the base model for image generation is kept frozen, and Low Rank Adaptation (LoRA) is used to adapt the base model to each condition. The proposed method also allows for generation based on multople conditions. The authors evaluate their method on different conditional generation task involving conditions such as depth, edges, images, human poses, and identities.

优点

  • The paper is well-written and easy to follow.

  • The proposed method provides a parameter-efficient way to model the joint image-condition distribution, which is more versatile for different conditioning tasks compared to specialized conditional methods.

  • The authors provide sufficient experiments and comparisons for their method.

  • Based on the provided results, the proposed method seems effective in modeling the joint image-condition distributions, and performing conditional generation

缺点

  • Conditional generation using the proposed method requires performing multiple denoising paths, which makes the inference compationally intensive compared to direct conditioning, especially for multiple conditions.

问题

  • It would be great if authors could provide an analysis on the computational cost of inference using their method.

  • In Section 4.2, line 233, authors mention "When the conditional image differs from a natural image, we add a LoRA to the y-branch." How is the branch trained for conditioning on natural images, considering that the base image generation U-Net is fozen and no LoRA is used?

评论

We greatly appreciate your thorough review and your recognition of the strengths of our work. In the following, we address your concerns and provide further insights to strengthen our work.

Weakness: Condiitonal generation using the proposed method requires performing multiple denoising paths, which makes the inference compationally intensive compared to direct conditioning, especially for multiple conditions.

Question 1: It would be great if authors could provide an analysis on the computational cost of inference using their method.

Thank you for highlighting the concern regarding inference cost. Before presenting our analysis, we want to clarify our model implementation. Though illustrated as two branches (image and condition), our approach uses one diffusion model and processes two inputs in a single feed-forward pass. It is realized by selectively applying our condition LoRA to the condition sample. Detailed implementation is provided in Appendix A, and we will further clarify the architecture in the revised version.

Below, we provide an analysis of inference speed.

Table A: Inference Speed Comparison (Iterations per Second)

MethodControlNetUniCon (Remove Joint Modules)UniCon
iteration/s10.827.655.02

Table A compares UniCon with ControlNet for controllable image generation. UniCon incurs additional inference overhead due to the separate computation of LoRA layers (condition LoRA and joint cross-attention LoRA) instead of fusing them into the pretrained weights. Therefore, separate LoRA layer computation incurs heavy overhead. This overhead can potentially be mitigated by saving multiple attention weights (with and without LoRA) directly in the model.

Table B: Latency Comparison for Depth Estimation (on A800 GPU)

MethodMarigoldMarigold(LCM)ZoeDepthDepth Anything V2UniCon (Ours)
latency (A800)2.3s372ms289ms91ms4.0s

Table B compares UniCon with other depth estimation methods. Diffusion-based methods (e.g., UniCon and Marigold) generally exhibit higher latency than traditional approaches. However, UniCon can leverage techniques like Latent Consistency Models (LCM), as demonstrated by Marigold, to significantly improve inference speed.

In summary, while UniCon currently has higher inference costs, its computational efficiency can be enhanced through optimizations like LoRA weight fusion and LCM integration. We plan to explore these directions in future work.

Question 2: "In Section 4.2, line 233, authors mention "When the conditional image differs from a natural image, we add a LoRA to the y-branch." How is the branch trained for conditioning on natural images, considering that the base image generation U-Net is fozen and no LoRA is used?"

For models including a natural image condition (e.g., ID, Appearance), we do not apply LoRA to the conditional branch and only train the joint cross-attention modules to learn the correlation between two images (e.g. two images from one person, two frames from one video). Note that in this case, two branches are symmetric and the model will perform the same as the base model if we remove all joint cross-attention modules.

评论

Thank you for your comments on our paper. We have submitted a detailed response and a revised paper on OpenReview. With the discussion phase ending on November 26, we would like to confirm whether we have effectively addressed all concerns. We also hope you might consider raising the scores based on these updates and the ongoing discussion.

Thank you

评论

I appreciate the authors' response to my questions.

  • Could author explain the reason for reporting different metrics (iterations per second and latency) on the two tasks for their computational analysis? Additionally, if there is time, it would be interesting to see the memory consumption as well.

  • To fully understand the details of the method, Do the additional parameters added to the model with each conditioning come from both joint cross-attention modules and the optional LoRA weights? In other words, does each conditioning require its own task-specific joint cross-attention modules?

评论

Thanks for your reply.

Could author explain the reason for reporting different metrics (iterations per second and latency) on the two tasks for their computational analysis?

For diffusion-based conditional generation, the latency will change for different sampling iteration numbers. So iterations/second is a better efficiency metric. For depth estimation, there are also non-diffusion methods such as ZoeDepth and Depth Anything V2, which do not have denoising iterations. So we just report the latency for all methods.

Additionally, if there is time, it would be interesting to see the memory consumption as well.

Here we provide a table of max memory consumption for methods we have compared:

MethodZoeDepthDepth Anything V2MarigoldControlNetUniCon (Ours)
Max GPU memory allocated2233.19 MB1914.42 MB4043.09 MB2908.33 MB4297.67 MB

does each conditioning require its own task-specific joint cross-attention modules?

Yes, each conditioning requires its own condition-specific joint cross-attention modules, which support diverse tasks as shown in the paper. In other words, one UniCon model is targeted on one specific image-condition pair, such as image-depth for UniCon-Depth. Each model has its own parameters including joint cross-attention and condition LoRA. We are also exploring whether we can unify different conditions in one model, i.e. training shared joint cross-attention modules and condition-specific condition LoRAs.

Thanks again for helping us improve our work!

评论

I appreciate authors's response and clarification. Overall, I find the proposed method interesting, and I do not have any major concerns. Therefore, I change my rating to Accept.

审稿意见
5

This paper introduces UniCon, a unified framework for handling diverse conditional generation tasks by learning a joint distribution over correlated image pairs using diffusion models. UniCon employs a simple architecture that adds minimal learned parameters (15% of base model) and uses a single efficient training stage while maintaining the standard model input. Through extensive experiments, the authors demonstrate that their single model can produce comparable or better results than specialized methods and prior unified approaches, while also showing that multiple models can be effectively combined for multi-signal conditional generation.

优点

  • Unified framework for handling diverse conditional generation tasks through a joint distribution approach
  • Lightweight adaptation of existing diffusion models with minimal parameter overhead
  • Clear empirical demonstration of the proposed framework

缺点

  1. Missing Comparisons/References
  • The paper lacks comparisons with several important recent methods in depth estimation, and this limits our understanding of where the method stands in relation to the current state-of-the-art in depth estimation.

    • ZoeDepth
    • Depth Anything
    • Depth Anything v2 (which is only used as an annotator in this work)
  • In addition, it will be helpful to discuss "DAG: Depth-Aware Guidance with Denoising Diffusion Probabilistic Models" [Kim et al.], which appears to be a relevant line of work for conditional generation with depth information.

  1. Technical Ambiguities
  • Section 4.2 should explicitly state the model architecture, e.g., whether the approach uses two pre-trained diffusion models and requires multiple feed-forward passes.
  • This paper does not provide sufficient analysis of the inference cost. Based on current understanding, while it uses LoRA, it requires twice as many feed-forward passes, making the inference step complex and resource-intensive.
  1. Joint Distribution Modeling
  • While the paper presents joint distribution modeling as a key contribution, it should better justify why this particular formulation is advantageous over alternatives. The theoretical benefits of learning the joint distribution p(x,y) versus other approaches could be more thoroughly explained and experimented. For instance, their joint modeling might mutually enhance depth-aware image generation through improved representation.

问题

  1. What is the inference cost of depth prediction compared to other traditional and recent approaches?

  2. How does your method compare with ZoeDepth, Depth Anything, and Depth Anything v2?

  3. What are the specific advantages of joint distribution modeling over other approaches discussed in the paper, particularly in terms of inference, accuracy, and other metrics?

评论

Thank you for your constructive feedback and for highlighting the strengths of our work. Below, we address some concerns and provide additional details to further clarify and enhance our contributions.

Weakness 1: Missing Comparisons/References

Question 2: "How does your method compare with ZoeDepth, Depth Anything, and Depth Anything v2?"

We show our model's performance on zero-shot affine-invariant depth estimation compared to ZoeDepth and Depth Anything V2.

MethodAbsRel (NYUv2)δ1\delta_1 (NYUv2)AbsRel (ScanNet)δ1\delta_1 (ScanNet)
UniCon (Ours)7.8593.949.2291.87
ZoeDepth4.9397.536.1296.33
Depth Anything v24.8797.144.8297.31

Although our UniCon model still lags behind state-of-the-art (SOTA) depth estimators on NYUv2 and ScanNet, existing benchmarks often lack scene diversity, as noted in Depth Anything V2. To address this limitation, we further evaluate UniCon-Depth on the relative depth estimation benchmark DA-2K, introduced in Depth Anything V2, which comprises diverse high-resolution test images.

Table A:

MethodMarigold*Geowizard*DepthFM*Depth Anything V1*ZoeDepthUniCon-Depth (Ours)Depth Anything V2 (ViT-L)*
Accuracy (%)86.888.185.888.589.190.597.1

*: results from Depth Anything V2

On DA-2K, Our UniCon-Depth model is superior to other models except for Depth Anything V2 (which is used as our annotator), demonstrating that our model better generalizes to diverse scenes. We also add a qualitative depth estimation comparison against Marigold in Fig. 11. Note that our method is trained on 16k images without GT depth annotation.

"In addition, it will be helpful to discuss "DAG: Depth-Aware Guidance with Denoising Diffusion Probabilistic Models" [Kim et al.], which appears to be a relevant line of work for conditional generation with depth information.”

DAG guides a diffusion model to generate geometrically plausible images using depth prior. We will add a discussion about DAG in Sec. 2.

Weakness 2: Technical Ambiguities

Though illustrated as two branches (image and condition), our approach uses one diffusion model and processes two inputs in a single feed-forward pass. It is realized by selectively applying our condition LoRA to the condition input. We explain our detailed implementation at Appendix.A. We will clarify our model architecture in our revised version.

We provide an analysis of inference speed for controllable image generation.

MethodControlNetUniCon (Remove Joint Modules)UniCon
iteration/s10.827.655.02

The table reports denoising iterations per second for each method. Our method produces more inference overhead than ControlNet. Currently, we do not fuse LoRA weights into the pretrained weights, for both condition LoRA and LoRAs in joint cross-attention modules. Therefore, separate LoRA layer computation incurs heavy overhead. The problem can be potentially solved by saving multiple attention weights (w./w.o. LoRA) in the model. We plan to improve the inference efficiency in future work.

Question 1: What is the inference cost of depth prediction compared to other traditional and recent approaches?

We show the inference latency of depth estimators in the following table:

MethodMarigoldMarigold(LCM)ZoeDepthDepth Anything V2UniCon (Ours)
latency (A800)2.3s372ms289ms91ms4.0s

Diffusion-based estimation methods (UniCon, Marigold) generally require more inference costs than others. Similar to Marigold, UniCon can improve the inference speed by incorporating Latent Consisteny Model (LCM).

评论

Weakness 3: Joint Distribution Modeling

Question 3: "What are the specific advantages of joint distribution modeling over other approaches discussed in the paper, particularly in terms of inference, accuracy, and other metrics?"

Thanks for pointing this out! The advantages of joint distribution modeling are central to our work. Below, we outline its benefits compared to three specialized methods. We provide qualitative results on depth estimation (Fig. 11) and conditional image generation (Fig. 12).

  • Vs Marigold for depth estimation. As shown in quantitive results on DA-2K (Table A) and qualitative results (Fig. 11), our method generally performs better than Marigold on diverse test cases, especially for images with ambiguous structure (Fig. 11 Line 2,5) or non-realistic style (Fig. 11 Line 1,7,8). Our joint modeling helps maintain SD's strong prior on natural images and helps our model to generalize to OOD test cases never seen during training. Marigold finetunes the whole model to enable conditional generation p(depth|image), losing the prior on diverse natural images.
  • Vs ControlNet for controllable generation. ControlNet sometimes generates over-saturated images (SoftEdge 2,4 in Fig. 12) or images not aligned with the condition (Pose 4 in Fig. 12), while it rarely happens to our method. We attribute it to that joint modeling stimulates our model to fully capture the bidirectional correlation between image and condition, while conditional modeling might learn to rely on minor clues in the condition image. Note that UniCon outperforms ControlNet in condition fidelity (Table 1).
  • Vs LooseControl for rough depth to image. Our joint modeling enables UniCon to take noisy conditions as input. Therefore, our models can accept various rough depth conditions, including the boxified depth condition used by LooseControl. In comparison, LooseControl requires finetuning on the new depth format to achieve this.
评论

Thank you for your comments on our paper. We have submitted a detailed response and a revised paper on OpenReview. With the discussion phase ending on November 26, we would like to confirm whether we have effectively addressed all concerns, particularly:

  • Whether our evaluation on DA-2K and the qualitative results sufficiently demonstrate UniCon's advantages in depth estimation.
  • Whether our analysis against three specialized methods (Marigold, ControlNet, LooseControl) demonstrates the benefits of joint modeling.

We also hope you might consider raising the scores based on these updates and the ongoing discussion.

Thank you

评论

I appreciate the authors' detailed response. It would be great to see other discussions and results that are not yet included incorporated into the revision.

I think that significantly underperforming compared to the depth prediction annotator, Depth Anything V2, makes this paper less appealing. Despite Stable Diffusion having a rich image prior, it does not benefit the depth prediction. Thus, I still believe the authors need to address the source of this limitation (e.g., whether the pretraining stage of Stable Diffusion makes the representations unsuitable for depth estimation) and develop methods to reduce this gap with the teacher model.

Additionally, in this context, I feel that estimating accurate depth maps from diverse test cases is merely an effect of using the Depth Anything V2 model as the annotator, rather than this paper's real contribution. While I agree that the architecture of UniCon is suitable for preserving Stable Diffusion's rich image prior, a more rigorous ablation study is needed to differentiate between the effects of the training data and its architecture on depth estimation for diverse test images.

I will reconsider the rating if these concerns are resolved.

评论

Thanks for your reply. We have incorporated discussions and results in our revised version. Most results can be found in Appendix.E.

I still believe the authors need to address the source of this limitation (e.g., whether the pretraining stage of Stable Diffusion makes the representations unsuitable for depth estimation) and develop methods to reduce this gap with the teacher model.

We acknowledge the gap between Depth Anything V2, and currently it is hard to reduce the gap. UniCon is a lightweight model, adding 15% paremter overhead and trained on 16k images. We believe a larger scale training and more paremter number will improve the performance using the same framework.

Additionally, we find our UniCon-Depth can refine the result from other depth estimators. Taking the estimation from ZoeDepth as depth input, we renoise it and generate a new depth map using UniCon-Depth. Below, we show the result on DA-2K, denoted as UniCon-Depth (refine ZoeDepth).

MethodZoeDepthUniCon-DepthUniCon-Depth (refine ZoeDepth)Depth Anything V2 (ViT-S)Depth Anything V2 (ViT-L)
Accuracy (%)89.190.592.195.397.1

The performance is better than Zoe-Depth or UniCon-Depth alone, and closer to DATv2. The estimated depth from Zoe serves as a good initialization, complementing the rich image prior of SD. Such editing ability is an advantage of our diffusion-based method over traditional depth estimators.

Additionally, in this context, I feel that estimating accurate depth maps from diverse test cases is merely an effect of using the Depth Anything V2 model as the annotator, rather than this paper's real contribution.

We train our depth model on PascalVOC, which consists of 16K real photos (some examples). Therefore, our model cannot generalize to diverse out-of-distribution test cases without the image prior from SD. We agree that the depth annotator influences the estimation performance. The estimation quality decreases as we switch the annotator to ZoeDepth, as shown here. In conclusion, we believe that both depth annotator and our model structure contribute to the final performance on diverse test cases. A more rigorous ablation study (e.g. training our model on a real depth dataset) should demonstrate the point clearer. We leave it to our future work.

We would appreciate it if you could reconsider the rating. Thanks again for helping us improve the paper.

审稿意见
6

The paper presents a diffusion model-based conditional image generation method as an add-on to current latent diffusion model (e.g., SD1.5). The model consists of joint cross attention layers and LoRA for conditional image branch, jointly denoising a concatenation of generated and conditional images. The training of the model is performed with disentangled noise levels from the two branches, inspired from Diffusion Forcing, allowing flexible conditional generation from different sampling schedules. The final model is compared to various problem-specific conditional model as well as with ControlNet, which solves similar problem with the proposed method with larger add-on network.

优点

The main strengths are in their lightweight configuration, good reported performance, and novelty in using independent timestep scheduling.

  • The writing of the paper is clear with comprehensive evaluations supporting the superiority of the proposed method.
  • The model is overall lightweight in terms of the size and the training time, compared to previous image conditional add-ons, e.g., ControlNets. This makes the method application-friendly.
  • Using disentangled noise level scheduling from Diffusion Forcing in paired generation for controllable image synthesis is quite novel. The demonstration in Figure 5 justifies this extension.

缺点

Although I believe the current version of the manuscript is above acceptance threshold, there are some limitations that prevents me recommending for higher honors (e.g., Highlight/Oral).

  • Although there are five image conditional models trained using the proposed framework, it seems that only three (Depth, SoftEdge, Pose) types are compared quantitatively. This asymmetry in the quantitative/qualitative demonstration makes the manuscript incomplete.
  • Moreover, there are also other metrics that compares the conditional generation. For example, for depth-conditional generation, there is δ1\delta_1 metric used throughout the methods (please see Table 1 of Marigold and Table 2 of Depth Anything) that compares the method in at least five datasets. For pose-conditional generation, there is multiple metrics such as CLIP Identity Observance and Pose Control Accuracy (as in Table 1 of FineControlNet). We can also use user preference study for the tasks that does not have quantitative measures (such as in identity preservation and appearance preservation). Also there is a CLIP score to report the quality of the generation as well as text-fidelity of the generation. I do not want to recommend showing all those metrics that I have mentioned (these are for examples), but I believe it is required to have a complete table of quantitative metrics at least for models presented.
  • It would be better to have a quantitative comparison in generated images from synchronous and asynchronous noise level scheduling in the inference time. The demonstration in the manuscript shows that the method works, but it does not show that this asynchronous generation leads to diversity and quality comparable to synchronous one.

问题

These questions are from pure curiosity and not counted in my final scores.

  • Does the presented method applicable to applicable/robust to inpainting/editing tasks? Supporting this feature will make the proposed method far more persuasive.
  • This is a very minor concern, but do the authors want to extend this model beyond SD1.5, which seems to be deprecated in the generative AI community?
评论

Thank you for your thoughtful and detailed review. We appreciate your recognition of our method's lightweight design, strong performance, and the novelty of our independent timestep scheduling. In the following, we address some concerns and provide clarifications to further strengthen our work.

Weakness 1: Although there are five image conditional models trained using the proposed framework, it seems that only three (Depth, SoftEdge, Pose) types are compared quantitatively. This asymmetry in the quantitative/qualitative demonstration makes the manuscript incomplete.

Weakness 2: Moreover, there are also other metrics that compares the conditional generation. ... I do not want to recommend showing all those metrics that I have mentioned (these are for examples), but I believe it is required to have a complete table of quantitative metrics at least for models presented.

Thanks for your detailed suggestion on our experiments and metrics! We have now included additional metrics and experiments to address this. (User Study, CLIP Score, δ1\delta_1).

We add CLIP Scre and δ1\delta_1 to Table 1 for depth conditonal generation:

MethodFIDCLIP ScoreAbsRelδ1\delta_1
Readout-Guidance18.7232.0723.1968.68
ControlNet13.6831.979.8590.15
UniCon (Ours)13.2132.119.2691.02

In terms of Pose metrics used in FineControlNet, we found that CLIP Identity Observance requires instance description that is absent from our validation dataset, and Pose Control Accuracy is similar to PCK we used.

We conduct a user study on 4 models:

MethodDepthEdgePoseMethodID
Readout-Guidance0.030.030.22
ControlNet0.490.430.27IP-Adapter-Face0.47
UniCon (Ours)0.480.540.51UniCon (Ours)0.53

For the Appearance model, evaluating it independently is challenging due to the following:

  • Applying it in isolation often produces outputs that closely resemble the input image. We must adjust the condition weight or text prompt for specific use cases. Its effectiveness is most apparent when combined with other models. Please check Figs. 1, 4, 7, 10 for its use cases.
  • A suitable baseline for comparison is currently unavailable.

Weakness 3: It would be better to have a quantitative comparison in generated images from synchronous and asynchronous noise level scheduling in the inference time. The demonstration in the manuscript shows that the method works, but it does not show that this asynchronous generation leads to diversity and quality comparable to synchronous one.

Below is a comparison of depth conditional generation under synchronous and asynchronous noise level scheduling:

MethodFIDCLIP ScoreAbsRelδ1\delta_1
UniCon-Sync14.7832.4517.4377.00
UniCon-Async13.2132.119.2691.02

For synchronous setting, we denoise image and depth together while replacing depth input with noisy GT depth at each step. The results demonstrate that asynchronous noise level scheduling offers clear advantages in depth conditional generation.

Does the presented method applicable to applicable/robust to inpainting/editing tasks? Supporting this feature will make the proposed method far more persuasive.

Thank you for raising this point. We add Fig. 9 to showcase our model's performance on inpainting or/and editing to image or/and condition. Fig. 5, 10 also show some related results.

Empirically, we observe that our model is robust for editing tasks. However, its inpainting performance is less stable, which is a limitation of the base model (Stable Diffusion). A straightforward solution is to apply our method to an inpainting-specific diffusion model, such as SD-Inpainting.

This is a very minor concern, but do the authors want to extend this model beyond SD1.5, which seems to be deprecated in the generative AI community?

Yes, we plan to train models on more advanced models, such as SD-XL, as computational resources allow.

评论

Thank you for your comments on our paper. We have submitted a response and a revised paper on OpenReview. With the discussion phase ending on November 26, we would like to confirm whether we have effectively addressed all concerns, particularly whether the additional experiments and metrics adequately resolve the limitations you highlighted regarding a comprehensive manuscript. We also hope you might consider raising the scores based on these updates and the ongoing discussion.

Thank you

评论

Dear Authors,

Thank you for the detailed comments for my concerns. I believe we can upload updates of the manuscript during the discussion phase. May I ask for adding these additional tablets to the manuscript? Thank you.

More question (one):

  • Is there a way to compare the performance of the "appearance" model? Lines 405-410 seem to advertise for the contribution of this fifth model as a video stylization or generation (such as AnimateDiff, maybe?). Can this be compared with other models regarding metrics used in related works? I believe this "appearance" model in this version of the manuscript is superfluous or inadequate unless related quantitative measures and descriptions are provided. What are these for, and what are the expected inputs/outputs?

Other concerns I am not mentioning can be regarded as resolved. Please update the tables in the manuscript. Thank you!

Best, Reviewer zDcz

评论

Thanks for your reply! Currently, we include all additional results (tables and figures) in the Appendix.E. We will add some key results to the Results section and rearrange the contents in a potential camera-ready version.

We acknowledge that the current version is superfluous regarding the Appearance model due to its lack of quantitative evaluation. So, we decided to lower the emphasis on it and treat it as an auxiliary model that can help transfer image appearance when using other models (Fig. 1, 7). We have removed Lines 405-410 and Appearance model samples in Fig. 4.

Following, we provide our understanding of the appearance model for your information. Unlike AnimateDiff, which is trained on consecutive frames, our Appearance model is trained on random frames from the same video. Its goal is to generate frames that "seem to come from the same video" as the input image, ensuring consistent appearance without temporal awareness.

This figure illustrates how we can use the model to obtain the output images with different levels of similarity. While applying the Appearance model independently is limited in scope, combining it with other control signals (e.g., depth) offers meaningful outputs, such as here. Video models like AnimateDiff cannot generate such discrete frames, so comparing the Appearance model against them is hard. While outside the scope of this paper, the Appearance model could be extended for video editing/stylization using a dedicated pipeline ( which is explored in a separate under-review paper).

Thank you again for helping us improve our work!

评论

Thank you for the quick response. I really appreciate the hard work the authors have done in this discussion period. All my concerns are resolved. I believe this is a nice work extending diffusion models with additional control handles. However, I also believe that the current version of the manuscript is not perfectly polished. For example, in lines 360-370, the authors propose five variants in equal tone, and in Figure 4 we have four models demonstrated qualitatively, and then in Table 1, only three models are compared quantitatively. I understand there can be some technical difficulties in comparing the variants symmetrically, but I believe this type of confusion undermining the completeness of this paper can be effectively handled with writing. I am still positive with accepting the paper, but I cannot give higher score than my original 6. Therefore, I will keep my score.

评论

Thanks for your appreciation and suggestions. We have submitted another revised version to address the problem you mentioned. The revised sections are marked with red text for ease of review. We believe the current version is clearer and more consistent in the evaluation of 4 main models.

Thanks again for the constructive discussion. Your feedback has significantly strengthened our work.

评论

Dear Authors, I appreciate your quick response to my comment. The changes you have made will be taken into my consideration during the discussion with other reviewers after this rebuttal period. Thank you for actively responding to my review. It really helped me understanding your work. Best, Reviewer zDcz.

AC 元评审

This paper introduces a framework for handling diverse conditional generation tasks by learning a joint distribution over correlated image pairs with diffusion models. Parallel model branches employ joint cross-attention blocks such that branches can attend to each other and enable concurrent processing of both image and condition inputs. Low Rank Adaptation allows for efficient adaptation of the base model to conditions. Model evaluation under various conditional generation tasks forms the bulk of the experimental work.

After reviewing the paper, rebuttal and resulting discussions AC believes that the overall strengths outweigh the weaknesses and recommends acceptance. The nature of framing the problem under a joint distribution is interesting and novel. For the camera-ready version, the authors should incorporate all key rebuttal results and discussion points.

审稿人讨论附加意见

The paper received five reviews where multiple reviewers recognised positive aspects relating to method efficiency, clear presentation, the nature of framing the problem under a joint distribution and quantitative, qualitative performnace.

The authors performed well in their rebuttal, addressing a majority of concerns by presenting additional evaluation results and explanations that further strengthen the work. Post-rebuttal, a majority of reviewers offer positive scores: two clear accepts, two borderline accepts and one borderline reject. Remaining concerns relate to downstream task performance and in particular depth estimation which can be considered more minor.

最终决定

Accept (Poster)