PaperHub
5.5
/10
Poster4 位审稿人
最低3最高3标准差0.0
3
3
3
3
ICML 2025

Improving Compositional Generation with Diffusion Models Using Lift Scores

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We introduce a training-free resampling criterion for compositional generation with diffusion models, which is computational efficient and requires no additional modules.

摘要

关键词
Diffusion ModelsTraining-freeRejection Sampling

评审与讨论

审稿意见
3

This paper aims to improve compositional generation at inference time via rejection sampling using Lift scores on each condition to be composed.

Update after rebuttal I appreciate the additional data (which I found convincing) and clarifications provided by the authors during the rebuttal. I was previously unaware of the prior work CAS mentioned by reviewer kyUu and agree that it may impact the novelty, but I also found the authors' clarifications about their focus specifically on compositions and systematic evaluation fairly convincing. Overall, I will keep my score at 3.

给作者的问题

Please see the questions in Methods And Evaluation Criteria.

A few others: Are Figure 4, 7, samples cherrypicked? If so I feel this should be acknowledged in the caption, and possibly some not-so-great examples included in appendix.

Answers to these questions would help me find the samples and metrics more convincing.

论据与证据

Yes

方法与评估标准

Quantitative metrics are important for this study. I feel that this was done thoroughly and mostly reasonable choices were made, although I have a few questions.

2D synthetic: (Q) Why did you choose Chamfer Distance (as opposed to other metrics e.g. KL, Wasserstein etc.)?

CLEVR: SAM2 and Liu pretrained classifier both make sense, it is nice that they were both tried and compared.

SDXL: CLIP and ImageReward seem reasonable here.
(Q) Could Segment Anything potentially be used in this context? (why not?). (Q) Would it be feasible to evaluate with the TIFA score as in Karthik et al (https://arxiv.org/pdf/2305.13308)? (Q) (L407) For assessing alignment via CLIP do you use the entire prompt (including all conditions) or assess the CLIP alignment on each condition individually? I believe it’s the former but do you think they latter might work better? (do you have any evidence on this?) (Q) Do you have any way to assess whether CLIP vs ImageReward is a more appropriate metric, and how well each of them actually checks whether all the conditions are present in the composition (e.g. for AND).

理论论述

N/A

实验设计与分析

Please see Methods And Evaluation Criteria

补充材料

No

与现有文献的关系

This approach builds on various existing approaches in diffusion, composition, rejection sampling, Lift scores, etc. I don't feel that it represents a big conceptual leap beyond these existing methods, however I think it cites prior work appropriately and focuses mainly on getting a conceptually simple idea to work well empirically with suitable validation.

遗漏的重要参考文献

N/A

其他优缺点

Strengths: I appreciate the focus on the missing objects problem. I find the samples and metrics fairly convincing (though please see Questions).

Weaknesses: A lot of the decisions feel a bit ad-hoc (e.g. how many activated pixels “count” e.g. tau = 250 on L245 and the choice to replace epsilon by eps_theta(x, c_compose) in Fig 6) although I don’t consider this a deal-breaker for a methods paper.

其他意见或建议

I would appreciate a clearer discussion of the final probability you are actually optimizing for (or energy you are minimizing) after your rejection-sampling procedure, if there is anything you can say theoretically about it. I did not find the variance-reduction interpretation of replacing epsilon by eps_theta(x, c_compose) very clear. I also wonder if there are any connections with CFG that you know of?

作者回复

Thank you for your valuable comments. We address your concerns as follows:

Chamfer Distance

We chose Chamfer Distance because it (1) applies to uniform distributions and (2) is sensitive to out-of-distribution samples. KL is inapplicable for out-of-distribution samples with undefined density ratio in uniform distribution settings. Wasserstein Distance is more robust to outliers, which doesn't meet our requirement to sensitively capture unaligned samples.

Segment Anything for t2i

Vanilla Segment Anything lacks text prompt support. Grounded Segment Anything [1] would be a good candidate. We used CLIP/ImageReward following standards in previous works [2,3], but we will add a discussion in Conclusion.

TIFA score [2, 4]

New TIFA experiments show improvements across categories:

MethodAnimalsObject&AnimalObjects
TIFA ↑TIFA ↑TIFA ↑
Stable Diffusion 1.40.6920.8220.629
SD 1.4 + Cached CompLift0.7500.8860.685
SD 1.4 + CompLift0.7940.9020.682
Stable Diffusion 2.10.8330.8730.668
SD 2.1 + Cached CompLift0.9050.9110.731
SD 2.1 + CompLift0.9270.9120.726
Stable Diffusion XL0.9130.9640.755
SD XL + Cached CompLift0.9490.9720.790
SD XL + CompLift0.9460.9740.782

entire prompt vs individual condition

We used entire prompt. We add new experiment with minCLIP (minimum CLIP score across subjects):

MethodAnimalsObject&AnimalObjects
minCLIP ↑minCLIP ↑minCLIP ↑
Stable Diffusion 1.40.2180.2480.237
SD 1.4 + Cached CompLift0.2250.2600.249
SD 1.4 + CompLift0.2280.2630.252
Stable Diffusion 2.10.2370.2580.247
SD 2.1 + Cached CompLift0.2480.2650.260
SD 2.1 + CompLift0.2490.2650.261
Stable Diffusion XL0.2430.2690.264
SD XL + Cached CompLift0.2480.2710.269
SD XL + CompLift0.2500.2710.271

Similar performance gains were observed, indicating CompLift primarily improves the weaker condition (typically missing object).

CLIP vs ImageReward

We manually labeled 100 samples from Fig. 10 for presence of both black car and white clock. With 62 positive and 38 negative samples, we calculated metric performance:

MetricsCLIPImageRewardTIFA
ROC AUC0.9490.9550.857
PR AUC0.9720.9680.901

CLIP and ImageReward perform similarly, both better than TIFA. CLIP slightly preferred due to imbalanced data.

decisions feel a bit ad-hoc

We choose τ=250\tau=250 as the median activated pixel count among all images. Tests at 25th and 75th percentiles showed the median works best - lower τ\tau reduces accuracy due to estimation variance, while higher τ\tau increase rejection rates.

Regarding ϵθ(xt,ccompose)\epsilon_\theta(x_t, c_{compose}): this design is from empirical observations. Intuitively, if object cc exists in image xx, then for most noisy images xtx_t, ϵθ(xt,c)\epsilon_\theta(x_t, c) should be closer to ϵθ(xt,ccompose)\epsilon_\theta(x_t, c_{compose}) than the unconditional ϵθ(xt,)\epsilon_\theta(x_t, \varnothing) in the corresponding pixels. We'll add this explanation to the paper.

connections with CFG

Our paper uses the constrained distribution:

$

x_0 \sim p_{\text{generator}}(x_0), \quad \text{s.t. } \log p(x_0 \mid c_i) - \log p(x_0) > 0, \quad \forall, c_i.

$

Now we show that [5,6] tries to satisfy the constraint using soft regularization. Using Lagrangian relaxation and multipliers λi0\lambda_i \geq 0, the objective can be transformed into:

$

\mathcal{L}(x_0, \lambda) = \log p_{\text{generator}}(x_0) + \sum_{c_i} \lambda_i \Bigl( \log p(x_0 \mid c_i) - \log p(x_0) \Bigr), \quad \lambda_i \geq 0.

$

Since xtlogpθ(x0)ϵθ(xt,t)\nabla_{x_t}\log p_\theta(x_0) \propto \epsilon_\theta(x_t, t), and [5, 6] assume an unconditional generator, the derivative matches Equation 11 in [6]:

$

\nabla_{x_t}\mathcal{L}(x_0, \lambda) \propto \epsilon_\theta(x_t, t) + \sum_{c_i} \lambda_i \Bigl( \epsilon_\theta(x_t, t \mid c_i) - \epsilon_\theta(x_t, t) \Bigr), \quad \lambda_i \geq 0.

$

CFG [5,6] uses fixed λi=w\lambda_i=w, not guaranteeing constraint satisfaction.

Are Figure 4, 7, samples cherrypicked?

Yes: Figure 4 shows the ones with most-improved CLIP scores. Figure 7 uses random prompts with clear pixel separation and aesthetic quality. We will add more samples to Appendix and captions to make the selection clear.

[1] https://arxiv.org/abs/2401.14159

[2] https://arxiv.org/abs/2305.13308

[3] https://arxiv.org/abs/2301.13826

[4] https://arxiv.org/abs/2303.11897

[5] https://arxiv.org/abs/2207.12598

[6] https://arxiv.org/abs/2206.01714

审稿意见
3
  • The paper introduces a novel criterion CompLift for rejecting samples of conditional diffusion models based on lift scores.
  • For compositional generation, i.e., cases in which the condition for sampling, e.g, a text prompt can be described as a composition of conditions (like desired individual objects in the image), CompLift intuitively evaluates whether final samples are more likely given each individual condition than without it and therefore whether the conditions have been properly considered throughout the generation process.
  • Formally, this criterion can be described in terms of lift scores, an existing concept in data mining, for which the authors introduce an approximation using the same conditional diffusion model as for sampling.
  • In an exploration of the design space, the authors evaluate the effect of noise and timestep sampling for the approximation and propose a more efficient algorithm that caches intermediate results from the generation process for later evaluation of the rejection criterion.
  • An evaluation on synthetic data, a toy image dataset, as well as text-to-image generation shows improved alignment with the conditions for compositional generation.

给作者的问题

I do not have any particular questions to the authors.

论据与证据

Most claims in the submission are supported by clear and convincing evidence except for:

  • The paper claims "significantly improve[d] compositional generation" (lines 24 ff., left column) while the quantitative results on the CLEVR position dataset mainly show accuracy improvements with 4 and 5 constraints, for which the FID however is worse than the Composable Diffusion baseline as also mentioned by the authors (lines 424 ff., left column)
    • If the rejection of samples reduces sample diversity as hypothesized by the authors, the kind of improvements for compositional generation (condition / prompt alignment) should be specified to avoid misunderstandings.
  • The paper compares quantitatively on the 2D synthetic dataset against additional baselines (EBM [1]), but limits qualitative comparisons to the Composable Diffusion baseline only.
  • The main paper compares only on the 2D synthetic dataset against baselines (EBM [1]), but misses to do so on the CLEVR and text-to-image setups. The appendix provides comparisons on CLEVR.

[1] Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. ICML 2023

方法与评估标准

The proposed methods and evaluation criteria make sense except for:

  • For the text-to-image compositional task, the number of trials is different for the vanilla version and the cached version. While I understand that the number of trials for the cached version is equal to the number of sampling steps, a fair comparison of both versions in terms of number of trials would be interesting to see for this task in order to evaluate the effect of caching.
  • While providing results on the 2D synthetic dataset makes sense, there is a severe lack of clarity regarding this benchmark:
    • Figure 1 showing results on this dataset is already on page 2 but first referenced on page 7. The caption of it also does not describe the experimental setup. I found this figure to be unclear if the dataset is not introduced yet.
    • As a result of that, the Compose function (lines 119 ff., right column) together with table 1 lack intuition. Having the synthetic 2D dataset described earlier for an example of different compositions of conditions (or possibly a text-to-image example in the introduction using different compose functions) would be helpful.
    • The introduction of the dataset in section 6 and metrics after the design space exploration / ablation (section 4) with quantitative resutls on this benchmark and figures 2 and 3 also raise questions about the dataset and how the accuracy is measured while reading the paper.
    • The description of the 2D synthetic dataset at the beginning of section 6.1 is lacking information about how the dataset is generated.

理论论述

There are no theoretical claims that require proofs.

实验设计与分析

All experimental designs and analyses seem to be valid.

补充材料

I reviewed the complete supplementary material, but did not check the pseudo-code algorithms 3 to 6 in detail.

与现有文献的关系

Given complex conditions such as text prompts, diffusion models are known to hallucinate samples not following the correct conditional distribution. If conditions can be decomposed into smaller conditions, prior work like Composable Diffusion [1] has shown that this compositionality can be effectively leveraged to generate samples with better alignment to complex prompts. This paper proposes a method orthogonal to prior work by introducing a rejection / resampling criterion to ensure that the final sample is positively correlated with the conditions.

遗漏的重要参考文献

I am not aware of any essential missing references.

其他优缺点

Strengths:

  • The paper is mostly well-written and easy to follow. Abstract, introduction, and related work provide a good motivation and introduction into the topic of compositional generation.
  • The method section includes formal derivations of the lift score approximation as well as the intuition behind the equations, which I found very helpful.
  • The qualitative results are convincing:
    • Once the 2D synthetic dataset and task is understood, the qualitative results illustrate the effect of CompLift and the different Compose functions well.
    • The pixel-wise scores for text-to-image generation clearly decompose the image according to the individual conditions.
  • The quantitative results show consistent improvements over the Composable Diffusion baseline.

Weaknesses:

  • As already indicated in review section "Methods And Evaluation Criteria", the structure of the paper w.r.t. the 2D synthetic dataset, figure 1, the different compose functions with table 1, and section 4 consisting of ablations before the dataset description is suboptimal and results in a lack of clarity.
    • And the dataset description itself also lacks information about how it was generated.
  • More lack of clarity:
    • In lines 154 ff., right column, the explanation for the small performance loss using noise sharing strategies in negation tasks is unclear to me.
    • I find the description of the cached CompLift version in Section 4.3. and Algorithm 2 difficult to understand, even though the idea itself is quite simple and intuitive.
  • The paper never introduces the abbreviations for the baselines from EBM and also not EBM itself.

其他意见或建议

  • In line 194 f., right column, you reference section 4.3. in section 4.3. itself.
  • In line 317, left column, the zz should be a ztz_t, if I am not mistaken.
作者回复

Thank you for your valuable feedback. We address your concerns below.

Overstatement of minimal

We will modify the abstract to make the contribution accurate as "significantly improved the condition alignment for compositional generation".

Limited comparisons: no T=50 for vanilla CompLift; missing EBM baselines on CLEVR/text-to-image

We have conducted additional experiments to address both concerns:

  1. We compared against EBM+ULA [1] on text-to-image generation (ULA is the default text-to-image sampler in their repo).

  2. We ran vanilla CompLift with T=50 to enable fair comparison.

The consolidated results are in the table below. CompLift outperforms EBM+ULA across all model variants, and the cached version performs similarly to vanilla CompLift with the same T=50, with CLIP scores very close while ImageReward scores are slightly lower for the cached version.

We wish to clarify that our main focus is demonstrating vertical improvement - how CompLift boosts the base method's performance. The horizontal comparison to other baselines serves as supportive evidence that this boost helps achieve state-of-the-art performance.

MethodAnimalsObject&AnimalObjects
CLIP ↑IR ↑CLIP ↑IR ↑CLIP ↑IR ↑
Stable Diffusion 1.40.310-0.1910.3430.4320.333-0.684
SD 1.4 + EBM (ULA)0.3110.0260.3420.3870.344-0.380
SD 1.4 + Cached CompLift0.3190.1280.3560.9900.344-0.131
SD 1.4 + CompLift (T=50)0.3200.2410.3550.9870.344-0.154
SD 1.4 + CompLift (T=200)0.3220.2930.3581.0930.347-0.050
Stable Diffusion 2.10.3300.5320.3540.9240.342-0.112
SD 2.1 + EBM (ULA)0.3300.8290.3570.9810.3480.218
SD 2.1 + Cached CompLift0.3390.8800.3611.2520.3540.353
SD 2.1 + CompLift (T=50)0.3400.9920.3611.2630.3540.454
SD 2.1 + CompLift (T=200)0.3400.9750.3621.2830.3550.489
Stable Diffusion XL0.3381.0250.3631.6210.3590.662
SD XL + EBM (ULA)0.3350.9130.3621.6760.3610.872
SD XL + Cached CompLift0.3411.2440.3641.6870.3650.896
SD XL + CompLift (T=50)0.3421.2220.3641.7000.3650.842
SD XL + CompLift (T=200)0.3421.2160.3641.7060.3670.890

Structure issues

Thank you for the constructive feedback. We will try our best to make the concept more clear. In particular, we will make the following modifications:

  1. Self-inclusive caption - summarize the experiment setup in the caption of Figure 1, including the component distribution, the algebra, the data generation, and the training.
  2. Early introduction of 2D dataset - briefly mention the 2D synthetic dataset in Section 3, including the description about the data generation, the component distribution, the algebra, and the accuracy metric. We will add a sentence to refer reader to Section 6 and Appendix D for more details.
  3. Intuitive explanation - add several more explanation of the Compose function, using the examples in the 2D dataset.

2D dataset missing generation details

We will add more text to Section 6.1 about the dataset generation. In short, the distributions follow the generation way in [1] - they are either Gaussian mixtures or uniform distribution. We sample 8000 data points randomly for each component distribution, and train 1 diffusion model for each distribution. We will include more parameters of the distributions in Appendix D.

Unclear noise sharing performance loss in negation

Our hypothesis is that sharing the same noise introduces some bias in the estimation, and it makes CompLift over-reject samples as a conservative way. With more trials, the bias of the estimation is amplified, thus makes more samples over rejected. After taking a deeper look in Figure 2, we also observe similar sharing-noise regression for Product and Mixture, though the regression is very slight for those 2 algebras. We will modify the explanation in the paper to make this hypothesis more clear.

Cached CompLift description unclear

We provide more details about the algorithm in Appendix C. Algorithm 5 and 6 are in pseudo-code styles and might be easier for the reviewer to parse. We will add a sentence in Section 4.3 to refer readers to Appendix C for more context. Please let us know if there remains such an issue, we will keep making the paper easier to read.

EBM-related abbreviations

Thanks. We'll add explanations in Section 6.1 for all abbreviations (EBM, ULA, U-HMC, MALA, HMC).

Typos

Thanks. We'll remove the self-reference and change zz to ztz_t in paragraph L317.

[1] https://arxiv.org/abs/2302.11552

审稿人评论

I appreciate the rebuttal from the authors that addresses all my concerns from my review. I do not have any follow-up questions.

审稿意见
3

This work proposes CompLift, a resampling criterion based on the concept of lift scores used to improve the compositional generation capabilities of pretrained diffusion models. CompLift approximates the lift scores with the diffusion modules noise estimation, without requiring any external reward modules to measure its alignment to the given condition. The authors additionally propose a caching technique for CompLift and achieves computationally effecient pipeline. Through evaluations on both simple synthetic generation and more complex text-to-image generation, the paper shows that CompLift leads to accurate compositional generation without additional training.

给作者的问题

Crucial questions are included in the "Claims" section.

论据与证据

The overall writing of the paper is well-structured, with a clear problem definition and a simple but effective solution. The idea of adopting the concept of lift scores for improving diffusion model's compositional generation is interesting.

However, the paper lacks reference and discussions on an important related work, as detailed below:

CompLift seems to resemble a closely related work CAS [1], in which the authors define a novel "condition alignment score" CAS as logp(x0c)logp(x0)\log p (x_0 | c) - \log p (x_0) (Fig. 3 of the paper). The main argument of CAS is that this term can effectively measure the alignment between the generated output x0x_0 and the given condition cc, and therefore it can be used as a alignment metric without the need of external modules. This claim is similar to the main contribution of this work.

In this regard, the proposed formulation of lift scores in Eq. (2)-(4) of this work seems to be quite similar to CAS, and the authors will need to provide a discussion on the difference between the two approaches in order to claim the novelty of CompLift.

[1] CAS: A Probability-Based Approach for Universal Condition Alignment Score, Hong et al., ICLR 2024

方法与评估标准

  1. For the request for the justification on the novelty of CompLift, please refer to the above section.

  2. Evaluations on the effect of CompLift on text-to-image generation is not very convincing as it does not include comparisons with the baselines. While the authors evaluate using the benchmark from Attend-and-Excite [1], I couldn't find the comparisons with the Attend-and-Excite itself. Since Attend-and-Excite (or its follow-ups) also does not require external modules for measuring the alignment with the given condition, I believe it should be a valid baseline for comparison. Otherwise, as stated in the introduction, it would be nice if the authors show that CompLift can indeed to applied together with such methods, yielding additional improvements.

[1] Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models, Chefer et al., SIGGRAPH 2023

理论论述

This paper doesn't provide theoretical claim or proof, and instead focuses on the empirical evidence on numerous types of data.

实验设计与分析

  1. Comparisons on the running time in Fig. 5 clearly shows the advantage of CompLift over the MCMC-based approaches in terms of efficiency.

  2. The idea of "counting the activated pixels" in Section 5.1 seems quite confusing. I'm curious whether this design choice can consider a variety of objects. For instance, some objects are likely to take up a large area of the image, while other objects might be generated in smaller sizes. If the same threshold for number of pixels is applied for checking the existence, can it handle both cases? How did you set the threshold τ\tau?

补充材料

The results of 2D toy experiments in Fig. 11-13 seem quite intuitive and clearly show that CompLift has an advantage over Composable Diffusion. In addition to this, I'm also curious whether the same trend holds for the more recent previous work "Reduce, Reuse, Recycle" [1] which is based on MCMC.

[1] Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC

与现有文献的关系

As also mentioned in the paper, the idea of using the diffusion model itself as a source of a conditional reward function could be useful for inference-time scaling methods for diffusion models aiming for better condition alignment.

遗漏的重要参考文献

A critical related work CAS [1] is missing, as mentioned in the "Claims" section.

[1] CAS: A Probability-Based Approach for Universal Condition Alignment Score, Hong et al., ICLR 2024

其他优缺点

I agree with the fundamental goal of the paper that "diffusion models should be able to assess its own alignment to the given condition". And the method is quite simple and intuitive. However, for now it is hard to give a positive score as the paper fails to address a critical previous work that has proposed a similar claim and solution.

其他意见或建议

typo: page 8 line 408: ImageResizer -> ImageReward

作者回复

Thank you for your thoughtful and constructive feedback. We address your concerns as follows.

Relationship to CAS [1]

Thank you for pointing out this important related work, which we previously overlooked. We will add reference to this valuable work, and incorporate a discussion of CAS in the "Related Work" section of the revised version. A summary of the relationship is as follows:

Our work can be seen as an extension of CAS, which investigates the potential of using CAS as a compositional criterion to decompose the alignment requirements of a complex prompt into multiple acceptance criteria. To approximate CAS, we employ ELBO estimation to reduce computational cost, as an alternative to the Skilling-Hutchinson estimator used in the original CAS paper. It would be interesting to explore how Skilling-Hutchinson-based estimation performs in a compositional setting. It may yield higher accuracy at the expense of greater computational overhead, which we plan to investigate in future work.

Comparison with Attent-and-Excite [2] on text-to-image generation

Thank you for the great question. We conducted a new experiment using Attent-and-Excite [2], focusing on the additional improvement achieved by incorporating CompLift. We observed consistent performance gains with both SD 1.4 and SD 2.1. Note that SD XL is not included due to the lack of support in the original Attent-and-Excite code.

MethodAnimalsObject & AnimalObjects
CLIP ↑IR ↑CLIP ↑IR ↑CLIP ↑IR ↑
A&E (SD 1.4)0.3300.8310.3571.3390.3570.815
A&E (SD 1.4) + Cached CompLift0.3381.1560.3611.4690.3620.934
A&E (SD 1.4) + CompLift0.3371.1600.3611.4580.3610.990
A&E (SD 2.1)0.3421.2250.3601.4710.3661.219
A&E (SD 2.1) + Cached CompLift0.3441.2980.3641.4880.3711.245
A&E (SD 2.1) + CompLift0.3461.3370.3651.5160.3701.246

The idea of "counting the activated pixels" in Section 5.1 seems quite confusing... How did you set the threshold τ\tau?

Thank you for raising this thoughtful concern. We agree that performance could be further improved by making τ\tau an object-specific hyperparameter. For simplicity, we currently set τ\tau as a uniform threshold.

We chose τ=250\tau = 250 as the median value among the number of activated pixels across all images. We also experimented with the 25th and 75th percentiles, and found the median performed best in practice. A lower τ\tau leads to less accurate rejection due to ELBO variance, while a higher τ\tau increases the rejection rate.

We will include more details on the derivation of τ\tau in the Appendix. Intuitively, τ=250\tau = 250 corresponds to ~1.5% of the total number of the latent pixels in SDXL (128x128 latent space) and ~6.1% in SD 1.4/2.1 (64x64). We find this small threshold sufficient, since our focus is on identifying "missing object" issues. If an object is missing, it tends to result in almost no activated pixels. Additional discussion will be added to the Appendix.

Does the same trend hold for the recent work "Reduce, Reuse, Recycle" [3] based on MCMC?

Thank you for this suggestion. We also observed a clear overall advantage of our method over MCMC-based methods like "Reduce, Reuse, Recycle" [3]. While certain MCMC variants such as U-HMC and MALA perform comparably in specific scenarios (e.g., the first test case in Product and second in Mixture), they often generate samples outside the target distribution in other settings, unlike our method.

We will update Appendix D to include visualizations comparing results from the MCMC-based methods.

Typo: page 8 line 408: ImageResizer -> ImageReward

Thank you for catching this typo. We will correct it in the revised version.

[1] https://openreview.net/forum?id=E78OaH2s3f

[2] https://arxiv.org/abs/2301.13826

[3] https://arxiv.org/abs/2302.11552

审稿人评论

I appreciate the authors' rebuttal and their efforts in answering all the raised questions. However, I still have concerns regarding the core technical contribution of LiftScore over CAS, outlined below:

While the authors distinguish between conditional generation (in CAS) and compositional generation (in LiftScore), it is unclear whether these tasks are fundamentally independent. The tasks in LiftScore (e.g. text-to-image generation, position task) could be framed as conditional generation tasks, which means that they can also be addressed by CAS.

Could the authors clarify why LiftScore could have an advantage over CAS specifically in the compositional setting?

If the key difference between the two methods is the choice of the approximation, I am concerned whether this choice can be justified for the specific tasks.

作者评论

Thank you for your acknowledgement of our rebuttal effort. We address your question as follows:

advantage of compositional criteria

We acknowledge that the compositional acceptance / rejection task can also be framed using 1 single criterion that works directly on the whole prompt, as addressed by CAS. To test how CAS-like variant performs for prompts containing multiple objects, we have conducted a new ablation study.

Here, CAS variant means that we use the single criteria logp(zccompose)logp(z)\log p(z| c_{\text{compose}})-\log p(z| \varnothing) as the latent lift score, which replaces the composed criteria from multiple individual lift scores in CompLift. Note that this is a controlled experiment to check the advantage of compositional criteria, thus, we keep the same estimation method using ELBO.

We provide the following table as the result. We observe only modest improvement when using the CAS variant. We hypothesize that CAS variant might face a similar problem as the original Diffusion Model for multi-object prompts - the attention to the missing object is relatively weak in the attention layers. Similar discussions can be found in previous works such as Attend-and-Excite [1], where diffusion model ϵθ(x,ccompose)\epsilon_\theta(x, c_\text{compose}) sometimes has weak alignment with some condition cic_i in ccomposec_\text{compose}.

We will add the new table and the related discussion to the Appendix.

MethodAnimalsObject&AnimalObjects
CLIP ↑IR ↑CLIP ↑IR ↑CLIP ↑IR ↑
SD 1.40.310-0.1910.3430.4320.333-0.684
SD 1.4 + CAS Variant0.312-0.1530.3480.7080.337-0.373
SD 1.4 + CompLift0.3220.2920.3581.0940.347-0.050
SD 2.10.3300.5320.3540.9240.342-0.112
SD 2.1 + CAS Variant0.3330.6260.3551.0800.3470.144
SD 2.1 + CompLift0.3400.9750.3621.2830.3550.489
SD XL0.3381.0250.3631.6210.3590.662
SD XL + CAS Variant0.3381.0640.3631.6280.3620.702
SD XL + CompLift0.3421.2160.3641.7060.3670.890

side note: why compositional in general?

One cause of the missing object issue might be the training-inference mismatch: similar combinations of objects in ccomposec_\text{compose} are rare in the training set. As more objects of interest are involved, we can observe that this problem gets more significant as the whole composed condition grows more complex (e.g., the CLEVR experiment). Similarly, sampling/criteria based solely on ϵθ(x,ccompose)\epsilon_\theta(x, c_\text{compose}) might not be as reliable as approaches that incorporate information from individual ϵθ(x,ci)\epsilon_\theta(x, c_i), as the table shown above.

Compositional generation is one approach to generalize to more complex prompts. For example, the CLEVR model is trained with only 1 object position in the prompt. With Composable Diffusion + CompLift, we can extend it to the combination with 5 object positions with high accuracy, while such combinations are rare to see in training.

key core contribution of our work

We would like to emphasize that our key contribution is to provide a systematic exploration and application of LiftScore / CAS, specifically to compositional generation challenges. Our contribution is not the invention of LiftScore, since it is already an existing concept in data mining, and theoretically equivalent to the CAS concept.

Our work can indeed be viewed as a complementary extension of LiftScore / CAS into the compositional generation domain - much like how science builds upon previous discoveries, we too stand on the shoulders of giants.

The CAS paper provided valuable insights on condition alignment with a single condition, which we acknowledge. Our contribution extends this foundation by:

  1. Developing the mathematical framework to apply these scores to multiple condition compositions, including algebras like Product, Mixture, and Summation.
  2. Introducing novel engineering solutions (like caching and variance reduction) that make compositional evaluation practical.
  3. Systematic evaluation on 2D, CLEVR, and text-to-image datasets.

We hope this clarification addresses your concerns about the relationship between our work and CAS. Our intention is to contribute meaningful extensions to this line of research by adapting and enhancing these techniques specifically for compositional generation tasks.

Thank you again for your insightful feedback, which has helped us better articulate the positioning of our work. If our responses have addressed your concerns adequately, we would be sincerely grateful if you would consider raising your score accordingly as a recognition of our work and this rebuttal effort. Thank you once again for your time and support.

[1] https://arxiv.org/abs/2301.13826

审稿意见
3

This paper proposes a training-free post-processing approach, CompLift, to select images with specified concepts from diffusion model-generated image candidates. The main idea is to use the lift score, which is equivalent to point-wise mutual information, to evaluate if conditioning cc reduces uncertainty of variable x**x**. As a post-processing approach, the performance of CompLift hinges on the generative model (i.e., composable diffusion model) it is based on. If composable diffusion model cannot generate accurate images at all, then CompLift cannot make any improvement. Experimental results show improved generation accuracy by using the proposed approach.

Most of my concerns are addressed. I maintain the rating.

给作者的问题

It is somewhat unclear about fair comparison with Composable Diffusion Model. Consider that Composable Diffusion Model generates 5 images, and one of them is accurate. CompLift selects the accurate one with lift score. Then how to determine if Composable Diffusion Model or CompLift is more accurate?

论据与证据

The paper claims that "as a novel resampling criterion using lift scores for compositional generation, requiring minimal computational overhead". This assertion seems somewhat overstated, and the term "minimal" is ambiguous without a clear criterion. While in some cases, the cached strategy results in no additional computational overhead, this is not universally true. For text-to-image generation, when replacing ϵ\epsilon with ϵθ(z,cccompose)\epsilon_\theta(z, c_{\text{ccompose}}), additional computational overhead (n + 2) · T forward passes is involved.

方法与评估标准

The proposed method make sense for the application.

理论论述

No theoretical claims are provided in the paper.

实验设计与分析

The paper states in Fig. 5 that the overhead introduced by the cached CompLift is negligible for the Composable Diffusion baseline (Liu et al., 2022). However, this experiment only shows running time without giving accuracy evaluation. It is not clear if it is trading accuracy performance with running time. It might be helpful to show both in a single figure.

补充材料

Did not review the supplementary material.

与现有文献的关系

The proposed approach uses lift score as a criterion to evaluate if a concept appears in an image. The estimated lift score in equation (4) is actually equivalent to the point-wise mutual information discussed in equation (5) in [1]. The proposed approach can be seen as an application of point-wise mutual information in [1].

[1] Kong, X., Liu, O., Li, H., Yogatama, D., and Steeg, G. V. Interpretable diffusion via information decomposition. arXiv preprint arXiv:2310.07972, 2023.

遗漏的重要参考文献

N/A

其他优缺点

Strength

  1. The proposed approach is training-free and requires little or no additional computational resources by cache design.

  2. Extensive experiments are conducted and show improvements over baselines.

  3. The writing is smooth and easy to follow.

Weakness

  1. Involving no extra training or guidance at inference time not only can be an advantage but also can be a limitation of the proposed model. This means that the proposed model cannot correct the generated images but can only select some from them. As a result, the performance of CompLift largely hinges on the generative model (like Composable Diffusion) it builds upon, because if Composable Diffusion does not generate accurate images, the CompLift cannot select accurate images from the generated images.

  2. Though CompLift shows significant improvement over Composable Stable Diffusion on synthetic datasets, the improvement on real-world text-to-image generative model is trivial as shown in Table 3. Again, this hinges on the performance of Composable Stable Diffusion that has limited ability to generate accurate multi-object images.

  3. Though called CompLift, the proposed approach itself is not compositional because it evaluate individual concepts separately. It is not evaluating a joint appearance of all concepts. For text-to-Image generation, only AND operation is considered by evaluating individual concepts.

  4. For text-to-image generation, when replacing ϵ\epsilon with ϵθ(z,cccompose)\epsilon_\theta(z, c_{\text{ccompose}}), additional computational overhead (n + 2) · T forward passes is involved. This contradicts the minimal computational overhead requirement claim, and should be discussed to avoid overclaim.

其他意见或建议

N/A

作者回复

Thank you for your valuable feedback and questions. We address your concerns below:

On CompLift's dependence on underlying generative model

We agree and will add this theoretical limitation to our Conclusion. While theoretically CompLift cannot improve if the base method produces no accurate images, in practice even weak generators often improve with ≤5 candidate images.

On "minimal computational overhead" claim

We'll remove the ambiguous term "minimal" and claim only "requiring no additional training." For text-to-image generation, the (n + 2) · T additional forward passes can be parallelized to reduce latency. Currently, generation takes ~15s and lift score calculation ~30s on a 4090 GPU, with GPU memory as the bottleneck. Ideally, we can further parallelize to the latency of O(1) forward pass given enough GPU memory.

On cached CompLift's accuracy-speed tradeoff

Every column in Fig. 5 has a corresponding accuracy row in Table 2 (Cached CompLift has T=50 by default). We'll add a footnote clarifying this. In practice, we perceive small accuracy regression on Mixture and Negation task when switching from vanilla CompLift to Cached CompLift. However, the accuracy remains significantly higher than other baselines. The tradeoff exists, but seems to be mild and acceptable given the substantial speed improvement.

On lift score equivalence to point-wise mutual information [1]

We'll update the Related Work section to reflect this equivalence. Our paper applies point-wise mutual information (PMI) as an acceptance/rejection criterion, focusing on missing-object cases and composing PMI for multiple objects.

On real-world text-to-image improvement

The seemingly trivial CLIP improvement is due to the low magnitude of CLIP scores. Here, we provide another perspective to interpret the numbers. We compare the CompLift selector to the perfect best-of-n selector, which has direct access to the metric function. The percentage gain is calculated as (CompLift metric - baseline metric) / (perfect selector metric - baseline metric). On average, the gains are ~40% for vanilla CompLift and ~30% for cached CompLift. We will add more explanation to the Appendix.

MethodAnimalsObject&AnimalObjects
CLIP gain% ↑IR gain% ↑CLIP gain% ↑IR gain% ↑CLIP gain% ↑IR gain% ↑
SD 1.4 + Cached CompLift31.5830.5842.6262.7437.0454.99
SD 1.4 + CompLift42.1146.4049.1874.3247.1463.05
SD 2.1 + Cached CompLift35.7144.1328.3455.0737.9757.82
SD 2.1 + CompLift39.6856.1832.3960.2841.1474.73
SD XL + Cached CompLift16.9555.8826.1346.1524.4946.06
SD XL + CompLift22.6048.7426.1359.4432.6544.88

On CompLift's compositionality

Our approach is to (1) evaluate individual concepts separately, and (2) compose an acceptance/rejection criteria from these multiple individual criteria, as Algorithm 3. Thus, the CompLift criteria seems compositional from our perspective. We wish to mention Composable Diffusion [2] as an example to elucidate such a perspective. Essentially, the approach uses (1) individual score on each concept, and (2) compose the score from these multiple individual scores. Such a factorize-and-compose way helps reduce the hardness to comply with complex prompt. As shown in our experiments, it improves performance on complex multi-object generation by evaluating individual conceptual alignment before making a composed decision.

On limited algebraic operations for text-to-image

We acknowledge this limitation. While we tested all algebras in the 2D dataset, we found no existing mature benchmark for OR/NOT algebras in text-to-image generation. We'll note this in our Conclusion for future work.

On fair comparison with Composable Diffusion

We recognize the challenge in comparing these approaches. Composable Diffusion generates candidates with no internal selection mechanism, while CompLift is a post-hoc filter. They serve different purposes and are not mutually exclusive—CompLift can enhance generation models by leveraging semantic alignment for filtering results. Our main goal in the experiments is to show such an enhancement, instead of a direct replacement, of other baselines such as Composable Diffusion.

[1] https://arxiv.org/abs/2310.07972

[2] https://arxiv.org/abs/2302.11552

最终决定

The reviewers find the paper well written, and the evaluation demonstrates that the proposed method consistently outperforms the baseline, particularly on the synthetic dataset. However, as noted by Reviewer kyUu, the literature review is incomplete, with a highly relevant work omitted which impacts the novelty and contribution of the paper. Additionally, the reviewers find the evaluation on real-world data less convincing. These concerns were addressed in the rebuttal, leading to improved overall recommendations. Area Chair agrees with the reviewers and considers the paper above the acceptance threshold, provided the authors appropriately incorporate the content and clarifications presented in the rebuttal into the final revision.