Increasing the Utility of Synthetic Images through Chamfer Guidance
摘要
评审与讨论
This paper uses the average closest-point distance between two sets (Chamfer distance) to guide the diffusion process. It leverages a set of previously generated images and a set of real images to enhance both fidelity and diversity. The authors present experiments on geo-diversity benchmarks using LDM diffusion models (1.5M and 3.5M parameters).
优缺点分析
Strengths:
- The paper tackles an important question in the literature.
- A key strength of this work compared to c-VSG is the reduced computational complexity of using the Chamfer distance.
Weaknesses:
- The primary concern is the limited novelty of the proposed approach.
- The method assumes access to real datasets, which may not be feasible in many real-world settings or for natural prompts.
- It also requires that the generated image set corresponds exactly to the same prompt, which limits flexibility and scalability.
- The paper misses important baselines such as Particle Guidance [1] and Interval Guidance [2], which are directly relevant for promoting diversity and distributional alignment. This exclusion narrows the scope of the evaluation.
[1] Corso et al. "Particle Guidance: non-I.I.D. Diverse Sampling with Diffusion Models", In ICLR 2024
[2] Kynkäänniemi et al, "Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models", In NeurIPS 2024
问题
- Why is Chamfer distance preferred over more distribution-aware metrics such as FID or KID? A brief justification would clarify the rationale behind this design choice.
- How do diversity metrics (e.g., Density/Coverage, Vendi Score, CLIPScore) behave across different classifier-free guidance (CFG) scales? Specifically, does the proposed method demonstrate improved diversity at higher CFG values compared to existing baselines?
- Can the method improve diversity in the absence of a reference dataset, as in open-ended text-to-image generation tasks using models like FLUX or SDXL?
- Is the method applicable across different prompts? In other words, can it generalize to multi-prompt scenarios while enhancing diversity?
- Does the method scale to large diffusion models (e.g., SDXL), and does it yield measurable improvements in diversity both quantitatively and qualitatively?
局限性
yes
最终评判理由
Bellow are my remaining concerns:
Accessing to real data, Few-shot Adaptation and Personalization:
I believe that having access to real data, especially for text-conditioned models, is difficult to obtain and represents a special case, as mentioned by Reviewer tTXY "scenarios where one truly has no real data, Chamfer Guidance cannot be applied". The authors have justified this with Few-shot Adaptation and Personalization. To strengthen their argument, I believe the manuscript should include experiments and evidence demonstrating that the methods perform well in these tasks, which is currently lacking.
Extending our method to text-to-image is a promising future direction
I respectfully believe that this is essential for the submission and should be included in the manuscript rather than relegated to a future direction.
Early experiment using FAISS for fast retrieval
The results looks promising and extending the results could definitely help the manuscript
Multi-prompt applications would be an interesting future direction
The authors frame multi-prompt generation as "an interesting future direction." However, I believe this is necessary for the generalizability and applicability of their method. The authors mentioned that their method "would suit this task because our guidance operates in the feature space of the exemplars," which is not trivial for me and requires experimental backing it up.
Overhead of feature extraction and reliance on them
As noted by Reviewer tTXY, I also agree that reliance on these feature extractors incurs considerable overhead in terms of GPU usage and may introduce biases.
In summary, most of my concerns remain unresolved, and I believe there is a room for improvement
格式问题
No issue
We would like to clarify the models used in our experiments:
- LDM1.5 refers to Stable Diffusion 1.5, a widely-used, U-Net-based implementation with 860 million parameters.
- LDM3.5M refers to Stable Diffusion 3.5 Medium, a flow-based Multimodal Diffusion Transformer with 2.5 billion parameters representing state-of-the-art among openly available generative models.
We then thank the reviewer for finding that our work tackles an important question in the literature, and that a key strength compared to c-VSG is the reduced computational complexity of using the Chamfer distance.
[W1] Novelty
We believe our work presents a significant contribution, as highlighted by other reviewers (a novel training-free guidance method that effectively enhances synthetic image utility, and that the idea of using Chamfer distance as guidance reward is conceptually elegant).
Our primary contribution is a simple, effective, and computationally efficient In-Context Learning (ICL) framework for generative models. While ICL has transformed language modeling, its application to steer generative vision models is significantly under-explored. We are among the first to propose a practical method for this purpose.
The novelty lies in creatively adapting the Chamfer distance, traditionally used for point clouds, as a guidance signal within a diffusion model feature space. This introduces a new way to control generation based on few-shot examples.
We demonstrate state-of-the-art results in the downstream usefulness of generated synthetic data with minimal data. These constitute a substantial contribution.
[W2] Access to real data
We argue that our work addresses a well-established and increasingly critical problem in generative modeling.
The problem of adapting a generative model to a target distribution given a few reference examples is a significant area of research, which is foundational to several applications:
- Few-shot Adaptation: This is a common setting for tasks like data synthesis for imbalanced classification, as explored in prior work [1, 2], which we also include as baselines.
- Personalization: The goal of personalizing large text-to-image models for subject-driven generation, as seen in impactful methods such as Textual Inversion [3], DreamBooth [4], and SVDiff [5].
More broadly, our work is an instance of In-Context Learning (ICL) for diffusion models. ICL is a cornerstone of modern LLMs [6], and its application to generative vision models is a critical emerging direction [7].
We are among the first to successfully adapt large diffusion models via ICL without requiring finetuning.
[W3,Q3] Scalability and extension to text-to-image
Our work focuses on object-centric and class-conditional generation, aiming to improve the downstream utility of synthetic data (e.g., for classification and geographic representation). This is a well-established research area, distinct from open-ended text-to-image synthesis, which is a different scope. Our contribution is a novel guidance mechanism that excels within this paradigm.
However, extending our method to text-to-image is a promising future direction. One approach could be: Step 1. Offline retrieval database: Embed a large text-image dataset into semantic vectors based on their captions. Step 2. Inference-time Retrieval: For a user's text query, retrieve the top-k relevant image-text pairs and use their images as exemplars for our guidance.
We perform an early experiment by implementing this using COCO. For Step 1, we use the image-text pairs from the COCO training dataset. We embed the captions with CLIP, and we use FAISS for fast retrieval. For Step 2, we subsample 5000 captions from the validation set and retrieve the top-k closest images to measure the Chamfer distance over. In this setting, we compared the performance of LDM1.5 with default sampling and our Chamfer Guidance. Results are reported in the table below, and we can observe that our method brings substantial improvement in this setting as well.
| Method | F1 | Precision | Coverage | Density | Recall | FDD | FID |
|---|---|---|---|---|---|---|---|
| LDM1.5 | 0.785 | 0.851 | 0.729 | 0.716 | 0.742 | 305.67 | 28.69 |
| Ours (k=2) | 0.937 | 0.937 | 0.938 | 1.294 | 0.726 | 202.65 | 21.96 |
[W4] Additional baselines
We initially omitted them as they address orthogonal research goals. Those methods increase ungrounded diversity without specific targets, similarly to CADS, which we employed as a baseline, while our work focuses on grounded generation.
We have implemented these methods. As shown below, while they may increase ungrounded diversity (recall), they don't perform well on our grounded diversity task requiring fidelity to exemplars.
| Method | F1 | Precision | Coverage | Density | Recall | FDD | FID |
|---|---|---|---|---|---|---|---|
| CADS | 0.718 | 0.850 | 0.621 | 0.743 | 0.546 | 217.96 | 13.434 |
| Limited Interval | 0.708 | 0.837 | 0.613 | 0.686 | 0.631 | 219.17 | 11.405 |
| Particle Guidance | 0.719 | 0.846 | 0.625 | 0.744 | 0.544 | 222.26 | 14.52 |
| Chamfer Guidance (ours) k=2 | 0.886 | 0.947 | 0.833 | 1.108 | 0.480 | 156.18 | 13.67 |
| Chamfer Guidance (ours) k=32 | 0.931 | 0.950 | 0.912 | 1.213 | 0.649 | 113.30 | 8.94 |
[Q1] Why Chamfer distance
Our decision was based on its combination of theoretical and practical advantages of the Chamfer distance for our specific task.
Chamfer distance is a well-established and principled metric for comparing point clouds, which is how we represent features in the embedding space. Its two components naturally map to the desirable properties of fidelity and diversity in generative modeling. It is also computationally efficient, as it relies on a simple 1-Nearest Neighbor (1-NN) search. This efficiency is critical for its usage as a guidance signal in the diffusion sampling loop.
We also considered other candidate optimization targets but found them unsuitable:
- Fréchet Inception Distance (FID): It was not feasible. It requires large distributions of both real and generated data to be meaningful. Moreover, its computational cost would make it prohibitive to optimize at each sampling step.
- Optimizing Recall/Coverage: These metrics also require access to the full data distributions for accurate estimation. More importantly, because they involve counting, they are non-differentiable, posing a significant optimization challenge.
[Q2] Different CFG scales
This analysis was included in our original submission (Table 4 and Table 10 in Appendix). Results demonstrate our method's improvements are consistent across CFG scales. For the reviewer's convenience, we are reporting these key results below.
| ω | k | F1 | Precision | Coverage | Density | Recall | FDD | FID |
|---|---|---|---|---|---|---|---|---|
| 1.0 | -- | 0.507 | 0.723 | 0.391 | 0.551 | 0.656 | 431.2 | 31.3 |
| 1.0 | 2 | 0.849 | 0.890 | 0.811 | 0.904 | 0.736 | 150.748 | 13.217 |
| 1.0 | 32 | 0.899 | 0.923 | 0.876 | 1.086 | 0.735 | 117.834 | 9.759 |
| 2.0 | -- | 0.673 | 0.802 | 0.580 | 0.648 | 0.684 | 226.2 | 10.8 |
| 2.0 | 2 | 0.881 | 0.932 | 0.835 | 1.051 | 0.637 | 124.191 | 8.840 |
| 2.0 | 32 | 0.931 | 0.950 | 0.912 | 1.213 | 0.649 | 113.301 | 8.935 |
| 7.5 | -- | 0.709 | 0.862 | 0.603 | 0.775 | 0.415 | 248.7 | 16.1 |
| 7.5 | 2 | 0.886 | 0.947 | 0.833 | 1.108 | 0.480 | 156.179 | 13.670 |
| 7.5 | 32 | 0.925 | 0.957 | 0.894 | 1.238 | 0.498 | 153.111 | 14.388 |
[Q4] Multi-prompt scenario
While our current focus is on object-centric generation, tackling text-to-image, for which we showed preliminary results in [Q3], and in particular multi-prompt applications would be an interesting future direction. We interpret "multi-prompt" as using a single set of visual exemplars to guide generation for several different text prompts.
Our approach would suit this task because our guidance operates in the feature space of the exemplars. This effectively decouples the visual concept (the exemplars) from the semantic context (the prompt).
Our guidance would ensure the generated concept is consistent with the exemplars, while the context and style are dictated by the specific prompt. Evaluating such a use case would require suitable data and parameter tuning.
[Q5] Large diffusion models
Our experiments include LDM3.5M, the 2.5 billion parameter Stable Diffusion 3.5 Medium model. This is a state-of-the-art Multimodal Diffusion Transformer, competitive with other large models like SDXL and FLUX. Thus, our method has been validated on current, large-scale diffusion models.
[1] Hemmat, R. A., et al. Feedback-guided Data Synthesis for Imbalanced Classification TMLR 2024
[2] Hemmat, R. A., et al. Improving geo-diversity of generated images with contextualized vendi score guidance ECCV 2024
[3] Gal, R., et al. An image is worth one word: Personalizing text-to-image generation using textual inversion ICLR 2023
[4] Ruiz, N., et al. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation CVPR 2023
[5] Han, L., et al. Svdiff: Compact parameter space for diffusion fine-tuning CVPR 2023
[6] Donq Q., et al. A Survey on In-context Learning EMNLP 2024
[7] Wang, Z., et al. In-context learning unlocked for diffusion models NeurIPS 2023
I thank the authors for their responses. Bellow are my remaining concerns:
Accessing to real data, Few-shot Adaptation and Personalization:
I believe that having access to real data, especially for text-conditioned models, is difficult to obtain and represents a special case, as mentioned by Reviewer tTXY "scenarios where one truly has no real data, Chamfer Guidance cannot be applied". The authors have justified this with Few-shot Adaptation and Personalization. To strengthen their argument, I believe the manuscript should include experiments and evidence demonstrating that the methods perform well in these tasks, which is currently lacking.
Extending our method to text-to-image is a promising future direction
I respectfully believe that this is essential for the submission and should be included in the manuscript rather than relegated to a future direction.
Early experiment using FAISS for fast retrieval
The results looks promising and extending the results could definitely help the manuscript
Multi-prompt applications would be an interesting future direction
The authors frame multi-prompt generation as "an interesting future direction." However, I believe this is necessary for the generalizability and applicability of their method. The authors mentioned that their method "would suit this task because our guidance operates in the feature space of the exemplars," which is not trivial for me and requires experimental backing it up.
Overhead of feature extraction and reliance on them
As noted by Reviewer tTXY, I also agree that reliance on these feature extractors incurs considerable overhead in terms of GPU usage and may introduce biases.
In summary, most of my concerns remain unresolved, and I maintain my initial evaluation of the manuscript.
We thank Reviewer wmJJ for their thoughtful comments, which give us the opportunity to elaborate on key aspects of our work. We are glad to see that we fully addressed 2 out of the initial 4 weaknesses and 4 out of 5 questions. We will clarify the remaining concerns in the current response.
Data access, and adaptation.
Firstly, on the topic of zero-shot extension, we would like to highlight that our clarification was found to be satisfactory by Reviewer tTXY.
The goal of our manuscript is to provide a methodology to adapt the distribution of a pre-trained image generative model in a training-free fashion by leveraging few-shot image examples. To show the effectiveness of our approach, we benchmark it on 3 datasets against several existing few-shot adaptation techniques, i.e., FG CLIP [1], c-VSG [2], and Textual Inversion [3]. In the manuscript, we report results using these models in Table 1 for the object-centric ImageNet-1k scenario, and in Table 2 for the Geographic Diversity. In these settings, we characterize utility as the model's ability to generate a distribution given few shots from the distribution itself. In addition to that, we also included image classification results showing that the use of Chamfer guidance in combination with two strong image generative models leads to improved classification performance on ImageNet-100 in Table 3.
These settings have different goals than personalization, where the goal is to match the appearance or style of a specific object.
To further strengthen the validation, we have extended our few-shot downstream experiments on ImageNet-100 to ImageNet-1k and out-of-distribution variants (IN-v2, IN-Sk, IN-R, and IN-A), and we are including these results below. Results show that training classifiers on limited real data is ineffective (Table a).
When training classifiers on synthetic data only, Chamfer guidance boosts the accuracy by up to 15.83 points over vanilla LDM generation (Tables b and c). When training classifiers on a combination of real and synthetic data, Chamfer guidance leads to accuracy of 63.81% accuracy (+4 accuracy points over vanilla sampling) using just 32 real images per class. Moreover, we observe that Chamfer guidance reduces the performance gap on the task of image classification between the samples coming from LDM 1.5 and LDM 3.5M, making newer models competitive sources of synthetic data. For out-of-distribution (OOD) generalization, models trained with our Chamfer-guided data consistently outperform models trained on real data alone (given the same number of real data). Notably, on one dataset, ImageNet-Sketch, our method on LDM 3.5M surpasses a classifier trained on the full 1.3M dataset.
These additional results demonstrate that our method is very effective in few-shot adapting image generative models to produce high-utility synthetic data.
a) Real data only.
| # of real images | IN1k | IN-v2 | IN-Sk | IN-R | IN-A |
|---|---|---|---|---|---|
| 2k | 5.01 | 3.94 | 0.63 | 0.89 | 0.21 |
| 32k | 34.05 | 25.38 | 4.17 | 5.04 | 0.53 |
| 1.3M | 82.60 | 70.90 | 32.50 | 44.60 | 29.40 |
b) LDM 1.5
| # of real images | # of synthetic images | Guidance | IN1k | IN-v2 | IN-Sk | IN-R | IN-A |
|---|---|---|---|---|---|---|---|
| 0 | 1.3M | ω = 2 | 47.67 | 40.33 | 20.49 | 17.49 | 1.45 |
| Chamfer k=2 | 52.88 | 45.37 | 28.07 | 19.60 | 1.71 | ||
| Chamfer k=32 | 54.91 | 46.43 | 28.08 | 19.78 | 5.11 | ||
| 2k | 1.3M | ω = 2 | 48.47 | 41.07 | 21.21 | 16.96 | 1.57 |
| Chamfer k=2 | 53.57 | 46.42 | 29.48 | 21.25 | 1.65 | ||
| 32k | 1.3M | ω = 2 | 59.07 | 49.77 | 25.04 | 20.10 | 2.44 |
| Chamfer k=32 | 63.81 | 53.84 | 32.34 | 22.40 | 2.72 |
c) LDM 3.5M
| # of real images | # of synthetic images | Guidance | IN1k | IN-v2 | IN-Sk | IN-R | IN-A |
|---|---|---|---|---|---|---|---|
| 0 | 1.3M | ω = 2 | 37.83 | 34.07 | 17.60 | 11.53 | 0.88 |
| Chamfer k=2 | 52.14 | 44.27 | 33.47 | 20.26 | 1.93 | ||
| Chamfer k=32 | 53.66 | 45.46 | 34.44 | 20.67 | 5.28 | ||
| 2k | 1.3M | ω = 2 | 40.89 | 31.72 | 20.03 | 12.31 | 1.32 |
| Chamfer k=2 | 52.95 | 45.26 | 33.52 | 20.51 | 1.99 | ||
| 32k | 1.3M | ω = 2 | 55.65 | 45.65 | 21.64 | 14.97 | 1.54 |
| Chamfer k=32 | 62.61 | 52.58 | 34.49 | 21.85 | 2.36 |
[1] Hemmat, R. A., Pezeshki, M., Bordes, F., Drozdzal, M., & Romero-Soriano, A. Feedback-guided Data Synthesis for Imbalanced Classification. TMLR 2024
[2] Hemmat, R. A., Hall, M., Sun, A., Ross, C., Drozdzal, M., & Romero-Soriano, A. Improving geo-diversity of generated images with contextualized vendi score guidance. ECCV 2024
[3] Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., & Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. ICLR 2023
Dear Reviewer wmJJ,
I notice that the authors have submitted a rebuttal. Could you please let me know if the rebuttal addresses your concerns? Your engagement is crucial to this work.
Thanks for your contribution to our community.
Your AC
Dear Reviewer wmJJ,
We are following up on our rebuttal for the paper, as the discussion period ends this week. We would appreciate it if you could take a moment to look at our response.
To address your feedback, we provided:
- A clarification on the recent, state-of-the-art models used in our work.
- A pilot study on extending Chamfer guidance to the text-to-image setting.
- Additional baselines you requested for comparison.
- A summary of our parameter ablation, which was already in the paper but is now highlighted for your convenience.
We are looking forward to hearing your thoughts on these points and are happy to clarify anything that remains unclear.
Submission 20967 Authors
Extensibility to Text-to-Image Generation:
We appreciate that the reviewer found our initial text-to-image results on COCO to be promising. Our primary objective for the rebuttal was to empirically demonstrate the feasibility and potential of extending our framework to this domain. We will include the current text-to-image results in the camera-ready Appendix with qualitative examples.
Multi-prompt applications
While we acknowledge the reviewer's point on multi-prompt applications, we respectfully disagree on its role as a key measure of generalizability and applicability. In our manuscript, we showed generalizability and applicability of our approach by reporting the following results: 1) object-centric representation, 2) mitigating representational biases in the "Geographic Diversity" setting, and 3) improving the downstream performance of a classification model trained on our generated data. In the rebuttal, we added results for COCO to prove the point that our method is applicable to more complex text conditionings, and provided further validation on the few-shot adaptation of the generative model in the context of increasing the utility of synthetic data. Chamfer guidance consistently outperformed prior art, proving its generalizability to multiple validation scenarios.
We would also like to clarify that our work already includes experiments that function as a multi-prompt evaluation. Specifically, our geographic diversity experiments on GeoDE and DollarStreet operate on this principle. In that setup (following cVSG), we use a fixed set of visual exemplars for a single object concept while varying the context within the text prompt (e.g., generating "a {object} in {region}"). This effectively tests the model's ability to apply a consistent visual concept across different textual contexts, which is the core challenge of the multi-prompt scenario the reviewer describes. The results, presented in Table 2 of our manuscript, demonstrate the success of our method in this task. Our Chamfer guidance achieves superior distributional coverage and better prompt adherence (measured by CLIPScore) when compared to all baselines, especially when using CLIP as the feature extractor.
Efficiency and Feature Extractors bias:
We wish to clarify the concerns regarding computational overhead and feature extractor bias.
Computational Efficiency: The overhead of our method is taken into account. On top of the efficiency analysis in Appendix A, we would like to highlight that our Chamfer Guidance is applied sparsely (once every 5 steps, as noted on L217), following the implementation of c-VSG. This design choice drastically reduces the computational load compared to guidance at every step, leading to a 31% reduction of FLOPS than CFG for LDM3.5M.
Bias: Our method improves representativeness of regions as our experiments on GeoDE and DollarStreet show (Tables 2, 5, 6, 7, and 8). We are consistently improving the worst-region performance across all metrics, and we achieve this with both DINOv2 and CLIP as feature extractors, thus reducing representational bias.
This paper proposes a novel inference-time method to guide image generation using a small set of real exemplar images. They compute the Chamfer distance between features of generated images and a few real reference images, and use the gradient of this distance as a guidance signal during the diffusion sampling process. Empirical results on ImageNet-1k and GeoDE/DollarStreet demonstrate strong performance.
优缺点分析
[Strengths] The proposed method delivers a clear advance in generation utility, yielding simultaneously higher diversity and fidelity than strong baselines.
The idea of using Chamfer distance as a guidance reward is a conceptually elegant alternative to previous approaches like c-VSG that required maintaining a memory bank and carefully tuning two competing diversity metrics.
The paper validates the method on both object-class generation and geographic diversity, and uses extensive metrics; Precision, Recall, Density, Coverage, F1, FID, FDD, to assess both quality and diversity.
[Weaknesses] The paper does not deeply analyze sensitivity to which exemplars are used, so we don’t know how robust the method is to different choices or outliers in the exemplar set.
Given the stochastic nature of diffusion sampling, it’s unclear how robust the reported gains are across multiple runs or different random seeds.
The method is tailored to scenarios where a target distribution is available as reference, which is a slightly specialized use case in practice.
The proposed method adds overhead by requiring feature extraction and backpropagation through a vision transformer multiple times during sampling.
The approach fundamentally relies on pretrained feature extractors like DINO to measure quality and diversity, so it inherits the biases or limitations of those features.
In zero-shot generation scenarios or when one truly has no real data, Chamfer Guidance cannot be applied.
问题
Why did the authors choose Chamfer distance specifically as the utility metric? What were the other choices they considered but decided not to pursue?
Did the authors study how results vary with different random selections of exemplars?
If the exemplars themselves are from a distribution somewhat shifted from the model’s training distribution, how robust is the guidance?
The results in Table 1 use a guidance scale that differs for LDM1.5 and LDM3.5M. How was this guidance weight selected?
局限性
yes
最终评判理由
The authors have sufficiently addressed my initial concerns in their rebuttal. I am updating my score accordingly.
格式问题
The font in Figure 1 is quite small.
We thank the reviewer for highlighting that ours is a novel inference-time method to guide image generation using a small set of real exemplar images, that delivers a clear advance in generation utility, yielding simultaneously higher diversity and fidelity than strong baselines, and that the idea of using Chamfer distance as a guidance reward is a conceptually elegant alternative to previous approaches.
[W1,2 Q2] Sample selection and stochastic process
Regarding the sample selection, the reference samples were chosen at random from the dataset without a specific sampling strategy. This approach is standard and consistent with prior works like c-VSG [2]. While exploring advanced sampling techniques is interesting for future research, our focus is on the novel generation method itself. Our experiments confirm that even with random sampling, our method produces high-quality, useful images. To provide empirical evidence, we ran an additional experiment on the GeoDE dataset using two different random seeds to select exemplar images. As the table below demonstrates, the standard deviation across runs is negligible.
Concerning the stochastic nature of the diffusion process, we mitigate this through the scale of our experiments. By generating large numbers of images (e.g., 50,000 for ImageNet-1k), we ensure robust metrics that represent stable averages.
| k | F1 | Precision | Coverage | Density | Recall | FDD | FID |
|---|---|---|---|---|---|---|---|
| 2 | 0.4399 ± 0.0263 | 0.6782 ± 0.0285 | 0.3256 ± 0.0222 | 0.5017 ± 0.0676 | 0.3562 ± 0.0221 | 405.36 ± 9.49 | 19.65 ± 0.13 |
| 4 | 0.5421 ± 0.0189 | 0.8342 ± 0.0086 | 0.4016 ± 0.0191 | 0.9229 ± 0.0514 | 0.1490 ± 0.0057 | 362.08 ± 10.74 | 18.90 ± 0.31 |
| 8 | 0.6361 ± 0.0191 | 0.9065 ± 0.0095 | 0.4900 ± 0.0199 | 1.3736 ± 0.0279 | 0.0709 ± 0.0054 | 349.47 ± 7.61 | 18.76 ± 0.09 |
| 16 | 0.7598 ± 0.0099 | 0.9481 ± 0.0028 | 0.6339 ± 0.0126 | 2.0409 ± 0.0142 | 0.0592 ± 0.0064 | 326.16 ± 5.32 | 18.46 ± 0.08 |
[W3] Specialized use case
We respectfully disagree that our work is a "specialized use case." Our work addresses a well-established and critical problem in generative modeling.
The problem of adapting a generative model to a target distribution given few reference examples is a significant problem. This setting is foundational to:
- Few-shot Adaptation: Common for tasks like data synthesis for imbalanced classification [1, 2]
- Personalization: Personalizing large text-to-image models for subject-driven generation, as in Textual Inversion [3], DreamBooth [4], and SVDiff [5]
More broadly, our work is an instance of In-Context Learning (ICL) for diffusion models. ICL is a cornerstone of modern LLMs [6], and its application to generative vision models is a critical emerging direction [7].
Our primary contribution is an approach that is both effective and highly efficient. We are among the first to successfully adapt large diffusion models via ICL without requiring any model finetuning, achieving state-of-the-art results with few reference samples.
[W4] Computational overhead
Our Chamfer Guidance can be significantly more computationally efficient than standard sampling with Classifier-Free Guidance (CFG), as analyzed in Appendix A.
Standard CFG sampling requires a double forward pass through the model at each diffusion step, doubling the floating-point operations (FLOPS). When applied to conditional-only models (CFG scale = 1), our guidance mechanism is more efficient. With the LDM3.5M model, our method achieves up to a 30% reduction in FLOPS while reaching state-of-the-art results.
[W5] Feature extractor
For the “Geographic Diversity” setting (Table 2, and Table 6 and 7 in the Appendix), we explored DINOv2 and CLIP as feature spaces for our guidance. Both lead to state-of-the-art results over prior works, with DINOv2 leading to better results than CLIP.
We then chose DINOv2 also for the other settings because of its self-supervised learning pre-training, which learns rich, instance-level visual features more suitable for our guidance task than features from classification models (Inception) or image-text alignment models (CLIP).
We validated this choice empirically by testing DINOv2 and CLIP on the GeoDE dataset. Our findings confirm that DINOv2 provides superior performance, with CLIP bringing improvements w.r.t. baselines as well.
| Feature Extractor | k | F1 | Precision | Coverage | Density | Recall | FDD | FID |
|---|---|---|---|---|---|---|---|---|
| LDM 1.5 | -- | 0.2334 | 0.4363 | 0.1593 | 0.1731 | 0.6135 | 684.81 | 35.27 |
| LDM 1.5 | -- | 0.3277 | 0.5433 | 0.2346 | 0.2647 | 0.4859 | 524.59 | 24.04 |
| LDM 1.5 | -- | 0.2960 | 0.6222 | 0.1942 | 0.3629 | 0.2025 | 693.31 | 42.60 |
| DINOv2 | 2 | 0.4242 | 0.6354 | 0.3184 | 0.4027 | 0.4614 | 410.84 | 19.73 |
| DINOv2 | 4 | 0.5313 | 0.8296 | 0.3907 | 0.8933 | 0.1501 | 368.27 | 19.08 |
| DINOv2 | 8 | 0.6251 | 0.9010 | 0.4785 | 1.3575 | 0.0708 | 353.80 | 18.81 |
| DINOv2 | 16 | 0.7527 | 0.9461 | 0.6250 | 2.0309 | 0.0547 | 323.11 | 18.44 |
| CLIP | 2 | 0.3888 | 0.6367 | 0.2798 | 0.3987 | 0.3604 | 446.77 | 19.66 |
| CLIP | 4 | 0.4063 | 0.5951 | 0.3084 | 0.3501 | 0.4827 | 421.20 | 19.01 |
| CLIP | 8 | 0.4163 | 0.5974 | 0.3195 | 0.3567 | 0.5003 | 416.75 | 18.25 |
| CLIP | 16 | 0.4088 | 0.5757 | 0.3169 | 0.3180 | 0.5735 | 415.02 | 18.22 |
[W6] Zero-shot scenario
Our method is designed for a few-shot setting with reference data to guide generation, related to In-Context Learning (ICL). However, we find extending to a data-free pipeline interesting for future work. We could envision a self-bootstrapping technique:
- Generate initial candidate images for a given class
- Automatically select a diverse subset maximizing "diameter" (coverage) in a robust feature space
- Use this synthetically-generated set as guidance exemplars
[Q1] Chamfer distance choice
Our decision was based on theoretical and practical advantages. The Chamfer distance is well-established for comparing point clouds, which is how we decided to represent images, with components naturally mapping to fidelity and diversity in generative modeling. It is computationally efficient, relying on simple 1-NN search, critical for guidance in the diffusion sampling loop.
We considered other targets but found them unsuitable:
- FID: Not feasible as it requires the full distribution to be stable, and it is computationally prohibitive to backpropagate through.
- Recall/Coverage: They require full data distributions and are non-differentiable.
[Q3] Shifted distribution
Our geographic diversity study used GeoDE and DollarStreet datasets, known to be challenging with distributions shifted from generative model training data. Base models struggle with these specialized datasets [2, 8].
Our results demonstrate that Chamfer Guidance is robust and effective in this setting, successfully steering generation to align with few-shot exemplars from shifted domains, proving practical utility in real-world scenarios.
[Q4] Guidance scale
The CFG scale for each model was chosen by performing a sweep and selecting the value maximizing F1 score on a validation set. For the large LDM3.5M model, we chose also to speed up generation, given the size of the model.
[1] Hemmat, R. A., Pezeshki, M., Bordes, F., Drozdzal, M., & Romero-Soriano, A. Feedback-guided Data Synthesis for Imbalanced Classification. TMLR 2024
[2] Hemmat, R. A., Hall, M., Sun, A., Ross, C., Drozdzal, M., & Romero-Soriano, A. Improving geo-diversity of generated images with contextualized vendi score guidance. ECCV 2024
[3] Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., & Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. ICLR 2023
[4] Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. CVPR 2023
[5] Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., & Yang, F. Svdiff: Compact parameter space for diffusion fine-tuning. CVPR 2023
[6] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A Survey on In-context Learning. EMNLP 2024
[7] Wang, Z., Jiang, Y., Lu, Y., He, P., Chen, W., Wang, Z., & Zhou, M. In-context learning unlocked for diffusion models. NeurIPS 2023
[8] Hall, M., Ross, C., Williams, A., Carion, N., Drozdzal, M., & Soriano, A. R. Dig in: Evaluating disparities in image generations with indicators for geographic diversity. TMLR 2024
Dear Reviewer tTXY,
With the discussion period winding down, we wanted to follow up on our rebuttal.
We believe we have addressed the points brought up in your review, and we would appreciate it if you could let us know whether our rebuttal has addressed your concerns. If you have any remaining questions or would like us to clarify anything further, we are ready and happy to do so.
Thank you for your time and valuable feedback.
Submission 20967 Authors
The authors have addressed my initial concerns in their rebuttal. I am updating my score accordingly.
The paper introduces Chamfer Guidance, a novel training-free method to improve the utility of synthetic images generated by conditional diffusion models. By leveraging a small set of real exemplar images, it uses the Chamfer distance to guide the generation process toward better fidelity and diversity with respect to real data distributions. The method achieves state-of-the-art performance across object-centric and geo-diversity benchmarks and significantly boosts downstream classifier performance using only synthetic training data. Additionally, Chamfer Guidance avoids the computational overhead of classifier-free guidance, offering clear FLOPs reduction while maintaining or improving image quality and diversity.
优缺点分析
Strengths: The paper presents a novel training-free guidance method that effectively enhances the utility of synthetic images by balancing fidelity and diversity through Chamfer distance. It achieves state-of-the-art performance across both object-centric and geo-diversity benchmarks while significantly reducing computational cost compared to classifier-free guidance. The method scales well with the number of exemplar images and improves downstream classification accuracy using only synthetic data. Its theoretical motivation, practical utility, and clean experimental design make it a strong contribution to synthetic data generation.
Weaknesses:
- DINOv2 serves the main embedding space, while the paper does not compare different feature extractors (e.g., CLIP, Inception) to quantify how sensitive Chamfer Guidance is to the underlying embedding space.
- No Human Evaluation: All evaluations rely on automated metrics, which may not fully capture perceptual quality or semantic coherence.
- There is limited discussion on the sensitivity of Chamfer Guidance strength and how tuning affects trade-offs between diversity and fidelity.
问题
Include a comparative analysis of multiple embedding spaces, although DINOv2 is well-motivated due to its self-supervised learning and semantic alignment.
Including a small-scale user study or human preference ranking would help validate whether the improvements measured by Chamfer Guidance align with human judgment.
The paper includes ablations on CFG scale, but does not fully study the effect of the Chamfer guidance strength. A sensitivity analysis across a range of γ values could demonstrate whether the method needs careful tuning per task or dataset, or if it is relatively stable.
局限性
Limitations are briefly discussed.
最终评判理由
The rebuttal resolves most of my concerns. I upgrade my rating accordingly.
格式问题
N/A
We thank the reviewer for finding our work a novel training-free guidance method that effectively enhances the utility of synthetic images that achieves state-of-the-art performance across both object-centric and geo-diversity benchmarks while significantly reducing computational cost, and that theoretical motivation, practical utility, and clean experimental design make it a strong contribution to synthetic data generation.
[Q1] Embedding space
For the “Geographic Diversity” setting (Table 2, and Table 6 and 7 in the Appendix), we explored DINOv2 and CLIP as feature spaces for our guidance, as DINOv2 and CLIP feature extractors correlate better with human judgment of similarity than Inceptionv3 [1]. Both led to state-of-the-art results over prior works, with DINOv2 leading to better results than CLIP. To further validate our choice of DINOv2, we conducted a preliminary empirical study comparing its performance as a feature extractor for performing Chamfer guidance against CLIP on the GeoDE dataset in an "object-centric" setting, e.g., with prompts like “a photo of a car.”. These experiments were run using LDM1.5.
Our findings were as follows:
- DINOv2 consistently yielded the best performance, showing substantial gains in both diversity and fidelity. Similarity in DINOv2 space also correlates to human-perceived similarity slightly more than CLIP, as reported in [1].
- CLIP improved upon the baselines, and in particular coverage scales with the number of guiding samples. With higher k we observe reduced marginal improvements, and we deem this to the fact that the model tended to converge towards generating an average representation of the object. We hypothesize this is because CLIP's pre-training is "concept-centric" (aligning images to general text concepts), whereas DINOv2's is "instance-centric" due to its self-supervised training, making it better at preserving the unique features of a specific reference image. T We have included a summary of these results in the table below. This analysis confirmed that DINOv2 was the most effective choice for our method, with CLIP being an alternative. We will include these results in the final version of the manuscript.
| Feature Extractor | k | F1 | Precision | Coverage | Density | Recall | FDD | FID |
|---|---|---|---|---|---|---|---|---|
| LDM 1.5 | -- | 0.2334 | 0.4363 | 0.1593 | 0.1731 | 0.6135 | 684.81 | 35.27 |
| LDM 1.5 | -- | 0.3277 | 0.5433 | 0.2346 | 0.2647 | 0.4859 | 524.59 | 24.04 |
| LDM 1.5 | -- | 0.2960 | 0.6222 | 0.1942 | 0.3629 | 0.2025 | 693.31 | 42.60 |
| DINOv2 | 2 | 0.4242 | 0.6354 | 0.3184 | 0.4027 | 0.4614 | 410.84 | 19.73 |
| DINOv2 | 4 | 0.5313 | 0.8296 | 0.3907 | 0.8933 | 0.1501 | 368.27 | 19.08 |
| DINOv2 | 8 | 0.6251 | 0.9010 | 0.4785 | 1.3575 | 0.0708 | 353.80 | 18.81 |
| DINOv2 | 16 | 0.7527 | 0.9461 | 0.6250 | 2.0309 | 0.0547 | 323.11 | 18.44 |
| CLIP | 2 | 0.3888 | 0.6367 | 0.2798 | 0.3987 | 0.3604 | 446.77 | 19.66 |
| CLIP | 4 | 0.4063 | 0.5951 | 0.3084 | 0.3501 | 0.4827 | 421.20 | 19.01 |
| CLIP | 8 | 0.4163 | 0.5974 | 0.3195 | 0.3567 | 0.5003 | 416.75 | 18.25 |
| CLIP | 16 | 0.4088 | 0.5757 | 0.3169 | 0.3180 | 0.5735 | 415.02 | 18.22 |
[Q2] User Study
While the primary focus of our study is the utility of the generated data for representation and as a training source for downstream tasks, which is confirmed by our extensive experimental evaluation with downstream classifier training, we agree that human evaluation provides valuable insights.
To that end, we conducted a small-scale user study to complement our quantitative findings. We collected 965 data points from more than twenty annotators. In this study, users were presented with samples generated from prompts based on the ImageNet dataset. Their task was to choose their preferred generation in a side-by-side comparison between images from the base LDM3.5M model and images generated with our Chamfer Guidance applied to the same model. Users were also presented with real images coming from the dataset to ground their evaluation in real-world quality and coherence.
The results showed a strong preference for our method: images generated with our Chamfer Guidance were preferred in 92% ± 2% of the cases. This suggests that the automatic evaluation of quantitative improvements in downstream utility also correlates with enhanced human-perceived quality and fidelity to the target concept distribution. We will include these results and details of the study in the revised manuscript.
[Q3] ablation
We would like to direct the reviewer to our detailed ablation study in Appendix E, where we analyze this point on the ImageNet1k dataset with LDM1.5. Our results show that our method is robust across different values of . In fact, for all tested strengths, our method surpasses the LDM1.5 baseline in terms of F1 score, demonstrating its consistent effectiveness.
Our analysis reveals a trade-off between guidance strength and perceptual quality as measured by FID:
- Optimal Performance: The configuration with and k=32 achieves the best overall results, yielding the highest F1 score (0.931) and precision (0.950). This indicates that stronger guidance produces samples with increased fidelity and diversity.
- Performance Trade-off: While stronger guidance boosts precision and coverage, it can slightly degrade the FID. For example, increasing from 0.05 (FID: 8.840) to 0.07 (FID: 13.670) shows this effect. The moderate guidance maintains a better balance between precision/coverage and overall image quality.
We hypothesize that this increase in FID with stronger guidance might be partially due to a metric mismatch. The FID metric is computed using features from the Inception network, which is optimized for classification. Our guidance, however, operates in the DINOv2 feature space, which is richer in instance-level semantics. This discrepancy may lead the Inception-based FID to penalize valid, high-fidelity generations that are perfectly aligned in the DINO space.
We report the table below.
| k | F1 | Precision | Coverage | Density | Recall | FDD | FID | |
|---|---|---|---|---|---|---|---|---|
| 0.02 | 2 | 0.837 | 0.896 | 0.786 | 0.910 | 0.702 | 145.390 | 9.268 |
| 0.02 | 32 | 0.872 | 0.912 | 0.835 | 0.994 | 0.699 | 138.917 | 9.301 |
| 0.05 | 2 | 0.881 | 0.932 | 0.835 | 1.051 | 0.637 | 124.191 | 8.840 |
| 0.05 | 32 | 0.914 | 0.946 | 0.884 | 1.162 | 0.650 | 114.847 | 8.906 |
| 0.07 | 2 | 0.886 | 0.947 | 0.833 | 1.108 | 0.480 | 156.179 | 13.670 |
| 0.07 | 32 | 0.931 | 0.950 | 0.912 | 1.213 | 0.649 | 113.301 | 8.935 |
—-
[1] Hall M., et al. Towards Geographic Inclusion in the Evaluation of Text-to-Image Models. In ACM FAccT 2024.
Thanks for your explanation and clarification! I think the rebuttal resolves most of my concerns. I would upgrade my rating accordingly.
Dear Reviewer mRsF,
I notice that the authors have submitted a rebuttal. Could you please let me know if the rebuttal addresses your concerns? Your engagement is crucial to this work.
Thanks for your contribution to our community.
Your AC
Dear Reviewer mRsF,
We would like to kindly follow up on our rebuttal for the paper. The discussion period is ending this week, and we would appreciate your feedback.
In our rebuttal, we provided:
- a novel embedding space ablation on top of the comparison already present in the paper
- a user study on the quality of our Chamfer guidance
- an ablation of the hyperparameter, which was present in the paper and discussed again in the rebuttal.
We are happy to discuss these points further or clarify any other aspects of the paper.
Submission 20967 Authors
Dear reviewer mRsF,
Thank you for your response and for taking the time to read our rebuttal. We are glad that our explanation has addressed your concerns. We appreciate you upgrading your rating.
We would be grateful if you could let us know which specific points of concern remain, as we would be happy to discuss them further during this period. Please let us know if there is anything else we can clarify.
Submission 20967 Authors
The submission received divergent ratings: 2 positive and 1 negative. A mutual concern raised by Reviewer wmJJ and Reviewer tTXY is the applicability of the proposed method in the zero-shot setting. The authors provided a rough idea of extending the current method. While Reviewer tTXY is satisfied with this response, Reviewer wmJJ is not convinced, as there is no experiment verification. The ratings of Reviewer tTXY and Reviewer wmJJ are opposite, accept vs. weak reject. Reviewer mRsF kept the positive rating after the rebuttal.
The AC has checked the review, the rebuttal, the manuscript, and the discussion. The AC found that the focus of this paper is to advance the utility of synthetic data, which is underpinned by adequate evaluations. Extending this method to other domains/settings is worth exploration, but can be follow-up work afterward. The AC decided to support this work on the condition that: i) the manuscript will be revised based on the reviews; 2) elaborate on the limitation or extension of the current method in sec. 5.
The final recommendation is acceptance. The decision has been discussed with and confirmed by the SAC.