PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
4
5
5
4.0
置信度
创新性2.8
质量3.0
清晰度2.8
重要性2.8
NeurIPS 2025

Universal Few-shot Spatial Control for Diffusion Models

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

Few-shot spatial control for Text-to-Image Diffusion models by leveraging the analogy between query and support spatial conditions to construct task-specific control features.

摘要

关键词
diffusion modelsspatial control image generationfew-shot learning

评审与讨论

审稿意见
4

This paper works on universal few-shot spatially-controlled image generation. The authors extend ControlNet to few-shot generation model by using two additional encoder for layout images and RGB images in the support set separately. To unify the control signal for different layout representations, the authors use a matching module to use the interpolation of RGB images as control signals. During training, they begin with a task-agnostic meta-learning training followed by finetuning a small set of task-specific parameters. The experiments show that the proposed method outperforms existing few-shot diffusion models.

优缺点分析

Strengths:

  1. The problem studied in this paper is valuable and important. Developing a universial model for different types of control signal is an important progress in this area.
  2. The matching module is a good method to unify different representations of label conditions.
  3. Lots of demonstrations are provided to show the advantage of the proposed model.

Weaknesses:

  1. The authors claim this work as the first method for few-shot spatial control. I think the connection between layout control and the model design is weak. Technically, this architecture should also be applicable to general image-to-image generation. I don’t see how it is motivated from layout control.
  2. The core of this method is the matching module. It’s very similar to Chameleon which also uses interpolation of images from the support sets for few-shot image-to-image translation. Can the author describe how the matching module differ from Chameleon design?
  3. Some details are not clear: (1) Is θτ\theta_\tau a portion of parameters in the encoder, or are they additional parameters added to the unet encoder? If former, how to pick these parameters? (2) The two condition encoders work in very similar way. Why not use one encoder for condition encoding? (3) This paper doesn’t describe the required computing resources and hyperparameters for test-time finetuning.
  4. In the experiments, Promptdiffusion overfits to normal, pose and densepose after finetuning, while it’s performance gets improved for the other layout types. Can the author explain why this happens?
  5. The first ablation study is not well-designed. In this setting, both matching module and support sets are removed. The authors should keep the support sets and use a straightforwad way (like linear projection) to map all features (from three encoders) to the space of UNet, and then add these features to the decoder.
  6. In Supp Fig.5, I don’t see obvious similarity between the query patch and reference patches. For instance, when the query patch is the elbow, the selected patches include all different joints. Can the authors further explain it?
  7. The model doesnot saturate using 150 samples. More samples should be used to test the saturation point as an upperbound for the model.

问题

I suggest the authors further clarify how spatial condition motivates this model designs first. The other questions in the weakness are very clear. The authors just need to respond to them in the rebuttal. I’m glad to raise my rating if all the concerns are addressed.

局限性

Yes.

最终评判理由

The authors respond with thorough experiments and explanation. My concerns have been addressed. I think the current version should be further polished and the explanations in rebuttal should be added to it in the final version. Overall, this is a good research work.

格式问题

There's no formatting issue.

作者回复

We gratefully appreciate your review, acknowledging “the matching module is a good method to unify different representations of label conditions”. Here, we address the raised concerns and questions.

Weaknesses

W1. Motivation for Layout Control-Specific Design

A1. We appreciate the reviewer’s comment. While spatial control is a subset of image-to-image generation, few-shot adaptation for spatial control poses unique challenges, requiring specialized designs. Many image-to-image generation tasks, such as image restoration, image editing, and style transfer, involve a (near) natural image as a condition. Since distribution shifts in the condition are generally not significant in these tasks, it has been observed that a large-scale pre-training is often sufficient to achieve reasonable generalization performance across tasks. However, the spatial control involves spatial guidance signals as conditions, such as edge/depth maps, which are generally non-natural images and have unique distributions across tasks. As shown in our experiments (Table 1 and 2) and Table G in our response to reviewer G2yU, this causes existing image-to-image generation models to fail to generalize in unseen spatial conditions in most cases. UFC is designed specifically to handle the large discrepancy between different spatial condition types, such as encoding (unseen) control conditions by the interpolation of image features through matching, parameter-efficient adaptation mechanisms equipped with meta-training, each of which turned out to be crucial in few-shot adaptation of unseen control tasks (Table 3 and our response to weakness 1 raised by reviewer d7Cj). We will clarify these motivations in the paper.

W2 Differences in the matching module between UFC and Chameleon

A2. While we agree with the reviewer that UFC and Chameleon share a high‑level design—both use a matching module to draw task‑relevant information from a support set for few‑shot adaptation—their objectives and guiding intuitions are fundamentally different.

UFC employs patch‑wise matching to unify diverse condition types by extracting task‑agnostic features from support images. Chameleon [1], in contrast, uses matching to harvest task‑specific features from support labels (which correspond to our support conditions).

Because the goals diverge, the roles of the attention components are effectively reversed. In UFC, condition features from a task‑specific encoder serve as the Query and Key, weighting task-agnostic image features from a shared encoder that play the Value role; this yields unified control features. In Chameleon, image features from a task‑specific encoder act as the Query and Key to express task‑specific similarity, while label features from a shared encoder act as the Value, producing a task‑specific map that a shared label decoder then processes.

Finally, UFC conditions its matching module on the diffusion timestep embedding, allowing control features to adapt at every denoising step, whereas Chameleon does not include timestep conditioning.

W3 Task-specific parameters selection; clarification on condition encoders; computing resources.

A3.

(1) (Task-specific parameters θτ\theta_\tau selection): While any parameter-efficient fine-tuning method can be applied to implement the task-specific parameters θτ\theta_\tau, we adopt bias-tuning (as described in L188-189) as it has been shown to be effective in few-shot dense prediction literature [1,2]. In this implementation, the bias parameters to be tuned are the portion of the parameters in the encoder.

(2) (clarification on condition encoders): We use a single, shared condition encoder for encoding the conditions (both the support and query), as stated in Eq.(3). We apologize for the misleading figure and will modify the condition encoder part to make it clearer.

(3) (computing resources): We report the computing resources and hyperparameters for test-time finetuning in Table A and B, respectively.

Table A. Computing Resources

GPU(s)Batch size/GPUMem/GPUTime
Training8 RTX 3090616GB12 hours
Fine-tuning1 RTX 30901021GB<= 1 hour
Inference1 RTX 3090811.5GB3.06 s / image

Table B. Hyperparameters

OptimizerAdamW
Learning rate1e-5
Weight decay0.01
Batch size10
Number of iterations<= 600

W4 Reason why fine-tuning with Prompt-Diffusion overfits on certain conditions

A4. To clarify, PromptDiffusion also slightly overfits to Depth. Therefore, PromptDiffusion overfits on the high-level conditions, such as depth, normal, pose, and densepose, while not on the low-level conditions, such as Canny and HED. We conjecture that this is due to our 3-fold evaluation protocol, where we split the tasks into three folds and use the other folds for meta-training when evaluating each fold. Since low-level labels can be relatively easily inferred from high-level labels, while the opposite is hard, we suspect that PromptDiffusion shows better generalization on low-level tasks by transferring its knowledge from high-level tasks contained in the meta-training dataset.

On the other hand, this behavior is not observed in UFC, which highlights the effectiveness of the robust patch-wise matching and parameter-efficient fine-tuning for few-shot adaptation to unseen tasks.

W5 About the ablation study on the matching module

A5. We want to clarify that our ablation study tests the matching module’s role in terms of the adaptive knowledge extraction from the support sets, instead of the matching layer itself. The simple linear projection is not capable of adaptively extracting relevant information for each query condition from the support set, and cannot cope with different support sizes (e.g, increased shot after fine-tuning). Furthermore, the addition of all features obtained after the linear projection can destroy the structural information encoded in the feature of the query condition and potentially amplify over-fitting.

To validate our claim, we conducted an additional experiment following the suggestion of the reviewer. Table C and D summarize the results. As expected, we observe that the w/o matching, w/ support variant (V1) fails to yield meaningful controllability and also damages the image quality a lot. This indicates that the ablation variant in our paper (V2) serves as a more reasonable baseline to ablate the effect of matching.

Table C. Controllability Measurement on COCO 2017 Validation

ControllabilityCanny (↑)HED(↑)Depth(↓)Normal(↓)Pose(↑)Densepose (↑)
w/o matching, w/ support (V1)0.23020.295499.7220.360.0000.1021
w/o matching, w/o support (V2)0.29840.497296.1716.060.1500.3995
UFC0.32390.512194.3815.090.2290.4340

Table D. FID (↓) Evaluation on COCO 2017 Validation

FIDCannyHEDDepthNormalPoseDensepose
w/o matching, w/ support (V1)25.4528.6827.1627.2449.8646.04
w/o matching, w/o support (V2)20.4320.4321.1023.7048.4740.84
UFC19.2420.5821.0421.6047.9137.79

W6. More explanation on the relevance between query/reference patches in Supp. Fig.5

A6. We would like to clarify that query patches attend to relevant support patches, not necessarily semantically identical ones. Since there is no explicit training objective for the matching module to model the semantic similarity, it is not very surprising that “knee” (we believe that the reviewer may miswrite it as elbow) patch can be attended to patches of other joints. We conjecture that what the condition encoder learned is to produce condition features that capture general “human joints”. Then the matching on these features composes a skeleton-like layout of features, which seems to be sufficient to guide the diffusion model. Similarly, for Canny/Depth, patches are matched by spatial layout rather than semantics. We hope our discussion on attention maps resolves the concern of the reviewer.

W7. Performance saturation analysis with increased sample size

A7. We provide performances of UFC with more samples (up to 300) in Table E and Table F. The results show the same trend as our previous discussion in Section 5.3 of the paper (L296 to L299): with more support data, the model gains better controllability performance. The controllability seems to be saturated for most of the tasks, except for Pose. We hypothesize that Pose requires semantic understanding of human joints given sparse labels, thus more challenging to learn than other tasks, and requires much more data to reach the saturation point.

Table E. Controllability Evaluation on COCO 2017 Validation Set

ShotCanny (↑)HED(↑)Depth(↓)Normal(↓)Pose(↑)Densepose (↑)
900.34010.534392.8914.810.2640.4514
1500.34600.541791.9914.330.2650.4573
3000.35670.543091.8814.530.3180.4606

Table F. FID (↓) Evaluation on COCO 2017 Validation Set

ShotCannyHEDDepthNormalPoseDensepose
9018.9320.0721.4220.6243.5737.72
15018.8920.6021.7021.2244.9536.17
30019.1419.4120.2320.0843.4935.91

[1] Kim, Donggyun, et al. "Chameleon: A data-efficient generalist for dense visual prediction in the wild." ECCV 2024.

[2] Kim, Donggyun, et al. "Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching." ICLR 2023.

评论

Thank you for the response. The response is thorough and in detail. After reading all reviews, I still think the link between the motivation and the method is very weak. In the response, the authors explain the difference of general image-to-image synthesis and layout-control generation. I want to know which part of the proposed method addresses the specific challenge of this layout-control generation task, to distinguish it from the general I2I generation.

评论

We appreciate the reviewer’s continued engagement. Our research focuses on few-shot spatial control generation, where the key challenge—compared to general image-to-image synthesis—lies in handling highly heterogeneous and unseen spatial conditions such as edges, depth, or segmentation maps. These conditions exhibit divergent visual structure distributions and differ significantly from both natural images and the training conditions. With this perspective, UFC has two key methodological designs that target the specific challenge of our problem: (1) the matching formulation and (2) the training procedure.

First, unlike the general image-to-image synthesis, spatial control tasks assume that the output is always natural images, while the input can be general spatial conditions. This leads us to formulate the matching to construct unified control features by composing the task-agnostic visual features extracted from the support images, while using conditions to compute the weights. Thanks to the unified control features produced by the matching module, UFC can universally handle heterogeneous conditions and efficiently generalize to unseen condition types with a few image-condition pairs by sharing most of the network parameters.

Second, we design our two-stage training pipeline–meta-training and fine-tuning–to directly target our universal few-shot control problem. Specifically, we first employ an episodic meta-training protocol to simulate the few-shot learning scenarios at test time, which directly targets the few-shot control problem. Then, we perform parameter-efficient fine-tuning by choosing specific parameters to update, i.e., bias parameters of the condition encoder, matching modules, and projection layers for injection. The minimal choice of learnable parameters at fine-tuning is also specific to our problem, since we can freeze the image encoder due to the matching formulation with task-agnostic visual features, which helps UFC avoid over-fitting to the few-shot support of unseen control tasks.

We hope that our response clarifies the reviewer’s concern. We also gently remind the reviewer that the impact of each component on our problem is demonstrated by our ablation study (Table 3 in our paper and our response to weakness 1 raised by reviewer d7Cj). We appreciate further discussions about any remaining concerns.

评论

Thank you for the detailed explanation to address my concern. I will update my score accordingly.

审稿意见
4

This paper, "Universal Few-Shot Spatial Control for Diffusion Models (UFC)", addresses the challenge of adapting text-to-image diffusion models to novel spatial control conditions with limited data. Existing methods typically require extensive datasets and retraining for each new spatial control task, leading to high computational costs and limited adaptability.

The core contribution of this work is the Universal Few-Shot Control (UFC) framework. UFC proposes a versatile few-shot control adapter that can generalize to novel spatial conditions using only a small number of labeled examples at test time. The key innovation lies in representing novel spatial conditions by adapting the interpolation of spatial structures derived from task-agnostic visual features of images within a small support set, rather than directly encoding task-specific labels. This is achieved through a matching mechanism that computes weights based on the similarity between query and support condition embeddings.

优缺点分析

Strengths:

  1. The paper clearly identifies a significant limitation in current spatial conditioning methods for diffusion models, which is their high data requirements and limited adaptability to novel conditions.

  2. The proposed Universal Few-Shot Control (UFC) is a novel framework that addresses the identified limitations by leveraging task-agnostic visual features and an interpolation-based approach for constructing task-specific control features.

  3. The experiments cover diverse spatial control modalities (six types) and include comparisons against both fully supervised and few-shot baselines. The evaluation protocol includes both image generation quality (FID) and controllability using task-specific metrics.

Weaknesses:

  1. This paper did not clearly explain the Support Conditions and Support Images during training and there computation cost for training and inference time.

  2. The chosen baselines, such as Uni-ControlNet and Prompt Diffusion, are indeed established works in controllable diffusion models. However, given the rapid advancements in this field, they may not represent the cutting edge of performance in few-shot or general controllable image generation. Including results from more recent and highly competitive baselines, particularly those that have shown strong performance in adapting to new conditions or offer some form of zero-shot transfer, would provide a more robust evaluation of UFC's current standing and highlight its unique advantages more effectively.

  3. In addition to these conditions, we can easily get by processing the the image. Can we find some real world scenarios that are really few-shot and verify the performance on these scenarios.

问题

please refer to the weaknesses part.

局限性

yes

最终评判理由

I appreciate the authors’ thorough rebuttal. After carefully reviewing their responses alongside the feedback from other reviewers, I find that most of my concerns have been adequately addressed. But as other reviewers mentioned, the writing of this paper can be improved. Consequently, I keep my original score.

格式问题

N/A

作者回复

We sincerely appreciate your review and recognition of our “novel framework” that addresses “high data requirements and limited adaptability” of current spatial-conditioning diffusion models. We address the raised concern below.

Weaknesses

W1. Additional explanations on support set and computing resources

A1. The support set is a set containing several condition-image pairs. For each pair, we refer to its condition and image as support condition and support image, respectively. During meta-training, we sample the support set randomly for each task at every iteration to simulate different few-shot episodes, while the support set is given and fixed during the fine-tuning and inference (Section 4.3). To save computation, we employ three-shot supports in each training episode and five-shot supports during inference. The meta-training takes 12 hours using 8 RTX 3090 GPUs, while fine-tuning takes less than an hour using a single RTX 3090 GPU. When conditioned on a five-shot support set, inference takes about 3 seconds to process an image using a single RTX 3090 GPU.

W2. Comparison with more recent controllable generation methods

A2. To our best knowledge, there have been recent advances in the image-to-image generation tasks, but it was hard to find comparable baselines. These methods fall into several categories, such as architecture improvement [1-5] for spatial control, or unifying architecture for image-to-image generation tasks [6 - 11]. However, none of these claim to have strong few-shot adaptation as well as zero-shot transfer for spatial control. For instance, OmniGen [6] demonstrates few-shot learning on unseen classes for segmentation, but also acknowledges that it fails to handle unseen input types (normal prediction). As a demonstration, we also conducted experiments to evaluate the zero-shot and one-shot capability of OmniGen and the zero-shot capability of Flux.1 Kontext [10] (please refer to Table G in our response to reviewer G2yU) and find that this model fails or shows limited performance in spatial control on unseen tasks.

In that regard, we would appreciate it if the reviewer could suggest any baseline so that we can compare.

W3. Additional performance verification on real-world few-shot scenarios

A3. We appreciate the reviewer for the suggestion. We would like to kindly remind the reviewer that we have the results for the requested experiments in Figure 10 and Section F of the supplementary file. In this experiment, we employed three unseen spatial control tasks from FreeControl, such as point clouds, meshes, and wireframes, which are inherently data scarce and difficult to obtain model-based annotations, as suggested by the reviewer. Similar to FreeControl, we were able to report only the qualitative results due to the absence of pre-trained models to measure quantitative controllability.

As shown in Figure 10, our method can produce convincing results with only a few-shot support (30 examples), although these unseen spatial conditions are significantly different from the ones in the meta-training dataset (Canny, HED, Depth, Normal, Pose, Densepose). We appreciate the reviewer’s suggestion and will include this result in the main paper.


[1] Tan, Zhenxiong, et al. "Ominicontrol: Minimal and universal control for diffusion transformer." ICCV 2025.

[2] Tan, Zhenxiong, et al. "Ominicontrol2: Efficient conditioning for diffusion transformers." arXiv preprint arXiv:2503.08280 (2025).

[3] Yu, Fanghua, et al. "UniCon: Unidirectional Information Flow for Effective Control of Large-Scale Diffusion Models." ICLR 2025.

[4] Zavadski, Denis, et al. "Controlnet-xs: Rethinking the control of text-to-image diffusion models as feedback-control systems." ECCV, 2024.

[5] Li, Ming, et al. "Controlnet++: Improving conditional controls with efficient consistency feedback” ECCV 2024.

[6] Xiao, Shitao, et al. "Omnigen: Unified image generation." CVPR 2025.

[7] Han, Zhen, et al. "Ace: All-round creator and editor following instructions via diffusion transformer." arXiv preprint arXiv:2410.00086 (2024).

[8] Lin, Yijing, et al. Realgeneral: Unifying visual generation via temporal in-context learning with video models. ICCV 2025.

[9] Lin, Weifeng, et al. "Pixwizard: Versatile image-to-image visual assistant with open-language instructions." ICLR 2025.

[10] Black Forest Lab. "FLUX. 1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space." arXiv preprint arXiv:2506.15742 (2025).

[11] Wang, Haoxuan, et al. "Unicombine: Unified multi-conditional combination with diffusion transformer." arXiv preprint arXiv:2503.09277 (2025).

评论

Thank you for reviewing our rebuttal. We are happy to provide further clarification or discuss any remaining concerns you may have.

审稿意见
5

This paper proposes Universal Few-shot Control (UFC), a few-shot control adapter designed to generalize to novel spatial conditions. UFC leverages the analogy between query and support conditions to construct task-specific control features through a matching mechanism and a small number of task-specific parameter updates. The authors conduct comprehensive experiments across various spatial control tasks to demonstrate the method's effectiveness.

优缺点分析

Strengths:

  1. The paper is well written and easy to follow.

  2. Comprehensive experiments are conducted on diverse spatial control tasks, showing strong and consistent performance.

  3. The motivation for introducing a universal few-shot control adapter is clear and addresses an important challenge in controllable generation.

Weaknesses:

  1. The training process of UFC involves two stages: meta-training on 300K text-image pairs with various spatial conditions and few-shot fine-tuning. The authors frequently mention “30-shot” learning, but this refers to fine-tuning a model that has already been meta-trained. Can the model generalize to new conditions if only the 30 examples are available without meta-training?

  2. In the comparisons with baseline methods, it is not clear whether all models are given access to the same data during training. For example, UFC appears to benefit from meta-training on datasets with diverse spatial conditions, whereas fully supervised baselines like ControlNet may only see a single-condition dataset. This raises fairness concerns in the comparison, and the authors should clarify the data access for all methods.

问题

During the few-shot fine-tuning stage, are the 30 examples selected randomly from the new-domain dataset? Would using a data selection strategy—e.g., ensuring sample diversity—lead to better generalization?

局限性

While the proposed UFC framework shows strong generalization to new spatial conditions, several limitations should be addressed. First, the method relies on a two-stage training process involving extensive meta-training on a large dataset (300K text-image pairs), which may limit its applicability in truly low-resource or data-scarce scenarios. The paper does not clarify whether the method remains effective when only a few examples are available without prior meta-training. Second, there is limited discussion on the fairness of comparisons—UFC benefits from diverse multi-condition data during meta-training, whereas baseline methods like ControlNet may only be trained on single-condition datasets. This discrepancy could overstate UFC’s advantage.

最终评判理由

I am satisfied with the authors’ response and have adjusted my score accordingly.

格式问题

N/A

作者回复

We greatly appreciate your review, recognizing our proposed framework addressing “an important challenge in controllable generation”. Here, we address your concerns and questions.

Weaknesses

W1. Performance of UFC without the meta-training stage

A1. We would like to clarify that, in the few‑shot learning literature [1–3], it is common to call a setting K‑shot when only K image–label pairs are available for each unseen task, even though auxiliary data may be used for pre‑training or meta‑training. Our references to 30‑shot follow this convention. Because our goal is few‑shot generalisation to unseen condition types, meta‑training is needed to give the universal control adapter prior knowledge of diverse conditions.

To verify this, we skipped meta‑training and adapted UFC directly to the unseen conditions, still using 30 support examples. Results for three representative tasks are shown below. As expected, omitting meta‑training degrades performance, confirming that this stage is crucial for universal few‑shot control.

Table A. Controllability measurement on COCO 2017 validation

Canny (↑)Normal (↓)Densepose (↑)
w/o meta training0.225020.100.3180
Ours0.323915.090.4340

Table B. FID (↓) evaluation on COCO 2017 validation

CannyNormalDensepose
w/o meta training24.9527.7739.82
Ours19.2421.6037.79

W2. Data fairness of the comparison across baselines

A2. We would like to clarify that all few-shot baselines in Table 1 and 2 (Prompt Diffusion, Uni-ControlNet+FT, Prompt Diffusion+FT) are pre-trained using the exact same meta-training dataset of our method and fine-tuned on the same few-shot data, as briefly discussed in Ln 221 - 225 in our paper. For fully-supervised methods, ControlNet was trained for each condition separately following previous practice [4, 5, 6] since it is designed to work on a single condition, hence cannot be pre-trained from other tasks. Considering that ControlNet still leverages full supervision of 150k examples for each task, but UFC leverages only 30 examples out of it to learn new spatial control tasks, we believe ControlNet serves as a strong upper-bound for various tasks as reported in the previous works. We also note that we considered the multi-task variants of ControlNet, Uni-ControlNet, which leverages full supervision of 6 evaluation tasks jointly for training. Compared to Uni-ControlNet, UFC was meta-trained with 4 tasks of Uni-ControlNet training data, and fine-tuned with only 30 examples for each new task, irrelevant to meta-training tasks. Therefore, UFC has no advantage over both few-shot and fully-supervised baselines in terms of data access. We appreciate the comment and will make this description clearer in the paper.


Questions

Q1. About diversity and selection strategy for support set

A1.

During the few‑shot fine‑tuning stage reported in our main paper, we randomly select 30 support examples for Canny, HED, Depth, and Normal. For Pose and DensePose, however, we manually curate 30 examples to ensure sufficient diversity for these higher‑level semantic tasks.

Because our matching strategy hinges on condition-specific similarity, the notion of “diversity” depends on tasks. For edge inputs such as Canny and HED, the support set must span a wide range of edge-map densities and orientations. For Pose and DensePose, it needs to cover varied human scales, occlusions, and pose deformations. In that regard, for most of the tasks in our experiments (except Pose and Densepose), the object class (e.g., cats or dogs) is irrelevant to ensure the diversity of the support set. For instance, our model could generalize to COCO 2017 validation set with only 30 support examples, which can cover only a small set of objects.

To measure how support choice influences performance, we repeated fine-tuning three times, each time drawing a new random set of 30 supports and carrying out few-shot generation. The results are summarized in the tables below.

For all tasks except Pose and DensePose, performance shows no significant sensitivity to support randomness. Pose and DensePose, however, performed worse than the results in Tables 1 and 2 of the main paper, which used curated supports maximising diversity in scale, deformation, crowding, and occlusion. These results confirm that the importance of support-set diversity is itself task-dependent, and having a data selection strategy to ensure sample diversity can potentially benefit the performance (as shown in Pose and Densepose).

Table C. Controllability evaluation with different support sets on COCO 2017

seedCanny (↑)HED(↑)Depth(↓)Normal(↓)Pose(↑)Densepose (↑)
# 10.32390.512194.3815.090.2290.4340
# 20.32950.495793.4715.820.0620.3994
# 30.32730.512894.8415.930.1710.3913
# 40.32900.494294.5015.560.1550.4180
Mean0.32740.503794.3015.600.15430.4107
Std0.00250.01010.58510.37370.06920.0192

Table D. FID (↓) evaluation with different support sets on COCO 2017

seedCannyHEDDepthNormalPoseDensepose
# 119.2420.5821.0421.6047.9137.79
# 221.0621.7020.5120.8546.3837.22
# 320.9023.8422.2122.5645.2335.29
# 420.4120.7821.8021.8948.7138.88
Mean20.4021.7321.3921.7347.0637.30
Std0.821.490.760.711.561.50

[1] Vinyals, Oriol, et al. "Matching networks for one shot learning." NIPS 2016.

[2] Finn, Chelsea, et al. "Model-agnostic meta-learning for fast adaptation of deep networks." International conference on machine learning. ICML 2017.

[3] Snell, Jake, et al. "Prototypical networks for few-shot learning." NIPS 2017.

[4] Zhao, Shihao, et al. "Uni-controlnet: All-in-one control to text-to-image diffusion models." NeurIPS 2023.

[5] Wang, Zhendong, et al. "In-context learning unlocked for diffusion models." NeurIPS 2023.

[6] Qin, Can, et al. "Unicontrol: A unified diffusion model for controllable visual generation in the wild." NeurIPS 2023.

评论

Thank you for the authors' detailed and thoughtful response. In my opinion, the authors have addressed most of my concerns and clarified the key points I raised. I would like to follow the ongoing discussion between the authors and other reviewers before making my final decision.

评论

Thank you for taking the time to review our rebuttal. If any concerns remain, we are open to further clarification and discussion.

审稿意见
5

The authors propose UFC, a framework/adapter for T2I diffusion model few-shot spatial control. Particularly, when finetuned on a small set of data pairs (30 pairs) on a condition type not seen during training of the adapter, the model achieves spatial control for any new input condition of this finetuned condition type. This is achieved via a matching module which, when given encoded tokens of the small set of data pairs (called support conditions and images), uses them as key and value which are queried by the unseen-during-training-or-finetuning spatial condition input.

优缺点分析

Strengths

  1. The method shows good results on dense(r) conditions like canny and depth, with performance exceeding existing few-shot methods and just shy of fully supervised methods.
  2. The proposed architecture is clear and makes intuitive sense.
  3. The quantitative results are presented very clearly and nicely show the strengths and weaknesses of the proposed method.
  4. The ablations for matching modules and finetuning shows that each component of the framework is important to its final performance.

Weaknesses

  1. The setup of the paper is a bit confusing at first. “meta-learning/training” [L50, L179] is mentioned twice leading up until the “Experiments” section, but it is unclear leading until then that the adapter is trained on 150K image-condition pairs of conditions which will not be seen during fine-tuning and inference. Clarifying this earlier (in the abstract and introduction) will greatly help the clarity from the start.
  2. The authors show ablation on support set sizes, but they only show down to 30. It would be great for the authors to show how fewer shots (15, 10, 5, etc.) performs both quantitatively and qualitatively, even if the performance is worse than the paper’s default 30-shot setting.
  3. The qualitative evaluation/samples shown are currently limited to annotatable conditions, that is, they can be annotated by off-the-shelf models. However, in the real world this is not super useful, as few-shot methods are really beneficial for settings where it is difficult to obtain model-based annotations. Thus, it would be nice to see how UFC performs when meta-trained on all the model-annotated pairs and applied to challenging control conditions like those shown in FreeControl (e.g., point clouds, meshes, wireframes).
  4. There are no user studies comparing UFC to the baselines, especially since generative model benchmarks/metrics are often not 100% representative of human-perceived quality.

问题

  1. To clarify, is there a separate matching module for each U-Net decoder resolution/level, or is there a single matching module whose output is fed into a linear projection which up/downsamples it to match the decoder resolutions/levels?
  2. (Echoing point 2 of Weaknesses, and a bit more on the fine-tuning stage) How does UFC perform on fewer shots? How does the amount of time/iterations spent fine-tuning affect the final performance? Do the authors observe overfitting easily during the fine-tuning stage or is it usually cut short due to time? Lastly, how does the diversity of the support set affect the generalizability of UFC, for example, what if we only give a support set of cats and do inference on dogs/humans/objects/buildings?
  3. (Echoing point 3 of Weaknesses) How does UFC generalize to difficult-to-obtain/in-the-wild conditions (conditions which cannot be easily annotated by an off-the-shelf model) like FreeControl and Ctrl-X tests on?
  4. The authors address the “why use a few-shot approach when training-free methods exist” question in the appendix and use FreeControl to illustrate the advantages of UFC—why this is not included in the main paper is not explained (since training-free, a.k.a. zero-shot methods “cover” the problem space of few-shot spatial control), though it could be due to FreeControl’s long inference time, so I would like some clarification from the author on this. However, Ctrl-X [1] is another training-free method which is guidance-free, unlike the three baselines the authors listed (FreeControl, FreeDoM, and Universal Guidance), so it has short enough inference time which could potentially be a good baseline. Also, T2I-Adapter [2] is also a good potential baseline for fully supervised methods.
  5. With the emergence of open-source language-driven/interactive image generation models such as OmniGen [3] and Flux.1-Kontext [4], how do the authors see the usefulness and applications for UFC within this landscape? This is especially considering that Flux.1-Kontext can natively do instruction-based image-to-image which (sometimes) works on spatial conditions, and OmniGen showcases few-shot spatial control too.

Addressing questions 2 to 3 well (with examples), a bit more examination of zero-shot methods as in addressing question 4, and a good discussion on question 5 can potentially push me to raise my score. Also addressing weaknesses 5 (user study) can push me to raise my score too, though I am not expecting the authors to be able to conduct user study in the short time frame during the rebuttal.

[1] Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou. “Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance”. NeurIPS 2024, https://arxiv.org/abs/2406.07540.

[2] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhonggang Qi, Ying Shan, Xiaohu Qie. “T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models.” AAAI 2024, https://arxiv.org/abs/2302.08453.

[3] Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, Zheng Liu. “OmniGen: Unified Image Generation.” CVPR 2025, https://arxiv.org/abs/2409.11340.

[4] Black Forest Labs. “Flux.1-Kontext.” https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev.

局限性

The authors constructed the meta-learning training dataset from the LAION-400m dataset, which has had widely-publicized reports in the past showing that it includes exploitative images of children and has been temporarily removed by LAION in the past. Re-LAION-5B [5] has since been released, where LAION claims to be much more thoroughly cleaned. It would be great if the authors discussed potential negative societal impacts of using LAION-400m (as opposed to Re-LAION-5B or some other dataset), along with considering re-training UFC on safer datasets if/when they open-source the code and weights of UFC.

[5] LAION. “Releasing Re-LAION 5B: Transparent iteration on LAION-5B with additional safety fixes.” https://laion.ai/blog/relaion-5b/.

最终评判理由

The authors have addressed most of my questions and concerns, and the responses are very thorough. I especially appreciate the authors running additional experiments and baselines (Ctrl-X, OmniGen, Flux.1 Kontext, along with experiments other reviewers requested) to showcase the effectiveness of UFC, and I also want to thank the authors for pointing me to the appendix on experiments I missed. I thus raise my final rating to Accept.

格式问题

N/A

作者回复

We sincerely appreciate the review and your acknowledgement of our “clear”, “intuitive” architecture and our “good results”. We address the raised concerns and questions below.

Weaknesses

W1. Clarification on meta-learning and few-shot finetuning setting

A1. We thank the recommendation. We will clarify it in the abstract and introduction.

W4. User studies

We thank the recommendation. Due to the time limit, we will conduct user studies and incorporate the results in our paper after the rebuttal period ends.


Questions

Q1. Clarification on the matching module

A1. There is a separate matching module for each layer of the U-Net encoder, which operates on the features at each level, and its output is incorporated into the corresponding U-Net decoder layer. We will clarify this in the method section.

W2, Q2. Performance with fewer shots; fine-tuning iterations and overfitting; Effect of support set’s diversity on model performance

A2.

(1) (Performance with fewer shots): Following the reviewer’s suggestion, we performed a fewer‑shot ablation; results are in the tables below. Consistent with Figure 4, controllability and quality drop as the supports shrink. However, it also shows that UFC can achieve a certain level of controllability using unseen control conditions, even with a very limited support (5 and 15 shots), showcasing the generalizability of our method.

Table A. Controllability evaluation on COCO 2017

ShotCanny (↑)HED(↑)Depth(↓)Normal(↓)Pose(↑)Densepose (↑)
50.31880.468797.2117.500.0850.2984
150.32170.499195.5216.690.1440.3523
300.32390.512194.3815.090.2290.4340

Table B. FID (↓) evaluation on COCO 2017

ShotCannyHEDDepthNormalPoseDensepose
521.8221.8623.2022.0550.4538.25
1520.4920.7521.5821.4650.4537.95
3019.2420.5821.0421.6047.9137.79

(2) (Fine-tuning iterations and overfitting): With 30 supports, < 600 steps (< 1hour on an RTX‑3090) suffice for all six tasks. Continuing the fine-tuning over a much longer iterations eventually leads to overfitting, but it was easily avoidable by applying early stopping using a single held-out validation example to monitor the denoising loss. We found that this strategy leads to similar early-stopping iterations across the tasks, with trends that low-level conditions (e.g., edges) tend to converge faster than high-level conditions (e.g., pose).

(3) (Effect of support sets’ diversity on model performance): Because our matching strategy hinges on condition-specific similarity, the notion of “diversity” depends on tasks. For edge inputs such as Canny and HED, the support set must span a wide range of edge-map densities and orientations. For Pose and DensePose, it needs to cover varied human scales, occlusions, and pose deformations. In that regard, for most of the tasks in our experiments (except Pose and Densepose), the object class (e.g., cats or dogs) is irrelevant to ensure the diversity of the support set. For instance, our model could generalize to COCO 2017 validation set with only 30 support examples, which can cover only a small set of objects.

To measure how support choice influences performance, we repeated fine-tuning three times, each time drawing a new random set of 30 supports and carrying out few-shot generation. The results are summarized in the tables below.

Table C. Controllability evaluation with different support sets on COCO 2017

seedCanny (↑)HED(↑)Depth(↓)Normal(↓)Pose(↑)Densepose (↑)
# 10.32390.512194.3815.090.2290.4340
# 20.32950.495793.4715.820.0620.3994
# 30.32730.512894.8415.930.1710.3913
# 40.32900.494294.5015.560.1550.4180
Mean0.32740.503794.3015.600.15430.4107
Std0.00250.01010.58510.37370.06920.0192

Table D. FID (↓) evaluation with different support sets on COCO 2017

seedCannyHEDDepthNormalPoseDensepose
# 119.2420.5821.0421.6047.9137.79
# 221.0621.7020.5120.8546.3837.22
# 320.9023.8422.2122.5645.2335.29
# 420.4120.7821.8021.8948.7138.88
Mean20.4021.7321.3921.7347.0637.30
Std0.821.490.760.711.561.50

For all tasks except Pose and DensePose, performance shows little sensitivity to support randomness. Pose and DensePose, however, the model performed worse than the results in Tables 1 and 2 of the main paper, which used curated supports maximising diversity in scale, deformation, crowding, and occlusion. These results confirm that the importance of support-set diversity is itself task-dependent.

W3, Q3. Generalization on difficult-to-obtain/in-the-wild conditions

A3. We would like to kindly remind the reviewer that we have the results for the requested experiments in Figure 10 and Section F of the Appendix. In this experiment, we employed three unseen spatial control tasks from FreeControl, such as point clouds, meshes, and wireframes, which are inherently data scarce and difficult to obtain model-based annotations, as suggested by the reviewer. Similar to FreeControl, we were able to report only the qualitative results due to the absence of pre-trained models to measure controllability.

As shown in Figure 10, our method can produce convincing results with only a few-shot support (30 examples), although these unseen spatial conditions are significantly different from the ones in the meta-training dataset. We appreciate the reviewer’s suggestion and will include this result in the main paper.

Q4. Comparison with training-free, fully-supervised baselines

A4. We would like to kindly remind the reviewer that Appendix E already explains why FreeControl is excluded: running it on the full COCO-2017 validation set (5,000 images) is prohibitively slow. The original FreeControl paper reports results on only 30 images from 10 classes—insufficient for fair comparison. We appreciate the suggestion to add Ctrl-X as a fast, training-free baseline. Accordingly, we evaluated Ctrl-X on COCO-2017 using the same Stable Diffusion 1.5 backbone as UFC. As shown in Tables F and G, Ctrl-X consistently underperforms UFC in both controllability and FID across all conditions.

Table E. Controllability measurement on COCO 2017 validation

Canny (↑)HED(↑)Depth(↓)Normal(↓)Pose(↑)Densepose (↑)
Ctrl-X0.29010.300298.1019.380.0050.1352
Ours0.32390.512194.3815.090.2290.4340

Table F. FID (↓) evaluation on COCO 2017 validation

CannyHEDDepthNormalPoseDensepose
Ctrl-X29.8330.1830.3230.3556.6045.88
Ours19.2420.5821.0421.6047.9137.79

We agree that T2I-Adapter would be a valuable fully supervised reference. However, due to our resource limitations, we selected the ControlNet and Uni-ControlNet as the main baselines. Because our goal is to demonstrate few-shot capability rather than to surpass fully supervised methods, we believe the current baseline set is sufficient to support our claims.

Q5. Usefulness and applications for UFC within language-driven/interactive image generation models.

A5. Open-source, language-driven image-generation systems such as Flux.1 Kontext and OmniGen can handle multiple conditioning signals, but only those seen during training. OmniGen’s authors themselves note that “previously unseen image types (e.g., surface-normal maps) can hardly be processed as expected,” and Flux.1 Kontext likewise makes no claim of zero-shot adaptability—the model is trained on natural images rather than abstract control inputs. UFC, in contrast, is designed to adapt to new condition types from just a handful of image–condition pairs. To demonstrate this, we evaluate each model on 30 images from the ImageNet-R-TI2I dataset [1]. Since OmniGen was trained on Canny, HED, Depth, Pose, and Segmentation, we evaluate it on the unseen Normal conditions in both zero-shot and one-shot. Similarly, we zero-shot evaluate Flux.1 Kontext on unseen HED and Normal conditions.

Table G. Controllability evaluation on ImageNet-R-TI2I

HED (↑)Normal (↓)
OmniGen - zero shot-21.28
OmniGen one shot-20.25
Flux.1 Kontext0.437921.64
Ours0.571817.75

Despite OmniGen’s larger and more diverse training corpus, it lags behind UFC, and Flux.1 Kontext performs poorly on both tasks—often simply reconstructing the input condition when we investigate the generated images. These results underline the limited few-shot capacity of current unified generators and the advantage of UFC’s universal few-shot control adapter, which is independent of the diffusion backbone. Because UFC is orthogonal to recent unified image-to-image frameworks, it can be integrated with them to extend their versatility to truly unseen control domains.

[1] Tumanyan, Narek, et al. "Plug-and-play diffusion features for text-driven image-to-image translation." CVPR 2023.


Limitation

L1 Social concern on the training dataset

A1. We acknowledge the societal concern about using LAION‑400M and appreciate the recommendation of the cleaned RE‑LAION‑5B dataset. We were not aware of these issues during the submission period while seeking an open-source dataset. We will add this discussion and consider re‑training UFC on safer data before releasing code and checkpoints.

评论

We appreciate the reviewer for carefully considering our rebuttal. If there are any remaining concerns, we would be willing to engage in further discussion.

最终决定

This work is carefully reviewed and received pretty positive feedback (5544).

All reviewers appreciate the importance of identified problem (unseen signals, with less data) in conditional generation. The proposed solution with few-shot (as few as 30) finetuning shows its effectiveness via extensive experiments as pointed out by reviewers. All these made this work stand out in existing conditional generation papers and exhibit impact to push forward this direction towards more generalized (or universal) controllable generations. Through the rebuttal, authors successfully addressed most concerned with detailed results and explanations, which is acknowledged by reviewers too. AC agrees with reviewers that writing needs to be improved but it should not outweigh the contribution of this work. Considering all the comments and discussions, AC made a decision of acceptance. Authors are highly suggested to include missing results raised by reviewers and necessary explanations for a strong and complete final version. Also please ask more experienced people for proofreading to make sure each point is clearly presented.