PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
3
3
3
ICML 2025

FlexControl: Computation-Aware Conditional Control with Differentiable Router for Text-to-Image Generation

OpenReviewPDF
提交: 2025-01-19更新: 2025-08-14

摘要

关键词
Diffusion modelcontrollable image generationdynamic routedata-drivenefficient inference

评审与讨论

审稿意见
4

This paper proposes FlexControl, a framework that introduces a novel gating mechanism for dynamically selecting blocks to activate in the control network, reducing computational overhead while preserving or improving image quality. The authors have conducted experiments on both UNet-based (SD1.5) and DiT-based (SD3.0) architecture across three tasks (depth, canny, seg.), demonstrating the effectiveness of the proposed method.

给作者的问题

  1. How is the FLOPs in the cost loss LC\mathcal L_\mathbf{C} computed, such that the gradients can be back-propagated to the parameters of the network?
  2. During inference, is it possible to manually adjust the number of activated blocks to achieve a efficiency-performance trade-off?

论据与证据

The claims in the submission are well-supported by the experiments.

方法与评估标准

Yes, the proposed methods make sense for the problem. The evaluation metrics (FID, CLIP score, depth RMSE, canny SSIM, seg mIoU) are all common criteria in the area of controllable image generation.

理论论述

This papar does not contain any theoretical claims or proofs.

实验设计与分析

Yes, I've checked all the experiments. Some issues are listed as follows:

  1. Quantitative comparison: As one of the main contributions of this paper is to reduce the computational overhead, it lacks comparison on the image quality (Table 1) and controllability (Table 2) with efficient control models mentioned in the Related Work section, such as ControlNeXt[1].
  2. Ablation study: Similar to the first issue, this paper lacks computational complexity comparison (Table 3) with efficient methods. Adding these comparisons could help the readers understand the computational efficiency of the proposed method better.
  3. The paper lacks explanation or ablation study on how to determine the hyperparameter λC\lambda_{\mathbf C} in equation (18).

[1] Peng, Bohao, et al. "Controlnext: Powerful and efficient control for image and video generation." arXiv preprint arXiv:2408.06070 (2024).

补充材料

Yes, I've reviewed all parts of the supplementary material.

与现有文献的关系

This paper falls into the area of controllable image generation. It addresses a key problem in this area that previous methods heavily rely on heuristic network design, and proposes a novel dynamic gating mechanism to solve this problem. This paper is also related to efficient control models, proposing a novel cost loss that controls the sparsity of the network.

遗漏的重要参考文献

The essential related works are well-discussed and cited.

其他优缺点

Strengths:

  1. The dynamic gating mechanism is novel in the area of controllable image generation.
  2. The paper is well-written, the presentation is clear and easy to follow.

Weaknesses:

  1. The desired sparsity γ\gamma needs to be specified before training. It would be better if a single model can handle all possible γ\gamma, further increasing the flexibility of the proposed method.
  2. Compared to other efficient control models, FlexControl only reduces computational overhead but does not decrease the number of parameters (actually doubles the parameters of the original ControlNet), which increases the burden of distributing and deploying the model.

其他意见或建议

I do not have other comments or suggestions.

作者回复

We sincerely thank the reviewer for acknowledging the novelty and performance of our paper. We hope the following answers reflect your questions.

Quantitative comparison and ablation on ControlNeXt..

We appreciate the reviewer’s concern regarding the need for additional comparisons with efficient control models. In response, we have conducted further experiments on ControlNext and Omini-Control[1], as mentioned in our reply to Reviewer eHec. It is important to highlight that while optimizing control block efficiency is valuable, our focus is different: we propose a dynamic routing strategy that adaptively determines the most efficient control strategy across different time steps and samples. This approach complements existing efficient control methods rather than solely aiming to reduce the cost of control blocks. Our experiments confirm that integrating our method with these efficient control models further enhances their performance while improving efficiency.

-[1] Tan, Z., Liu, S., Yang, X., Xue, Q. and Wang, X., 2024. OminiControl: Minimal and Universal Control for Diffusion Transformer. arXiv e-prints, pp.arXiv-2411.

The paper lacks explanation or ablation study on how to determine the hyperparameter λC\lambda_C

We appreciate the reviewer’s request for further clarification on determining the hyperparameter λC\lambda_C​ in Equation (18). λC\lambda_C serves as a scaling factor to balance different objectives: the diffusion loss optimizes image quality, while LCL_C regulates and enforces control block sparsity. To ensure the trained gating mechanism achieves the desired sparsity while maintaining generation quality, λC\lambda_C​ needs to be tuned empirically. While the optimal value may vary slightly across models, our experiments indicate that setting λC\lambda_C = 0.5 provides the best trade-off in practice. To compute LCL_C​, we utilize a precomputed block-wise FLOPs lookup table, following methodologies from prior work ([a], [b], [c]). This approach ensures an efficient and structured way to regulate computational cost while preserving performance.

-[a] Meng, L., Li, H., Chen, B. C., Lan, S., Wu, Z., Jiang, Y. G., & Lim, S. N. (2022). Adavit: Adaptive vision transformers for efficient image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12309-12318).

-[b] Rao, Y., Liu, Z., Zhao, W., Zhou, J., & Lu, J. (2023). Dynamic spatial sparsification for efficient vision transformers and convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 10883-10897.

-[c] Han, Y., Liu, Z., Yuan, Z., Pu, Y., Wang, C., Song, S., & Huang, G. (2024). Latency-aware unified dynamic networks for efficient image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.

The desired sparsity γ needs to be specified before training. During inference, is it possible to manually adjust the number of activated blocks to achieve an efficiency-performance trade-off?

We appreciate the reviewer’s suggestion regarding an inference-time scaling mechanism. To explore this, we conducted additional experiments where a ControlNet-Large model was trained with all blocks activated on a segmentation mask control task. During inference, we dynamically adjusted the gating threshold to control the number of activated blocks. Our results confirm that scaling activation blocks at inference is feasible, leading to better performance than the ControlNet baseline while maintaining comparable FLOPs and inference speed. However, the performance does not fully match that of the γ\gamma-aware trained version as we proposed in the paper, indicating that explicit training with sparsity constraints remains crucial for achieving the optimal efficiency-performance trade-off. Detailed results are presented below.

MethodBase ModelFIDCLIP_scoremIoUSpeed
FlexControl(γ=0.5)SD1.514.800.28420.37515.21±0.12 it/s
FlexControl(w.o. training)SD1.519.860.27320.32955.24±0.11 it/s
FlexControl(γ=0.7)SD1.514.710.28400.37754.94±0.07 it/s
FlexControl(w.o. training)SD1.516.560.27780.36654.86±0.09 it/s

Issue of parameters

We acknowledge the reviewer’s concern about parameter count. Our goal is not just to reduce control block cost but to develop a dynamic routing strategy that optimally adapts efficiency across time steps and samples. As shown in our experiments and discussed in our response to Reviewer eHec, integrating our method with efficient control models further improves both performance and efficiency, reinforcing its broad applicability of our methods.

审稿意见
3

This paper proposes FlexControl, a novel method aimed at improving the computational efficiency of ControlNet, an important model for adding controllability in text-to-image generation tasks. Unlike the original ControlNet, which utilizes half of the diffusion architecture as its encoder, FlexControl introduces an additional, fully trainable encoder as a separate copy of the entire diffusion architecture. A differentiable router is trained alongside this encoder to dynamically activate only the necessary blocks required for each specific task. To train this router, authors propose a computation-aware loss function that regularizes the model by matching a predetermined target ratio for reducing Floating Point Operations (FLOPs). The chosen ratio significantly influences both the performance and efficiency of FlexControl. The proposed method demonstrates improved results in terms of both performance and computational efficiency across various conditions, including depth maps, Canny edges, and segmentation masks.

Update after rebuttal

Through the rebuttal and discussion, I’ve come to understand that FlexControl can indeed achieve improved computational efficiency compared to baselines when gamma is properly selected. However, the method still requires careful hyperparameter tuning, which may present practical challenges.

I am raising my score to a weak accept, though I still believe this paper sits on the borderline and could reasonably be rejected.

给作者的问题

Regarding the SD3 adaptation, which part of the dual-stream block was specifically trained using a trainable copy in FlexControl? Parameters of both modalities? Or only the image modality?

论据与证据

The claims presented in the paper regarding computational efficiency and performance gains through FlexControl require additional evidence. Specifically, a comparative analysis with the open-source community's LoRA baseline [A], which modifies only the input layer's channel count and trains only LoRA parameters, without additional models, should be included. Such baseline may be more efficient in parameter count, FLOPs, and inference speed. Moreover, comparisons with ControlNeXt are essential.

The authors note that FlexControl with Gamma values of 0.5 and 0.7 performs well and maintains similar speed to ControlNet at Gamma=0.5. However, Table 4 reveals that at Gamma=0.3, FlexControl underperforms relative to ControlNet, indicating that performance gains only occur when computational efficiency is equivalent to or less than ControlNet. This raises questions regarding the actual advantage of FlexControl over ControlNet, particularly when improved efficiency corresponds to reduced performance. Further exploration of the lower bound of the gamma value and its impact on model performance is also necessary for a comprehensive evaluation.

[A] Black Forest Labs, https://github.com/black-forest-labs/flux/blob/main/docs/structural-conditioning.md

方法与评估标准

The evaluation used in this paper appropriately assess quality and fidelity across various tasks.

理论论述

There are no theoretical claims.

实验设计与分析

A deeper exploration into the effects of varying gamma values, especially their lower limits, would strengthen the experiments.

补充材料

The supplementary material includes the implementation details and the distribution of activated control blocks. The implementation details provided are sufficient to allow reproducibility of the experiments. The distribution figures offer valuable insights to readers, showing the interesting observation that most blocks, except for the initial ones, tend to be predominantly activated during later inference steps.

与现有文献的关系

Adding controllability to text-to-image generation models is a critical research topic within visual generative modeling. In this context, computational efficiency emerges as an essential aspect.

遗漏的重要参考文献

No issues found.

其他优缺点

A notable strength of FlexControl is its ability to maintain strong performance, comparable to ControlNet, particularly at gamma values around 0.5. However, its key weakness is evident when aiming for higher computational efficiency (gamma=0.3), where its performance significantly drops below ControlNet's baseline.

其他意见或建议

No issues found.

作者回复

We sincerely thank the reviewer for the detailed and constructive feedback.

ControlNext.. LoRA-based… We appreciate the reviewer’s feedback regarding the comparison with other methods. While a direct comparison is not applicable (as our work focuses on control block integration rather than parameter fine-tuning), we have instead integrated our methods with two representative approaches: ControlNeXt and Omini-Control [1] (a recent popular LoRA-based control method). Specifically, for Omni-Control, instead of following Flux’s approach of concatenating control tokens into new tokens, it appends condition image tokens with noisy image tokens as a longer sequence and leverages LoRA to jointly process them.

Our results show that, unlike methods that control all blocks by default, our approach achieves superior performance with fewer activated blocks, demonstrating its adaptability and broader applicability. Detailed comparisons follow below.

[1] Tan, Z., et,al. OminiControl: Minimal and Universal Control for Diffusion Transformer. arXiv e-prints, pp.arXiv-2411.

On segmentation mask

MethodBase ModelFIDCLIP_scoremIoUFLOPsSpeed
ControlNeXtSD1.524.160.26590.282551.72 G5.34±0.02 it/s
FlexControlNeXt(γ=0.3)SD1.525.220.25310.2644//
FlexControlNeXt(γ=0.5)SD1.523.740.26640.2819//
FlexControlNeXt(γ=0.7)SD1.523.710.26740.2841//
FlexControlNeXt(γ=0.8)SD1.523.840.26620.2841//

On Canny

MethodBase ModelFIDCLIP_scoreSSIMFLOPsSpeed
OminiControlFLUX.122.840.28300.412516.89 T2.36±0.00 it/s
FlexOminiControl(γ=0.2)FLUX.136.620.27120.312210.76 T3.42±0.09 it/s
FlexOminiControl(γ=0.3)FLUX.126.650.27910.366811.45 T3.28±0.07 it/s
FlexOminiControl(γ=0.5)FLUX.122.610.28860.412313.08 T3.08±0.10 it/s
FlexOminiControl(γ=0.7)FLUX.122.390.28550.414614.57 T2.80±0.09 it/s
FlexOminiControl(γ=0.8)FLUX.122.270.28610.415315.40 T2.69±0.07 it/s

Table 4.. Gamma=0.3…Further exploration..gamma value…

We appreciate the reviewer’s observations on FlexControl’s performance across γ\gamma values. Even at γ\gamma = 0.3 our method surpasses standard ControlNet while being significantly more efficient. Though slightly behind the more computationally expensive ControlNet-Large, it achieves over three times the efficiency, highlighting its effectiveness. To provide further insights, we conducted additional ablation studies on segmentation and Canny tasks, analyzing γ\gamma values from 0.2 to 0.8. The results, detailed below, illustrate the trade-offs between efficiency and performance:

On segmentation mask

MethodBase ModelFIDCLIP_scoremIoUFLOPsSpeed
ControlNetSD1.521.330.25310.2764233 G5.23±0.07 it/s
FlexControl(γ=0.2)SD1.521.520.25840.2995112 G5.98±0.09 it/s
FlexControl(γ=0.3)SD1.517.210.27130.3572168 G5.64±0.12 it/s
FlexControl(γ=0.8)SD1.515.590.28040.3695448 G4.82±0.06 it/s

On Canny

MethodBase ModelFIDCLIP_scoreSSIMFLOPsSpeed
ControlNetSD3.027.210.25120.37493.25 T48.34±1.78 s/it
FlexControl(γ=0.2)SD3.028.110.25240.35771.25 T38.21±2.97 s/it
FlexControl(γ=0.3)SD3.023.390.25810.42861.86 T40.83±3.09 s/it
FlexControl(γ=0.8)SD3.020.720.27190.48164.97 T54.05±2.53 s/it

We apologize for any confusion caused by our text and table presentation that may have led to misunderstandings. We will refine our wording to ensure greater clarity in the camera-ready version.

SD3 adaptation

In SD3 tasks, we select all transformer blocks as candidates and use our dynamic routing strategy to flexibly decide which transformer block to add control to. We copy all parameters in a block for both modalities.

审稿人评论

I appreciate the authors’ detailed rebuttal and the additional results provided.

I’m curious why the ControlNeXt table does not report FLOPs or speed metrics, and why the SD3.0 table reports speed in seconds per iteration (s/it), which differs from other tables.

作者评论

We appreciate the reviewer's thoughtful consideration of our detailed rebuttal and additional experimental results. Below, we clarify the points raised concerning the reporting of FLOPs and speed metrics:

  1. Regarding the absence of FLOPs and speed metrics in the ControlNeXt table:

ControlNeXt processes control features using a lightweight module, subsequently normalizing these features and applying them across each block. Our method introduces router units solely to determine the applicability of features to these blocks, without incorporating mechanisms for skipping control blocks. Consequently, our proposed approach does not alter inference speed or FLOPs relative to the baseline ControlNeXt model. Therefore, we have reported only the performance metrics for ControlNeXt, omitting unchanged FLOPs and speed metrics for clarity and conciseness.

  1. Rationale for reporting SD3.0 speed metrics in seconds per iteration (s/it):

As detailed in our manuscript (line 377), inference experiments for both SD1.5 and SD3.0 models were conducted on a single RTX2080Ti GPU (22GB memory). However, the computational complexity significantly varies between these two model versions: SD1.5 FLOPs range approximately between 168G and 561G, whereas SD3.0 FLOPs span from 1.86T to 3.25T. Consequently, inference speed for SD3.0 is substantially slower than SD1.5. To clearly and effectively illustrate speed differences across various experimental configurations in SD3.0, we have reported inference speeds in seconds per iteration (s/it), diverging from the units (it/s) used for other tables. We previously believed this choice enhances readability and comprehension of the substantial computational cost differences involved. We thank the reviewer for raising this, and let us know that it might confuse. We have now updated the table format as follows:

MethodBase ModelParam.FLOPsSpeed
ControlNetSD1.50.36 G233 G5.23±0.07 it/s
ControlNet-LargeSD1.50.72 G561 G4.02±0.05 it/s
FlexControl(γ=0.7)SD1.50.73 G393 G4.94±0.07 it/s
FlexControl(γ=0.5)SD1.50.73 G280 G5.21±0.12 it/s
FlexControl(γ=0.3)SD1.50.73 G168 G5.64±0.12 it/s
ControlNetSD3.01.06 G3.25 T(20.68±0.56)E-3 it/s
ControlNet-LargeSD3.02.02 G6.22 T(16.82±0.51)E-3 it/s
FlexControl(γ=0.7)SD3.02.03 G4.35 T(19.18±0.78)E-3 it/s
FlexControl(γ=0.5)SD3.02.03 G3.11 T(21.86±0.86)E-3 it/s
FlexControl(γ=0.3)SD3.02.03 G1.86 T(24.49±0.82)E-3 it/s

We will apply this update in the next version of the manuscripts.

审稿意见
3

This paper studies the Computation-Aware ControlNet by proposing a dynamic routing strategy which dynamically selects blocks to activate at each denoising step. It aims at adjusting control blocks based on timestep and conditional information while maintaining (or even improving) generation quality. The experimental results show its effectiveness (higher score).

update after rebuttal

The response addressed my concerns partially. I would like to increase my initial rating, but I am also not against Rejection, as other reviewers an I both have some concerns on the performance (such as FLOPs, compuatational cost, the impact caused by hyperparameters.).

给作者的问题

See Weaknesses.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

N/A

实验设计与分析

No issues.

补充材料

All parts.

与现有文献的关系

It is helpful for designing an efficient controlnet for the image generation community.

遗漏的重要参考文献

This paper proposes the dynamic routing strategy, which have been widely studied in computer vision community, like [1,2,3], even in text2image [4], while it does not discuss them.

[1] Cai et al., Dynamic Routing Networks

[2] Wang et al., SkipNet: Learning Dynamic Routing in Convolutional Networks

[3] Ma et al., DiT: Efficient Vision Transformers with Dynamic Token Routing

[4] Xue et al., RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

其他优缺点

Strengths:

The proposed method is efficient, achieving better performance with lower FLOPs.

Weaknesses:

  • The proposed method requires more parameters than the typical controlnet.

  • In Tab.3, why do not compare with controlnet++. ControlNext should also be compared w.r.t. performance and parameters and flops.

  • In Tab.5, what is the control signal? Why do not provide the relevant results in Tab.1 for clear comparison?

  • How to optimize the cost loss (Eq. 17).

  • There are many previous works for dynamic routing strategy. The authors do not review them, and the proposed implementation also does not contain new thing.

其他意见或建议

作者回复

We hope the answer below solves all the clarity issues.

The proposed method requires more parameters than the typical controller.

Yes, our method requires more parameters than standard ControlNet. However, compared to ControlNet-Large, it achieves better generation quality and controllability while halving inference FLOPs and improving inference speed. As discussed in Remark (page 4), the additional parameters have a negligible impact on GPU memory and inference performance. Importantly, our focus is not on parameter efficiency within control blocks but on dynamic routing for adaptive efficiency. Our approach is also compatible with recent parameter-efficient control methods, such as ControlNext and OmniControl, as noted in our response to Reviewer eHec.

not compare with ControlNet++ and controlnext on flops and parameters

We would like to clarify that our research does not focus on designing more efficient control block structures but rather on investigating how to integrate control blocks effectively and efficiently into pre-trained diffusion models. In this sense, our work is orthogonal to ControlNeXt methods, which is why we initially excluded it from Table 1 and Table 2. However, our approach can potentially be adopted by ControlNeXt to further enhance its performance.

We appreciate the reviewer's suggestion for additional experiments. To evaluate our method in the context of ControlNeXt, we refer the reviewer to our response to Reviewer eHec. Briefly, since ControlNeXt applies control to all blocks by default, our method achieves comparable performance to ControlNeXt at γ\gamma = 0.3 and outperforms it at γ\gamma = 0.5.

Regarding ControlNet++, it primarily focuses on improving training strategies while maintaining the same inference cost as the standard ControlNet. In contrast, our work explicitly targets inference efficiency. Thus, a direct comparison is not applicable. Nevertheless, we acknowledge the relevance of these methods and will consider discussing them in future versions of our paper.

In Tab.5, what is the control signal? Why do not provide …

Table 5 presents an ablation study on different control strategies under the "Canny edge" control signals. We agree with the reviewer's suggestion and will reorganise the table in a camera-ready version to improve the clarity.

How to optimize the cost loss (Eq. 17).

The FLOPs loss is computed by referencing pre-computed FLOPs values from a lookup table, which are then combined with the diffusion loss using a scaling factor. This approach aligns with prior research[a,b,c] (as referred to in response to reviewer eKoB), on dynamic and efficient neural network architectures that optimize computational cost while maintaining performance. We will add those parts in the related works.

There are many previous works for dynamic routing strategy…

We appreciate the reviewer’s suggestion to discuss prior work on dynamic routing strategies. While our method shares some conceptual similarities with existing approaches, its goal and implementation are fundamentally different. Below, we clarify these distinctions concisely:

  • [1] (Dynamic Routing Networks): Introduces a model with multiple branches, where a learned router selects the best path for each input to improve efficiency. Their method is trained from scratch for classification and focuses on reducing FLOPs. In contrast, our approach dynamically adjusts the influence of a fine-tuned ControlNet within a pre-trained diffusion model, aiming for controlled generation rather than computational savings.

  • [2] (SkipNet): Uses a gating mechanism to decide whether to skip certain convolutional layers, reducing computation for easier inputs. Unlike SkipNet, which skips layers within a single network, our method balances contributions between a fixed diffusion backbone and a fine-tuned control module. We modulate control strength from ControlNet to a pre-trained diffusion model for better adaptation.

  • [3] (DiT: Dynamic Token Routing): Dynamically routes image tokens within a Vision Transformer, deciding which tokens to process at each layer for efficiency. While DiT optimizes computation by selectively processing tokens, our method adjusts how much control blocks the ControlNet influences the final output, without altering token flow within the transformer.

  • [4] (RAPHAEL): A large-scale diffusion model using a mixture-of-experts (MoE) to assign different paths for different styles or concepts. RAPHAEL is trained from scratch on massive datasets, while our approach efficiently adapts a pre-trained diffusion model using dynamic routing, making it suitable for low-data settings.

These prior works focus on optimizing efficiency or designing new architectures, while our method adapts an existing pre-trained model for more flexible and controlled generation. We appreciate the reviewer’s suggestion and will incorporate this discussion into the final version of our paper.

审稿意见
3

The paper addresses the limitations of existing ControlNet implementations in diffusion-based generative models, which often rely on ad-hoc heuristics for selecting control blocks. The authors employs a trainable gating mechanism to dynamically select which blocks to activate at each denosing step.

给作者的问题

  1. In Table 3, why does the speed decrease when λ\lambda changes from 0.3 to 0.5 with SD1.5?
  2. How should λC\lambda_C be chosen, and does its value affect performance?

论据与证据

Yes

方法与评估标准

Yes

理论论述

no theoretical claims.

实验设计与分析

Yes, the paper uses various metrics and baselines to support the effectiveness of the proposed method.

补充材料

I have check the appendix and there is no other supplementary material.

与现有文献的关系

The introduction of a computation-aware training loss aligns with prior research on optimizing computational efficiency in generative models.

遗漏的重要参考文献

No

其他优缺点

Strengths:

  1. The paper is well-written and organized.
  2. The paper introduces a novel dynamic control mechanism that enhances the adaptability of diffusion models, moving away from static, heuristic methods.
  3. The paper includes extensive experiments across multiple architectures (UNet and DiT) and various tasks, providing robust evidence of FlexControl's effectiveness.

Weaknesses:

  1. The ablation study presented in the paper lacks rigor and comprehensiveness. The authors should investigate how the performance of FlexControl is affected by replacing the proposed gating mechanism with simpler alternatives, such as random selection of control blocks.

  2. At the optimal performance setting (\lambda=0.5), both the number of parameters and the FLOPs (Floating Point Operations) are worse than those of ControlNet. This raises concerns about the validity of their claims regarding efficiency.

  3. Also no code provided here make it harder to evaluate the method.

其他意见或建议

None

作者回复

We thank the reviewer for the positive feedback and support of our work. We hope to have answered all of your questions satisfactorily below. Please let us know if you see any further issues in the paper that must be clarified or addressed.

The ablation study … with simpler alternatives, such as random selection of control blocks…

We sincerely thank the reviewer for their suggestion to conduct ablation studies on simpler alternatives to the random selection of control blocks. In response, we have considered the following alternative sampling strategies.

As suggested, we first evaluate the simplest strategy—uniform sampling—where 50% of the control blocks are randomly selected, denoted as Uniform.

The experimental results corresponding to these sampling strategies are presented below.

MethodBase ModelFIDCLIP_scoremIoUFLOPsSpeed
UniformSD1.519.140.26000.3024323 G4.95±0.07 it/s
FlexControl(γ=0.3)SD1.517.210.27130.3572168 G5.64±0.12 it/s
FlexControl(γ=0.5)SD1.514.800.28420.3751280 G5.21±0.12 it/s

Notably, compared to all random sampling strategies, our approach with γ\gamma = 0.3 achieved superior performance and higher inference speed. Moreover, the FID score improves from 17.21 to 14.80 when γ\gamma is increased from 0.3 to 0.5. This ablation study further demonstrates the effectiveness of our method, and we will incorporate these findings into the camera-ready version of our paper.

At the optimal performance setting (λ\lambda=0.5)... regarding efficiency…

We would like to clarify that our method with γ\gamma = 0.3 has already outperformed the ControlNet baseline in both SD1.5 and SD3.0 experiments, as demonstrated in Table 4 and Table 5 of the original paper, while also achieving significantly lower FLOPs, as shown in Table 3.

Our choice of γ\gamma = 0.5 for comparison in Table 1 and Table 2 is not because it represents the optimal value, but rather because it provides a more direct comparison with the ControlNet baseline in terms of computational cost (280G FLOPs vs. 233G FLOPs). In contrast, when γ\gamma = 0.3, the FLOPs are substantially lower at just 168G. We decide to add all γ\gamma value experiment results in Table 1 and Table 2 to increasing the clarity in camera ready version.

code links

We provided anonymous links for comparison and reproduction:【https://github.com/Anonym916/Anonymity】

In Table 3. why does the speed decrease from 0.3 to 0.5

Since γ\gamma represents the expected overall number of activated blocks, increasing γ\gamma results in a higher number of active blocks, leading to increased computational cost (FLOPs) and consequently lower inference speed. This explains the observed decrease in speed from γ\gamma = 0.3 to γ\gamma = 0.5 in Table 3. We will refine our text in the camera-ready version to improve clarity.

How should (λC\lambda_C) be chosen, and does its value affect performance?

λC\lambda_C​ is a scaling factor that balances the diffusion objective and the FLOPs constraint objective. Since these two loss functions operate on different scales, λC\lambda_C​ is necessary to ensure proper weighting between them. We tune this value to achieve the target activation percentage while maintaining overall performance. Proper selection of λC\lambda_C ensures that the model activates the desired number of blocks without significantly degrading generation quality. We proposed to use λC\lambda_C = 0.5 in all our experiments as an experience value.

最终决定

After reading the authors’ rebuttal and discussing intensively, all reviewers come to the consensus of accepting this paper. The AC agrees with the reviewers that the new idea proposed in this paper to improve the computational efficiency of ControlNet is nice and this paper did make some valuable contributions to the community.