7.0

/10

Poster4 位审稿人

最低6最高8标准差1.0

3.8

置信度

正确性3.0

贡献度3.0

表达3.0

ICLR 2025

T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching

Zizheng Pan,Bohan Zhuang,De-An Huang,Weili Nie,Zhiding Yu,Chaowei Xiao,Jianfei Cai,Anima Anandkumar

OpenReview PDF

提交: 2024-09-27更新: 2025-02-26

TL;DR

We propose Trajectory Stitching, a simple but effective technique that leverages small pretrained diffusion models to accelerate sampling in large pretrained diffusion models without training.

摘要

关键词

diffusion modeltransformersmodel stitching

评审与讨论

审稿意见

评分: 6置信度: 42024-10-21

This paper introduces a training-free acceleration technique named T-Stitch for diffusion models. The core spirit of the approach is to employ a compact model for early timesteps and a more substantial model for later stages. The authors have provided empirical evidence that model performance remains unaffected even when the lighter model is employed for 40% of the initial steps. While the proposed method is simple and efficacious, parts of its evaluation appear to rely heavily on empirical evidence, and in my opinion, falls to a typical trap of this type of papers of not including further limit studies.

优点

Generally, I believe this is a good paper backed by solid experimentation. It has extensive comparative analysis involving various timesteps and samplers, and also compares itself against other methods, including those that are training-based, training-free, and search-based.

The paper is also well written and clearly motivated.

缺点

In my view, theoretical insights and empirical studies hold equal value; the simplicity of an idea does not de-value from its merit if it is proven effective, especially if it is a topic like Efficient AI. However, my primary issue with such papers lies in my preference for a clear explanation of the method's limitations, also through sets of experiments.

First, the authors of the T-Stitch paper state that 40% is an appropriate cutoff for switching models, a decision purely grounded in empirical evidence. This raises the question of how well-founded this choice is. If I were to apply this switching method to a different set of diffusion pairs of the models, would the 40% value still be relevant? Intuitively, the cutoff point likely hinges on the performance disparity between the more and less powerful models. From that perspective, if you put the model difference (Strong model FLOPs - Weak Model FLOPs) on x-axis, and the best cut-off point on y-axis, do you simply expect a flat line at 40% cut-off?

Second, although the authors did claim that the method can go beyond pari-wise, and have demonstrated how switching (I would maybe actually call this switching rather than stitching) can happen across 3 models, it is unclear about the limitation on this. Clearly the increased number of models would complicate the decision making on where to switch, and potentially make this method becomes a search-based one. More importantly, there must exhibits certain limitation on this switching especially when one has limited diffusion time steps. When you have N models to stitch/switch with M time steps. When N becomes larger or M becomes smaller, the return of this optimization should inevitably becomes diminishing.

Also something minor: Figure 5: the bottom of the images are cropped.

问题

Please see the two points about limitations I have raised in my Weakness section

伦理问题详情

I do not think I have any concerns here.

评论- Official Response from Authors

2024-11-22

Thanks for your valuable feedback. We would like to address your concerns as below.

Q1 - Concerns on the ideal cut-off for switching models.

Please refer to our general response.

Q2 - Limitations when T-Stitch beyond pair-wise.

Applying T-Stitch beyond pairwise naturally increases the number of potential allocation strategies, which we also agree with the reviewer that it introduces additional challenges. To address this, our initial manuscript provides a practical guideline (Section A.2) for such scenarios by framing T-Stitch as a compute allocation problem, which aims to find a few configuration sets from a lookup table to satisfy a computational budget and then apply to generation tasks.

Besides, with more models becoming available, as mentioned by the reviewer, our experiments in Table 3 indicate that simply adopting a small model to speed up the largest model performs favorably compared to the searched combination of 4 denoisers by DDSM. Notably, as mentioned in Lines 446–447, we only adopt the smallest network in their model family to accelerate the largest network. This suggests that we may not need too many denoisers in the sampling steps to achieve speedup in practice. As our current study has demonstrated broad effectiveness in many practical scenarios, we assume there is a potential in future work to extend this idea more intelligently when it is beyond pair-wise.

2024-11-25

I do not think my concerns are addressed. Neither deferring to future work nor explaining the trade-offs adequately addresses the requested study of limitations. Consequently, my score remains unchanged.

评论- Follow-up Response by Authors

2024-11-27

Thank the reviewer for engaging the rebuttal with us. We appreciate the opportunity to further clarify a few things below.

1. Should we always expect a flat line at the 40% cutoff in T-Stitch?

As initially discussed in our general response, the 40% threshold might not hold for all use cases. We provide further explanations with two examples in the revised submission Section A.23.

CFG scale. In Figure 36, we show that different classifier-free guidance (CFG) scales may affect this optimal threshold during our experiments with DiT models. By default, we set the CFG scale to 1.5 as it is the default value in DiT evaluation. However, under the CFG scale of 2.0, we observe this optimal threshold occurs at around 60%. But we should not assume a larger CFG scale would always help T-Stitch since FID sometimes cannot reflect the desired image quality, as mentioned in the SDXL report. We aim to demonstrate that the optimal threshold could be affected by CFG scale, as one of the limitations of T-Stitch.

Different models. It is intuitive that different models behave differently when using T-Stitch. For example,

a) In Figure 37, the best cut-off for DiT at CFG scale of 1.5 is around 40%.
b) On the other hand, the experiment on BK-SDM Tiny and SD v1.4 indicates that the optimal cut-off exists before the 40% estimate.
c) In our initial submission Lines 348-350 and Table 1, we have already shown that LDM-S can replace ~50% early steps for the baseline LDM with comparable or even better FID

Overall, the optimal switching point can vary depending on the specific characteristics of the model pairs being used and the hyperparameters during image generation, while 40% serves as a reasonable general guideline as demonstrated in our experiments.

In practice, determining the optimal switching point in T-Stitch can be done very efficiently, as discussed in our general response. For optimal results, we recommend conducting these efficient preliminary experiments to determine the ideal switching point for specific configurations.

2. T-Stitch goes beyond pairwise, …When N becomes larger or M becomes smaller, the return of this optimization should inevitably become diminishing.

Exploring this scenario through experiment is quite challenging for us since we do not have this number of pretrained models, e.g., switching 20 models but sampling 10 timesteps. Thus, the requirement of pretrained models naturally becomes one limitation for T-Stitch beyond pairwise since it relies on publicly available model weights.

Additionally, different models may have different optimal CFG scales. This means that combining multiple models along the same sampling trajectory could result in a much larger search space compared to pairwise combinations, making comprehensive evaluation challenging.

We thank the reviewer again and have included the above discussion in our limitations at Section A.23. Due to the limited rebuttal period, we leave more in-depth experiments for future work.

Best regards,

Authors of Submission 8597

审稿意见

评分: 6置信度: 32024-10-28

The paper introduces a new method to speed up the sampling efficiency of diffusion models. By using smaller models for the initial denoising steps and larger models in later stages, this approach greatly boosts sampling speed while keeping quality on par. Notably, this method is distinct from existing training-based or training-free techniques aimed at improving sampling efficiency. The authors demonstrate the effectiveness of their approach through extensive experiments, highlighting improved quality-efficiency trade-offs with a clear Pareto frontier.

优点

Overall this is a well-written paper which presents a simple but effective approach for accelerating sampling speed of large diffusion models. The authors convey their ideas clearly and support their approach through extensive experiments. I guess the key significance of this stitching approach is that it is orthogonal to other techniques, like model distillation or improved ODE solvers, allowing it to be easily combined with other methods to further reduce inference time.

缺点

Even if we ignore the inference time needed for running the small models, the time to generate samples of comparable quality can still be reduced by 30-40% at most. It is hard to say this method is a groundbreaking technique for improving sampling efficiency in diffusion models. While the paper presents a comparison of the Pareto frontiers between T-stitching and M-stitching, it might be more insightful to compare it with methods like progressive distillation, which can be much faster and does not need store multiple models.

Additionally, the approach uses models of different scales along the same denoising trajectory, which necessitates that both the small and large models predict score functions in the same latent space. This requirement may limit its applicability.

问题

The work primarily relies on the observation of the alignment of noise predictions between models of different scales during the early stages of the denoising process (Fig. 3). While this is an intriguing phenomenon, the paper does not provide sufficient explanation for why this occurs. Furthermore, the magnitude of the latent vectors is also important. Does the $L^2$ -distance exhibit a similar pattern as shown in Fig. 3?

I believe that the requirement for a shared latent space is a strict condition for this method. It is unclear whether this method is also robust for models trained with different configurations, such as varying noise schedules (like variance-preserving versus cosine) and different diffusion steps (e.g., T=100 versus T=500).

Is it possible that small models trained only over larger T steps (for example, $t \sim [70, 100]$ with a total $T=100$ ) yield better results?

评论- Official Response from Authors - Part 1

2024-11-22

Thanks for your constructive comments. We would like to address your additional concerns as follows.

Q1 - Concerns on the reduced time cost from T-Stitch, and comparison with progressive distillation.

Essentially, T-Stitch provides a free lunch for accelerating the large diffusion model sampling by directly adopting a small model. This speedup not only demonstrates favorable efficiency gain (“meaningful computational savings at little to no cost to the quality of generated images”, as recognized by Reviewer 4dLD), but is also broadly applicable ("Compatible with existing acceleration techniques", as recognized by Reviewer Aq9T).

Relation with step-distillation based method. In fact, T-Stitch allocates different compute budgets at different steps, which is a complementary technique with reducing sampling steps, not competing with it. In Figure 33, we have shown that T-Stitch works with step-distilled models as a complementary acceleration technique. Furthermore, we would like to note that T-Stitch is training-free, thus may not be directly comparable to training-based approaches. However, in Figure 15, we additionally provide an ablation study of the direct comparison for comprehensive reference.

Q2 - Shared latent space may limit the applicability of T-Stitch.

We have never assumed that all different models have shared latent space, as it is impractical. Also, not all models in the public model zoo can achieve the same effectiveness in T-Stitch.

In fact, we design T-Stitch by starting from the insight that different diffusion models can share similar sampling trajectories if trained on the same data distribution (Lines 95-98), thus pointing out that we can directly allocate an efficient small model at the early steps for accelerating large model sampling. And we are indeed able to build upon those small models to validate our idea.

Besides, we have demonstrated that T-Stitch is generally applicable to many scenarios, as shown in our experiments and highly recognized by Reviewer 4dLD ("various backbone architectures , with various samplers") and Reviewer Aq9T ("Broader Impact & Applications"). Therefore, this similar latent space does not hinder the broad applicability of T-Stitch in practice.

Q3 - Further explanations on Figure 3, and compared to L2 distance.

Initially, we have explained this phenomenon in Lines 94-101 with two insights, where we assume that 1) different diffusion models trained on the similar dataset would capture the same data distribution. 2) Moreover, recent works have shown that compared to the large diffusion model, small diffusion models are able to capture relatively good low frequencies at the early steps, indicating the power of small diffusion models to generate good global image structures given the same condition.

We further evidenced the second insight in Figure 17 and show that it actually happens: applying the DiT-S at the early steps minorly affects the global structures (tiger, table, dog, etc) while it gradually loses more details with a more increased fraction of DiT-S timesteps. This experiment suggests that we can exploit the advantage of small models in capturing good low-frequencies to achieve speedup.

"Does the L2-distance exhibit a similar pattern as shown in Fig. 3?"

Yes. In Section A.21 of our revised submission, we show that the L2 distances between different DiT models at the early denoising steps are much smaller than those of later steps, similar to the observation in Figure 3.

Q4 - A shared latent space is a strict condition, T-Stitch for models trained with different configurations such as noise schedules and different sampling steps.

It is quite challenging for us to find pretrained diffusion models with different noise schedules or diffusion steps in one code repository, as many of them are trained by different authors and contain very different configurations. Intuitively, models trained with different configurations (e.g. linear noise schedule and cosine noise schedule) differ more significantly at the intermediate timesteps, thus applying T-Stitch in this case possibly hurts the generation quality.

Besides, we would like to mention that our main goal is to accelerate the sampling process of a large pretrained diffusion model. In this case, our contribution is to show that a small model with a shared latent space as the large model provides a free lunch for accelerating diffusion model sampling in many practical scenarios. At this stage, our work has demonstrated broad effectiveness and has been recognized by most reviewers. Thus, we will leave those explorations for future works.

评论- Official Response from Authors - Part 2

2024-11-22

Q5 - Is it possible that small models trained only over larger T steps (aka, early denoising steps) yield better results?

According to our experiments in Figure 18, DiT-S checkpoints at different training steps (400K to 5000K) perform similarly at the early sampling steps in T-Stitch, while differing more significantly at the later sampling steps. Thus, we speculate that training DiT-S only over the early steps (i.e., making it a better expert in this range) could perform similarly in terms of FID when applying T-Stitch.

评论- Thank you for your response

2024-11-22

The author's rebuttal has addressed my concerns, and I have raised my score to 6.

评论- Thanks!

2024-11-22

Thank you for your prompt reply and for raising the score. We are pleased that your concerns have been addressed.

Best regards,

Authors

审稿意见

评分: 8置信度: 32024-11-01

The paper proposes a method to accelerate sampling in diffusion models by using a sequence of two denoising networks, a smaller one followed by larger one (instead of using the same network for all sampling steps as is traditionally done). In their experiments, they show their method can lead to meaningful computational savings at little to no cost to the quality of generated images.

优点

Their main idea of leveraging model of various sizes throughout the diffusion sampling process is simple, yet it is shown to be effective. The simplicity is an added benefit in my opinion, as it makes the method more reproducible and more likely to be adopted
I also believe their idea to be novel (though I am not fully up to date with the diffusion literature due to its high pace)
The experiments are very comprehensive, they try out their trajectory-stitching approach on various backbone architectures (DiT, UNet), with various samplers, for unconditional/conditional cases etc.
Also, I like how instead of proposing yet another new efficient diffusion model (and thus contributing to the model zoo), the authors find a smart way to combine/reuse the existing models via their trajectory-stitching approach

缺点

I think the writing can be improved. For camera-ready it would make sense to move the background/preliminaries to the main text and perhaps to move some of the experiments to the appendix. Also I find Section 3 quite chaotic (it talks about too many different things, from motivation to model design and connection to the other efficiency techniques like speculative decoding)
It is not clear how to select the switching point/threshold between the small and large model (r1). While I understand that by varying it you can get a Pareto frontier, however, that still requires running/evaluating a large number of candidate thresholds.

问题

Your idea reminds me a bit of works on early-exit diffusion models [1,2] where the depth on the denoising network is made adaptive based on the estimated difficulty of the sampling step. Could be interesting to draw further parallels between early-exit and your stitching approach

[1] AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation

[2] DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach

评论- Official Response from Authors

2024-11-22

Thanks for your very positive reviews and constructive comments, we would like to address your additional questions below.

Q1 - Paper polishing.

Thanks for your great advice, we will polish our manuscript based on those comments.

Q2 - Selecting the switching point between the small and large model.

Please refer to our general response.

Q3 - Further discussion with early-exit works.

In general, we aim to explore the compute budget allocation for diffusion model sampling, which is orthogonal with individual model acceleration techniques such as early-exiting or model compression that specifically focus on one model, as discussed in Section A.8.

Compared to recent early-exit works, such as DeeDiff [A], AdaDiff [B], and DuoDiff [C], T-Stitch does not require training since we directly drop the small model at the early denoising steps. Furthermore, compared to Adaptive Score Estimation (ASE) [D] which heuristically designs block-exiting strategies based on different architectures then finetunes the target diffusion model with substantial training cost, we found our speed-quality trade-offs are clearly better under the equal experimental setting, as shown below.

Name	FID-5K	Acceleration
DiT-XL ([D] implementation)	9.10	-
D1-DiT	8.89	14.38%
D3-DiT	8.99	20.99%
D4-DiT	9.19	28.70%
D6-DiT	11.41	36.80%

The above results are adopted from Table 3 of [D]. FID-5K is evaluated based on ImageNet-256 and DDIM sampler. “Acceleration” refers to the acceleration in sampling speed. “n” in “Dn-DiT” represents the acceleration scale. Details for different settings can be found in Table 2 of ASE [D].

Name	FID-5K	Acceleration
DiT-XL (our implementation)	9.20	-
T-Stitch (10%)	9.17	7.84%
T-Stitch (20%)	8.99	18.71%
T-Stitch (30%)	9.03	32.00%
T-Stitch (40%)	9.95	50.00%
T-Stitch (50%)	10.06	75.53%

The above results are from our Figure 1, which is based on the same experimental setting: ImageNet-256, DDIM sampler, and FID-5K. Note that due to different implementations, our DiT-XL baseline performance can be slightly different. We have included this comparison in Section A.22 of the revised submission.

[A] Tang, Shengkun, et al. "Deediff: Dynamic uncertainty-aware early exiting for accelerating diffusion model generation." arXiv preprint arXiv:2309.17074 (2023).

[B] Zhang, Hui, et al. "Adadiff: Adaptive step selection for fast diffusion." arXiv preprint arXiv:2311.14768 (2023).

[C] Fernández, Daniel Gallo, et al. "DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach." arXiv preprint arXiv:2410.09633 (2024).

[D] Moon, Taehong, et al. "Early Exiting for Accelerated Inference in Diffusion Models." ICML 2023 Workshop on Structured Probabilistic Inference {&} Generative Modeling. 2023.

2024-11-26

Thanks for your answers. I acknowledge I have read the rebuttal as well as other reviews. I find the rebuttal convincing enough so I am keeping my recommendation for paper's acceptance.

审稿意见

评分: 8置信度: 52024-11-07

This paper introduces T-Stitch, a training-free approach to accelerate sampling in diffusion models by strategically utilizing different-sized models across the denoising trajectory. The key insight is that small and large models trained on the same data distribution learn similar encodings, particularly in early steps where low-frequency components dominate. By leveraging this property, T-Stitch uses smaller models for early steps (global structure) and larger models for later steps (fine details), achieving significant speedup without quality degradation. The method demonstrates broad applicability across various architectures (DiT, U-Net, Stable Diffusion) and shows interesting benefits for stylized models' prompt alignment. Extensive experiments validate the effectiveness across different settings, samplers, and guidance scales.

优点

Novel & Foundational Insight

Deep understanding of diffusion models' behavior across timesteps
Thorough empirical validation of latent space similarity between models
Clear frequency analysis supporting the theoretical foundation
Novel perspective on leveraging model size differences temporally

Practicality

Training-free nature enables immediate deployment
Compatible with existing acceleration techniques
Works across various architectures and model families
Clear implementation guidelines and deployment considerations

Comprehensive Empirical Validation

Extensive experiments across multiple architectures
Thorough ablation studies covering various aspects
Clear demonstration of speedup-quality tradeoffs

Broader Impact & Applications

Unexpected benefits in prompt alignment for stylized models
Natural interpolation between style and content
Practical applications in Stable Diffusion ecosystem
Potential implications for efficient model deployment

缺点

Critical Absence of Limitations Analysis

Paper lacks a dedicated section for discussing limitations
No systematic analysis of failure cases
Insufficient discussion of edge cases and potential risks
Missing critical self-reflection on method boundaries

Theoretical Gaps

No mathematical justification for the 40% threshold
Lack of theoretical guarantees for quality preservation
Missing analysis of optimal model size ratios
Incomplete understanding of feature compatibility requirements

Architectural Considerations

Limited analysis of cross-architecture compatibility
No clear guidelines for multi-model (>2) scenarios
Insufficient investigation of feature space alignment
Missing discussion of architecture-specific optimization

Practical Implementation Challenges

Memory overhead management not thoroughly addressed
Pipeline complexity implications understated
Limited guidance for scenarios without suitable small models
Deployment considerations in resource-constrained environments lacking

+)

The absence of a dedicated limitations section limits the paper's completeness

问题

How does the method perform when architectural differences between small and large models are more significant? Are there specific architectural compatibility requirements?
The improved prompt alignment for stylized models is intriguing. Could you provide more analysis of why this occurs and how generally applicable this finding is?
What are the primary failure modes of T-Stitch? Are there specific scenarios where the method consistently underperforms?

评论- Official Response from Authors

2024-11-22

Thanks for your very positive and comprehensive reviews. Below, we would like to address your additional questions.

Q1 - Overall concerns.

Thanks for providing those valuable comments. We would like to briefly address them below.

Limitations Analysis. We have added a brief discussion on limitations in Section A. 23 of our revised submission.

Theoretical Gaps. Please refer to our response to Reviewer 1Rg8 Q1.

Architectural Considerations. In general, T-Stitch works well as long as both small and large models share similar latent space and spatial dimension of latents, as shown in Table 8. We are also orthogonal to individual model optimization, as discussed in Section A. 8. As our current study has demonstrated broad effectiveness, we would like to leave other interesting explorations (multi-model scenarios, feature space alignment) for future works.

Practical Implementation Challenges. Overall, the implementation of T-Stitch is simple: let the small model do denoise sampling first then switch into the large model for the subsequent timesteps. With minor memory overhead (Table 4), T-Stitch is general and is “more likely to be adopted” (Reviewer 4dLD) in many practical scenarios, such as ControlNet (Figure 30) and text-to-video generation (Figure 34).

Q2 - T-Stitch for diffusion models of very different architectures.

The fundamental insight of T-Stitch is that different diffusion models trained on similar datasets can share a similar latent space. As demonstrated in Table 8 of the Appendix, our experiments show that T-Stitch performs very well when applied to very different model families, such as U-ViT and DiT.

"Are there specific architectural compatibility requirements?"

Yes. T-Stitch requires the latent noise from both models to have the same spatial dimensions to allow seamless switching during denoising sampling. This design is inspired by the observation that widely adopted models (e.g., DiTs, SD, and fine-tuned SDs) typically share the same latent shape (Lines 208–211). We leave the development of more challenging stitching strategies for future work.

Q3 - More analysis of the improved prompt alignment?

We speculate that stylized SD models, such as those trained with DreamBooth, are more prone to overfitting and catastrophic forgetting [A, B] due to being trained on very few images. On the other hand, the initial SD model, trained on large-scale text-to-image datasets, may help complement the forgotten knowledge. By adopting the small SD model during the early steps, it can provide general priors at the beginning [C], thus compensating for the missing concepts in the prompts for the overfitted stylized SD models.

In our experiments, we found this approach generally applicable to both standard SD models and fine-tuned/stylized SD models (Figures 26, 27, and 28), as well as to other diffusion model acceleration techniques such as DeepCache (Figure 21) and token merging (Figure 22).

[A] Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." ICCV. 2023.

[B] Ruiz, Nataniel, et al. "Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation." CVPR. 2023.

[C] Graikos, Alexandros, et al. "Diffusion models as plug-and-play priors." NeurIPS (2022).

Q4 - Primary failure modes of T-Stitch.

Note that our primary goal in this work is to accelerate diffusion model sampling while preserving its generation quality as much as possible. Considering the fact that T-Stitch adopts a small model to accelerate the target large model, the primary failure mode lies in the weakness of the chosen small model.

For example, if the small model is not significantly faster than the larger model while its generation quality is substantially worse, T-Stitch inevitably results in a poor speed-quality trade-off. We expect future works to train or distill an efficient small diffusion model to specialize in the early sampling steps for better speed-quality trade-offs. Besides, T-Stitch will also underperform if the chosen small model has a very different latent space compared to the large model. For these reasons, we have provided a principle of model selection in our initial submission (Lines 227-232).

2024-11-27

The authors' responses have adequately addressed my concerns. I already considered this to be a strong paper, and since my concerns were not significant enough to warrant a score adjustment, I will maintain my current rating.

评论- General Response

2024-11-22

We sincerely thank all the reviewers for their thoughtful comments and would like to briefly summarize the reviews and our paper revision as follows.

1. Summary of reviews

In general, T-Stitch is highly recognized by reviewers,

“Novel & Foundational Insight …Practicality…Comprehensive Empirical Validation… Broader Impact & Applications” (Reviewer Aq9T),
“simplicity is an added benefit…more reproducible and more likely to be adopted…a smart way to combine/reuse the existing models ” (Reviewer 4dLD),
“simple but effective…orthogonal to other techniques… be easily combined with other methods to further reduce inference time” (Reviewer GeS8)
“... good paper backed by solid experimentation.…extensive comparative analysis…well written and clearly motivated” (Reviewer 1Rg8)

Besides, we have provided responses in the rebuttal for each reviewer and hopefully could address their further concerns.

2. The Optimal Threshold for T-Stitch

In our observation, stitching at the early ~40% minorly affects the generation quality, while at the larger fractions, T-Stitch provides a clear trade-off between a small and large model. This phenomenon has been observed across various architectures and samplers. Although this 40% threshold might not hold for all use cases, it is worth noting that

Compared to the time-consuming FID evaluation (not commonly used for downstream users), the practical usage in the community suggests that users would more like to iteratively refine their prompts in order to get their desired image quality.
Fortunately, determining the optimal switching point in T-Stitch can be done very efficiently by directly generating images for a prompt at different fractions. For example, in Figure 7, sequentially generating 11 images from 0% to 100% fraction of the small model only requires a few seconds.

Thus, we can directly observe the trade-off for each prompt in a short time for each model, without costly searching for a schedule under different time budgets. This advantage provides a practical value of T-Stitch, especially given the existence of thousands of models in the public model zoo. As our current study has demonstrated broad effectiveness across many scenarios, which has also been recognized by most reviewers, we would like to leave further explorations of this topic for future work.

3. Summary of Paper Revision

According to the feedback from the reviewers, we have included the following sections and results,

In Section A.21, we provide the L2-distance comparison of latent embeddings between DiT models at different denoising steps, complementing the comparison based on the cosine similarity in Figure 3.
In Section A.22, we discuss the relation between T-Stitch and early-exit works.
In Section A.23, we briefly summarize the limitations of T-Stitch.

We thank the reviewers and ACs again for their efforts in reviewing our paper, and sincerely welcome further discussions.

Best regards,

Authors of Submission 8597

2024-11-25

Dear reviewers,

If you haven’t done so already, please engage in the discussion as soon as possible. Specifically, please acknowledge that you have thoroughly reviewed the authors' rebuttal and indicate whether your concerns have been adequately addressed. Your input during this critical phase is essential—not only for the authors but also for your fellow reviewers and the Area Chair—to ensure a fair evaluation. Best wishes, AC

AC 元评审

2024-12-20

This paper introduces a novel and training-free approach to accelerating the sampling process of diffusion models by leveraging small diffusion models. The proposed method is both simple and effective, demonstrating effectiveness across various tasks, including large-scale text-to-image diffusion models. The experimental results are thorough and convincingly validate the approach, showcasing its practicality and relevance. All the reviewers have expressed strong support for the significance of the contribution, highlighting its potential impact on the field. The AC concurs with the reviewers' positive assessment, commending the quality and rigor of the work.

审稿人讨论附加意见

There were some initial concerns raised by the reviewers, but most of them were mainly for clarification, so they were mostly resolved during the rebuttal period. The consensus for acceptance remained unchanged during the reviewer discussion period.

最终决定Accept (Poster)

2025-01-22

Accept (Poster)