PaperHub
6.3
/10
Poster4 位审稿人
最低5最高7标准差0.8
6
5
7
7
3.3
置信度
正确性3.0
贡献度2.8
表达3.3
NeurIPS 2024

PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
Flow ModelDiffusion ModelGenerative ModelImage Generation

评审与讨论

审稿意见
6

This paper proposes piecewise rectified flow (PeRFlow) for accelerating pre-trained diffusion models. To overcome the requirement of synthetic data generation in rectified flow, the authors propose to prepare the training data by dividing the entire ODE trajectory into multiple time windows. The sampling trajectories within each time window are then straightened by the reflow operation. The proposed method is adapted to multiple diffusion models with different parameterizations. Experiments on text-to-image (SD-v1.5, SD-v2.1, SDXL) and text-to-video (AnimateDiff) models demonstrate the effectiveness of the proposed method.

优点

  • This paper addresses a major performance bottleneck in rectified flow, namely the synthetic data generation stage, which requires costly simulation with higher numerical errors. The proposed solution allows online simulation of ODE trajectories and thus more efficient training.
  • The proposed PeRFlow is extensively tested on multiple diffusion models for text-to-image and text-to-video generation, and the comparative results with the previous state-of-the-art few-step diffusion baselines are impressive.
  • Code is provided for both training and inference.

缺点

  • The contribution of this work is weakened by its similarity to Sequential Reflow [1], and the additional design to be compatible with different parameterization strategies is somewhat incremental.
  • The main motivation to improve training efficiency (line 47) is not reflected in the experiments. The authors should provide a more comprehensive comparison with rectified flow in terms of performance/training computation tradeoff.
  • The statement in line 55 that PeRFlow "has a lower numerical error than integrating the entire trajectories" should be more carefully validated, e.g. by quantitatively comparing straightness [2] or curvature [3] within each time window. The authors could also apply their method to the commonly used 2D checkerboard data to provide a more intuitive visualization of the learned probability path.
  • There is a lack of ablation studies for several design choices. The authors should analyze the sensitivity of PeRFlow's performance to the number of time windows and sampling steps (more densely). It is also unclear how adding "one extra step in [t_K,t_K1][t\_K,t\_{K-1}]​" (line 223) contributes to the final results.

  1. Yoon, et al. Sequential Flow Straightening for Generative Modeling. arXiv 2024.
  2. Liu, et al. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. ICLR 2023.
  3. Lee, et al. Minimizing Trajectory Curvature of ODE-based Generative Models. ICML 2023.

问题

  • The proposed method seems to be applicable to model training in addition to acceleration. Have the authors considered training their models from scratch on CIFAR-10 or ImageNet? This would allow a direct comparison of the performance/efficiency tradeoff with a broader family of flow matching algorithms.
  • What does "one-step" mean in Figure 8 when the sampling step should be lower bounded by the number of time windows of 4?

局限性

Yes. The authors have discussed their limitations in Section 5.

作者回复

Thank you very much for the comments and advice.

  1. The contribution of this work is weakened by its similarity to Sequential Reflow [1], and the additional design to be compatible with different parameterization strategies is somewhat incremental.
  • Technically, PeRFlow shares a similar idea of dividing a whole time range into multiple sequences. Moreover, PeRFlow designs dedicated parameterizations for each type of pretrained diffusion models, which facilitate fast convergence of acceleration. PeRFlow discusses the effect of CFG of teacher models during distillation. PeRFlow also demonstrates the plug-and-play properties of flow-based acceleration methods.
  1. The main motivation to improve training efficiency (line 47) is not reflected in the experiments. The authors should provide a more comprehensive comparison with rectified flow in terms of performance/training computation tradeoff.
  • Yes, we agree that a more comprehensive comparison of performance/computation tradeoff should be included.
  • In each training iteration, the computational cost of PeRFlow for synthesizing the training target is 1/K1/K of that of InstaFlow, where KK is the number of time windows. That's why we make the claim.
  • For example, in each iteration, InstaFlow samples a noise and uses 32-step DDIM to solve the target ( solving from t=1 to t=0). If PeRFlow divides the whole time window into 4 segments, it only requires 8-step DDIM operations to solve a sub- time window.
  1. The statement in line 55 that PeRFlow "has a lower numerical error than integrating the entire trajectories" should be more carefully validated, e.g. by quantitatively comparing straightness [2] or curvature [3] within each time window. The authors could also apply their method to the commonly used 2D checkerboard data to provide a more intuitive visualization of the learned probability path.
  • Thanks for the advice, we should add some visualization. Usually, the claim may be held in most cases, because the accumulation error of numerical integration increases along with the length of time.
  1. There is a lack of ablation studies for several design choices. The authors should analyze the sensitivity of PeRFlow's performance to the number of time windows and sampling steps (more densely). It is also unclear how adding "one extra step in [𝑡𝐾,𝑡𝐾−1]" (line 223) contributes to the final results.
  • Thanks for the advice. We will add more analysis in the final version.
  • number of time windows and sampling steps
    • The number of training segments depends on the minimum steps we expected for the inference stage. Suppose the minimum steps for the inference stage is N, the number of training segments K should be less or equal to N. In our experiments, we evaluate 4-step, 6-step, and 8-step generation results, so we set the number of training segments as four. The reason is that we cannot approximate the velocity of a time window by the velocity of its previous time window. So, for each window, we should allocate at least a one-step computation budget.
    • In some special cases (e.g., Wonder3D in Figure 8 Appendix), after 4-piece PeRFlow acceleration, the trajectory across the whole time window is almost linear. We can generate multi-view results with one step. But in most cases, we should use an inference step larger or equal to the training segments.
  • We explain the inference budget allocation strategy in line 215-225.
    • We agree it should be clarified with a more formal statement. Thanks for this suggestion. Suppose we have KK time windows and N>=KN>=K inference steps. From noisy to clean state, the time windows are indexed by K,K1,...1K, K-1, ... 1.
    • If N can be divided by KK, each time window will be solved by N//KN//K steps. Otherwise, we allocate N//K+1N//K+1 steps for windows, whose index ii satisfies Ki<NmodKK-i < N \text{mod} K. The rest windows are allocated N//KN//K steps. In other words, we try to allocate the budget equally. If not divisible, the extra budget is given to time windows in noisy regions, because the important layout synthesis is finished in these regions.
  1. The proposed method seems to be applicable to model training in addition to acceleration. Have the authors considered training their models from scratch on CIFAR-10 or ImageNet? This would allow a direct comparison of the performance/efficiency tradeoff with a broader family of flow matching algorithms.
  • In this work, we focus on the acceleration of pretrained diffusion models. We leave the study of pretraining (training flow models from scratch via PeRFlow) into future work.
  • Although Stable-diffusion 3 demonstrates that training a rectified flow (single-piece linear flow) can generate high-quality images, there are still several open questions to study.
    • Piecewise linear flow generalizes the rectified flow, does there exist a proper window-dividing plan to train a better generation model?
    • If so, what kind of properties will it have in comparison to the vanilla rectified flow? Fast convergence? Fewer steps requested for inference sampling?
  1. What does "one-step" mean in Figure 8 when the sampling step should be lower bounded by the number of time windows of 4?
  • Please refer to the answer to weakness 4.
评论

Thank you for the detailed responses to my review, which addressed many concerns. I understand that some of the requested experiments cannot be performed due to limited time for rebuttal, and I hope that these experiments will be included in the final version.

审稿意见
5

This paper presents a novel approach to accelerating diffusion models by introducing the Piecewise Rectified Flow (PeRFlow). This method significantly enhances the efficiency of generating high-quality generative samples by dividing the flow trajectories of diffusion models into several time windows and straightening them using a reflow operation.

Key contributions of the paper include:

  • Superior Performance in Few-Step Generation: PeRFlow reduces the number of inference steps required while maintaining or improving the quality of generative samples.
  • Fast Training and Transfer Ability: The models adapt quickly due to inherited parameters from pre-trained diffusion models, demonstrating good transferability across different models.
  • Universal Plug-and-Play Capability: PeRFlow models serve as accelerators compatible with various pre-trained diffusion models, facilitating seamless integration into existing workflows.

优点

Overall I find that the writing is clear, concise, and well-structured, making it easy for readers to follow the arguments and understand the key points. I like the idea of multi-step or piecewise generative models since it is natural to extend InstaFlow into a multi-step fashion, which offers flexibility between speed and quality.

缺点

  • I think the multi-step consistency model [1] should be discussed since it has a strong correlation with this paper. In the experiments section, you only compare PeRFlow with LCM and InstaFlow, both of which are relatively early works. There are plenty of distillation methods in this field that are worth mentioning and comparing, including HyperSD [2], CTM [3], and DMD [4].
  • The most important hyper-parameter N, i.e., the number of segments, lacks analysis. How do you choose its value? What’s the relationship between the number of segments used in training and the number of sampling steps used in inference?
  • I like the idea of “plug-and-play” accelerator by extracting the delta weight to speed up other diffusion models. However, the implementation details and analysis in the paper are really limited with just a few demos. Besides, I think this is a general method that can be applied to any accelerated diffusion model, such as LCM?
  • The paper claims that “the computational cost is significantly reduced for each training iteration compared to InstaFlow”. However, do you have any quantitative evaluation, including the comparison with other methods?

[1] Heek, Jonathan, Emiel Hoogeboom, and Tim Salimans. "Multistep consistency models." ICML 2024.

[2] Ren, Yuxi, et al. "Hyper-sd: Trajectory segmented consistency model for efficient image synthesis." arXiv preprint arXiv:2404.13686 (2024).

[3] Kim, Dongjun, et al. "Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion.” NeuIPS 2023.

[4] Yin, Tianwei, et al. "One-step diffusion with distribution matching distillation.” CVPR 2024.

问题

See the weakness above.

局限性

The authors have adequately addressed the limitations and potential negative societal impact of their work.

作者回复

Thank you very much for the comments and advice.

  1. I think the multi-step consistency model [1] should be discussed since it has a strong correlation with this paper. In the experiments section, you only compare PeRFlow with LCM and InstaFlow, both of which are relatively early works. There are plenty of distillation methods in this field that are worth mentioning and comparing, including HyperSD [2], CTM [3], and DMD [4].
  • Thanks for providing these related works. Yes, we add the discussion of [1], which shares a similar idea of dividing the whole time window into multiple segments. The difference is that [1] trains a consistency model for each segment while PeRFlow trains a linear flow.
  • In Table 1, we compared PeRFlow with LCM, InstaFlow, and SDXL-lightning which is a state-of-the-art few-step text-to-image generator. Yes, we agree that other distillation methods should be discussed also.
  1. The most important hyper-parameter N, i.e., the number of segments, lacks analysis. How do you choose its value? What’s the relationship between the number of segments used in training and the number of sampling steps used in inference?
  • The number of training segments depends on the minimum steps we expected for the inference stage.
  • Suppose the minimum steps for the inference stage is N, the number of training segments K should be less or equal to N. In our experiments, we evaluate 4-step, 6-step, and 8-step generation results, so we set the number of training segments as four.
  • The reason is that we cannot approximate the velocity of a time window by the velocity of its previous time window. So, for each window, we should allocate at least a one-step computation budget.
  • In some special cases (e.g., Wonder3D in Figure 8 Appendix), after 4-piece PeRFlow acceleration, the trajectory across the whole time window is almost linear. We can generate multi-view results with one step. But in most cases, we should use an inference step larger or equal to the training segments.
  1. I like the idea of “plug-and-play” accelerator by extracting the delta weight to speed up other diffusion models. However, the implementation details and analysis in the paper are really limited with just a few demos. Besides, I think this is a general method that can be applied to any accelerated diffusion model, such as LCM?
  • Yes, the plug-and-play is a general property, like LCM, where we can add the delta-weights to any other diffusion pipeline for inference acceleration, such as controlnet, image-to-image, and ip-conditioned image generation.
  • We should add more implementation details. The delta weights are equal to the weights after PeRFlow acceleration minus the initial pretrained weights.
  • We provide a short analysis in line 196-204, where we observe that the delta-weights of PeRFlow can better preserve the properties of the original diffusion models in comparison to LCM, including a minor domain shift.
  1. The paper claims that “the computational cost is significantly reduced for each training iteration compared to InstaFlow”. However, do you have any quantitative evaluation, including the comparison with other methods?
  • In each training iteration, the computational cost of PeRFlow for synthesizing the training target is 1/K1/K of that of InstaFlow, where KK is the number of time windows. That's why we make the claim.
  • For example, in each iteration, InstaFlow samples a noise and uses 32-step DDIM to solve the target ( solving from t=1 to t=0). If PeRFlow divides the whole time window into 4 segments, it only requires 8-step DDIM operations to solve a sub- time window.
评论

Thanks for your reply. I would like to maintain my score.

审稿意见
7

The paper introduces a new flow-based method designed to accelerate diffusion models by dividing the sampling process into several time windows. The sampling path within each time window is straightened by the reflow operator. This approach allows for fast training convergence and, transferability, compatibility with various pretrained diffusion model workflows.

优点

  • The approach’s motivation is clear. Theoretical arguments support the proposal well.
  • The empirical results are promising, and better than other existing baselines.

缺点

  • When dividing the sampling process into several time windows, the error of the previous windows immensely affects the later ones, potentially increasing the cumulative error of the whole sampling process.

问题

  • Can PeRFlow's approach be generalized to other types of generative models beyond diffusion models such as GAN-based or VAE-based?
  • Can you provide more detailed insights into the parameterization techniques used and their impact on the training convergence and final model performance?
  • What are the observed benefits of using synchronized versus fixed CFG modes?

局限性

N/A

作者回复

Thank you very much for the comments and advice.

  1. When dividing the sampling process into several time windows, the error of the previous windows immensely affects the later ones, potentially increasing the cumulative error of the whole sampling process.
  • In practice, we observe increasing the number of time windows reduces the difficulty of training, because a shorter time window is easier to straighten and the error of this short window is better controlled.
  • We agree with this concern. This potential issue may happen when the error in each window does not change much concerning the time length. Then, more windows may lead to a larger error. But, in our experiments, we find the error of each time decreases obviously if we shorten the time window.
  1. Can PeRFlow's approach be generalized to other types of generative models beyond diffusion models such as GAN-based or VAE-based?
  • GAN-based and VAE-based methods are naturally one-step generators. PeRFlow works well for iterative generators like diffusion/flow methods.
  1. Can you provide more detailed insights into the parameterization techniques used and their impact on the training convergence and final model performance?
  • Pretrained diffusion methods learn much useful information from large-scale training data. Parameterizing the target few-step model in the same way as the pretrained diffusion helps inherit useful information, such as interacting with the conditioning texts. Then, the acceleration algorithm can train on a relatively small dataset and converge fast.
  1. What are the observed benefits of using synchronized versus fixed CFG modes?
  • as discussed in section 2 (line 152-158), the CFG-sync mode preserves better the sampling diversity and the compatibilty of the original diffusion models with occasional failure in generating complex structures, while the CFG-fixed trades off these properties in exchange for fewer failure cases.
评论

Thank you to the authors for their responses. I would like to keep my original score.

审稿意见
7

In this paper, the author proposes a new paradigm of sampling process (Piecewise Rectified Flow-PeRFlow) with reflow operation in the diffusion model, straightening the trajectories of the origin PF-ODEs and achieving a better performance in a few-step generation. Specifically, the PeRFlow divides the sampling process (ODE trajectories) into multiple time windows, then does reflow operation in each single time window to straighten the trajectories in each time window. Compared to the original diffusion model with reflow operation, it significantly reduced the synthesis time of training data for reflow, also narrowing the numerical errors of solving ODEs when generating the training data to get a higher-quality generated training dataset. Also, it only requires several inference steps to solve the ending point in each time window, achieving a diffusion model acceleration method with faster training convergence, more linear trajectories, better performance.

优点

  • The paper is well organized and clearly structured, it is very easy to follow.

  • Many fully detailed mathematical formulas are derived, making it easier to understand the details of the proposed method.

  • The figure about the proposed method is well designed, the effect and rough structure of the proposed method can be understood at a glance without looking at the text description.

  • The paper used enough large dataset of images which contain rich images & texts and used enough SOTA acceleration methods to evaluate the proposed method.

缺点

  • The evaluation metrics for most generative models include FID and IS. And this paper only adopts the FID as the evaluation metric. Although this paper is aimed to accelerate the diffusion model with better performance, it is better if author can evaluate the diversity of generated images of the proposed method using the IS. In this case, people can know if such method will affect the diversity of generated images.

  • It would be better if author can directly indicate on the table that lower FID values are better. People who are not in the generative model filed are not familiar with FID.

问题

Please refer to the weakness part.

局限性

N/A

作者回复

Thank you very much for the comments and advice.

  1. The evaluation metrics for most generative models include FID and IS. And this paper only adopts the FID as the evaluation metric. Although this paper is aimed to accelerate the diffusion model with better performance, it is better if author can evaluate the diversity of generated images of the proposed method using the IS. In this case, people can know if such method will affect the diversity of generated images.
  • Thanks for your suggestion. We will add the IS values in the final version. Due to the limit of rebuttal time and GPU resources, we cannot generate enough images to compute IS value in this round of discussion.
  1. It would be better if author can directly indicate on the table that lower FID values are better. People who are not in the generative model filed are not familiar with FID.
  • Thanks for your suggestion. We will highlight this in the final version.
最终决定

The paper presents a piecewise extension of rectified flows for accelerated sampling from diffusion models. The reviewers have acknowledged the quality of the presentation and the efficacy of the method on several datasets. However, they have also raised concerns regarding the novelty of the method in comparison to sequential flows and multistep consistency models. Given the new perspectives on different parameterizations and strong results, we are happy to recommend the paper for acceptance at NeurIPS.