PaperHub
6.0
/10
Rejected4 位审稿人
最低6最高6标准差0.0
6
6
6
6
3.5
置信度
ICLR 2024

Improve Temporal Consistency In Diffusion Models through Noise Correlations

OpenReviewPDF
提交: 2023-09-21更新: 2024-02-11

摘要

关键词
diffusion modeltemporal consistencysequential data generation

评审与讨论

审稿意见
6

This paper proposes an autoregressive way to manipulate the noise within diffusion models. It uses predefined hyperparameters to control the Gaussian covariance matrix when sampling noise. Experiments on audio and motion datasets demonstrate the effectiveness of the proposed method.

优点

  1. By incorporating the temporal correlation during sampling noise, the experiments shows improved performance compared with baseline
  2. The proposed techniques do not introduce too much extra computation for both training and inference time.

缺点

  1. The authors conduct experiments on two baseline methods: DiffWave and DiffWave-sashim and evaluate the proposed method on two tasks: neural vocoding and unconditional generation. The proposed method improves both baselines on neural vocoding tasks in Table 1 a), but obtains inferior performance on unconditional generation tasks under some metrics (like FID, IS). It is preferred that the authors clearly explain the possible reason for this phenomenon.
  2. In Table 2, the performance improvement in human motion generation is not substantial compared to audio generation. In many cases, the performance even deteriorated. Does that mean the proposed method may not be general enough for different tasks? Or the hyperparameters to produce a covariance matrix of Gaussian is not tuned well?
  3. My major concern is that only changing the noise distribution in diffusion models may not be enough to address the temporal consistency problem encountered in various tasks. The authors may need to conduct experiments on more tasks like video generation.

问题

  1. For video generation, the data has a very high dimension, what about the complexity analysis when producing a covariance matrix? Is it still affordable for a diffusion process?
  2. Is there any visualization of the generated contents to demonstrate the improvement of temporal consistency after using the proposed method?
评论

Thank you sincerely for providing valuable suggestions. We greatly appreciate the time you dedicated to reviewing our paper. Below, we have provided point-to-point responses addressing each of your comments.

W1. The authors conduct experiments on two baseline methods: DiffWave and DiffWave-sashim and evaluate the proposed method on two tasks: neural vocoding and unconditional generation. The proposed method improves both baselines on neural vocoding tasks in Table 1 a), but obtains inferior performance on unconditional generation tasks under some metrics (like FID, IS). It is preferred that the authors clearly explain the possible reason for this phenomenon.

We postulate there are two possible reasons that account for our inferior performance assessed by FID and IS in the unconditional audio generation task.

Firstly, we have to admit that our hyper-parameter tuning for the decay rate λ\boldsymbol{\lambda}, the coefficient cc of AR(1) structure in local-window design, and the window size ww is incomplete due to the limited computational resources. Note that the number of frames F=16,000F=16,000 in this task is considerably large, posing a challenge for us to exhaustively search over all possible combinations of these hyper-parameters. In the meantime, inspired by another reviewer, we may improve the hyper-parameter selection in the future with the guidance of dataset-specific information. For example, we may use our proposed Temporal Consistency Metric (TCM) to measure the temporal dependency of a dataset. If TCM yields a small value, it indicates strong temporal dependencies in training data, and we can prioritize exploring larger values of λ\boldsymbol{\lambda} and cc; and vice versa.

Besides, the inconsistencies observed in our performance, as measured by FID, FAD and IS, may be partially attributed to the nature of these metrics.

  • [1] found that it is important to evaluate IS on a large enough number of samples (i.e. 50k) as part of this metric measures the diversity of generated samples. This sample size requirement is much larger than the sample size we generate, which is 1,024 due to the limited computation resources. Therefore, the reported results from IS might be unsatisfactory.
  • In addition, FAD [2] is designed based on FID [3] and both of them improve IS by taking the real data distribution into account. Notably, FAD is tailor-made for evaluating audio data and its embeddings are obtained from the VGGish model which is trained on a large dataset of YouTube videos with an audio classifier. In contrast, the embeddings of FID are derived from the inception model that is trained on ImageNet and specializes in capturing vision-relevant features. Hence we postulate that FAD is more accurate than FID for evaluating generated audio and our ARTDiff has substantial improvements in terms of FAD.

Following your suggestion, we have summarised these points to the interpretation of Table 1(b) in Section 3.1 of the revised manuscript.

评论

W3. My major concern is that only changing the noise distribution in diffusion models may not be enough to address the temporal consistency problem encountered in various tasks. The authors may need to conduct experiments on more tasks like video generation.

Thanks for your constructive feedback. Following your suggestion, we have newly added an experiment on text-to-video generation to evaluate ARTDiff, where Tune-A-Video [4] is adopted as the baseline model. We summarize the task and our implementation below; please refer to Section 3.3 in the revised manuscript and [4] for more details.

One-shot text-to-video generation Given a video V=vii[1,m]\mathcal{V}=\\{v_i|i \in [1,m]\\} that has mm frames and its corresponding prompt P\mathcal{P}, the goal is to generate a novel video V\mathcal{V}^* by an edited text prompt P\mathcal{P}^*. For example, given a video and a prompt describing "a man is skiing", we may wish to generate a new video corresponding to the edited prompt "Spider man is skiing".

Tune-A-Video The Tune-A-Video method consists of two stages: fine-tuning and inference. The first stage finetunes a pre-trained text-to-image diffusion model with the attention blocks in U-Net modified for spatio-temporal modeling. During inference, it first obtains a latent noise of the source video V\mathcal{V} via DDIM inversion (see its definition and notation in Section 3.1 of [5]):

$ z_{t+1} = \sqrt{\frac{\alpha_{t+1}}{\alpha_{t}}}z_t + \Big( \sqrt{\frac{1}{\alpha_{t+1}}-1}- \sqrt{\frac{1}{\alpha_{t}}-1}\Big)\cdot \epsilon_{\theta}(z_t, t, C) ; $

and then sample a novel video from the noise, guided by the edited prompt P\mathcal{P}^*, via the deterministic DDIM sampling [6]:

$ z_{t-1} = \sqrt{\alpha_{t-1}} \Big(\frac{z_{t}-\sqrt{1-\alpha_{t}}\cdot \epsilon_{\theta}(z_t, t, C)}{\sqrt{\alpha_{t}}} \Big)+\sqrt{1-\alpha_{t-1}-\sigma_t^2} \cdot \epsilon_{\theta}(z_t, t, C) + \sigma_t \epsilon_t, $

with σt=0\sigma_t=0.

Modifications to the baseline Similar to our previous experiments, during the fine-tuning stage, we sample the noise in the forward diffusion from N(0,Σ(λ,m))N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)) instead of N(0,I)N(\boldsymbol{0}, \boldsymbol{I}). Note that originally there is no random noise involved but only the noise predicted by the UNet in the inference stage of Tune-A-Video as it uses deterministic DDIM inversion and sampling. To obtain a temporal consistent initial noise, we replace ϵθ(zt,t,C)\epsilon_{\theta}(z_t, t, C) in DDIM inversion with ϵ=(1τ)ϵθ(zt,t,C)+τϵ~\epsilon'=(1-\tau)\cdot\epsilon_{\theta}(z_t, t, C) + \tau \cdot \tilde{\epsilon}, where ϵ~N(0,Σ(λ,m))\tilde{\epsilon}\sim N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)) and τ(0,1)\tau \in (0,1) is a tuning hyper-parameter. Besides, in DDIM sampling, we allow σt>0\sigma_t > 0 and let ϵtN(0,Σ(λ,m))\epsilon_t\sim N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)). All the other implementations follow [4].

Results

Quantitatively, we utilize three metrics to assess the quality of generated videos:

  • CLIP score: a larger value indicates better alignment between the video and the text prompt;
  • CLIP similarity: the average cosine similarity between the CLIP image embeddings of adjacent frames;
  • Pixel TCM: a smaller value indicates stronger temporal consistency at the pixel level.

We present the results in Table 3 of the revised manuscript and include them here for your easy reference. Our ARTDiff improves all the metrics, indicating its contribution to temporal consistency and textual alignment.

Table: Reported metrics for Tune-A-Video vs. Tune-A-Video + ARTDiff in one-short video generation task.

ModelCLIP scoreCLIP similarityPixel TCM
Tune-A-Video33.0560.9720.048
+ARTDiff33.3620.9740.040

Qualitatively, we showcase the generated video samples in Figure 3 of the manuscript and on an anonymous Github page (link: https://anonymous4peerreview.github.io/artdiff.github.io/), along with the corresponding text prompts. It can be clearly observed that both the subject and the background in our ARTDiff-generated videos have a more temporal consistent behavior. We are still generating more video samples with different text prompts and source videos and will keep you updated once they are available.

We hope this additional experiment can well support the effectiveness and generality of ARTDiff.

评论

W2. In Table 2, the performance improvement in human motion generation is not substantial compared to audio generation. In many cases, the performance even deteriorated. Does that mean the proposed method may not be general enough for different tasks? Or the hyperparameters to produce a covariance matrix of Gaussian is not tuned well?

Thanks for the careful reading. When evaluated using quantitative metrics, our ARTDiff indeed has a similar performance to the baseline model in terms of the relevancy of the generated motions to the input prompts (Accuracy), the similarity between the generated and ground truth distribution (FID), the variability in the resulting motion distribution (Diversity), and the average variance given a single text prompt (MultiModality). We agree with you that the relative difference in these metrics is not substantial. As the first point in our response to your W1, we admit that our hyper-parameter tuning is insufficient because of the computational constraints.

However, ARTDiff does generate more temporal-consistent motion samples compared to the baseline model, as indicated by the smaller values of TCM. To further support its effectiveness, we have visualized the action-to-motion generation from ARTDiff (on the right) and the baseline MDM model (on the left) in Figure 2 of the revised paper and on an anonymous GitHub page (link: https://anonymous4peerreview.github.io/artdiff.github.io/), which shall be more self-evident.

Specifically, we present visualizations of various human action prompts, including "eat", "drink", and "run". While MDM has shown promising results in this task, it still generates a non-negligible subset of motions that lacks temporal consistency. In contrast, ARTDiff is capable of generating human motions with significantly higher levels of temporal consistency in most of the time. For instance:

  • Given the "drink" prompt, the motion generated by the baseline model exhibits noticeable inconsistencies, particularly in the movement of the left arm. In contrast, motion generated by ARTDiff showcases smooth and consistent movements.
  • Given the "eat" prompt, the human motion generated by ARTDiff smoothly picks up food while the baseline model generates several frames that appear irrelevant to nearby frames.
  • Given the "run" prompt, the baseline model's generated motion and posture undergo frequent changes, particularly in the last second when the person abruptly turns back and appears suddenly at a distant location from their previous position. Conversely, ours adjusts its posture and moves smoothly.

These examples are provided for illustrative purposes, and there are more samples with different prompts that further demonstrate ARTDiff's strong performance in both temporal consistency and textual alignment. We hope these visualizations can effectively highlight the capabilities of ARTDiff.

评论

Q1. For video generation, the data has a very high dimension, what about the complexity analysis when producing a covariance matrix? Is it still affordable for a diffusion process?

We appreciate your great suggestion. To validate ARTDiff's computational time efficiency in video generation, we measure the fine-tuning and inference time in one-shot video generation using the Tune-A-Video backbone and visualize the result in Section 3.4 of the revised manuscript. For your easy reference, we present numerical values in the table below. The additional time incurred by applying ARTDiff is negligible, suggesting its efficiency.

Table: Time comparison of Tune-A-Video and ARTDiff for fine-tuning and inference in text-to-video generation task. The fine-tuning time is recorded for 500 steps and the inference time includes DDIM inversion and generating four videos.

ModelFine-tuning Time (s)Inference Time (s)
Tune-A-Video817174.4
+ ARTDiff812175.8

Q2. Is there any visualization of the generated contents to demonstrate the improvement of temporal consistency after using the proposed method?

We totally agree with you that the visualization of the generated contents is very important and helpful to demonstrate the improvement of temporal consistency brought by our ARTDiff. Following your advice, we have visualized the action-to-motion generation and the generated videos from our newly-added one-shot video generation task. Please refer to our responses to your W2 and W3 for a detailed comparison with the baseline model. We have also added these visualizations in Figures 2 and 3 of the revised manuscript and presented the playable video samples on the anonymous GitHub page (link: https://anonymous4peerreview.github.io/artdiff.github.io/),

Hope they could address your concerns effectively.

Reference

[1] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., & Chen, X. (2016). Improved Techniques for Training GANs. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 29).

[2] Kilgour, K., Zuluaga, M., Roblek, D., & Sharifi, M. (2018). Fr\’echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms. https://doi.org/10.48550/arxiv.1812.08466

[3] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2018). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. ArXiv.Org. https://doi.org/10.48550/arxiv.1706.08500

[4] Wu, J. Z., Ge, Y., Wang, X., Weixian Lei, Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., & Mike Zheng Shou. (2023). Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. ArXiv.Org. https://doi.org/10.48550/arxiv.2212.11565

[5] Mokady, R., Hertz, A., Aberman, K., Pritch, Y., & Cohen-Or, D. (2022). Null-text Inversion for Editing Real Images using Guided Diffusion Models. ArXiv.Org. https://doi.org/10.48550/arxiv.2211.09794

[6] Jiaming Song, Chenlin Meng, & Stefano Ermon (2021). Denoising Diffusion Implicit Models. In International Conference on Learning Representations.

评论

Dear Reviewer zRzD,

As the rebuttal period is drawing to a close, we would like to know if you have any remaining concerns regarding our revised manuscript or the additional experiments. Your feedback is highly valuable to us, and we want to ensure that all relevant points are addressed before the final evaluation. Thanks for your time and effort again!

评论

Thanks for the detailed response from the authors. I have carefully read the rebuttal and think the authors have addressed most of my concerns. I thus upgraded my rating.

评论

Dear Reviewer zRzD,

We deeply appreciate your valuable comments and feedback, as they have played a crucial role in improving the quality of our work. Thank you very much for your acknowledgment of our work's quality and for your time and effort in reviewing our paper.

审稿意见
6

This paper presents the AutoRegressive Temporal diffusion (ARTDiff) method to tackle this consistency challenge. ARTDiff efficiently introduces a Gaussian noise distribution, accounting for time-based correlations, which strengthens the temporal connections and boosts consistency in the sequences generated. When tested on audio and motion tasks, ARTDiff outperformed standard diffusion models in sample fidelity and realism, making it a practical solution for diffusion-based generation models.

优点

  1. The paper is clearly written. The motivation is well presented and is also reasonable: setting up correlations between frames is indeed a meaningful method for improving temporal consistency.

  2. The paper is verified on a vast number of tasks, making it solid.

缺点

  1. Currently, it seems a general correlation for all tasks. Will the performance be better if taking some dataset/task-specific statistics into consideration?

  2. Some visualizations and videos will make the proposal more strong.

  3. Will the method work on fine-tuning a large generative model?

问题

See weakness part.

评论

Reference

[1] Wu, J. Z., Ge, Y., Wang, X., Weixian Lei, Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., & Mike Zheng Shou. (2023). Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. ArXiv.Org. https://doi.org/10.48550/arxiv.2212.11565

[2] Mokady, R., Hertz, A., Aberman, K., Pritch, Y., & Cohen-Or, D. (2022). Null-text Inversion for Editing Real Images using Guided Diffusion Models. ArXiv.Org. https://doi.org/10.48550/arxiv.2211.09794

[3] Jiaming Song, Chenlin Meng, & Stefano Ermon (2021). Denoising Diffusion Implicit Models. In International Conference on Learning Representations.

评论

W3. Will the method work on fine-tuning a large generative model?

Following your suggestion, we have newly added an experiment on text-to-video generation via one-shot fine-tuning of text-to-image diffusion models. Specifically, Tune-A-Video [1] is adopted as the baseline model to validate our methodology. We summarize the task and our implementation below; please refer to Section 3.3 in the revised manuscript and [1] for more details.

One-shot text-to-video generation Given a video V=vii[1,m]\mathcal{V}=\\{v_i|i \in [1,m]\\} that has mm frames and its corresponding prompt P\mathcal{P}, the goal is to generate a novel video V\mathcal{V}^* by an edited text prompt P\mathcal{P}^*. For example, given a video and a prompt describing "a man is skiing", we may wish to generate a new video corresponding to the edited prompt "Spider man is skiing".

Tune-A-Video The Tune-A-Video method consists of two stages: fine-tuning and inference. The first stage finetunes a pre-trained text-to-image diffusion model with the attention blocks in U-Net modified for spatio-temporal modeling. During inference, it first obtains a latent noise of the source video V\mathcal{V} via DDIM inversion (see its definition and notation in Section 3.1 of [2]):

$ z_{t+1} = \sqrt{\frac{\alpha_{t+1}}{\alpha_{t}}}z_t + \Big( \sqrt{\frac{1}{\alpha_{t+1}}-1}- \sqrt{\frac{1}{\alpha_{t}}-1}\Big)\cdot \epsilon_{\theta}(z_t, t, C) ; $

and then sample a novel video from the noise, guided by the edited prompt P\mathcal{P}^*, via the deterministic DDIM sampling [3]:

$ z_{t-1} = \sqrt{\alpha_{t-1}} \Big(\frac{z_{t}-\sqrt{1-\alpha_{t}}\cdot \epsilon_{\theta}(z_t, t, C)}{\sqrt{\alpha_{t}}} \Big)+\sqrt{1-\alpha_{t-1}-\sigma_t^2} \cdot \epsilon_{\theta}(z_t, t, C) + \sigma_t \epsilon_t, $

with σt=0\sigma_t=0.

Modifications to the baseline Similar to our previous experiments, during the fine-tuning stage, we sample the noise in the forward diffusion from N(0,Σ(λ,m))N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)) instead of N(0,I)N(\boldsymbol{0}, \boldsymbol{I}). Note that originally there is no random noise involved but only the noise predicted by the UNet in the inference stage of Tune-A-Video as it uses deterministic DDIM inversion and sampling. To obtain a temporal consistent initial noise, we replace ϵθ(zt,t,C)\epsilon_{\theta}(z_t, t, C) in DDIM inversion with ϵ=(1τ)ϵθ(zt,t,C)+τϵ~\epsilon'=(1-\tau)\cdot\epsilon_{\theta}(z_t, t, C) + \tau \cdot \tilde{\epsilon}, where ϵ~N(0,Σ(λ,m))\tilde{\epsilon}\sim N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)) and τ(0,1)\tau \in (0,1) is a tuning hyper-parameter. Besides, in DDIM sampling, we allow σt>0\sigma_t > 0 and let ϵtN(0,Σ(λ,m))\epsilon_t\sim N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)). All the other implementations follow [1].

Results

Quantitatively, we utilize three metrics to assess the quality of generated videos:

  • CLIP score: a larger value indicates better alignment between the video and the text prompt;
  • CLIP similarity: the average cosine similarity between the CLIP image embeddings of adjacent frames;
  • Pixel TCM: a smaller value indicates stronger temporal consistency at the pixel level.

We present the results in Table 3 of the revised manuscript and include them here for your easy reference. Our ARTDiff improves all the metrics, indicating its contribution to temporal consistency and textual alignment.

Table: Reported metrics for Tune-A-Video vs. Tune-A-Video + ARTDiff in one-short video generation task.

ModelCLIP scoreCLIP similarityPixel TCM
Tune-A-Video33.0560.9720.048
+ARTDiff33.3620.9740.040

Qualitatively, we showcase the generated video samples in Figure 3 of the manuscript and on an anonymous Github page (link: https://anonymous4peerreview.github.io/artdiff.github.io/), along with the corresponding text prompts. It can be clearly observed that both the subject and the background in our ARTDiff-generated videos have a more temporal consistent behavior. We are still generating more video samples with different text prompts and source videos and will keep you updated once they are available.

We hope this additional experiment can well support the effectiveness and generality of ARTDiff.

评论

Thanks to the authors for their reply. I have read carefully and gone through all the newly added demonstrations on the webpage. I decided to keep my original score.

评论

Dear Reviewer jkr1,

Your comments and feedback have been instrumental in improving the quality of our work. Thank you again for your time and effort in reviewing our paper!

评论

Thank you very much for your acknowledgment of our work's quality and your valuable suggestions. We sincerely appreciate your time in reading the paper, and our point-to-point responses to your comments are given below.

W1. Currently, it seems a general correlation for all tasks. Will the performance be better if taking some dataset/task-specific statistics into consideration?

We totally agree with you that taking some dataset/task-specific statistics into consideration will be beneficial and we have already included it as a possible future work in the last section of our previous manuscript.

Currently, the selection of hyper-parameters (decay rate λ\boldsymbol{\lambda}, coefficient cc for AR(1) model across windows, and window size ww) relies on grid search. We may leverage dataset-specific information as a guide to improve the searching efficiency. For example, we can use our proposed temporal consistency measure (TCM) to evaluate the temporal dependencies of the training dataset. If TCM yields a small value, it indicates strong temporal dependencies in training data, and we can prioritize exploring larger values of λ\boldsymbol{\lambda} and cc; and vice versa.

On the other hand, we can integrate more dataset/task-specific statistics when constructing the parametric form of the noise's covariance matrix Σ\boldsymbol{\Sigma}; and even render it tailor-made for each particular dataset/task. We believe it will better align with the underlying data distribution while the efficiency of such design needs to be carefully addressed. As you mentioned, different datasets may require different forms of statistics due to the distinct information available. For example, in conditional sound generation, we may use statistics from Mel-spectrogram, whereas in text-to-video generation, statistics extracted from text prompts can be useful. Moreover, the derivation of these statistics shall be computationally affordable. Given the significant values and challenges of this direction, we decide to leave it as our future work.

W2. Some visualizations and videos will make the proposal more strong

Thanks for your constructive feedback. We have added visualizations for our action-to-motion generation in Figure 2 of the revised paper and on an anonymous GitHub page (link: https://anonymous4peerreview.github.io/artdiff.github.io/).

Specifically, we adopt the MDM model as the backbone and present visualizations of various human action prompts, including "eat", "drink", and "run". While MDM has shown promising results in this task, it still generates a non-negligible subset of motions that lacks temporal consistency. In contrast, ARTDiff is capable of generating human motions with significantly higher levels of temporal consistency most of the time. For instance:

  • Given the "drink" prompt, the motion generated by the baseline model exhibits noticeable inconsistencies, particularly in the movement of the left arm. In contrast, motion generated by ARTDiff showcases smooth and consistent movements.
  • Given the "eat" prompt, the human motion generated by ARTDiff smoothly picks up food while the baseline model generates several frames that appear irrelevant to nearby frames.
  • Given the "run" prompt, the baseline model's generated motion and posture undergo frequent changes, particularly in the last second when the person abruptly turns back and appears suddenly at a distant location from their previous position. Conversely, ours adjusts its posture and moves smoothly.

These examples are provided for illustrative purposes, and there are more samples with different prompts that further demonstrate ARTDiff's strong performance in both temporal consistency and textual alignment. We hope these visualizations can effectively highlight the capabilities of ARTDiff.

审稿意见
6

The ARTDiff method introduced capitalizes on the autoregressive dependence structure inherent in temporal data, employing a Gaussian noise distribution with correlations between time frames defined by a function of their temporal difference. This approach intentionally captures temporal dependencies, bolstering the consistency across generated sequences. When applied to tasks involving audio and motion generation, ARTDiff showcases its effectiveness, producing samples with markedly enhanced fidelity and realism compared to those generated by baseline diffusion models. The method's straightforwardness and efficiency highlight its practicality, positioning ARTDiff as a viable option for instilling temporal consistency in diffusion-based generative models.

优点

The propose method is well-motivated.

The paper is well-structured, capable of clearly elucidating its core ideas.

The conducted experiments adequately showcase the efficacy of the method being proposed.

缺点

The noise correlation in this method bears a resemblance to that in [1]. Could you elucidate the principal distinctions?

[1] Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

What strategies are employed to maintain temporal consistency across various clips?

In scenarios of extended long-term generation involving F frames and a given window size w, what is the length of the input sequence during the training phase?

For the complex data like video, will the proposed correlation work? It seems to be a great way to generate long videos.

问题

Please see the weakness.

评论

Reference

[1] Ge, S., Nah, S., Liu, G., Poon, T., Tao, A., Catanzaro, B., Jacobs, D., Huang, J.B., Liu, M.Y., & Balaji, Y. (2023). Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 22930-22941).

[2] Feiqing Huang, Kexin Lu, Yuxi CAI, Zhen Qin, Yanwen Fang, Guangjian Tian, & Guodong Li (2023). Encoding Recurrence into Transformers. In The Eleventh International Conference on Learning Representations.

[3] Albert Gu, Karan Goel, & Christopher Re (2022). Efficiently Modeling Long Sequences with Structured State Spaces. In International Conference on Learning Representations.

[4] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, & Bryan Catanzaro (2021). DiffWave: A Versatile Diffusion Model for Audio Synthesis. In International Conference on Learning Representations.

[5] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. ArXiv.Org. https://doi.org/10.48550/arxiv.2012.07436

[6] Wu, J. Z., Ge, Y., Wang, X., Weixian Lei, Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., & Mike Zheng Shou. (2023). Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. ArXiv.Org. https://doi.org/10.48550/arxiv.2212.11565

[7] Mokady, R., Hertz, A., Aberman, K., Pritch, Y., & Cohen-Or, D. (2022). Null-text Inversion for Editing Real Images using Guided Diffusion Models. ArXiv.Org. https://doi.org/10.48550/arxiv.2211.09794

[8] Jiaming Song, Chenlin Meng, & Stefano Ermon (2021). Denoising Diffusion Implicit Models. In International Conference on Learning Representations.

[9] Wayne A. Fuller. Introduction to statistical time series. Wiley series in probability and statistics. J. Wiley, New York, 2nd ed. edition, 1996. ISBN 1-282-30767-3.

[10] J.D. Cryer and K.S. Chan. Time Series Analysis: With Applications in R. Springer Texts in Statistics. Springer New York, 2008. ISBN 9780387759593. URL https://books.google.com.hk/books?id=bHke2k-QYP4C.

评论

W4. For the complex data like video, will the proposed correlation work? It seems to be a great way to generate long videos.

Thanks for your constructive feedback. Following your suggestion, we have newly added an experiment on text-to-video generation to evaluate ARTDiff, where Tune-A-Video [6] is adopted as the baseline model. We summarize the task and our implementation below; please refer to Section 3.3 in the revised manuscript and [6] for more details.

One-shot text-to-video generation Given a video V=vii[1,m]\mathcal{V}=\\{v_i|i \in [1,m]\\} that has mm frames and its corresponding prompt P\mathcal{P}, the goal is to generate a novel video V\mathcal{V}^* by an edited text prompt P\mathcal{P}^*. For example, given a video and a prompt describing "a man is skiing", we may wish to generate a new video corresponding to the edited prompt "Spider man is skiing".

Tune-A-Video The Tune-A-Video method consists of two stages: fine-tuning and inference. The first stage finetunes a pre-trained text-to-image diffusion model with the attention blocks in U-Net modified for spatio-temporal modeling. During inference, it first obtains a latent noise of the source video V\mathcal{V} via DDIM inversion (see its definition and notation in Section 3.1 of [7]):

$ z_{t+1} = \sqrt{\frac{\alpha_{t+1}}{\alpha_{t}}}z_t + \Big( \sqrt{\frac{1}{\alpha_{t+1}}-1}- \sqrt{\frac{1}{\alpha_{t}}-1}\Big)\cdot \epsilon_{\theta}(z_t, t, C) ; $

and then sample a novel video from the noise, guided by the edited prompt P\mathcal{P}^*, via the deterministic DDIM sampling [8]:

$ z_{t-1} = \sqrt{\alpha_{t-1}} \Big(\frac{z_{t}-\sqrt{1-\alpha_{t}}\cdot \epsilon_{\theta}(z_t, t, C)}{\sqrt{\alpha_{t}}} \Big)+\sqrt{1-\alpha_{t-1}-\sigma_t^2} \cdot \epsilon_{\theta}(z_t, t, C) + \sigma_t \epsilon_t, $

with σt=0\sigma_t=0.

Modifications to the baseline Similar to our previous experiments, during the fine-tuning stage, we sample the noise in the forward diffusion from N(0,Σ(λ,m))N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)) instead of N(0,I)N(\boldsymbol{0}, \boldsymbol{I}). Note that originally there is no random noise involved but only the noise predicted by the UNet in the inference stage of Tune-A-Video as it uses deterministic DDIM inversion and sampling. To obtain a temporal consistent initial noise, we replace ϵθ(zt,t,C)\epsilon_{\theta}(z_t, t, C) in DDIM inversion with ϵ=(1τ)ϵθ(zt,t,C)+τϵ~\epsilon'=(1-\tau)\cdot\epsilon_{\theta}(z_t, t, C) + \tau \cdot \tilde{\epsilon}, where ϵ~N(0,Σ(λ,m))\tilde{\epsilon}\sim N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)) and τ(0,1)\tau \in (0,1) is a tuning hyper-parameter. Besides, in DDIM sampling, we allow σt>0\sigma_t > 0 and let ϵtN(0,Σ(λ,m))\epsilon_t\sim N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)). All the other implementations follow [6].

Results

Quantitatively, we utilize three metrics to assess the quality of generated videos:

  • CLIP score: a larger value indicates better alignment between the video and the text prompt;
  • CLIP similarity: the average cosine similarity between the CLIP image embeddings of adjacent frames;
  • Pixel TCM: a smaller value indicates stronger temporal consistency at the pixel level.

We present the results in Table 3 of the revised manuscript and include them here for your easy reference. Our ARTDiff improves all the metrics, indicating its contribution to temporal consistency and textual alignment.

Table: Reported metrics for Tune-A-Video vs. Tune-A-Video + ARTDiff in one-short video generation task.

ModelCLIP scoreCLIP similarityPixel TCM
Tune-A-Video33.0560.9720.048
+ARTDiff33.3620.9740.040

Qualitatively, we showcase the generated video samples in Figure 3 of the manuscript and on an anonymous Github page (link: https://anonymous4peerreview.github.io/artdiff.github.io/), along with the corresponding text prompts. It can be clearly observed that both the subject and the background in our ARTDiff-generated videos have a more temporal consistent behavior. We are still generating more video samples with different text prompts and source videos and will keep you updated once they are available.

We hope this additional experiment can well support the effectiveness and generality of ARTDiff.

评论

W2. What strategies are employed to maintain temporal consistency across various clips?

Firstly, we would like to clarify that the diffusion models that our paper targets at as well as all the backbone models (DiffWave and MDM) in our experiments are non-autoregressive, i.e., in an example of video generation, the entire video will be generated at once instead of clip by clip in an autoregressive way.

The proposed design for the covariance matrix of noises strengthens the temporal dependencies among frames without sabotaging the non-autoregressive generation framework of diffusion models. The correlation structure can be imposed at once by Eq.(1) of the paper. In the scenario of generating a sample xRF\boldsymbol{x} \in \mathbb{R}^{F} whose temporal dimension FF is large (such as F=16kF=16k in our neural vocoding task), we introduce a localized design for correlations by letting ϵ=(ϵ1,ϵ2,,ϵn)RF\boldsymbol{\epsilon} = (\boldsymbol{\epsilon}_1^\top,\boldsymbol{\epsilon}_2^\top,\cdots,\boldsymbol{\epsilon}_n^\top)^\top \in \mathbb{R}^F and

$ \boldsymbol{\epsilon}\_{i+1} = \sqrt{c} \boldsymbol{\epsilon}_i + \sqrt{1-c} \tilde{\boldsymbol{\epsilon}} \quad \text{with} \quad \boldsymbol{\epsilon}_1, \tilde{\boldsymbol{\epsilon}} \sim \mathcal{N}(\boldsymbol{0},\boldsymbol{\Sigma}(\boldsymbol{\lambda},w)), $

where ϵiRw\boldsymbol{\epsilon}_i \in \mathbb{R}^w, nn is the number of local windows, ww is the window size, and F=nwF = nw. The noise ϵ\boldsymbol{\epsilon} can be generated by consecutively sampling the component ϵi\boldsymbol{\epsilon}_i or directly drawing from a multivariate normal distribution whose covariance matrix has a closed form as in Eq.(3) of the paper. Importantly, the generation of x\boldsymbol{x} remains non-autoregressive as we only modify the noise sampling mechanism.

Despite not the focus of this paper, we may extend our autoregressive-structured noise design to accommodate diffusion models whose generation is autoregressive. This involves designing a correlation structure for the noises used in the training of consecutive clips. For instance, suppose we aim to generate a sample xRF\boldsymbol{x}\in \mathbb{R}^F while each training input for an autoregressive diffusion model is x(i)=xi:i+F01RF0\boldsymbol{x}^{(i)}=\boldsymbol{x}_{i:i+F_0-1}\in \mathbb{R}^{F_0}. In this case, we can first sample the noise ϵ\boldsymbol{\epsilon} from a multivariate normal distribution N(0,Σ(λ,F))N(\boldsymbol{0},\boldsymbol{\Sigma}(\boldsymbol{\lambda}, F)), where the covariance matrix may take the form of Eq.(1) in the paper. Then for each input x(i)\boldsymbol{x}^{(i)}, we may extract the corresponding noise ϵ(i)\boldsymbol{\epsilon}^{(i)} from ϵ\boldsymbol{\epsilon} based on the temporal location ii. In this way, the correlations across various clips can be maintained, enabling temporal consistent generation.

W3. In scenarios of extended long-term generation involving F frames and a given window size w, what is the length of the input sequence during the training phase?

Since the diffusion models that are of interest to our paper and all the backbone models (DiffWave and MDM) are non-autoregressive, the length of the input sequence during the training phase is the same as the length of the generation target (FF) irrespective of the given window size (ww) in all of our experiments.

There is only one exception lying in the neural vocoding task of audio generation, following the original DiffWave paper [4]. For this conditional generation task, during training, the input sequence length Ftrain=16kF_{\text{train}}=16k for each batch, which is only a part of the complete audio clip; while during inference, the length is extended to that of a complete audio clip with a sampling rate of 22.05kHz22.05kHz, making FinferenceF_{\text{inference}} to range from 20k\sim20k to 200k\sim200k. In other words, the diffusion model trained with only part of the complete audio clip is used to generate the full-length audio.

It is noteworthy that the window size ww is irrelevant to determining the input sequence length of the diffusion models and it only decides the correlation structure of noises as shown in Eq.(3) of the paper. The concept of the window in this context differs from that in time series forecasting with a rolling window as discussed in [5].

评论

Thank you very much for your acknowledgment of our work's quality and your valuable suggestions. We sincerely appreciate your time in reading the paper, and our point-to-point responses to your comments are given below.

W1. The noise correlation in this method bears a resemblance to that in [1]. Could you elucidate the principal distinctions? [1] Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Firstly, we apologize for not being aware of [1] at the time of our submission to ICLR, as the publication date of their paper was very close to the conference deadline. We have now rectified this oversight by incorporating [1] as a related work in Section 1.1 of the revised manuscript.

Shortly speaking, while both [1] and our work explore the noise design in diffusion models, our ARTDiff can provide richer patterns for temporal correlations through a local window design for large temporal dimensions and the newly-added exponentially damped cosine/sine wave behavior. Besides, in terms of the design principle, we emphasize offering a light yet flexible solution to ensure temporal consistency. Thus, our approach leans towards seamlessly integrating the autoregressive (AR) models' temporal pattern into the parametric form of Σ\boldsymbol{\Sigma}, which allows a more versatile and efficient design.

Richer temporal correlation patterns

  1. Our design of parametric covariance matrix Σ\boldsymbol{\Sigma} allows more autoregressive correlation patterns among the frames compared to Eq.(4) in [1]. [1] provides only a single autoregressive structure that the influence of the previous frame on the current frame follows a strictly exponential decay pattern as their temporal distance increases, which is essentially an AR(1) structure in time series. In contrast, there are three types of temporal decay patterns in literature [2, 3, 9, 10]. Therefore, we offer three choices for the element-wise parametric form fij(λ)f_{|i-j|}(\boldsymbol{\lambda}) for Σ_i,j\boldsymbol{\Sigma}\_{i,j}, which is the (i,j)(i,j)-th element of Σ\mathbf{\Sigma}. The first option is to let fij(λ)=fij(λ)=λijf_{|i-j|}(\boldsymbol{\lambda})=f_{|i-j|}(\lambda)=\lambda^{|i-j|} with λ(0,1)|\lambda|\in(0,1); while the other two options set λ=(γ,θ)\boldsymbol{\lambda}=(\gamma, \theta) with γ(0,1)\gamma \in (0,1) and θ(π/2,π/2)\theta \in (-\pi/2, \pi/2), and let fij(λ)=fij(γ,θ)=γijcos(ijθ)f_{|i-j|}(\boldsymbol{\lambda})=f_{|i-j|}(\gamma, \theta)=\gamma^{|i-j|}\cos(|i-j|\theta) or γijsin(ijθ)\gamma^{|i-j|}\mathrm{sin}(|i-j|\theta). The latter choices are newly added, and they introduce correlations that exhibit a damped cosine or sine wave behavior, which the noise form proposed in [1] cannot achieve.
  2. To accommodate the generation of samples with a very large temporal dimension, we have also provided a localized approach that [1] did not consider. Our approach first introduces correlations only within local windows and then bridges different windows with an AR(1) structure. Though appearing in similar forms, Eq.(2) of our paper is identical to Eq.(4) of [1] only when the window size w=1w=1, which is highly unlikely as it deviates from our aim of employing such a localized approach. When w>1w>1, the covariance matrix of the combined noise ϵ\boldsymbol{\epsilon} follows a parametric form as shown in Eq.(3). The correlations within each window resemble Σ(λ,w)\boldsymbol{\Sigma}(\boldsymbol{\lambda}, w) respectively, with the overall correlations across windows controlled by the hyper-parameter cc. This design carefully handles both the short- and long-term temporal dependencies, which a simple AR(1) structure in [1] is incapable of.

A light yet flexible solution to ensure temporal consistency

From the perspective of the design principle, our objective is to offer a light yet flexible way to enhance temporal consistency for diffusion models. Therefore, rather than directly applying AR models to the noises, we prefer to seamlessly integrate their correlation patterns into the parametric form of Σ\boldsymbol{\Sigma}. Note that in the context of sequence modeling, [2] and [3] respectively find that the temporal dynamics of recurrent neural networks and state space models both can be characterized by two fundamental patterns: exponential decays and damped sine or cosine waves. This inspires us to propose different correlation structures for Σ\boldsymbol{\Sigma} when the temporal dimension is moderate. Such design can offer greater versatility and efficiency than [1].

评论

Dear Reviewer uW7w,

As the rebuttal period is drawing to a close, we would like to know if you have any remaining concerns regarding our revised manuscript or the additional experiments. Your feedback is highly valuable to us, and we want to ensure that all relevant points are addressed before the final evaluation. Thanks for your time and effort again!

评论

Appreciate the response. All my concerns have been addressed and I lean to accept.

评论

Dear Reviewer uW7w,

We sincerely appreciate your valuable comments and feedback, which have significantly contributed to enhancing the quality of our work. Thank you once again for your time and effort in reviewing our paper.

审稿意见
6

Diffusion models have been applied to sequential data, including audio, video, and motion. Temporal consistency is crucial in sequential data. This paper introduces the AutoRegressive Temporal Diffusion (ARTDiff) method to address temporal consistency in diffusion models. By incorporating a Gaussian distribution with temporal correlations based on time differences, ARTDiff captures temporal dependencies effectively. The effectiveness of ARTDiff is evaluated in audio and motion generation tasks.

优点

  • Clear and easy-to-follow writing.
  • The paper addresses an important topic as temporal consistency is crucial in the rapidly evolving field of diffusion models applied to audio and video data.
  • The method's effectiveness is demonstrated in audio and motion generation tasks.

缺点

  • The proposed method is relatively simple and lacks novelty.
  • The experimental section is limited, comparing the method only in some uncommon tasks. Tasks like Text2Video might be more practically relevant.

问题

As indicated above, despite some experimental and innovative limitations, the paper has a good foundation and satisfactory completion. While I'm not deeply familiar with audio tasks, I currently lean toward a weak acceptance considering the paper's decent starting point and overall quality.

评论

Thank you very much for your acknowledgment of our work's quality and your valuable suggestions. We sincerely appreciate your time in reading the paper, and our point-to-point responses to your comments are given below.

W1. The experimental section is limited, comparing the method only in some uncommon tasks. Tasks like Text2Video might be more practically relevant.

Thanks for your constructive feedback. Following your suggestion, we have newly added an experiment on text-to-video generation to evaluate ARTDiff, where Tune-A-Video [1] is adopted as the baseline model. We summarize the task and our implementation below; please refer to Section 3.3 in the revised manuscript and [1] for more details.

One-shot text-to-video generation Given a video V=vii[1,m]\mathcal{V}=\\{v_i|i \in [1,m]\\} that has mm frames and its corresponding prompt P\mathcal{P}, the goal is to generate a novel video V\mathcal{V}^* by an edited text prompt P\mathcal{P}^*. For example, given a video and a prompt describing "a man is skiing", we may wish to generate a new video corresponding to the edited prompt "Spider man is skiing".

Tune-A-Video The Tune-A-Video method consists of two stages: fine-tuning and inference. The first stage finetunes a pre-trained text-to-image diffusion model with the attention blocks in U-Net modified for spatio-temporal modeling. During inference, it first obtains a latent noise of the source video V\mathcal{V} via DDIM inversion (see its definition and notation in Section 3.1 of [2]):

$ z_{t+1} = \sqrt{\frac{\alpha_{t+1}}{\alpha_{t}}}z_t + \Big( \sqrt{\frac{1}{\alpha_{t+1}}-1}- \sqrt{\frac{1}{\alpha_{t}}-1}\Big)\cdot \epsilon_{\theta}(z_t, t, C) ; $

and then sample a novel video from the noise, guided by the edited prompt P\mathcal{P}^*, via the deterministic DDIM sampling [3]:

zt1=αt1(zt1αtϵθ(zt,t,C)αt)+1αt1σt2ϵθ(zt,t,C)+σtϵt,z_{t-1} = \sqrt{\alpha_{t-1}} \Big(\frac{z_{t}-\sqrt{1-\alpha_{t}}\cdot \epsilon_{\theta}(z_t, t, C)}{\sqrt{\alpha_{t}}} \Big)+\sqrt{1-\alpha_{t-1}-\sigma_t^2} \cdot \epsilon_{\theta}(z_t, t, C) + \sigma_t \epsilon_t,

with σt=0\sigma_t=0.

Modifications to the baseline Similar to our previous experiments, during the fine-tuning stage, we sample the noise in the forward diffusion from N(0,Σ(λ,m))N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)) instead of N(0,I)N(\boldsymbol{0}, \boldsymbol{I}). Note that originally there is no random noise involved but only the noise predicted by the UNet in the inference stage of Tune-A-Video as it uses deterministic DDIM inversion and sampling. To obtain a temporal consistent initial noise, we replace ϵθ(zt,t,C)\epsilon_{\theta}(z_t, t, C) in DDIM inversion with ϵ=(1τ)ϵθ(zt,t,C)+τϵ~\epsilon'=(1-\tau)\cdot\epsilon_{\theta}(z_t, t, C) + \tau \cdot \tilde{\epsilon}, where ϵ~N(0,Σ(λ,m))\tilde{\epsilon}\sim N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)) and τ(0,1)\tau \in (0,1) is a tuning hyper-parameter. Besides, in DDIM sampling, we allow σt>0\sigma_t > 0 and let ϵtN(0,Σ(λ,m))\epsilon_t\sim N(\boldsymbol{0}, \boldsymbol{\Sigma}(\boldsymbol{\lambda}, m)). All the other implementations follow [1].

Results

Quantitatively, we utilize three metrics to assess the quality of generated videos:

  • CLIP score: a larger value indicates better alignment between the video and the text prompt;
  • CLIP similarity: the average cosine similarity between the CLIP image embeddings of adjacent frames;
  • Pixel TCM: a smaller value indicates stronger temporal consistency at the pixel level.

We present the results in Table 3 of the revised manuscript and include them here for your easy reference. Our ARTDiff improves all the metrics, indicating its contribution to temporal consistency and textual alignment.

Table: Reported metrics for Tune-A-Video vs. Tune-A-Video + ARTDiff in one-short video generation task.

ModelCLIP scoreCLIP similarityPixel TCM
Tune-A-Video33.0560.9720.048
+ARTDiff33.3620.9740.040

Qualitatively, we showcase the generated video samples in Figure 3 of the manuscript and on an anonymous Github page (link: https://anonymous4peerreview.github.io/artdiff.github.io/), along with the corresponding text prompts. It can be clearly observed that both the subject and the background in our ARTDiff-generated videos have a more temporal consistent behavior. We are still generating more video samples with different text prompts and source videos and will keep you updated once they are available.

We hope this additional experiment can well support the effectiveness and generality of ARTDiff.

评论

Reference

[1] Wu, J. Z., Ge, Y., Wang, X., Weixian Lei, Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., & Mike Zheng Shou. (2023). Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. ArXiv.Org. https://doi.org/10.48550/arxiv.2212.11565

[2] Mokady, R., Hertz, A., Aberman, K., Pritch, Y., & Cohen-Or, D. (2022). Null-text Inversion for Editing Real Images using Guided Diffusion Models. ArXiv.Org. https://doi.org/10.48550/arxiv.2211.09794

[3] Jiaming Song, Chenlin Meng, & Stefano Ermon (2021). Denoising Diffusion Implicit Models. In International Conference on Learning Representations.

评论

Dear Reviewer kucM,

As the rebuttal period is drawing to a close, we would like to know if you have any remaining concerns regarding our revised manuscript or the additional experiments. Your feedback is highly valuable to us, and we want to ensure that all relevant points are addressed before the final evaluation. Thanks for your time and effort again!

AC 元评审

This paper proposes Autoregressive Temporal Diffusion (ArtDiff), a technique for modeling temporal consistency in videos. The idea is to have gaussian noise prior with temporal correlations so as to induce temporal consistency in the diffusion process. Such an idea is intuitive and interesting to study.

One of the biggest concerns I have is that experimental section is very weak. The authors just have DiffWave and DiffWave + ArtDiff in audio generation results, MDM and MDM+ArtDiff in motion generation and Tune-A-Video and Tune-A-Video+ArtDiff in video generation. There are no other baseline comparisons included. The quantitiative results on motion generation is very incremental (as pointed by some reviewers) and the results on video generation is extremely incremental as well. In the rebuttal authors say that even though quantitative results are not good, qualitatively the results are good. And they provide an anonymous link with three results for motion generation and four results for video generation. With such few results, it is really not possible to tell whether the method improves or not.

The authors in the rebuttal also mention that "our hyper-parameter tuning is insufficient because of the computational constraints". This might be one of the reasons why the quantitative performance is bad. This really makes the paper incomplete at its current state. I would encourage the authors to properly perform experiments and refine the paper.

For a top-tier conference like ICLR, I think the experimental results need to be strong. Proper analysis should be performed to understand different components of the method. Since the experimental section is too weak for this paper, I vote for rejecting the paper.

为何不给更高分

Experimental section is really weak. Performance improvement shown in motion generation and video generation is very incremental.

为何不给更低分

N/A

最终决定

Reject