PaperHub
6.0
/10
Poster4 位审稿人
最低5最高7标准差1.0
7
7
5
5
3.8
置信度
正确性3.0
贡献度2.8
表达3.0
NeurIPS 2024

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-05
TL;DR

We design a video-to-audio generation model with higher quality and fewer sampling steps.

摘要

关键词
video-to-audio generationrectified flow modelefficient generation

评审与讨论

审稿意见
7

They propose FRIEREN, an efficient video-to-audio generation model based on rectified flow matching that obtains state-of-the-art results.

优点

  • They successfully propose rectified flow matching for video-to-audio, that is a problem that is important in current generative AI setup where most video generative models are generating video without audio.
  • They run a perceptual study.
  • The paper (specially section 3) is very well written and clear.
  • I appreciate the examples in the demo webpage. Also the selection of the examples, that does not feel cherry picked.

缺点

The introduction lacks scientific rigor.

  • line 36: "leave room for further advancement". This is a general statement, can you be more specific?
  • line 36: "autoregressive models lack the ability to align the generated audio with the video explicitly". This is not true, because AudioLM and MusicLM are autoregressive models that use explicit semantic tokens to capture structure similar to the conditioning in video-to-audio.

I could not find how you compute alignment accuracy.

Minor comment related to scientific writing:

  • line 18: "revolutionary enhancements". It feels like marketing and this is a scientific paper.

问题

  • Do you plan to release the code?
  • How do you compute alignment accuracy?
  • Why not using CLIP for visual representation?

局限性

  • 16kHz and short-form videos.
  • No code/weights provided.
作者回复

We highly appreciate your positive appraisal of our work and would like to discuss the issues you raised here.

[Computing alignment accuracy]

As stated in the Metrics part, Section 4.1, we adopt the alignment classifier provided in Diff-Foley[1] for calculating alignment accuracy. Specifically, we convert the generated audio to 128-bin mel-spectrogram and feed it to the classifier along with the CAVP features. We will emphasize these details in the revised version.

[Alignment learning ability of autoregressive models]

Thanks for pointing out this. We agree that recent large-scale autoregressive models like AudioLM and MusicLM have shown a strong ability to learn alignment between different modalities with self-attention and in-context learning. However, the good performance of these models often relies on large-scale transformer decoders and substantial amounts of training data. Early autoregressive baselines on VGGSound perform poorly in terms of temporal synchrony (see table 1 in our paper), this may be due to the limited model capacity and training data volume. We will modify our statements in the revised version of our paper.

[Other wording and scientific rigor issues]

Thank you for pointing out the issues in our writing, we will revise our wording in the revised version. For example, we may change line 18 to ... significantly enhanced the quality and diversity..., and change line 36 to ...but still has a gap in quality compared to state-of-the-art text-to-audio models and real-world audio.

[Selection of visual features]

In previous work [1], CLIP has been shown to be not effective enough in generating temporally aligned audio. In this work, we mainly use CAVP for a fair comparison with [1] and attempt to find a better visual representation for video-to-audio generation (taking MAViL as an example).

[Sampling rate and duration issues]

Following most previous audio generation models, we adopt 16kHz audio. This could be improved by using spectrogram, VAE, and vocoder with a higher sampling rate, which shouldn't be a big problem. On the other hand, most public-available on-the-shelf video-to-audio data, like VGGSound and AudioSet, are short video clips, restricting extending generation lengths. We may delve into these issues in future work.

[Plan for releasing code and weights]

Thanks for your interest. We plan to release our code and weights on GitHub several weeks later, no matter whether our paper gets accepted or not.


Once again, thank you for your effort in reviewing our work and your acknowledgment. We welcome further discussion with you.


[1] Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. Advances in Neural Information Processing Systems, 36, 2024.

审稿意见
7

This paper presents a new model for video-to-audio generation. The proposed model is based on rectified flow formulation and adopts Transformer-based architecture. A conditional video is fed into the model via channel-level concatenation to the audio tokens after processed by a length regulator. After training of the model, the model is further fine-tuned with the reflow and distillation. They are conducted with synthetic data generated by the firstly trained model with classifier-free guidance. The experimental results demonstrate that the proposed model outperforms the existing models by a large margin both quantitatively and qualitatively.

优点

  • The design of the proposed model is simple and reasonable. The proposed model is based on Transformer, and the video condition is fed into the model via channel-level concatenation to the audio tokens after adjusting the number of tokens. This design would be beneficial for boosting the temporal alignment, as it explicitly utilizes the temporal correspondance between the conditional video and the generated audio.

  • In the experiments, the proposed method outperforms the other existing methods by a large margin. I have checked the generated examples on the website, and they are really amazing.

  • The proposed model is quite light-weight, and it is great to be able to train the model with only two GPUs. In addition, the inference speed is substantially fast thanks to the reflow and distillation as well as the light-weight design.

  • The manuscript is well-written and easy to follow.

缺点

  • The experiments have only been conducted with one dataset, which is VGGSound. Training or zero-shot evaluation with other datasets (such as Landscape dataset) would be beneficial to validate the generalization capability of the proposed method.

  • The empirical analysis on why the proposed method performs well seems insufficient. According to the results shown in Table 2, DDPM with the proposed model architecture already achieves substantially better performance than the existing methods. Thus, it appears that the model architecture, rather than the usage of the rectified flow, is the key for the impressive performance. As far as I understand, its major difference from the standard Transformer is two-fold: channel-level concatenation for the conditional inputs instead of sequence-level one (or cross-attention mechanism as in [23]) and the usage of 1D-conv instead of 2D-conv. It would be great if this paper could provide an empirical analysis on which component actually boosts the performance for video-to-audio generation. The current manuscript places significant emphasis on the rectified flow aspect, which is not particularly novel as the proposed model largely follows to the settings of previous works.

问题

  • Is there any particular challenge (and its solution) when applying rectified flows for audio generation?

  • Minor questions:

    • Is CFG also applied for the reflowed models? I understand that it is applied when generating the training data for the reflow process but cannot find how it is set during the inference phase.

<After the rebuttal>

I updated my rating from 5 to 7.

局限性

Limitations have been discussed in the appendix.

作者回复

We are highly grateful for your positive appraisal of our work, and we'd like to discuss the issue you raised here.

[Generization experiments on landscape dataset]

Following your advice, we conduct experiments on the landscape dataset to investigate the generalization capability of our model. We compare our zero-shot performance with Diff-Foley. We also try to finetune our model on landscape for about 4k steps (about 268 epochs). The results are illustrated in the following table.

ModelModeFD↓IS↑KL↓FAD↓KID ×103\times10^{-3}
Diff-Foleyzero-shot76.982.964.169.7041.50
Frierenzero-shot34.874.154.122.6412.29
Frierenfinetuned30.383.746.121.9412.28

It can be seen that the zero-shot performance of Frieren significantly outperforms Diff-Foley in multiple metrics. On the other hand, we observe that finetuning improves in FD and FAD, with the differences being 4.49 and 0.70. However, it also leads to degradation of 0.41 and 2.00 in IS and KL, respectively. Due to the limited size of the landscape dataset, there may be a distribution gap between the training and testing splits. Finetuning could lead to a degree of overfitting on the training set, resulting in a decline in certain metrics.

[More empirical analysis on model performance]

First, we'd like to claim that both the transformer architecture and the rectified flow (RF) modeling method contribute to the model performance (please refer to our response to reviewer MSJt for details and additional results). Our rectified flow model brings improvement in IS, FAD, and Acc while enabling generation with fewer or even one step. Besides, adopting a better ODE solver can further improve the performance of RF and increase its performance gap with DDPM.

Second, we have conducted experiments with sequence-level concatenation for the conditional inputs (similar to cross-attention essentially, as the alignment is learned by attention). However, this model fails to generate meaningful audio, and the metrics are unacceptably bad as shown in the table below.

Cond MechanismFD↓IS↑KL↓FAD↓KID ×103\times10^{-3}ACC↑
Channel-Level Concat12.2512.422.731.322.4997.22
Sequence-Level Concat83.921.6222.1612.3141.6328.91

We also provide two pairs of results of sequence-level and channel-level concatenation in the PDF. It can be seen that the sequence-level model tends to generate flat, monotonous, and meaningless audio. However, the frequency bands where energy is concentrated are similar in the results of the two models, with the bright lines on the spectrograms being of similar height. We speculate that this indicates that the sequence-level model can extract semantic information from conditional inputs, but it fails to learn temporal alignment through attention. Adopting different positional embedding doesn't help. This seems a little weird, as cross-attention shows a fundamental alignment learning ability in the baseline diffusion model, but it just turns out not working in our architecture. Maybe it has something to do with model size and capacity. Anyway, these results illustrate the necessity of channel-level feature fusion.

Last, the design of our transformer block with 1D convolution derives from a diffusion-based text-to-audio generation model[1], and it is proven in the paper to perform better than 2D-convolution-based U-Net on audio generation, and has better potential to generalize to audio of longer and variable lengths. Due to the limited response time, we haven't made it to implement and train a 2D version of the model to examine the performance gap yet. We plan to add these results in the final revised version of our paper.

[Challenges in applying rectified flows for audio generation]

Compared to other tasks, we think that applying RF in video-to-audio generation faces the following challenges:

  1. Our method shares some similarities with RF-based TTS models like VoiceFlow [2]. However, we find that compared to the strong and highly deterministic content condition (text) in TTS, V2A has a weaker condition and its performance relies more on guidance. Compared to RF TTS models, we found that using the CFG-corrected vector field as the regression target during the reflow stage is crucial for model performance, rather than using the same vector field as in previous RF models (see section 3.5).

  2. As stated above, compared to text-conditioned generation like T2I and T2A, it turns out attention-based conditional mechanism fails to provide precise information in terms of semantic and temporal alignment for V2A. Hence we propose the channel-level feature fusion used with the feed-forward transformer architecture.

[CFG in reflow]

Yes. As shown in equations (8) and (9), we use the CFG-corrected vector field as the regression target during the reflow and distillation, where the CFG scale is the same as that used for generating reflow data. And we use the same CFG scale for sampling with the reflowed model.


Once again, thank you for your effort in reviewing our work and your acknowledgment. We hope our clarifications address your concerns, and we always welcome further discussion with you.


[1] Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, and Zhou Zhao. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023.

[2] Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. Voiceflow: Efficient text-to-speech with rectified flow matching. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 11121–11125. IEEE, 2024.

评论

Thanks for the response and additional experimental results. I have read them as well as the other reviews. The additional experimental results clarify the advantage of the proposed method as well as which component contributes the performance gain. As my concerns have been properly addressed in the rebuttal, I would like to update my rating from 5 to 7.

审稿意见
5

The following work proposes a video to audio generation model. The model architecture closely follows that of prior work diff-foley, which operates on 4 frames per second, fits a temporally aligned latent space between audio and video content, and then a latent diffusion model to map from this latent space to audio. This work proposes to replace the latent diffusion model architecture with a transformer-based rectified flow model, and opts to use the MAViL audio-video joint latent representation instead of the one from diff-foley (CAVP). Results are qualitatively much better than that of diff-foley, and also faster to sample from due to the rectified flow matching formulation.

优点

  • Qualitative results significantly improve over prior work
  • Decent ablation studies over critical architectural design choices, such as CAVP vs MaVIL, loss-reweighting for training flow matching models.

缺点

  • Despite improvements over prior works such as diff-foley, the contributions of this work remain limited. The time-aligned audio generation appears to stem from architectural choices made in diff-foley.
  • Furthermore conditional-optimal-transport flow-matching generative models have been applied audio models with similar conclusions. The specific application to the video-to-audio task, in my opinion, is not sufficiently different from prior applications in audio for the findings in this work to be particularly new. Specifically, it should be considered to be very closely related to other temporally-aligned conditional generation tasks such as text-to-speech.
  • It's also worth noting that Diff-Foley uses a very simple griffin-lim to map predicted spectrograms to audio waveforms, whereas this work makes use of the much more effective BigVGAN model. This makes it very difficult to pinpoint the qualitative improvements of the proposed work compared to prior methods.

问题

  • I'm curious how the authors were able to try MAViL given that the code for this project does not appear to be publicly available?

局限性

Yes

作者回复

Thanks for your valuable comments. We'd like to make some clarification and discussion about the issues you raised.

[Difference with Diff-Foley in architecture]

We'd like to clarify that our model significantly differs from Diff-Foley in both architecture and alignment mechanisms. Diff-Foley adopts a U-Net denoiser with a cross-attention-based conditional mechanism. As stated in sections 1 and 4.2, cross-attention alone struggles to achieve precise audio-visual alignment, and therefore Diff-Foley relies on an additional classifier for guidance, which is complicated and unstable with fewer steps. In contrast, we adopt a transformer vector field estimator with channel-level cross-modal feature fusion, achieving higher synchrony and robustness with simpler architecture.

[Difference with other flow-matching-based audio models and novelty issues]

We agree that our model shares some similarities with previous flow-matching-based audio models. Nevertheless, we have delved deeper into certain aspects compared to previous speech models like VoiceFlow [1].

  1. Compared to the strong and highly deterministic content condition (text) in TTS, V2A has a weaker condition and its performance relies more on guidance. We found that using the CFG-corrected vector field as the regression target during the reflow stage is crucial for model performance, rather than using the same vector field in both initial training and reflow as in previous rectified flow models (see section 3.5).

  2. Upon reflow, we further investigate one-step distillation, which further improves the single-step performance and previous flow-matching-based audio models did not address it. We also investigate techniques like objective reweighting for further performance improvement.

[Effect of vocoder on qualitative results]

We agree that the vocoder significantly impacts audio fidelity and objective metrics. Our target is to build an integral system for video-to-audio generation with higher quality and generation efficiency, and therefore we replace the slow and low-quality Griffin-Lim with BigVGAN. For reference, we provide the results of Frieren with Griffin-Lim as the vocoder in the following table. The number of Griffin-Lim iterations is the same as Diff-Foley.

It can be seen that despite the performance drop, Frieren still surpasses Diff-Foley in KL, FAD, and ACC, with FAD showing a significant advantage while maintaining competitive FD and IS values.

ModelFD↓IS↑KL↓FAD↓ACC↑
Diff-Foley (w/ CG)23.9411.113.284.7295.03
Diff-Foley (w/o CG)24.9711.693.237.1092.53
Frieren (Griffin-Lim)28.2910.673.173.7095.22

Moreover, due to the limitations of objective metrics, we highly recommend you refer to our demo page (https://frieren-v2a.github.io/). It can be observed that in addition to audio fidelity, the samples from our model exhibit better semantic content and more precise temporal alignment compared to Diff-Foley, demonstrating the qualitative advantages of our rectified-flow-based model.

It's also worth mentioning that other than audio quality, Frieren achieves a generation speed 7.3×7.3\times that of Diff-Foley (see table 8 in the paper, taking only the spectrogram generation for consideration), demonstrating a significant advantage in terms of generation efficiency.

[Source of MAViL]

We adopt the MAViL implementation and checkpoints from a public-available audio-visual representation benchmark project (AV-SUPERB, https://github.com/roger-tseng/av-superb). We made a slight modification to the model input so that it takes 4FPS video rather than 2. Due to concerns about potentially violating anonymity policies by using links in our rebuttal, we must state that there is no overlap or connection between the authors of this project and our paper.


We hope our clarifications address your concerns and we are looking forward to your re-assessment of our work. We also welcome further discussion with you. Thank you again for your efforts.


[1] Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. Voiceflow: Efficient text-to-speech with rectified flow matching. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 11121–11125. IEEE, 2024.

评论

We'd like to offer an additional perspective to assess the impact of vocoders on model performance. We use BigVGAN, rather than Griffin-Lim, as the vocoder for both Diff-Foley and Frieren. The output from Diff-Foley is converted into an 80-bin mel-spectrogram and then fed into BigVGAN. The results are shown in the following table.

ModelVocoderFD↓IS↑KL↓FAD↓KID ×103\times10^{-3}
Diff-Foley (w/ CG)BigVGAN18.0210.892.886.325.32
FrierenBigVGAN12.2512.422.731.322.49

First, using BigVGAN for Diff-Foley improves its FD, KL, and KID, indicating the effectiveness of BigVGAN for Diff-Foley on improving audio quality. On this basis, Frieren outperforms Diff-Foley across all metrics, with a greater difference than when using Griffin-Lim for both. This further demonstrates that our model is significantly superior to Diff-Foley. In contrast, Griffin-Lim is too weak, forming a performance bottleneck that narrows the performance gap between Frieren and Diff-Foley.

评论

Dear Reviewer,

As the end of the discussion period approaches, we are eager to get your feedback. We have tried our best to resolve your concerns and clarify misunderstandings. We would be grateful to hear your feedback regarding our answers to the reviews.

Best Regards, Authors

评论

Dear Authors, I appreciate the additional information regarding guided reflow matching and the additional vocoder ablations. I have also previously gone through the qualitative samples and don't really have any doubts regarding the qualitative improvements from this work. I'm leaning towards a higher rating but would prefer to discuss with other reviewers during the final discussion phase first.

审稿意见
5

This paper proposes a diffusion model based on rectified flow matching. Besides, to generate better audio quality, the authors propose re-weighting objective. The method achieves the state-of-the-art results on V2A benchmark.

优点

  • The proposed method is the first to leverage rectified flow matching on video-to-audio generation tasks.

  • The quantitative and qualitative results demonstrate the superiority comparing with existing baselines.

缺点

  • Although FRIEREN shows impressive results, the competing methods ( i.e., Diff-foley) based on U-Net style diffusion model are relatively weak. The performance gain seems mostly come from transformer architecture.

  • Following previous point, DDPM shows pretty similar results when increasing steps. It makes the proposed method less stronger. Thus, it would be great to show the results with more steps.

  • The proposed method, reflow, cannot consistently benefit FAD on different number of step. It seems not reasonable.

Overall, the results are good. If the authors can address some questions and more insight (comparing to simple adapting reflowing in V2A like speech model), that would make the paper more convincing.

问题

  • Do the authors use any pretrained initialization for transformer?

  • The design of Fig2b is very similar to standard ViT block. Any intuitions or differences between these two?

  • In Fig2b, are the c latents, video features, performed any pooling layer?

局限性

See weakness.

作者回复

Thank you for your valuable comments on our work. We would like to discuss the issues you raised here.

[Effect of transformer architecture and rectified flow]

We believe that both the transformer architecture and the rectified flow (RF) modeling method contribute to the model performance. We will elucidate the role of RF from the following perspectives.

  1. According to Table 1 in the paper, our RF model demonstrates an advantage in IS, KL, FAD, and Acc, with differences being 2.33, 0.13, 0.45, and 1.89, as well as higher MOS. These differences are actually quite significant. When the sampling steps increase to 40, these advantages remain consistent (see the table below). We believe these results serve to demonstrate the significant positive effect on model performance that RF has.

    ModelStepFD↓IS↑KL↓FAD↓KID ×103\times10^{-3}ACC↑
    DDPM4011.6310.282.871.722.1895.26
    Frieren4011.8712.632.741.312.3997.19
  2. Differences in the sampler may potentially reduce the performance gap between RF and DDPM. We adopt an advanced solver, DPM-Solver, for DDPM, in contrast to the simplest Euler solver for RF. Due to the differences in the models, it is difficult to eliminate this effect through an entirely consistent sampler. However, we can further unlock RF's potential by employing a more advanced sampler for RF. The following table shows the results of Frieren with the Dormand–Prince method (dopri5). We can see that the RF model holds an advantage in almost all metrics, especially in IS, FAD, and Acc, with only a slight disadvantage in KID. This further indicates the advantage of RF.

    ModelSamplerStepFD↓IS↑KL↓FAD↓KID ×103\times10^{-3}ACC↑
    DDPMDPM-Solver2511.7910.092.861.772.3695.33
    Frierendopri52511.6312.762.751.372.3996.87
  3. Lastly, RF not only enhances the quality of generated audio but also reduces sampling steps through reflow and distillation, significantly improving the model's generation efficiency, which DDPM cannot achieve.

[Effect of reflow under different steps]

We briefly discussed this issue at the end of Section 4.2 and we'd like to provide a possible explanation with more details here. Theoretically, reflow should not alter the model’s marginal distribution. Yet in practice, the limited number of steps for reflow data generation (25 steps in our experiments) can affect the data quality, introducing errors into the regression targets during the reflow process. While reflow can straighten trajectories and improve generation quality with a few steps, such errors in the regression targets can degrade the model's generation quality with over 25 steps, leading to reductions in metrics such as FAD and IS. This might be mitigated by increasing the number of sampling steps during reflow data generation. Moreover, increasing the number of iterations of generating reflow data and conducting reflow can lead to the accumulation of more errors, and that's why we only generate data once for both reflow and distillation (as discussed in section 3.5).

[Design of transformer block]

Our design of transformer block derives from Make-an-Audio 2[1], which is proven to be efficient in audio generation, despite that it does not necessarily outperform standard ViT block significantly. In other words, it is not necessarily the best choice, but it is sure to be a good one.

[Model initialization]

Yes. We load the weight of the diffusion denoiser in Make-an-Audio 2[1], which is a text-to-audio ddpm model trained with more data. Despite that we did not observe significant improvement in metrics and the rate of convergence, it seems to slightly help the subjective perception quality. We will supplement these details in the revised version of our paper.

[Pooling on condition latent]

No. No pooling is conducted on the condition sequence, as we want to keep the temporal information for generating synchronized audio.

[Difference with RF-based speech models and other Insights]

We agree that our method has some similarities with RF-based speech models, such as VoiceFlow [2]. However, we believe our model delves more deeply into certain aspects.

  1. Compared to the strong and highly deterministic content condition (text) in TTS, V2A has a weaker condition and its performance relies more on guidance. Compared to RF TTS models, we found that using the CFG-corrected vector field as the regression target during the reflow stage is crucial for model performance, rather than using the same vector field as in previous RF models (see section 3.5).

  2. Upon reflow, we further investigate one-step distillation, which further improves the single-step performance and previous speech models did not address it.


We hope our clarifications address your concerns. If you find our response helpful, we would very appreciate it if you consider increasing your evaluation of our work. And we always welcome further discussion with you. Thank you again for your efforts.


[1] Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, and Zhou Zhao. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023.

[2] Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. Voiceflow: Efficient text-to-speech with rectified flow matching. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 11121–11125. IEEE, 2024.

作者回复

To all reviewers, ACs, and PCs:

We thank all reviewers for their valuable suggestions with their effort and time. Your comments have improved our work. We have individually responded to the comments and concerns of each reviewer. Please refer to each response for details.

In order to better illustrate the effect of the channel-level fusion condition mechanism to respond reviewer hYKZ, we provide a PDF showing the generated spectrogram of our model with channel-level and sequence-level concatenation.

We sincerely hope that our responses have addressed the concerns raised by the reviewers and welcome further discussions. Once again, thank you for your time and efforts.

Best regards, Authors.

最终决定

Audio generation from videos/images has been studied for sometime now but most models suffer from complexity of proposed architecture (in efficient for real time) or lack of perceptual correctness. The authors propose a simple architecture based on transformer flow with a tweak to diffusion loss, leading to small sample processing based synthesis. The authors show significant improvement over state of the art in both objective and perceptual studies.

There is general agreement among the reviewers regarding the importance and sufficiently broad impact of the work. The authors are suggested to capture the comments/replies into the final paper.