PaperHub
6.0
/10
Rejected4 位审稿人
最低5最高8标准差1.2
8
5
6
5
4.5
置信度
正确性3.3
贡献度2.8
表达3.5
ICLR 2025

Elucidating the Design Space of Text-to-Audio Models

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

We elucidate the design space of text-to-audio and present ETTA with state-of-the-art result and improved abilities to generate creative audio.

摘要

关键词
audio generationtext-to-audiosynthetic datadiffusionflow matching

评审与讨论

审稿意见
8

This paper presents a 1.35M high-quality audio-caption dataset utilizing the state-of-the-art (SoTA) audio-language model, Audio Flamingo, referred to as AF-Synthetic. Additionally, the paper introduces a text-to-audio system based on the Latent Diffusion Model (LDM) with a Diffusion Transformer (DiT) backbone. Experimental results demonstrate that the proposed Elucidated Text-To-Audio (ETTA) system achieves SoTA performance across multiple metrics on both the AudioCaps and MusicCaps datasets.

优点

The paper introduces a novel filtering pipeline to select the best captions using the CLAP score, developing one of the largest high-quality audio-language datasets. The proposed ETTA generation model is trained across a wide range of diverse audio datasets and achieves competitive scores on both AudioCaps and MusicCaps, demonstrating state-of-the-art performance in both text-to-audio and text-to-music tasks.

External ablation studies investigate the system's performance across different model sizes and training/sampling strategies. Subjective evaluations highlight significant improvements in the proposed ETTA system over baseline models, proving its enhanced ability to generate complex and imaginative captions.

Overall, this paper presents interesting work in the field of audio generation. The authors first introduce a large-scale dataset and then present a state-of-the-art generation system trained using this dataset. Various experiments demonstrate the effectiveness of different methods or modules within the system, concluding with an analysis of the ETTA system's limitations.

缺点

  1. The authors claim the effectiveness of the proposed dataset by demonstrating the SoTA performance of the ETTA system trained on AF-Synthetic. However, the paper lacks sufficient experiments directly comparing the dataset's contribution, such as showing how models like Tango or AudioLDM perform when trained with AF-Synthetic.
  2. All the metrics used for evaluating the text-to-audio system are objective, which is generally sufficient but may not always reflect real-world performance. Therefore, including subjective evaluations like the Mean Opinion Score (MOS) would be beneficial, particularly for comparing models that achieve top performance across some metrics.
  3. The authors attempt to compare the performance between different baseline models. While ablation studies already demonstrate the effectiveness of the proposed ETTA strategies, the paper lacks experiments where the proposed model is trained on the same dataset (e.g., AudioCaps) to clearly illustrate the improvements contributed solely by the system itself.
  4. The ETTA model mainly builds on the backbone of the Stable Audio system. Beyond the training and sampling strategies (Flow Matching and ODE solvers, which are standard improvements in current LDM systems), the key enhancement appears to be the use of Adaptive Layer Normalization (AdaLN) within the DiT structure and the T5-base model for text embedding. These techniques are already implemented in the original DiT paper [1] and other baseline models. As a result, the model seems more like an engineering application with limited novel contributions.

Overall, this is an interesting paper with great effort, I am willing to change the score if the author can solve the questions and fulfil the following experiments:

  1. Train baseline models like Tango or AudioLDM on AF-Synthetic and compare to their original performance.
  2. Train ETTA on other datasets like AudioCaps or TangoPromptBank and compare performance to when trained on AF-Synthetic. These experiments would help separate the contributions of the model architecture from the dataset.
  3. Develop human evaluations on a subset of generated samples, comparing ETTA to top baseline models, such as Mean Opinion Score (MOS) or other equivalent subjective ratings on audio quality, text relevance and so on.
  4. More clearly articulate the novel aspects of their model architecture compared to existing work like DiT and Stable Audio.

[1]. Peebles, William, and Saining Xie. "Scalable diffusion models with transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

问题

  1. For the comparison of the proposed AF-Synthetic dataset, the paper indicates that both AF-AudioSet and Sound-VECaps have around 161K samples remaining after filtering with a CLAP score of 0.45. Since both datasets are primarily developed from AudioSet, is there a significant overlap in the subset that achieves a higher CLAP score? If so, could this raise concerns regarding the reliability of using the CLAP score as a filtering metric?
  2. What are the key differences between AF-AudioSet and the proposed AF-Synthetic, apart from the fact that the latter generates ten captions per sample and selects the one with the highest CLAP score?
  3. What distinguishes the AdaLN layer in the proposed ETTA-DiT structure from the one used in the original DiT model mentioned in a previous section?
  4. How does the ETTA model perform with improvements limited to the ETTA-DiT component (e.g., when trained solely on AudioCaps)?

伦理问题详情

No ethic concerns

评论

Dear reviewer 5Eah,

We appreciate your supportive review and constructive feedback. We clarify your questions as below:

Q: “the paper lacks sufficient experiments directly comparing the dataset's contribution, such as showing how models like Tango or AudioLDM perform when trained with AF-Synthetic”

This is a good point. We found some training instability issues when re-training these two models with AF-Synthetic, and therefore we could not finish the experiments by the rebuttal deadline. We will do our best to identify the issues and include their results into our paper.

We do directly measure the dataset’s contribution with the Stable Audio Open architecture. The results are shown in Table 4 & 5 on page 8 (1st vs 2nd row). To further remove the effect of classifier free guidance on the evaluation results, we turn off classifier free guidance and report additional results in Table 26 & 27 on page 27 (1st vs 2nd row). The results validate the improvements from AF-Synthetic.

Specially, we note that AF-Synthetic contains no studio-grade music recording samples with annotations from musicians, but Table 5 indicates Stable Audio Open trained with AF-Synthetic has better objective metrics and music-text relevance than their original model trained with large scale professional music data (FMA-large).

Q: “The paper lacks experiments where the proposed model is trained on the same dataset (e.g., AudioCaps) to clearly illustrate the improvements contributed solely by the system itself.” … “Train ETTA on other datasets like AudioCaps or TangoPromptBank and compare performance to when trained on AF-Synthetic.”

Table 6 & 7 on page 8 include results of ETTA trained on AudioCaps and TangoPromptBank. For example, Table 6 (1st row) indicates ETTA trained only on AudioCaps achieves better FDOFD_O, FDPFD_P, KLSKL_S, CLAPLCLAP_L, and CLAPMCLAP_M than non-proprietary models in Table 2. We will add this row to Table 2 for better presentation.

To remove the effect of classifier free guidance on evaluation results, we additionally run evaluations with no classifier free guidance in Table 28 & 29 on page 27. The improvements of each component of ETTA are more consistent and obvious.

Q: “including subjective evaluations like the Mean Opinion Score (MOS) would be beneficial, particularly for comparing models that achieve top performance across some metrics.” … “Develop human evaluations on a subset of generated samples, comparing ETTA to top baseline models, such as Mean Opinion Score (MOS) or other equivalent subjective ratings on audio quality, text relevance and so on.”

This is a very good point. Per your request we added MOS evaluation in our general response (part 1): Subjective evaluations for benchmark datasets. The results are consistent with the objective metrics overall.

Q: “More clearly articulate the novel aspects of their model architecture.” …”What distinguishes the AdaLN layer in the proposed ETTA-DiT structure from the one used in the original DiT model mentioned in a previous section?”

We will describe the differences more clearly in the paper. In summary, (1) ETTA-DiT applies AdaLN-zero to the cross attention input with unbounded gating; (2) ETTA-DiT’s AdaLN operates in full numerical precision (FP32) for both training & inference; and (3) ETTA-DiT uses the final projection layer with zero-initalization along with AdaLN with identity initialization. (4) we use GELU activation with tanh approximation. (5) ETTA-DiT’s RoPE embeddings operates in FP32.

Q. AF-AudioSet and Sound-VECaps have around 161K samples remaining after filtering with a CLAP score of 0.45. Since both datasets are primarily developed from AudioSet, is there a significant overlap in the subset that achieves a higher CLAP score? If so, could this raise concerns regarding the reliability of using the CLAP score as a filtering metric?

This is a very interesting question. We computed the overlap and found there are only 14,133 overlapping samples between AF-AudioSet and Sound-VECaps, which is less than 10% of their sizes (both around 160K). This is likely because their captioning methods are very different.

评论

Q: What are the key differences between AF-AudioSet and the proposed AF-Synthetic, apart from the fact that the latter generates ten captions per sample and selects the one with the highest CLAP score?

The main difference is the scale between these two datasets and the additional efforts to tackle the challenges caused by the scale. AudioSet is already a well curated 10-second dataset (it has no bad audio, contents are balanced, and so on).

However, for audio in-the-wild (e.g. FreeSound, WavCaps), we need to carefully remove useless audio (e.g. noisy or silent), down-weight long and homogeneous audio (e.g. circular background sound for over an hour), and re-weighting different contents (e.g. music vs speech conversation vs street sound). We engineered a series of filtering and weighting methods for these, which finally extracted 1.35M captions (as reported) out of 6M raw captions.

In addition, we changed the inference hyper-parameters of the captioning model (Audio Flamingo) to encourage more diverse generations.

Q: How does the ETTA model perform with improvements limited to the ETTA-DiT component?

The improvements from ETTA-DiT are shown in Table 4 & 5 on page 8 (2nd vs 3rd row). We additionally run evaluations with no classifier free guidance, and the results are in Table 26 & 27 on page 27 (2nd vs 3rd row). The results, especially with no classifier free guidance, show that ETTA-DiT leads to consistent improvements in most metrics.

Remark

We conduct additional evaluations on the SongDescriber dataset. See our general response (part 4): Additional music evaluation on SongDescriber dataset.

评论

Dear reviewer 5Eah,

Thank you for reviewing our manuscript. We have addressed your concerns and conducted the additional requested experiments.

As the rebuttal window nears its end, we kindly request your review of our responses and welcome any further feedback.

We look forward to hearing from you.

Authors

评论

Thanks for your replies with additional supplementary materials and explanations, which addressed most of my concerns. I have increased the rating and good luck to the paper.

审稿意见
5

The paper focuses on pushing the performance of diffusion-based text-to-audio models by introducing a new synthetic dataset and revisiting design choices.

  • New synthetic dataset (AF-Synthetic): The authors improve the pipeline from AF-AudioSet to generate 1.35M captions using audio samples from multiple datasets. The result is a large dataset with high CLAP score, while existing datasets are either small or have simple to no filtering method. The authors also point out that AF-Synthetic captions are different than existing datasets.
  • The model is built upon Audio-VAE (from stable-audio-tools), Diffusion Transfromer (Peebles & Xie, 2023) with some changes in the architecture.
  • The authors compare across datasets and metrics to discover the optimal choice for the number of function evaluations and classifier-free guidance.

优点

The work aims to push the performance of diffusion-based text-to-audio models and show better results than open source models. The improvements come from a better dataset and an extensive evaluation of design/hyperparameter choices built on top of SOTA models. I appreciate the amount of work put into creating the dataset and try many different hyperparameters and design choices, and the commitment to open source the code that appears to push the field ahead.

缺点

Since this is a purely empirical paper, my main concerns are mostly about evaluation and results.

  • Since the dataset is curated using CLAP scores, I find the CLAP score not a reliable indicator, while FAD and KL use old models to extract features.
  • There are no subjective scores in the results except for Table 9 which compares the final model with others.
  • The improvements in ETTA-DiT, OT-CFM and t-sampling are not consistent across metrics.
  • Using AF-AudioSet and AF-Synthetic yields similar results in AudioCaps and MusicCaps despite much larger size (Table 6-7)

Overall, I find that a few conclusions in the paper are not very helpful (e.g. increasing model size improves the performance), especially when we have inconsistent metrics and lack of subjective scores. While it's good to see many hyperparameters being assessed, the contribution is lack of novelty since it does not propose new techniques or reveal surprising findings. The lack of reliable metrics, which authors also admit in the final section, also weaken claims and conclusions.

问题

  • What can be the reason behind mode-collapsed models consistently produce good scores across multiple metrics, while the backbone for these metrics are different?
评论

Dear reviewer rAqM,

Q: Since the dataset is curated using CLAP scores, I find the CLAP score not a reliable indicator, while FAD and KL use old models to extract features.

We appreciate that you raised this important point. We address your concern below.

  • First, It is a common practice to use CLIP to filter and curate training data in text-to-image models, for example, [1] [2]. We follow this approach and apply it to the audio generation domain.

  • We used CLAPLCLAP_L (CLAP trained by Laion) to curate our training data. In evaluations, we additionally report CLAPMCLAP_M, a completely separate CLAP model trained on different datasets by Microsoft. We believe this metric is not biased in our setting.

  • We report FAD and KL in order to conduct thorough comparison with baselines. Many baseline models, especially those not opensourced, report these metrics in their papers. While we agree that VGGish used in FAD is old, PANNs and PaSST used in KL are among the most representative audio classifier backbones currently available.

  • We also report newer metrics, especially FDOFD_O using the OpenL3 feature extractor (which could evaluate 44.1kHz stereo audio), and find our model significantly outperforms baselines on this metric.

  • Our intention was to report as many well-established evaluation metrics as possible to provide the most comprehensive evaluation in the text-to-audio community. However, we do see your point and we will improve our paper to include detailed analysis of each metric including their limitations.

[1] Schuhmann, Christoph, et al. "Laion-400m: Open dataset of clip-filtered 400 million image-text pairs." arXiv 2021.

[2] Hong, Rachel, et al. "Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp." ACM EAAMO 2024.

Q: There are no subjective scores in the results except for Table 9 which compares the final model with others.

We address your question in our general response (part 1): Subjective evaluations for benchmark datasets.

Q: The improvements in ETTA-DiT, OT-CFM and t-sampling are not consistent across metrics.

We appreciate your mentioning of the metric disagreement.

  • Indeed, we find these metrics disagree with each other often. For example, optimizing FAD, especially to extreme, often leads to worse FDPFD_P and vice versa. This happens to not only ETTA but also most baseline models (Tango-series, AudioLDM-series, etc.). However, we would like to be transparent about our results and decide to report all of them, and hope to call the community’s attention to study this contradiction in depth.

  • We also think that classifier free guidance potentially caused the inconsistency. We additionally turn off classifier free guidance and report evaluation results in Table 26 & 27 on page 27. The improvements of each component are clear and consistent in most metrics including FD, KL, and CLAP. The only exceptions are FAD (which is old and not preferred as mentioned) and IS (which measures sharpness and is most affected by classifier free guidance).

Q: Using AF-AudioSet and AF-Synthetic yields similar results in AudioCaps and MusicCaps despite much larger size (Table 6-7)

First, we find that AF-Synthetic leads to more stable training than AF-AudioSet towards longer (>1M) training steps.

We also hypothesize that the reason why AudioCaps and MusicCaps results are similar is that both are subsets of AudioSet, and so AF-AudioSet already covers the distribution. To validate this hypothesis, we additionally evaluate on SongDescriber, a high-quality music dataset (44.1kHz stereo) and an OOD test set in our setting. In contrast, MusicCaps has varied sample rates and channels.

The main results are in Table 36 on page 28, and the comparison between AF-AudioSet vs AF-Synthetic are in Table 37 on page 28. Results indicate pretraining with AF-Synthetic outperforms AF-AudioSet in most metrics, showing our AF-Synthetic pretraining set offers stronger generalization ability to the model.

Q: the contribution is lack of novelty since it does not propose new techniques or reveal surprising findings.

We summarize our novelty and contribution in our general response (part 2): Novelty and Contribution.

Q: “What can be the reason behind mode-collapsed models consistently produce good scores across multiple metrics, while the backbone for these metrics are different?”

We think the reason is related to small, homogeneous fine-tuning and test sets. We include further analysis in our general response (part 3): Clarification on potential mode-collapsing issue of ETTA-FT-MC.

评论

Dear reviewer rAqM,

Thank you for reviewing our manuscript. We have addressed your concerns and conducted the additional requested experiments.

As the rebuttal window nears its end, we kindly request your review of our responses and welcome any further feedback.

We look forward to hearing from you.

Authors

评论

Thank you for your response.

I increased the score to 5 due to the extensive and transparent results reported, but some of my points remain a concern, so I am still inclined to recommend rejection.

Since objective scores are not reliable and maybe biased (e.g., for the case of CLAP), I find them suitable only for development, but any significant claims made require subjective eval scores. The authors provided the scores for the final model, but not for any analysis in the paper. Since the important contribution of the paper is to compare between design choices, I recommend reporting those for at least Table 4 and 5. My apology for not pointing this out earlier to allow enough time for eval (so my score has already partially exclude this weakness). Since subjective scores are the most important metrics, we can rely on that to compare and evaluate the significance, rather than having to look at many inconsistent metrics (btw, they look quite consistent in Table 8 as expected, raising the question of how to evaluate when there's less agreement). I must also say this is also specific to the text-to-audio task since even the data used to train those models for objective scores may also be noisy; audio-caption data is less reliable than for example speech-transcript or audio-audio similarity. The authors have pointed out problems of the metrics; however that doesn't change the fact that the results are not enough to show the improvement. Looking at Table 26 and 27, I can't tell if the last two rows are better than ETTA-DiT.

"Scalability with Model Size" and "Choice of Sampler and its Impact on Metrics" in Page 9 do not provide any new discovery. The results should improve with larger model sizes and both samplers show similar results, also FD_V is a less reliable metric as authors pointed out (another metric reported in the appendix can be used here). Also if the goal is to compare between solvers, there are more candidates (e.g. DDIM) and this is also not something new with no evidently winning candidate.

Overall, I think the biggest contribution of the paper is the dataset and engineering efforts to train a good model on the dataset. However, the technical contributions are limited, with some unconvincing results and discoveries. The paper may still be valuable for its many results and observations.

评论

Dear reviewer rAqM,

Thank you for your response and sharing additional feedback. We appreciate your acknowledgement of our extensive and transparent experimental results and observations valuable to the community. We would like to respond to your additional concerns to the best of our ability:

  • While we agree that subjective scores are important in assessing models, applying it to all experiments incurs significant costs, both in time and financial resources, especially for ablation studies and analyses. By introducing all objective metrics widely respected by the audio generation community, which is also more reproducible than human evaluation, we believe our ablation study and analysis collectively provide insights.

A similar approach is demonstrated in [1] within text-to-image domain, where it presented comprehensive ablation experiments using objective metrics (CLIP, FID, GenEval, etc.) to improve the design and analysis, despite some inconsistencies among these metrics (Table 2, 4, and 5 in [1], for example).

  • Scaling up the model size, as another example, is not as straightforward as one may assume and one should be strategic in allocating the model capacity to achieve optimal results. In our experiments, we showed that 1) while increasing width or depth of attention layers is beneficial, increasing the receptive field of convFF is not, and 2) applying larger and/or more text encoders may not necessarily improve results, contrary to common belief. We believe these are non-trivial findings that provide practical insights into how we should approach model scaling.

  • Scaling data towards better results is indeed the expected goal, but successfully executing it requires significant, non-trivial effort. Constructing AF-Synthetic, for example, is not about simply applying off-shelf audio language model; it requires a series of pipelines to generate then filter synthetic captions to curate high quality synthetic dataset.

Finally, with all the efforts combined, this work also demonstrated a more challenging task, creative audio generation, which can be made possible with this non-trivial technical effort in scaling up. The improvement of ETTA in creating novel audio is substantial, which we strongly believe as an impactful discovery: scaling model with synthetic dataset can enable an ability to generate entirely novel sound unlike previous work. We welcome the readers to listen to the demo: https://anonymous.4open.science/r/etta_demo-72C4/index.md

[1] Esser, Patrick, et al. "Scaling rectified flow transformers for high-resolution image synthesis." Forty-first International Conference on Machine Learning. 2024.

审稿意见
6

The primary focus of this paper is not to explore novel model designs, but rather to provide a comprehensive understanding of the current paradigms in TTA models. It seeks to identify critical factors that contribute to performance improvements and to evaluate scalability concerning data and model size. Additionally, the paper introduces AF-Synthetic, the first million-size synthetic caption dataset with strong audio correlations.

优点

  1. The contribution of the dataset: Experiments demonstrate that AF-Synthetic significantly improves performance.
  2. It provides a practical guide for hyperparameter tuning in the field of TTA.
  3. Each experimental conclusion is highlighted with a purple-bordered box, which makes the paper very reader-friendly and pleasant to navigate.

缺点

  1. As the authors themselves mentioned, the paper doesn't propose a novel method. Moreover, as a paper based on extensive experiments, while it offers valuable conclusions, it lacks an innovative insight or discovery that stands out. Therefore, I believe this paper might be a better fit for the Dataset & Benchmark track.
  2. I would also suggest that the authors include the following three points:
    • Could the authors further analyze the "potentially mode-collapsed mode" part? In theory, the FD metric should be able to measure sample diversity.
    • Since the authors have already tried logit-normal t-sampling, could they also test the Min-SNR Weighting Strategy?
    • Regarding the Auto-guidance section, it might be worth experimenting with removing CFG (Classifier-Free Guidance) for the first 40% of steps and then adding CFG for the remaining 60% of steps to see if this improves the FD score.

问题

Please refer to Weaknesses#2.

评论

Dear Reviewer gSe7,

Thank you for your thoughtful review and valuable feedback on our paper. We are pleased that you found our contributions of the AF-Synthetic dataset, practical guidance for design choices of TTA, and good presentation to be strengths of our work. We address your questions below.

Q: “As the authors themselves mentioned, the paper doesn’t propose a novel method. Moreover, as a paper based on extensive experiments, while it offers valuable conclusions, it lacks an innovative insight or discovery that stands out. Therefore, I believe this paper might be a better fit for the Dataset & Benchmark track.”

We address your concern in our general response (part 2): Novelty and Contribution.

Q: “Could the authors further analyze the ‘potentially mode-collapsed mode’ part? In theory, the FD metric should be able to measure sample diversity.”

We answer your question in our general response (part 3): Clarification on potential mode-collapsing issue of ETTA-FT-MC.

Generally FD could measure quality and diversity for unconditional generation. However, it may not be able to measure diversity perfectly in text-conditional generation, especially when the test set is small and homogenous. The diversity in the latter case refers to generating different audio with a slightly different caption or a different random seed with the same caption. We ran additional human evaluation for this model and got the following results: OVL=3.77±0.103.77\pm0.10 where GT=3.88±0.103.88\pm0.10, REL=3.73±0.093.73\pm0.09 where GT=3.90±0.103.90\pm0.10, which also show very good quality. Therefore, we would like to leave our finding as an open problem and call the research community’s attention.

Q: “Since the authors have already tried logit-normal t-sampling, could they also test the Min-SNR Weighting Strategy?” …“Regarding the Auto-guidance section, it might be worth experimenting with removing CFG (Classifier-Free Guidance) for the first 40% of steps and then adding CFG for the remaining 60% of steps to see if this improves the FD score.”

These are really good points. We explored these training (Min-SNR weighting [1]) and sampling strategies (CFG on limited interval [2]) per your suggestion. The results are below.

  • Min-SNR: We use γ=5\gamma=5 per convention, and trained ETTA-DiT for 250k steps with the v-diffusion objective. Results can be found in Table 32 & 33 on page 28 (+Min-SNR-γ\gamma (γ=5\gamma=5)). Most metrics became worse, but FDOFD_O and ISPIS_P are slightly better in music generation.

  • CFG on a limited interval: We remove CFG for the initial 40% of diffusion steps and then apply CFG (wcfg=3w_{cfg}=3) for the remaining 60% steps. We inference and evaluate both with and without AutoGuidance. Results can be found in Table 34 & 35 on page 28 (+ CFG @ [0, 0.6]). Most metrics became worse, but FDPFD_P is slightly better in audio generation using the small (XS) model.

[1] Hang, Tiankai, et al. "Efficient diffusion training via min-snr weighting strategy." ICCV 2023.

[2] Kynkäänniemi, Tuomas, et al. "Applying guidance in a limited interval improves sample and distribution quality in diffusion models." arXiv 2024.

Remark

We conduct additional evaluations on the SongDescriber dataset. See our general response (part 4): Additional music evaluation on SongDescriber dataset.

评论

The author replied very actively, and the experimental supplement was also very comprehensive, so I increased the score.

评论

Dear gSe7,

Thank you very much for acknowledging our response. We are glad that our response has been comprehensive and appreciated. We are happy to discuss any remaining questions you may have.

Authors

审稿意见
5

This paper empirically studies the way to improve the current text-to-audio generation system, including creating a new synthetic audio-caption paired dataset, an improved architecture, and other related settings.

优点

  1. The author proposed a synthetic data generation pipeline and created the first million-size dataset with high audio-text correlation.
  2. The author has implemented a comprehensive list of objective evaluation metrics and shows interesting results during comparison.
  3. The author mentioned they will open-source their code for reproducibility.

缺点

  1. As there are a large number of evaluation metrics used. It would be helpful to explain what each evaluation metrics focus on. For example, there are three types of Frechet Distance - what are the differences?
  2. It is not clear how ETTA perform without pretraining on the large-scale datasets. For example, what would the result look like if the ETTA model is trained on AudioCaps from scratch (I assume this is a common setup)?
  3. Lack of subjective evaluation. As the author mentioned, there is not yet an effective objective evaluation metric for the TTA task. Out of the seven main conclusions in the paper, only one is backed by the subjective evaluation. This can make the conclusion less convincing.
  4. The architectural improvement seems a bit marginal. The improvement of ETTA seems to come from the new synthetic dataset mostly.

问题

  1. “We initialize the final projection layer of DiT to output zeros.” Do you mean zero-initialize all the weight and bias in the projection layer? Or do you use gated operation? More explanations are welcomed.
  2. Please consider adding the result of Make-an-Audio and AudioLDM in Table 2 and Table 3, as they are also very important baselines to compare with.
  3. In Table 6, the ETTA trained on AudioCaps shows 3.00 FAD on the audiocaps evaluation set. This is significantly lower than the current state-of-the-art. It would be helpful if the author could explain this result.
  4. Line 510: Please say more about the loss divergence issue? It seems contradictory as the evaluation metrics still look normal. More explanations would be helpful.
评论

Dear Reviewer dkey,

Thank you for your thoughtful review and valuable feedback on our paper. We appreciate your recognition of our contributions, including the creation of a large-scale synthetic audio-caption dataset and the implementation of comprehensive evaluation metrics. We address your concerns and questions below.

Q: “It would be helpful to explain what each evaluation metrics focus on. For example, there are three types of Frechet Distance - what are the differences?”

Thanks for your suggestion. We will add a detailed explanation to each metric, in addition to the high level descriptions presented in Section 4.3. The three types of Frechet Distances are different in their feature extraction networks, which are from different classifiers (VGGish, PANNs, and OpenL3). VGGish is the oldest and the least accurate, and therefore FDVFD_V is less preferred (as pointed out by AudioLDM and Stable Audio). PANNs and OpenL3 are newer and have better quality, and so FDPFD_P and FDOFD_O are more preferred. We report all possible evaluation metrics used by the community in order to compare with baselines thoroughly.

Q: “It is not clear how ETTA perform without pretraining on the large-scale datasets. For example, what would the result look like if the ETTA model is trained on AudioCaps from scratch (I assume this is a common setup)?”

The first row of Table 6 includes results of ETTA only trained with AudioCaps. This model achieves better FDOFD_O, FDPFD_P, KLSKL_S, CLAPLCLAP_L, and CLAPMCLAP_M than non-proprietary models in Table 2. We will add this row to Table 2 for better presentation.

Q: “Out of the seven main conclusions in the paper, only one is backed by the subjective evaluation. This can make the conclusion less convincing.”

Per your suggestion, we report MOS results in our general response (part 1): Subjective evaluations for benchmark datasets. The results are consistent with objective evaluations overall. The results indicate ETTA has high quality and relevance to captions compared to baselines.

We kindly disagree with the statement that our conclusions are less convincing due to lack of subjective evaluations. We included all standard objective metrics in the literature. Similar to metrics in image generation (FID, bits/dim, CLIP, …), while we think a single metric is not comprehensive enough, these combined are very informative about the model’s quality.

Q: “The architectural improvement seems a bit marginal. The improvement of ETTA seems to come from the new synthetic dataset mostly.”

We disagree that the architectural improvements are marginal. However, we highly appreciate that you raise this issue from our existing results. To better understand the improvements from each component, we turn off classifier free guidance (CFG) and re-run all evaluations in Table 4 & 5. The results without CFG are shown in Table 26 & 27, page 27 in the updated pdf. The results show that the ETTA-DiT architecture leads to significant and consistent improvements on most metrics.

Q: ”‘We initialize the final projection layer of DiT to output zeros.’ Do you mean zero-initialize all the weight and bias in the projection layer? Or do you use gated operation? More explanations are welcome.”

We use both. We zero-initialize the weights and biases of the final projection layer. We also employ AdaLN with identity initialization with the gating mechanism.

Q: “Please consider adding the result of Make-an-Audio and AudioLDM in Table 2 and Table 3, as they are also very important baselines to compare with.”

Thank you for this suggestion. We have updated Table 2 & 3 to include these results. We also note that we have already included their successive work (Make-an-Audio 2 & AudioLDM2), which have better quality.

Q: “In Table 6, the ETTA trained on AudioCaps shows 3.00 FAD on the audiocaps evaluation set. This is significantly lower than the current state-of-the-art. It would be helpful if the author could explain this result.”

FAD (i.e. FDVFD_V) uses VGGish network to extract audio features. It is older and less accurate than PANNs (FDPFD_P) and OpenL3 (FDOFD_O), as pointed out by the authors of AudioLDM and Stable Audio. Therefore, FDPFD_P and FDOFD_O are preferred over FAD. In addition, FDVFD_V is constrained with 16kHz mono audio whereas FDOFD_O measures up to 48kHz stereo, compatible with our model’s audio quality (44.1kHz stereo). We will include a more detailed analysis of these metrics in the paper.

We also find that FAD often contradicts with FDPFD_P. By this we mean optimizing FAD, especially to extreme, often leads to worse FDPFD_P and vice versa. This happens to not only ETTA but also most baseline models (Tango-series, AudioLDM-series, etc.). However, we would like to be transparent about our results and decide to report all of them, and hope to call the community’s attention to study this contradiction in depth.

评论

Q: “Line 510: Please say more about the loss divergence issue? It seems contradictory as the evaluation metrics still look normal. More explanations would be helpful.”

Thank you for pointing out this confusion. The loss divergence refers to training instability after fine-tuning on MusicCaps for too long, possibly because the finetuning set is too small and homogeneous. We select the fine-tuned checkpoint before instability happens.

We include further analysis on the evaluation results in our general response (part 3): Clarification on potential mode-collapsing issue of ETTA-FT-MC.

Remark

We conduct additional evaluations on the SongDescriber dataset. See our general response (part 4): Additional music evaluation on SongDescriber dataset.

评论

Dear Reviewer dkey,

Thank you for reviewing our manuscript. We have addressed your concerns and conducted the additional requested experiments.

As the rebuttal window nears its end, we kindly request your review of our responses and welcome any further feedback.

We look forward to hearing from you.

Authors

评论

(1) Subjective evaluations for benchmark datasets

As several reviewers have asked for subjective evaluation results on the main benchmarks, we additionally conducted MOS (OVL = overall audio quality where raters do not see captions, REL = audio-text relevance). The results are shown below (also shown in Table 30 & 31, page 27 in the updated pdf).

On AudioCaps, ETTA has significantly better quality and relevance. On MusicCaps, while Stable Audio Open has better music quality – likely because it is trained with massive music data including FMA – ETTA has better music-text relevance, indicating better caption-following ability.

AudioCaps

ModelGTAudioLDM2-largeTANGO2Stable Audio OpenETTAETTA-FT-AC-100k
OVL3.43 ± 0.113.00 ± 0.113.08 ± 0.103.29 ± 0.113.43 ± 0.113.26 ± 0.10
REL3.62 ± 0.103.11 ± 0.103.66 ± 0.093.15 ± 0.113.68 ± 0.103.77 ± 0.10

MusicCaps

ModelGTAudioLDM2-largeTANGO-AFStable Audio OpenETTA
OVL3.88 ± 0.103.25 ± 0.103.38 ± 0.093.92 ± 0.103.53 ± 0.10
REL3.90 ± 0.103.15 ± 0.103.31 ± 0.103.35 ± 0.113.57 ± 0.10

(2) Novelty and Contribution

We believe our work offers significant contributions to advance the field of text-to-audio generation. While studying novel model architecture is not the focus of our work, we do present an improved DiT implementation that leads to better quality. Here is the summary of our contributions:

  • We create the first million-size synthetic dataset with strong audio-text correlations.

  • For the first time, we show it is possible to achieve SOTA generation results on multiple benchmarks (AudioCaps, MusicCaps, SongDescriber) using only synthetic captions and at scale.

  • We systematically investigate the impact of data and model scaling, architectural choices, and training and sampling strategies, providing valuable insights to guide future research in this area.

  • This work represents the most comprehensive suite of exploration and evaluation in the text-to-audio literature.

  • We not only achieve SOTA generation results on standard audio and music generation benchmarks, but also show an advanced ability to synthesize novel audio that does not exist in the real world.

see our demo website at: https://anonymous.4open.science/r/etta_demo-72C4/index.md

In summary, we believe that elucidating the existing paradigm to push the practical boundaries in this domain is as equally important as exploring novel architecture, and our contributions collectively offer valuable insights to the research community.

(3) Clarification on potential mode-collapsing issue of ETTA-FT-MC

Several reviewers asked questions about our discussion in Section 5. We thank the reviewers for highlighting this point as an interesting observation worth further explanation. We aim to describe our observation and analysis to the best of our ability.

Context: The discussion only applies to the model fine-tuned on about half of MusicCaps samples and evaluated on the other half, corresponding to the setting in Tango-AF. It does not apply to our main audio and music generation results in Section 4 as well as the Appendix.

Observation: We listened to ETTA-FT-MC generated samples with different random seeds as well as slightly different captions. We found the model has reduced diversity, similar to but not as severe as the typical mode collapse phenomenon shown in GANs. We do not observe this in our pretrained ETTA model or ETTA-FC-AC that is fine-tuned with AudioCaps.

Analysis: We believe it is a new bias that is undefined in the literature and likely cannot be identified by the existing quality metrics, and it is related to small and homogeneous fine-tuning sets. It is possible that Frechet Distance does not measure diversity well for small and homogeneous test sets. We also conducted MOS evaluation on quality and relevance, and the scores are high and consistent with the objective evaluations.

  • OVL=3.77±0.103.77\pm0.10 where GT=3.88±0.103.88\pm0.10, REL=3.73±0.093.73\pm0.09 where GT=3.90±0.103.90\pm0.10.

Why we discuss these results: We aim to call the community’s attention to further study this bias that happens to small and homogeneous fine-tuning sets and motivate better evaluation metrics tailored for this situation.

评论

(4) Additional music evaluation on SongDescriber dataset

As several reviewers have questions regarding music evaluation, we additionally conduct evaluations on SongDescriber, a high-quality music dataset (44.1kHz stereo). In contrast, MusicCaps has varied sample rates and channels. The results are in Table 36 on page 28.

ETTA outperforms baselines in most metrics, especially FDOFD_O which could measure quality of 44.1kHz stereo outputs. We are also preparing subjective human evaluation (OVL/REL) on this dataset as well.

AC 元评审

The paper focuses on pushing the performance of diffusion-based text-to-audio models by introducing a new synthetic dataset and revisiting design choices.

  • New synthetic dataset (AF-Synthetic): The authors improve the pipeline from AF-AudioSet to generate 1.35M captions using audio samples from multiple datasets. The result is a large dataset with high CLAP score, while existing datasets are either small or have simple to no filtering method. The authors also point out that AF-Synthetic captions are different than existing datasets.
  • The model is built upon Audio-VAE (from stable-audio-tools), Diffusion Transfromer (Peebles & Xie, 2023) with some changes in the architecture.
  • The authors compare across datasets and metrics to discover the optimal choice for the number of function evaluations and classifier-free guidance.

The objective evaluations only give a partial view:

CLAP scores, I find the CLAP score not a reliable indicator, while FAD and KL use old models to extract features.

The rebuttal period was productive and 3 out of 4 reviewers increased their score. Notably, the authors added subjective evaluations. Weaknesses that were addressed in the rebuttal: no subjective evaluations, and disentangling more clearly (in the paper) the effect of the methods vs. of the dataset. This is overall a good paper. However, those changes make the paper (which has 2 contributions: data and method) harder to follow. While reviewers increased their score, the paper falls a bit short of the required quality to be published at ICLR.

审稿人讨论附加意见

The rebuttal period was productive and 3 out of 4 reviewers increased their score. Notably, the authors added subjective evaluations and many updates, see e.g. https://openreview.net/forum?id=xmgvF0sLIn&noteId=bZtIhF5Iav The paper still averages at 6.00 after 3 score increases.

最终决定

Reject