PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
3
3
4
4
ICML 2025

IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling

OpenReviewPDF
提交: 2025-01-17更新: 2025-07-24
TL;DR

IMPACT is a text-to-audio generation framework that combines iterative mask-based parallel decoding with continuous representations driven by latent diffusion models to achieve high-performing audio quality and fidelity while ensuring fast inference.

摘要

关键词
Text-to-audioDiffusion modelsIterative parallel decodingMask-based generative modeling

评审与讨论

审稿意见
3

This paper proposes a framework to achieve high-quality and high-fidelity audio synthesis in text-to-audio generation tasks, by combining iterative mask parallel decoding with continuous latent diffusion model while maintaining efficient inference speed.

Specially, it applies iterative mask parallel decoding to continuous latent space for the first time, overcoming the fidelity limitation of the discrete labeling method. In addition, an unconditional pre-training strategy is introduced, that is, an unsupervised pre-training phase is introduced before text conditional training to improve the basic ability of audio generation.

给作者的问题

see other strengths and weakness

论据与证据

Yes

方法与评估标准

Yes

理论论述

The paper does not involve theoretical proof, but mainly relies on experimental verification.

实验设计与分析

The experimental design is reasonable By comparing the existence of unconditional pre-training, the amount of data with different text conditions, and other variants, the contribution of key modules can be effectively verified. And analyze the impact of the number of decoding iterations and the number of diffusion steps, revealing the trade-offs between efficiency and performance.

In terms of comparison of results, compared with mainstream models such as Tango, Audio LDM2, and MAGNET, it covers discrete/continuous representation and autoregressive/non-autoregressive generation paradigms.

补充材料

The supporting materials provide a comparison of the different systems of audio, corresponding to the state-of-the-art performance demonstrated in the paper in terms of the key metrics FD and FAD of the AudioCaps evaluation set.

与现有文献的关系

This paper inherits the mask decoding idea of MAGNET[1] and the continuous space generation of MAR[2]. Also compare the latest text-to-audio models such as Tango[3] series and AudioLDM[4].

However, I don't mention some recent similar work, eg: In terms of training strategy, there are great similarities with E2 TTS[5], SeedTTS(DiT)[6].

[1] Ziv, A., Gat, I., Lan, G. L., Remez, T., Kreuk, F., Copet, J., D ́efossez, A., Synnaeve, G., and Adi, Y. Masked audio generation using a single non-autoregressive transformer. In The Twelfth International Conference on Learning Representations, 2024.

[2] Li, T., Tian, Y., Li, H., Deng, M., and He, K. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024.

[3] Majumder, N., Hung, C.-Y., Ghosal, D., Hsu, W.-N., Mihalcea, R., and Poria, S. Tango 2: Aligning diffusionbased text-to-audio generative models through direct preference optimization. In ACM Multimedia 2024, 2024.

[4] Liu, H., Yuan, Y., Liu, X., Mei, X., Kong, Q., Tian, Q., Wang, Y., Wang, W., Wang, Y., and Plumbley, M. D. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.

[5] Eskimez S E, Wang X, Thakker M, et al. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts[C]//2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024: 682-689.

[6] Anastassiou P, Chen J, Chen J, et al. Seed-tts: A family of high-quality versatile speech generation models[J]. arXiv preprint arXiv:2406.02430, 2024.

遗漏的重要参考文献

see Relation To Broader Scientific Literature

其他优缺点

This paper expands iterative mask parallel decoding (derived from the discrete labeled model MAGNET) for the first time to continuously potential space and combines with the diffusion model (LDM), breaking through the fidelity limitation of discrete representations. This design achieves an effective balance between generation efficiency and quality.

An unconditional pre-training strategy was introduced, and an unsupervised pre-training stage was proposed before text conditional training was proposed to make full use of unlabeled audio data to improve the basic generation ability, providing a scalable solution for data scarce scenarios.

Experimental results show that in the AudioCaps evaluation set, IMPACT is better than baseline models such as Tango and AudioLDM2, and subjective evaluation also shows that its text correlation is better.

However, this unconditional pre-training strategy and iterative mask decoding of continuous space does not seem novel. And replacing the traditional attention mechanism through lightweight diffusion heads (eg: MLPs) to solve the pain points of slow sampling speed of diffusion models is also a mainstream method of joint modeling of AR and diffusion.

In addition, there is a lack of mathematical proof or convergence analysis for the cosine attenuation strategy of iterative masks.

Nevertheless, I am still inclined to the fact that this work is acceptable cause of the above advantages

其他意见或建议

No

作者回复

Response to reviewer UxfA

  • Missing references: Thank you for bringing this to our attention. We will include additional references to relevant TTS literature in the final version, placing our work more firmly in the broader context of speech and audio generation.
  • Methodological contribution compared with MAR: Thank you for noting that, while IMPACT introduces unconditional pre-training and iterative mask decoding to an audio modality, it maintains a methodology aligned with MAR. As suggested by the rebuttal guidelines, we have combined and thoroughly addressed this point in our response to Reviewer RpS3 (bullet index 2).
  • Regarding the lack of mathematical proof or convergence analysis for the cosine attenuation strategy of iterative masks: We assume the reviewer is referring to the mathematical proof of training convergence. We respectfully clarify that the cosine attenuation strategy is only applied during inference as a masking schedule for iterative decoding. During inference, the cosine masking scheduler is to ensure that more latents are gradually revealed throughout the decoding process. This ensures that in early decoding stages without sufficient context, not many latents are generated, while in further decoding stages, the model can rely on some already-generated content as information for generating new latents. Our cosine-based masking schedule is empirically grounded in prior work on iterative parallel mask-based generation, including MaskGIT (Chang et al., 2022), MAGE (Li et al., 2023), and MAGNET (Ziv et al., 2024), which have demonstrated strong empirical success with similar strategies. Sections 4.4 and Table 3 of MaskGIT present ablation studies on various masking schedules, comparing the cosine schedule with alternatives like linear, square, and cubic. The results show that the cosine schedule consistently yields superior image generation quality, achieving an FID of 6.06, outperforming linear (FID 7.51), square (FID 6.35), and cubic (FID 7.26) schedules. These findings justify the choice of the cosine mask schedule for optimal performance.
审稿人评论

Thanks to the author for the reply. I acknowledge the effectiveness of this study’s findings, including unconditional pre-training strategy and iterative mask decoding of continuous space. However, this study still has weaknesses in terms of evaluation and innovativeness. Therefore, I will keep my raw score.

作者评论

Response to reviewer UxfA

Evaluation

We appreciate the reviewer's concern regarding evaluation, however, we would like to respectfully clarify that reviewer UxfA's initial comments did not indicate weaknesses in evaluation. Nevertheless, we are glad to comprehensively address this point as follows:

  • Objective Metrics: We conducted thorough evaluations of our models using multiple objective metrics widely recognized as standard in the audio generation research community. Our comparisons include a variety of state-of-the-art models, demonstrating the effectiveness and superiority of our proposed IMPACT framework.
  • Subjective Human Evaluations: In addition to objective measures, we included comprehensive subjective human evaluations to further strengthen our evaluation strategy by directly assessing human perception of audio quality and text relevance. Although our initial submission did not include standard errors and confidence intervals, these were subsequently provided in response to reviewers JkUy and cuSy. Specifically, following reviewer cuSy's recommendation, we expanded the human evaluation by evaluating 100 generated samples to provide a detailed comparison between our IMPACT model (c) and Tango 2, which is currently considered the strongest baseline. The expanded evaluation results exhibited small standard errors, and the confidence intervals for our IMPACT model (c) were non-overlapping with those of baseline models, indicating statistically significant differences in performance.
  • Latency Evaluation: We carefully measured inference time for batches of audio samples, aligning with standard practices in the field. This evaluation provides critical insights into the practical applicability of our models, especially in scenarios where inference speed is essential. Notably, this latency measurement approach is consistent with the methodology used in MAGNET, a recent and relevant baseline.
  • Ablation studies: We conducted detailed ablation studies to systematically investigate key factors affecting the IMPACT model’s performance. These studies include:
    • Decoding Iterations: Clearly showing the trade-off between objective performance and inference speed.
    • Diffusion Steps: We empirically identified optimal settings balancing high-quality audio generation with efficient inference.
    • Unconditional Pre-training: Demonstrated its positive effect on performance, providing valuable insights into the training methodology.

Taken together, our comprehensive evaluations consisting of objective metrics, enhanced subjective assessments supported by statistical rigor, latency analysis, and detailed ablation studies form a robust and rigorous evaluation framework. This should thoroughly address and effectively mitigate any potential concerns related to the evaluation methodology.

审稿意见
3

The authors adapt the recently proposed Masked Autoregressive Models (MARs) from [Li et al. 2024] to text-to-audio generation. This architecture is essentially a MaskGIT model with a lightweight diffusion head to enable generating continuous data from an audio autoencoder instead of discrete tokens. The authors utilize a two stage training procedure and first pre-train the generative model on a large set of unlabeled audio data before fine-tuning it with paired text-audio data. The authors compare against a range of recent models using a variety of objective and subjective metrics and report strong performance. They also demonstrate that their approach achieves a strong performance/latency tradeoff.

给作者的问题

  1. What is the impact of the cfg schedule presented in appendix A? People have introduced various guidance schedules, but the most common choice remains a constant schedule. Is the presented schedule necessary or beneficial? This should be ablated.

论据与证据

The authors present comprehensive objective and subjective metrics to evaluate the performance of their approach. They demonstrate that their proposed system achieves strong performance with good latency compared to competing methods. For subjective metrics with relatively limited sample sizes, it is good practice to report the standard error to validate that the observed differences are meaninful. Without quantifying the variance, they are somewhat hard to interpret.

方法与评估标准

The proposed method is primarily an application of a recently proposed image generation model for audio generation. Adapting MaskGIT style models to generate continuous data with a lightweight diffusion head is similarly applicable allows you to take advantage of both the speedup from MaskGIT style models and the high quality of continuous VAE latents. These advantages extend beyond visual data to audio data as well. As a result, the proposed method is reasonable for the application at hand. The authors utilize standard benchmarks for training and evaluation of text-to-audio models.

理论论述

The authors do not present any theoretical claims.

实验设计与分析

From the perspective of analyzing the effectiveness of the MAR paradigm (i.e. mask-based parallel decoding directly on continuous representations), it would have been more informative to perform apples-to-apples experiments with Mask-GIT style models. While their experiments are reasonable, they do not control for things like training data, so it is difficult to isolate the contrbution of the modeling paradigm versus other choices such as unsupervised pre-training. I think that such a controlled comparison might've produced more insight into the tradeoffs between such methods.

补充材料

I reviewed the supplementary material. In appendix A, the notation for the cfg scaler \alpha is overloaded because \alpha is used earlier in the work when discussing diffusion models (as is standard notation). Alternative notation should be used for the guidance scale.

与现有文献的关系

This work adapts the recently proposed MAR model (Li et al. 2024) for image generation to the setting of text-to-audio generation. Given the success of Mask-GIT style models, which were also originally developed for images, in audio generation, the extension of MAR models is reasonable. This work is, as far as I am aware, the first demonstration of their success for audio generation.

遗漏的重要参考文献

The discussion of related work is comprehensive.

其他优缺点

Strenghts:

  1. The application of the MAR class of models to audio generation is well-motivated. The same concerns that motivated their application for image development are relevant for audio generation.
  2. The authors present strong results, especially when considering the latency of their method.
  3. The authors ablate a number of their choices such as the unsupervised pre-training. These ablations can help guide future work in the area.

Weaknesses:

  1. This work is primarily an adaptation of a recently proposed image generation method to the audio domain. While effective, the results are not surprising and there is limited novelty.
  2. The experimental setting is not very scientific. This work compares against previously existing models instead of comprehensively studying the effectiveness of MAR vs. MaskGIT models in audio generation. As a result, I think the work provides less insight than it otherwise could. It is hard to disentangle the improvement of the MAR paradigm over MaskGIT for audio generation. An apples-to-apples comparison against a MaskGIT baseline would be a more valuable scientific contribution.
  3. In general, the presentation of the results throughout the paper is not very clear. The tables and plots are very crowded which makes them harder to interpret. The presentation of different configurations in the table (a, b, c, etc.) is not very clear. The configurations should be denoted more clearly. The plots should not have the metrics annotated in text for every point.

其他意见或建议

For figure 3, the color scheme should be consistent across plots.

作者回复

Response to reviewer RpS3

Thank you for the suggestions. Here are our responses.

  1. Standard error for subjective evaluation: Following the guideline of rebuttal, we merged the concerns of missing standard error values and confidence intervals in subjective evaluation in our response to Reviewer JkUy (bullet index 1). In conclusion, IMPACT models (c) and (e) clearly outperform the baseline models, Tango2, AudioLDM2, and MAGNET-S, as evidenced by their higher average scores and non-overlapping confidence intervals. Furthermore, we would add standard errors to the main table in the final version of the paper.

  2. Results are not surprising and there is limited novelty: We repectively point out that despite adopting a similar method of MAR, IMPACT is the first attempt to combine iterative parallel decoding with diffusion modeling on continuous representations in the audio domain. In the following, we highlight the distinct and substantial contributions our work.

    • Regarding performance:

      • IMPACT achieves state-of-the-art on FD and FAD on the AudioCaps evaluation set. IMPACT outperforms all baseline models in terms of inference speed (Figure 2), making IMPACT currently the fastest model for text-to-audio generation with good fidelity.
    • Regarding experimental analyses:

      • We compared iteratively decoding models with single-pass generation models (Table 2).
      • We provided extensive comparisons between baseline models and IMPACT under varying decoding steps (Tables 1, 4 and Figures 2, 3).
      • We analyzed the effects of iterations for iterative decoding and diffusion sampling on objective metrics and how these two factors affect inference speed (Figures 2, 4, 6, 7, 8, 9 and Tables 3, 6).
    • In general, we demonstrated that iteratively decoding continuous representations works extremely well beyond image modalities. As mentioned by the reviewer, this is "the first demonstration of their (MAR) success for audio generation". Detailed analyses regarding the inference speed and objective performance tradeoff reveal the fact that IMPACT can generate high fidelity audio efficiently, and we believe this should be viewed as non-trivial contributions.

  3. Controlled comparison with MaskGIT–style models (MAGNET): MAGNET is an audio generation model that decodes discrete tokens based on confidence scores, which is very similar as MaskGIT. Comparing IMPACT to MAGNET, which resembles MaskGIT in audio domain, is a key demonstration of our model’s advantages, namely, IMPACT's fast inference speed (despite performing iterative decoding with a diffusion head) and its superior performance relative to MAGNET, as shown in Table 4. Nonetheless, we emphasize that our model exhibits a clear advantage over MAGNET while using fewer parameters: our 193M-parameter IMPACT model (b) already outperforms the 300M-parameter model MAGNET-S, even when trained on fewer hours of data (Table 1). Due to the limited time of the rebuttal period, we could not conduct a fully matched apple-to-apple comparison by retraining MAGNET from scratch using our own data configuration or further training IMPACT by scaling up text-conditional training using the same amount of training data (4000hr) MAGNET used. However, the fact that our IMPACT model is smaller in size and trained on less data further supports the conclusion that diffusion modeling with iterative parallel decoding outperforms MaskGIT-style approaches.

  4. Impact of the CFG schedule: Our IMPACT models employ a cosine CFG scheduler, a decreasing schedule illustrated in Figure 5. Early decoding iterations particularly benefit from stronger guidance (higher CFG scale), as fewer latents are available to serve as content for conditional generation. The IMPACT model (d), utilizing the cosine CFG scheduler, achieves an FAD score of 1.38. Switching to a constant CFG scheduler significantly deteriorates the performance, resulting in an FAD score of 1.68, demonstrating the clear benefit of the cosine CFG scheduler adopted in our work.

  5. Clarity of presentation: We will revise the manuscript to more clearly distinguish IMPACT models (a, b, c, etc.) in the main results table and enhance the clarity of Figures 1, 2 and 3.

审稿人评论

I thank the reviewers for their additional results, including the confidence intervals for the subjective evaluation and the ablation of the CFG schedule.

I still maintain that a controlled apples-to-apples comparison with MaskGIT–style models would have strengthened the scientific contribution of this work, but I acknowledge that the MAGNET baseline does provide a similar, although not apples-to-apples, comparison.

I also acknowledge the authors point that absolute performance of their method is quite strong, especially when considering the inference-time.

Given these points, I am raising my score to a 3.

审稿意见
4

This paper proposes IMPACT, a text-to-audio generation model that balances quality and speed via a hybrid mask-based decoding diffusion architecture. During inference, IMPACT utilizes a masking scheduler to iteratively generate latent embeddings, where each embedding is generated via diffusion modeling.

给作者的问题

  • Is there any chance that IMPACT can be open-sourced (not a requirement, just curious)?
  • Since the embedding sequence is iteratively generated, later-generated embeddings have more information to rely on. Doe this mean that earlier-generated embeddings could be worse in quality? Do you think it would improve the performance if we "revisit" already-generated embeddings by masking it and generating it again?

论据与证据

Yes in general. There is one minor thing:

  • The authors mention that an unconditional pre-training phase is "indispensable." In my opinion, the reason for the pre-training phase to be important may not be its absence of conditions, but rather its data size (AS vs AC).

方法与评估标准

Yes in general, but I have some minor questions:

  • Why do we need both CLAP and FLAN-T5 text encoders? Could using only one of them work as well? Is text encoding computationally significant in the inference process?
  • How long are the generations? Are they 10 seconds following AudioLDM? Are they stereo or mono? What is the sampling rate?
  • Has it been considered to use faster diffusion samplers like Heun? The result of IMPACT is already quite impressive, but faster samplers may further accelerate its inference without additional training.
  • Human subjective evaluation is done on 30 generated audio examples. Are these cherry-picked? The sample size seems quite small here. It's great that at least 10 participants evaluated each sample, but I would trade this for more audio examples. I suggest the authors include more examples here. To reduce the workload, I think it's acceptable to include fewer IMPACT models and baseline models in this test.

理论论述

N/A.

实验设计与分析

Yes. They make sense.

补充材料

I briefly went through it.

与现有文献的关系

This paper proposes a new hybrid approach for text-to-audio generation which I think has a lot of potentials.

遗漏的重要参考文献

Not that I am aware of.

其他优缺点

See the aforementioned points.

Overall I found this paper solid, offering a lot of great potential and worth accepting.

Things I recommend improving are the human evaluation scale and some additional discussions here and there. If they can be addressed, I will consider increasing my rating.

其他意见或建议

N/A.

作者回复
  • Human evaluation
    • Are samples are cherry-picked?: No. In the original subjective evaluation presented in the paper, we randomly selected some text descriptions from the testing split. To maintain diversity and avoid redundancy, we ensured that the evaluation set excluded text prompts that were highly similar or repetitive, and resulted in 50 samples. These 50 samples were randomly distributed to 15 human annotators for conducting the subjective evaluation. Eventually, about 30 samples received at least 10 ratings for the baseline models and IMPACT models.
    • Scale up human evaluation: We conducted additional human evaluations on 100 generated samples each for IMPACT model (c) and Tango 2.
    • Our model IMPACT (c) achieves a REL score of 4.26, significantly outperforming Tango2 (4.11) with non-overlapping confidence intervals [4.17, 4.35] vs. [4.07, 4.15], indicating a statistically significant improvement in perceived relevance to text prompts.
    • For OVL, IMPACT (c) achieves 3.48, also higher than Tango2 (3.37), with almost non-overlapping confidence intervals [3.43, 3.53] vs. [3.31, 3.43], again confirming statistical significance in overall audio quality. | Model | REL | CI (REL) | OVL | CI (OVL) | |---------------|------------|------------------|------------|------------------| | Ground truth | 4.48(0.04) | [4.40, 4.56] | 3.56(0.03) | [3.50, 3.62] | | Tango2 | 4.11(0.02) | [4.07, 4.15] | 3.37(0.03) | [3.31, 3.43] | | IMPACT (c) | 4.26(0.04) | [4.17, 4.35] | 3.48(0.03) | [3.43, 3.53] |
  • Quality of early-generated latents: Latents generated in the initial decoding steps have limited contextual information and can be less robust than those generated later. To address this, we incrementally increase the number of unmasked elements with each iteration, starting with generating a small number of latents in the early steps and gradually increasing them throughout the decoding process (mentioned in Section 5.3.1, line 319). This Figure shows a 32-iteration decoding process. The generated latents at each decoding iteration is compared with the ground truth latents measured by Mean Square Error (MSE). The figure shows that latents generated in the early decoding steps exhibit higher MSE, suggesting they differ significantly from the ground truth, whereas those produced in later iterations more closely resemble the ground truth latents.
  • "Revisit" already-generated latents by masking it and generating it again: Given that early generated latents tend to be of lower quality, we attempted to improve overall performance by regenerating the latents produced during the first 4 decoding iterations after completing the generation of the full sequence of latents. However, as shown in the table below, this had little impact on the results. This is likely because only a small number of latents are generated in the first few iterations, for instance, in a 32-iteration decoding process, just 5 latents are decoded during the first 4 iterations out of a total sequence length of 256. As a result, regenerating these few latents has minimal influence on the overall output. | Model | FD | FAD | KL | IS | CLAP | |-------------------------|-------|------|------|-------|--------| | IMPACT (b’) | 14.90 | 1.07 | 1.05 | 10.06 | 0.364 | | IMPACT (b’) + revisit | 14.98 | 1.10 | 1.05 | 10.08 | 0.360 |
  • Reason for the pre-training phase to be important may not be its absence of conditions, but rather its data size: We respectfully disagree. Both IMPACT (a) and IMPACT (b) are trained on the same dataset (AC+WC 1200 hr). IMPACT (b), which includes an unconditional pre-training phase, outperforms IMPACT (a), which does not. This shows that the performance gain stems from the pre-training strategy itself rather than the quantity of data.
  • CLAP and FLAN-T5 Encoders:
    • Regarding inference time, for the base configurations of IMPACT models, it takes 22.2 seconds of generate a batch of 8 audios. When measuring inference time, we consider the text-encoder overhead - approximately 0.05 seconds for a batch size of 8.
    • Regarding the role of text encoders, when we remove CLAP text embedding from the conditional input, the FAD increases from 1.38 to 1.49 for the IMPACT (c) model, demonstrating the crucial role of CLAP in guiding more real data aligned generation.
  • Can IMPACT be open-sourced?: We will initiate the necessary processes to open-source the model upon acceptance of the paper, aiming to contribute more broadly to the scientific community.
  • Audio duration and sampling rate: Each generated audio clip is 10 seconds long, in mono, and sampled at 16,000 Hz, aligning with the AudioLDM setup for fair comparison.
  • "Heun diffusion sampler": We believe adopting the Heun sampler would be beneficial to our work by improving inference speed. We leave this as future work.
审稿人评论

Thanks to the authors for replying. Since my questions have been mostly resolved, I have increased the rating to 4 as promised.

As mentioned in my review, I have not worked on auto-regressive generation and may not be familiar with related approaches for other modalities (e.g., text-to-image). Hence, my reviews focus on the information presented in this paper.

审稿意见
4

The paper introduces IMPACT, a text-to-audio model combining masked generative modeling with diffusion models. The main result is the computational efficiency of the proposed method. IMPACT has a significantly lower latency compared to prior work, while being on-par in terms of objective quality and better in terms of subjective quality. The main methodological contribution of the paper includes an implementation of a parallel decoding, masked generative model, operating on continuous VAE representation through light-weight diffusion processes.

给作者的问题

Fixed masking rate during text-to-music training: Please provide an ablation comparing this design choice to the varying masking rate alternative.

论据与证据

Most of the claims made in the submission are well supported by convincing evidence.

Nevertheless, the random position selection strategy used during inference (section 3.2.1), should be compared to the confidence based alternatives. The authors claim that such alternatives are infeasible, though could be implemented in several techniques such as:

  • Latent representation clustering and per-cluster probability estimation.
  • Latent likelihood estimation, following existing methodologies for likelihood evaluation of diffusion models and transformers.

For better soundness of the paper I suggest adding either an empirical comparison to such an alternative or an empirical analysis demonstrating the computational efficiency gained by the design choice of avoiding generation of all positions in each decoding iteration, as done in prior work.

In addition, the methodology of unconditional pre-training, as well as the observation that more text conditional data leads to better performance, should not be claimed to be novel contributions.

方法与评估标准

The proposed methods and evaluation criteria makes sense for the problem of evaluating text-to-audio models on different axes of performance, including computational efficiency, quality, and text adherence.

理论论述

n/a

实验设计与分析

I checked the soundness of the experimental designs and analyses. This includes the following sections:

  • The main experiment on AudioCaps, comparing IMPACT to several baseline models suggested by prior work, with both objective and subjective metrics, in addition to comparing performance of the main IMPACT configurations.
  • The extensive ablation study on parallel decoding steps, diffusion denoising steps.
  • Graphical analysis of FAD, KL, IS and CLAP scores as function of latency, comparing IMPACT to prior work. I assess that these analyses are valid.

Nevertheless, I would like to highlight the lack of confidence intervals in the subjective evaluation results. It is therefore unclear if the superiority of impact (c) over (b) in subjective metrics is significant. This fact reduces my confidence in human studies.

补充材料

I listened to a large number of audio samples on the anonymous website provided in the supplementary material.

与现有文献的关系

The authors clearly connect the paper’s contributions to prior work on masked generative modeling and latent diffusion models for audio generation. Compared to MAGNeT, IMPACT maintains the iterative parallel decoding methodology while improving quality significantly, and reducing latency at the same time. Compared to MAR, the modality is different, audio as opposed to images, but the methodology is similar. The paper lacks a clear comparison of the methodological contribution compared to MAR.

遗漏的重要参考文献

Missing a reference to discrete flow matching [Gat et al. 24, https://arxiv.org/abs/2407.15595]. Though this is not an audio generation work, it is closely related, as a model combining the masked audio generative techniques with Flow Matching, a generalization of diffusion models.

其他优缺点

Strengths:

  • Low latency that better scales with batch size compared to magnet.
  • Figure 4 heat-map presents clearly the effect of the number of MGM/diffusion steps on latency.
  • The authors used a wide set of baselines for empirical evaluation.

Weaknesses:

  • No outstanding novelty. Specifically, it is unclear what are the methodological novelties compared to MAR.

其他意见或建议

  • Section 5.2.1 - lacks a reference to the corresponding results table.
  • Table 1 is a mix of baseline comparison and an ablation on the unconditional pre-training component: I suggest splitting it for readability.
作者回复

Response to reviewer JkUy

Thank you for your insightful comments and suggestions. Our responses to specific concerns are detailed below:

  1. Confidence intervals in subjective evaluation results: The 95% confidence intervals (CI) for the subjective evaluation results are as follows (values in brackets are the standard error):
modelRELCI (REL)OVLCI (OVL)
Ground Truth4.43 (0.01)[4.41, 4.45]3.57 (0.03)[3.52, 3.62]
Tango 24.13 (0.03)[4.08, 4.18]3.37 (0.02)[3.33, 3.41]
MAGNET-S3.83 (0.04)[3.74, 3.92]2.84 (0.05)[2.74, 2.94]
AudioLDM2-full3.74 (0.05)[3.63, 3.85]3.19 (0.04)[3.11, 3.27]
IMPACT (b)4.15 (0.03)[4.09, 4.21]3.45 (0.02)[3.41, 3.49]
IMPACT (c)4.31 (0.03)[4.25, 4.37]3.51 (0.02)[3.49, 3.53]
IMPACT (e)4.39 (0.04)[4.32, 4.46]3.47 (0.02)[3.43, 3.51]

These intervals further support our conclusions regarding model performance between IMPACT models (b) and (c). Non-overlapped confidence intervals for text-relevancy and slightly overlapped confidence intervals for overall quality indicate that the superiority of IMPACT model (c) over (b) in subjective metrics is significant, leading to conclusions that more unconditional training data benefits audio generation in terms of text-relevancy and audio quality in human perspectives.

  1. Masking ratio during training: In our methodology, the "masking percentage factor" refers to the minimum masking rate. For example, a value of 0.7 means that we select of random number for the masking percentage between [0.7, 1], indicating that at each training step, at least 70% of the latents are masked. We will revise the manuscript to clearly reflect this clarification. By applying fixed masking ratio during training, the results degrade severely. This is likely because during inference, each decoding step uses a different number of masks. A model that is not able to deal with varying masking rate is not suitable for iterative decoding. | Mask Percentage | FD | FAD | KL | IS | CLAP | |----------------------|-------|-------|------|-------|--------| | [0.7, 1] | 15.36 | 1.13 | 1.04 | 10.37 | 0.361 | | 0.7 (fixed) | 60.16 | 11.06 | 3.37 | 3.64 | 0.08 |

  2. Computational efficiency gained by the design choice of avoiding generation of all positions in each decoding iteration: With iterative parallel decoding as used in IMPACT models, all positions are computed in parallel during each iteration. This design means that whether updating every position or only a subset at each step, the overall inference time remains nearly identical.

  3. Random position selection strategy vs confidence based alternatives: Following the reviewer’s suggestion, we trained a VAE latent clustering model using k-means with 1024 clusters and computed confidence scores by applying a softmax over the inverse distances to the cluster centers. We then performed decoding by selecting positions in the style of MaskGIT. However, this approach performed significantly worse, with a FAD of around 11. This is higher than any IMPACT model proposed in the paper. One possible reason is that k-means struggles to effectively cluster the high-dimensional VAE latents, leading to unreliable confidence scores and degraded decoding performance.

  4. Methodological contribution compared with MAR: While IMPACT employs a different modality, it maintains a methodology aligned with MAR. As suggested by the rebuttal guidelines, we have combined and thoroughly addressed this point in our response to Reviewer RpS3 (bullet index 2).

  5. Clarity & Readability: Thank you for your valuable suggestions. We will revise the paper to address the following points: (1) clarify that our findings regarding improved performance through unconditional pre-training or additional text-conditional data are not a primary contribution of this work, (2) include citations of Discrete Flow Matching (Gat et al., 2024) in the final version, and (3) restructure Table 1 to distinctly separate baseline comparisons from ablations related to unconditional pre-training, ensuring improved clarity, readability, and appropriate referencing.

审稿人评论

I thank the authors for clearly addressing the main concerns reflected in my review. I raised the score from 3 to 4.

最终决定

This paper introduces a framework for high-quality and high-fidelity audio synthesis in text-to-audio generation by integrating iterative mask parallel decoding with a continuous latent diffusion model, while ensuring efficient inference. Reviewers unanimously found the approach compelling, with well-executed experiments and performance that surpasses the evaluated baselines.

During the discussion phase, the authors provided additional results that addressed the reviewers’ concerns. (I strongly encourage the authors to include these results in the final version, as they significantly enhance the paper.) Some reviewers raised questions regarding the method’s novelty, I recommend the authors to clarify this aspect in the paper’s presentation.