PaperHub
4.8
/10
Poster3 位审稿人
最低2最高3标准差0.5
2
3
3
ICML 2025

Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We replace discrete tokens with continuous latents for audio language modeling, enhanced by a novel masked next-token prediction task.

摘要

关键词
text-to-audioaudio language modelingcontinuous tokendiffusion modelnext-token prediction

评审与讨论

审稿意见
2

The paper proposes a text-to-audio model that leverages diffusion-based designs and causal language models, named AudioMNTP. In detail, the model applies a transformer-based decoder for the next-token prediction of the feature in latent space before being forwarded into the diffusion-based structure, then follows a VAE decoder and Gan-based vocoder to reconstruct the waveform. In addition, a token masking strategy is proposed to improve the model performance. Experiments illustrate that the proposed model achieves state-of-the-art performance with a significantly lower model size.

给作者的问题

Overall, the model provides an enhanced performance with the smaller size of the system, however, the writing of the paper needs to be improved as the current file is hard to follow. I would be happy to increase the score if the author could answer the following questions and also explain the overall pipeline of the model.

Question 1: What is the main proposed model, why introduce both AudioNTP and AudioMNTP? Are both necessary?

Question 2: Why use MLP architecture for the diffusion-based section? Any consideration for other architectures?

Question 3 What is the overall structure of the model, is the proposed model mainly replacing the previous LDM/transformer-based feature generation model with the transformer-based decoder?

Question 4 Can we provide some demos of the result?

Question 5 What is the inference time of the proposed model? Question 6 Why the evaluation in Table 2 being categorized into non-speech and speech, Does that mean the model can successfully generate real-speech content?

论据与证据

The authors claim that previous diffusion-based audio language models present bottlenecks and limitations. However, evidence supporting the effectiveness and superiority of the proposed AudioMNTP models over simpler baselines is not adequately provided. A suggestion is to provide some demos to support such a claim and also present the enhancement by the proposed methods Additionally, the introduction of both the transformer-based decoder and the overall structure is not very understandable.

方法与评估标准

The proposed system mainly applies AudioCaps and WavCaps for training. However, the availability of upscaling the dataset with numerous other audio-language datasets remains unclear. Although the author claims that the upscale of the model is limited by the computational constraints, the model is trained using 104 V100 GPUs for 5 days, such resource is already enough for any large model. The author provides various evaluation matrices to illustrate and compare the performance, however, the result including speech and non-speech categories is hard to understand, Does this mean that the proposed model can correctly generate speech contents?

理论论述

All the theoretical claims are discussed in the paper, however, the written format of the paper could be improved, currently,y the structure is hard to follow and understand. Why does the paper actually provide the details of two models, AudioNTP and AudioMNTP? The introduction of the inference section for both models is also ambiguous. It is hard to understand how the decoder works during inference, is all the tokens are general noise in the beginning?

实验设计与分析

The paper lacks most of the essential details of the experiments, such as specific training/inference parameters of each model, e.g., the inference time. After reading the whole paper, it is still hard to understand the pipeline of the system and how the model actually operates.

补充材料

As a generative model, the paper should provide demo audios to illustrate the performance of the system, especially the demos to compare the effectiveness of each proposed modules/strategies.

与现有文献的关系

The paper discusses a new idea of applying the advantage of causal language models for generative models. In addition, MAE-based training strategies are applied to improve performance.

遗漏的重要参考文献

There are actually more text-to-audio models which have achieved better performance, such as AudioBox and Re-AudioLDM. Although it might be hard to run these models, the paper could be improved by discussing them.

其他优缺点

The overall structure, especially the two graphs, is excessively complex and hard to understand or follow. The format of the paper should also be improved, some sections are overlapped and the order of some parts should be improved. The paper introduced many techniques but lacked the details of the proposed model. Currently, it is even unclear the overall pipeline of the model and how the transformer-based decoder with diffusion-head works. The model did not provide any demos on the model.

其他意见或建议

None

作者回复

Thank you for the constructive feedback! Below, we address all comments and outline planned improvements.


1. Presentation Improvement

We agree the current version can be clearer and will revise structure, improve diagrams, and expand explanation in the final version.

If you have specific instructions on the organization, please let us know!

Relation between AudioNTP and AudioMNTP. Why show both AudioNTP and AudioMNTP? Are they both necessary? What is the final pipeline?

AudioNTP introduces continuous-valued tokens, and AudioMNTP extends it with MNTP training. Both are our novel contributions, so introducing both is necessary. Table 1 shows both are key to achieving SOTA audio generation with LMs. AudioMNTP—integrating both—is our final proposal. For clarity, we illustrate the full AudioMNTP pipeline below, covering training and inference.

Training

Please see Figure A for the training pipeline.

Waveforms are encoded into latents, masking is applied, and remaining tokens form a new sequence. A Transformer decoder performs next-token prediction on this masked sequence, guided by target positional embeddings. The LM outputs (zz) go into a small MLP diffusion head, trained with diffusion loss to predict the next unmasked token.

See our response to Reviewer v7YS for more on target positional embeddings.

Inference

See Figure B and Figure C for step-by-step inference.

Given the BOS token and text embedding, the model generates latent token z0z^0 using one Transformer pass. Conditioning on z0z^0, the audio token x0x^0 is sampled from noise via diffusion. Then x0x^0 is used to generate x1x^1 autoregressively. After generating a chunk of tokens, they are decoded into waveform using a VAE decoder and HiFi-GAN vocoder.

Are all the tokens general noise in the beginning?

Yes — each token is initialized with Gaussian noise and refined via the MLP diffusion head.

Why use MLP for diffusion?

A simple MLP keeps latency low and enables real-time generation, which current TTA diffusion models don’t support. If the diffusion head is too slow, denoising becomes a bottleneck. See our response to Reviewer v7YS for latency discussion.

Does the model mainly replace LDMs with a Transformer?

No. Our Transformer runs once per token to produce a conditioning input for the small MLP, which does iterative diffusion. LDMs run the full model at each diffusion step, causing high latency.

2. Does not compare to Diffusion-Based Audio LM Baselines

To our knowledge, this is the first use of continuous-valued tokens in LMs for TTA. If we missed related work, we’d appreciate pointers. The most relevant work cascade a discrete audio LM and a diffusion model, not a single diffusion-based audio LM. Also, it's not open-sourced.

We propose AudioNTP (the first diffusion-based Audio LM) and improve it with AudioMNTP. Table 1 compares our methods to LDMs and discrete LMs, supporting two key claims:

  1. Continuous audio LMs outperform discrete ones.
  2. MNTP improves training over next-token prediction.

3. Demos

Thanks for the suggestion! Please see our demo page. Our method outperforms AudioGen Large and AudioLDM2-Full-Large and matches Tango 2. Due to internal download limits, we can’t include more samples for the ablation. But Tables 3 & 4 support our claims.

4. Why not scale up the datasets?

Our method already achieves SOTA on standard datasets. We are scaling up and will include results in the final version if accepted (see reply to Reviewer 8Ca7). Thanks for the suggestion!

5. Why split into speech vs. non-speech?

Speech is harder to generate and not a common case in TTA. Most TTA systems — including ours — can’t generate intelligible speech. Since ~50% of AudioCaps involves speech, it can obscure performance on sound. So, we split evaluation into pure sound and sound + speech (see Appendix G). We agree current naming can be confusing and will revise them. The rationale will also be added to the main text.

6. Missing experimental details?

Most details are in Appendix F and G due to page limits. Let us know if anything should be moved to the main text or is missing.

7. AudioBox and Re-AudioLDM

These models are not public, so direct comparison is difficult. We will include discussion in the final version.

8. Inference time

Latency details will be added in the final version. Please see our response to Reviewer v7YS for results and discussion.


We appreciate your thoughtful feedback! Your comments helped us improve clarity and depth. If you feel your concerns are addressed, we’d be grateful if you could consider raising your score. Let us know if anything can be further improved. Thank you!

审稿人评论

Thanks for the reply from the author. However, I decided to keep the score as the author seems did not provide most of the requests(excluding the claim to be added to the final version).

作者评论

We thank the reviewer for the comments. However, we believe that we have addressed most of the reviewer's requests. For clarity, we summarize them point-by-point below:

Main requests


Main Weakness: Presentation and Demo page

We address the request in "1. Presentation Improvement". Specifically, we provided 3 new figures and explanations to illustrate the training and inference pipeline step-by-step. Furthermore, we also mentioned the additional clarification on the target positional embedding in "Response to Reviewer v7YS". Finally, we provided the demo page. We would be grateful if the reviewer could provide details on the reading difficulty so we can improve the organization!

Question 1: What is the main proposed model, why introduce both AudioNTP and AudioMNTP? Are both necessary?

We address the request in "1. Presentation Improvement". Specifically, we clarify that AudioMNTP is our main proposed model. However, introducing AudioNTP is also necessary, as there was no existing diffusion-based audio language model baseline in the literature. Therefore, we propose AudioNTP as the first such baseline and improve upon it with AudioMNTP.

Question 2: Why use MLP architecture for the diffusion-based section? Any consideration for other architectures?

We address the requests in "1. Presentation Improvement" and "8. Inference Time". "1. Presentation Improvement" clarifies the step-by-step inference pipeline and explains why using a simple MLP is critical for inference (for real-time usage). "8. Inference Time" provides inference time results and compares them to several baselines. Our model achieves real-time performance, whereas conventional diffusion models fall short of real-time capabilities. Furthermore, Table 4 in our paper shows that even a modest increase in the diffusion head’s complexity leads to a notable increase in overall latency. Therefore, we do not consider more sophisticated architectures such as Transformers.

Question 3: What is the overall structure of the model, is the proposed model mainly replacing the previous LDM/transformer-based feature generation model with the transformer-based decoder?

We address the request in "1. Presentation Improvement". We provide three new figures to illustrate the overall structure of the model and explain the training and inference pipelines step by step. We also clarify the differences between our proposed approach and simply replacing the previous LDM/Transformer-based feature generation model with a Transformer-based decoder.

Question 4: Can we provide some demos of the result?

We address the request in "3. Demos". Our demo page provides pairwise comparisons between AudioMNTP and several baselines, demonstrating the effectiveness of our method.

Question 5: What is the inference time of the proposed model?

We address the request in "8. Inference time". The inference results suggest that our proposed model achieves real-time performance and is significantly faster than conventional diffusion models. Also, our model is comparably fast to discrete Audio LMs but provides much better generation quality.

Question 6: Why the evaluation in Table 2 being categorized into non-speech and speech? Does that mean the model can successfully generate real-speech content?

We address the request in "5. Why split into speech vs. non-speech?". Specifically, we explain the rationale behind this division and propose renaming the categories to pure sound and sound + speech to avoid confusion. We also clarify that most TTA systems — including ours — cannot generate intelligible speech.

Additional requests we infer from the comments


Additional baselines for diffusion-based Audio LM

The authors claim that previous diffusion-based audio language models present bottlenecks and limitations. However, evidence supporting the effectiveness and superiority of the proposed AudioMNTP models over simpler baselines is not adequately provided.

We address the request in "2. Does not compare to Diffusion-Based Audio LM Baselines". Specifically, we point out that there is no prior diffusion-based audio LM for TTA. As a result, we propose the first such model, AudioNTP, as a strong baseline. Our AudioMNTP further improves upon AudioNTP and several existing baselines, including discrete LMs and diffusion models. The effectiveness and superiority of our method are supported by both objective metrics and the newly provided demo page.


We thank the reviewer again for the valuable feedback. As shown in the above point-by-point rebuttal, we believe our original response has addressed your requests. If there are any remaining concerns, we would greatly appreciate it if you could point them out in more detail. We are happy to further clarify or make additional improvements!

审稿意见
3

This paper presents a novel approach for generative audio language modeling using continuous-valued tokens instead of discrete tokens. The key contributions include:

  1. Following previous works, such as masked autoregressive (MAR), which introducing continuous-valued audio tokens to replace discrete ones, improving generative modeling by preserving more information.

  2. Proposing a novel Masked Next-Token Prediction (MNTP) learning task that enhances next-token prediction by incorporating masked token prediction.

  3. Demonstrating that their approach achieves significant improvements over AudioGen and UniAudio.

  4. Achieving results comparable to state-of-the-art diffusion models while maintaining a more efficient and streamable transformer-based causal language modeling framework.

  5. Validating the approach with extensive quantitative and qualitative evaluations, including human evaluation on speech and non-speech audio generation.

给作者的问题

Can you give more explanation about the design of target positional embedding? From Figure 3, I can only find the it starts from BOS token. I cannot fully understand it's value.

论据与证据

The claims made in the submission is supported by clear and convincing evidence.

  1. The claim that continuous-valued tokens lead to better generation quality is validated by improvements in FAD and KL divergence compared to discrete-token-based models like AudioGen.

  2. The effectiveness of MNTP is demonstrated through significant improvements over their baseline AudioNTP model, suggesting that masked token prediction enhances next-token prediction.

方法与评估标准

The proposed methods, including the use of continuous-valued tokens and MNTP, are well-motivated and clearly explained.

The choice of evaluation metrics (FAD, KL divergence, IS, CLAP score, and subjective human ratings) is appropriate and widely used in audio generation tasks.

理论论述

This paper is experiment-driven research. Without any theoretical claims.

实验设计与分析

The experimental designs effectively validate the proposed method in the appropriate benchmarks and comparisons.

补充材料

I carefully review all of the supplementary parts.

与现有文献的关系

This study follows the continous LM training strategies in image/audio generation. Although the idea is not novel, this paper validates the effectiveness of such training strategy in audio generation domain. Futhermore, they also introduce masked next-token prediction (MNTP) strategy.

遗漏的重要参考文献

No.

其他优缺点

Weaknesses:

  1. The paper does not include a detailed comparison of inference latency between their model and diffusion-based methods, which would be critical for real-time applications.

  2. The proposed MNTP schedule needs more discussion:

    (1) The authors point out that "Recent works show that dropping instead of masking yields similar performances while greatly reduce the training cost". Whether authors conduct experiments to support this claim?

    (2) What is the max mask ratio for MNTP?

    (3) which types of positional encoding?

其他意见或建议

Refer to Weaknesses parts.

作者回复

1. Inference Latency Comparison with Diffusion Models

We appreciate the reviewer’s comment and will include a latency discussion in the final version. We present the latency comparison in Table A.

Table A. Latency comparison of TTA models. Measured with batch size = 1 on a single NVIDIA A100 using 10-second clips. RTF = inference time / 10 sec. Bold: best; Italic: second best. AudioGen Base has no reported latency.

ModelTypeParamLatency (s) ↓RTF ↓FD ↓FAD ↓KL ↓IS ↑CLAP ↑
AudioLDM 2Diffusion346M32.033.20332.142.171.626.920.273
AudioLDM 2 LargeDiffusion712M65.236.52333.182.121.548.290.281
Tango 2Diffusion866M60.776.07720.662.691.129.090.375
AudioGen BaseDiscrete LM285M---2.842.14--
AudioGen LargeDiscrete LM1B12.41.24-1.821.69--
AudioMNTP BaseContinuous LM193M13.431.34314.811.681.169.670.336
AudioMNTP LargeContinuous LM462M15.771.57714.31.221.179.810.341

AudioMNTP v.s. Diffusion Models

AudioMNTP Base and Large achieve much lower latency and better quality than diffusion models. This is due to using a small MLP for diffusion steps, instead of repeatedly applying the full model. Over 90% of the parameters run only once per token.

The token sequence is 256 long—much shorter than the typical 1000+ denoising steps used in diffusion models.

AudioMNTP v.s. AudioGen

Compared to AudioGen Large, AudioMNTP gives much better quality with only a small latency increase. Both use a single-pass Transformer, but AudioMNTP replaces the single-pass categorical head with a multi-pass diffusion MLP. The added diffusion steps greatly boost generation quality with a slightly higher latency.

Real-Time Applications

Only AudioGen and AudioMNTP reach near real-time generation (RTF ≈ 1). On an H100 GPU, real-time is easily achievable. Diffusion models remain too slow for such use cases.

2. Discussion on MNTP Schedule

Masking v.s. Dropping

Table 3 (rows C–E) shows similar results for zero masking, Gaussian masking, and dropping:

Masking SchemeFD ↓FAD ↓KL ↓IS ↑CLAP ↑
Zero Masking16.781.971.289.330.333
Gaussian Masking15.151.821.189.220.324
Dropping16.621.771.329.250.315

We chose dropping due to its lower compute cost, which improves efficiency and scalability.

Max Mask Ratio

We sample mask ratios from 0,10, 1 . When the ratio hits 1, we retain one random token and use the BOS token and positional embedding to predict it. We’ll clarify this—thank you for the observation.

Positional Encoding

We use absolute positional embeddings, as both training and inference operate on 10s clips. Extending this to longer sequences using RoPE is a promising direction for future work.

Can you give more explanation about the design of target positional embedding?

Certainly! Traditional prediction always targets the immediate next token. MNTP, in contrast, predicts tokens at variable distances. For instance, if the sequence is 0, 3, 6, 9, token 3 predicts 6, not 4. Since the offset varies, the model can't guess the target from context alone.

To help, we add a target positional embedding that tells the model which position to predict.

In Figure 3, each token predicts the next unmasked token at a variable offset. The embedding ptp_t encodes this offset (e.g., x0+pt3x^0 + p_t^3x3x^3, x3+pt5x^3 + p_t^5x5x^5). We use absolute embeddings for ptp_t.

Thanks again—we agree the paper’s explanation was brief and will expand it in the revision.

3. Demo

We didn’t include the demo in the original submission. It’s now live at https://audiomntp.github.io/. Please have a look!


We truly appreciate your feedback. It helped improve the paper in concrete ways. If you feel your concerns are resolved, we’d be grateful if you could consider a higher score. Let us know if there's anything else we can improve. Thank you again!

审稿人评论

Thank you for your thorough and constructive response. My concerns have now been fully addressed. I maintain my initial overall assessment leaning towards acceptance.

作者评论

We sincerely thank the reviewer for the kind words! If all concerns are addressed, would you consider a higher score?

审稿意见
3

This paper investigates generative audio language modeling with continuous-valued tokens. It begins with a next-token prediction approach in which each latent embedding (i.e., token) from an autoencoder is iteratively produced via a token-wise diffusion process. Building on this, the authors propose Masked Next-Token Prediction (MNTP), a strategy that predicts future tokens rather than strictly the next one. Experimental results demonstrate that MNTP not only improves both next-token and masked-token prediction but also matches or surpasses strong audio generation baselines. Furthermore, an extensive ablation study covering architecture design, training objectives, and token masking techniques, highlights the impact of each component on the model’s overall effectiveness.

update after rebuttal

I appreciate the authors' response to my concerns. Although the rebuttal did not present extensive new experimental evidence on those concerns, the work’s potential impact on its field remains noteworthy, albeit not groundbreaking. Consequently, I maintain my initial overall assessment leaning towards acceptance.

给作者的问题

I don't have any other questions.

论据与证据

The proposed methods and evaluation criteria are relevant.

方法与评估标准

Proposed methods and/or evaluation criteria are reasonable and aligned with the problem/application.

理论论述

I reviewed proposed methods and found them sound.

实验设计与分析

I reviewed the experimental design and analyses and found them sound.

补充材料

I reviewed Ablating the components of MNTP and Masking Schedules.

与现有文献的关系

This work contributes to improving language modeling in the audio domain and introduces innovations to further enhance generative performance, which could also be applicable or relevant to other generative modeling domains, such as image generation or text-to-speech.

遗漏的重要参考文献

Essential references are well discussed.

其他优缺点

Strengths:

  • The proposed Masked Next-Token Prediction (MNTP) approach is both conceptually simple and effective. It delivers stronger results than standard next-token or masked-token prediction techniques.
  • The authors' explanations throughout the manuscript, including insightful footnotes, effectively highlight the contributions of this work and clearly distinguish it from other methods. Their detailed discussion enriches the understanding of both the proposed method and experimental analysis.
  • The authors conducted extensive and rigorous experiments, providing comprehensive results. The detailed ablation study, covering architecture variations, training objectives, and token masking strategies, clearly illustrates how each individual component contributes to the overall performance and effectiveness of the method.

Weaknesses:

  • Only two model size variants (base and large) are considered, with an ablation study limited to the size of the MLP-based diffusion module. Including additional experiments focused on scalability, both in terms of model size and dataset size, would offer deeper insights into the proposed approach’s potential at larger scales.
  • The method is currently evaluated exclusively on language modeling involving continuous-valued tokens, leaving uncertainty regarding its applicability to discrete-valued tokens. Exploring the effectiveness of MNTP when applied to token sequences generated by vector-quantized VAEs, for example, could broaden its scope and demonstrate greater versatility.

其他意见或建议

Page 6, line 326: Typo – “(2) our model is significantly smaller, i.e., 866M Tango 2.”

作者回复

We are sincerely grateful for your kind words and thoughful comments. Regarding our weaknesses, we address them in the following:

1. Larger scale in both model size and dataset

We fully agree that exploring the scaling behavior of our proposed methods would provide valuable insights. In this current submission, our primary focus is on introducing continuous-valued tokens and the MNTP method. With smaller-scale models and datasets, we achieve SOTA performance, justifying the effectiveness of our proposal. Nonetheless, we acknowledge the importance of examining whether these methods maintain their effectiveness at larger scales.

We have preliminary evidence of scaling effectiveness, where increasing model parameters from Base to Large by approximately 140% led to notable improvements in overall performance metrics. This positive trend is noteworthy, especially given that such improvements are not universally observed, as exemplified by the comparative performance of AudioLDM-full vs. AudioLDM-full Large and Magnet small vs. Magnet large.

We agree that scaling analyses are broadly relevant in the context of large language models. We would be eager to further investigate scalability; however, the substantial increase in convergence time and resource requirements associated with training larger models and datasets means we cannot provide these results during the rebuttal period. Should our submission be accepted, we commit to including detailed scalability analyses in the final manuscript.

Note. As a practical consideration, training the Large model on AudioSet, similar to the approach adopted by AudioGen, requires approximately one month of computational resources utilizing around 100 V100 GPUs. Even scaling down to the Base model demands around two weeks of training. These estimates do not account for the additional resources necessary to explore even larger model configurations.

2. Generalizability of MNTP to discrete tokens

Thanks again for the helpful comments! In this work, our primary claim emphasizes that both continuous-valued tokens and MNTP are crucial for achieving SOTA audio generation using language models, supporting MNTP's efficacy specifically for continuous data. However, extending MNTP methodologies to discrete tokens is indeed a compelling new direction and will be part of our future research efforts. We appreciate your suggestion!

3. Typo

We will fix the typo. Thanks for the comment!

4. Demo

We omitted the demo from our initial submission. Here is our demo page. Please take a look!


We appreciate your valuable comments! If you feel we have adequately addressed your concerns, please consider increasing the score. If you have any further suggestions for improving our paper, please do not hesitate to let us know. Thank you!

审稿人评论

I appreciate the authors' response to my concerns. Although the rebuttal did not present extensive new experimental evidence on those concerns, the work’s potential impact on its field remains noteworthy, albeit not groundbreaking. Consequently, I maintain my initial overall assessment leaning towards acceptance.

To further highlight and broaden contributions of this work, it would be beneficial to include additional experimental results. Specifically, providing evidence regarding the scalability of this approach could effectively highlight its advantages over competitive methods, aligning with the points raised by Reviewer eX1D regarding works like Audiobox. This aspect is particularly relevant given the rapid advancements in text-to-audio generation, exemplified by recent work like TangoFlux (I mention this purely for context regarding the field's progress, not necessarily requiring discussion or citation). Additionally, showing the MNTP module's potential applicability to discrete tokenization, which is widely used in other domains such as text-to-speech or image generation, could significantly broaden the perceived impact of this work.

Please let me know if I have misunderstood the authors’ response, if my comments extend beyond the initial concerns, or if the authors believe that they have already adequately addressed most of the reviewers’ requests.

作者评论

We thanks the reviewer for the kind word! We believe we are aligned with the reviewer points:

  1. The large-scale results would further highlight the advantage of our approach, given the evidences shown by LLM
  2. The applicability of MNTP to discrete tokens would further broaden the impact of our proposal, given that discrete LM is the mainstream.

We fully agree with the above comments, and we summarize our responses into key messages:

  1. We wish to highlight that our method using small-scale model and data successfully outperforms several established baselines, all of which use large-scale models and data—serving as strong evidence of our method's effectiveness. Our scalability results, from the small (190M) model to the medium (460M) model, also verifies scalability within a range we can afford. What we lack is the large-scale result (i.e. 1B), which are highly resource demanding. The reviewer highlights the benefit of this further study and still accepts our paper. We plan to incorporate the large-scale results but cannot provide numbers at this point due to time constraints.
  2. Our scope in this paper is primarily focused on introducing the first continuous-valued audio language model (LM) for TTA, and on further improving it. Therefore, studying MNTP's effectiveness on discrete tokens may be somewhat out of scope, as MNTP remains critical for continuous LMs regardless of its effectiveness on discrete LMs. However, we fully agree with the reviewer's point that such a study could significantly broaden MNTP's applicability. As such, we believe it would be more appropriate to explore this topic in a separate paper specifically focused on the discrete use case.

We thank the reviewer for all the responses. We believe there is no misunderstanding. Thank you!

最终决定

This paper explores generative audio language modeling using continuous-valued tokens. It starts with a next-token prediction framework, where each latent embedding from an autoencoder is generated through a token-wise diffusion process. Building on this, the authors introduce Masked Next-Token Prediction (MNTP)—a strategy aimed at predicting upcoming tokens beyond the immediate next one.

While reviewers initially noted missing results and comparisons, the authors addressed most of these concerns during the discussion phase. Despite the authors’ responsiveness and willingness to provide additional results and clarifications, the novelty and results of the proposed approach are not convincing enough for an ICML publication. I encourage the authors to better justify their architectural and design choices, and include results from additional speech / audio domains. This will greatly improve the quality of the submission.