PaperHub
5.0
/10
Rejected4 位审稿人
最低2最高4标准差0.8
2
3
4
4
4.0
置信度
创新性2.3
质量2.5
清晰度2.8
重要性3.0
NeurIPS 2025

Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

Streaming speech to text and text to speech, scaling to unbounded sequence lengths with constant memory, modeling two streams across modalities delayed with one another.

摘要

关键词
automatic speech recognitiontext to speechspeechlanguage models

评审与讨论

审稿意见
2

The paper introduces DSM, a delayed streaming modeling approach applied to decoder-only architectures. Building upon the proposed multistream architecture (Moshi) by Défossez et al. (2024), the paper introduces a controllable delay for the streaming processing. The DSM approach is evaluated on automatic speech recognition (ASR) and text-to-speech (TTS) tasks. The model is trained on English and French speech (automatically transcribed with Whisper) and tested on short- and long-form benchmarks.

优缺点分析

Strengths:

  • The paper applies the recent multistream architecture (Moshi) to streaming tasks, introducing a controllable delay.
  • The model achieves good results in terms of quality and is trained on large-scale settings

Weaknesses:

  • Translation and, in general, crosslingual tasks are at the core of the sequence-to-sequence modeling, and historically represented more challenging tasks for shifting from attention-based encoder-decoder architectures to alternative architectures such as transducers [1,2] for the streaming processing. However, such tasks are not addressed in the paper, which addresses only monolingual and monotonic tasks such as ASR and ST, strongly limiting its impact and a sound validation of the approach.
  • The paper is not well written in general, and often the final message (or the purpose) of the paragraphs is not clear. For instance, the introduction seems very fragmented, starting with describing the streaming sequence-to-sequence task, then moving to offline architectures, then decoder-only architectures, then neural compression algorithms without a clear flow. Instead, many relevant works in streaming processing in ASR [3], MT [4], and ST [5] are not mentioned.
  • The paper points out that current models operate by regularly sampling the audio or video modality, preventing their applications to continuous transcription or translation (lines 40-42), but then introduces a method operating at constant framerates (lines 48-49), which it is not clear how it solves problem. Moreover, it is stated that "Yet, a major limitation of these decoder-only is their incompatibility with streaming. First, their prefix-based formulation requires access to the full input sequence before generation, which prevents real-time inference" while previous work [6] showed, instead, its feasibility for streaming applications.
  • The multistream architecture adopted in the paper is not a novel component of the paper, as it has been proposed by Defossez et al. (2024). Therefore, the novelty part is in introducing the delay component, which is pretty limited. Indeed, the "delay conditioning" is not a novel concept, as most streaming and simultaneous models allow for partial or complete control over the delay of the model (e.g., the simplistic wait-k policy [7] that waits for k words before starting the output emission, and k is controllable by the user). Moreover, as already mentioned, the feasibility of adopting decoder-only architectures has already been addressed in previous work [6].
  • The Related Works, and, in particular, the Streaming Sequence-to-Sequence Learning paragraph, contains mostly incorrect information. First, the SeamlessM4T family is not a family of streaming models (but, instead, offline models), but they were cited (multiple times) in the paragraph. SeamlessStreaming, which is a subsequent work focusing on streaming processing, would have been more appropriate. Moreover, other works [8] successfully used Seamless with a simultaneous policy, and there is no reason for claiming that they are "incompatible with batching", which, moreover, does not make sense as we are dealing with streaming applications, which is the task of processing a continuos stream of information, without any batching. Lastly, there exist streaming models that are not modality specific (speech-to-text only or text-to-speech only), conversely to what has been affirmed by the authors (lines 66-67), such as SeamlessStreaming and Streamspeech, already cited by the authors.
  • The performance of the proposed architecture is only compared against one system for ASR (WhisperStreaming), which is rather insufficient, considering the works mentioned in the related works. Moreover, there is no detail about WhisperStreaming, which is the only relevant streaming ASR baseline considered in the paper, and the arbitrary choice of using 2.5s as a hyperparameter is not motivated. Likewise, the choice of the baseline for TTS is not motivated, and the baselines are not streaming, as latency is only reported for the proposed method (Table 5). Therefore, the comparisons are not sound for a streaming processing paper, for which the results discussion parts are also particularly short.

[1] C. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le, M. Jain, K. Schubert, C. Fuegen, and M. L Seltzer, “Transformer-transducer: End-to-end speech recognition with self-attention,” 2019.

[2] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,” in Proc. ICASSP, 2020, pp. 7829–7833.

[3] Dominik Macháček, Raj Dabre, and Ondřej Bojar. 2023. Turning Whisper into Real-Time Transcription System. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations, pages 17–24, Bali, Indonesia. Association for Computational Linguistics.

[4] Javier Iranzo-Sánchez, Jorge Civera, and Alfons Juan. 2022. From simultaneous to streaming machine translation by leveraging streaming history. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6972–6985, Dublin, Ireland. Association for Computational Linguistics.

[5] Sara Papi, Marco Gaido, Matteo Negri, and Luisa Bentivogli. 2024. StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3692–3707, Bangkok, Thailand. Association for Computational Linguistics.

[6] Xue, Jian & Wang, Peidong & Li, Jinyu & Sun, Eric. (2023). A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability. 1-7. 10.1109/ASRU57964.2023.10389799.

[7] Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Haifeng Wang. 2019. STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3025–3036, Florence, Italy. Association for Computational Linguistics.

[8] Sara Papi, Marco Gaido, Matteo Negri, and Luisa Bentivogli. 2024. SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation. In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), pages 72–79, Bangkok, Thailand (in-person and online). Association for Computational Linguistics.

问题

  • Why do the authors cite Vaswani et al. (2017) for decoder-only Transformer architectures (Introduction)?
  • Why do the authors only use automatically transcribed speech as training data while gold-labeled data exists (at least, for high-resource languages as English)?
  • What is the purpose of reporting the results on both short- and long-form ASR if no latency measure is presented?
  • Why is batching a desired feature for streaming processing?
  • Why have you chosen monotonic tasks such as ASR and TTS for validating this framework?

局限性

The limitations mentioned in the paper are rather short and do not address any technical limitations of the work, such as the languages analyzed, the choice of certain baselines, and, most importantly, the decision to restrict the paper's scope to ASR and TTS.

最终评判理由

I decided to keep my scores for the following reasons:

  • The claimed novelty about the delay conditioning is not novel at all, as it is the main focus of all research around streaming and, especially, simultaneous speech processing. I mentioned some works in my responses (e.g., SimulSeamless and StreamSpeech), but they were just two among many others.
  • The claimed novelty about being the first work to prove that decoder-only architectures can be leveraged for streaming applications is not novel at all, as it was already introduced by Guo et al. (2024) and followed by other works.
  • The paper claims a novel sequence-to-sequence multistream framework for streaming, but i) the multistream was already proposed by Defossez et al. (2024) and only applied to streaming here (as acknowledged by the authors' response), ii) the seq2seq framework is validated only on ASR and TTS, which are monotonic and rather simple tasks compared to, for instance, translation and question answering. Therefore, I find the seq2seq claim not sufficiently empirically validated, and a framing of the work on ASR and TTS would have been more appropriate (and, by the way, also sufficient). Lastly, the authors mentioned "domain-specific contributions," but it is not clear what they should be.
  • The comparison with only one streaming ASR model (WhisperStreaming, for which the choice of the hyperparameters has not been motivated, even in the rebuttal) is not sufficient at all to validate the effectiveness of the approach on ASR, while other SOTA models exist, as already mentioned in the Related Works of the paper. Additional results promised by the authors were provided only on SeamlessStreaming without, however, including comparable latency measures (as AL/LAAL is a very different metric compared to the one used by the authors in this paper), making it impossible to understand if the comparison was fair (i.e., the systems are working with a similar latency).
  • The writing of the paper is, in general, not good and lacks almost all the related literature in the streaming field, which I already extensively mentioned in my original response.

Overall, almost none of the mentioned weaknesses were addressed during the rebuttal, and the responses given to my review were often incomplete (as the authors even ignored many of the raised weaknesses in their first response), and partly not related (see the provided results about DMS-TTS, while I was not raising any point about that, only about the choce of the baselines).

格式问题

No

作者回复

We thank the reviewer for their detailed feedback, and we hope we can clarify some important aspects of the paper along with taking into account the feedback. As we improved significantly the results of our TTS model since the submission, we also include updated results for it in the last two sections.

About the interest of batching and streaming

Being able to batch and stream at the same time means that the models we developed can be served at scale, for instance through an API, while keeping inference cost low. We demonstrate that we can achieve a ratio of the duration generated audio over computation time of 100 for TTS, and a ratio of the duration of the transcribed audio over computation of more than 400 for ASR on a single H100, opening the way for cheap, energy efficient and scalable commercial applications.

About the related work

We thank the reviewer for the useful references. However we believe the reviewer is mistaken on a few of them. We already cite [3], and compare to it, we updated the related work to include it earlier. Regarding the issue with the version of Seamless we cite, namely the version available on arXiv with the id 2312.05187, this version does cover the streaming case, Section 5 “Streaming Seamless”. Regarding the ability to batch Seamless, we maintain that the high number of components operating on different rates makes it non-trivial to batch compared with a Transformer decoder and, to the best of our knowledge, no batched implementation exists for it. The work cited [8] does not solve that issue. As outlined in the previous section, our contribution is to provide a method that is both easy to stream and batch at the same time. We will add a citation to [7] for the fixed delay between modalities.

Regarding the mention in the introduction that existing decoder-only methods are limited by the different frame rates between modalities, we maintain that this is a limitation of a number of TTS models: they use a text prefix operating at its own frame rate, then switch to the frame rate of the audio. This means that they cannot be streaming with respect to both, or only with a coarse granularity like CSM that alternates text chunks and audio chunks. Our claim is that aligning first the modality frame rates is essential to simplify the subsequent model design, and reaching the best inference performance, both in terms of accuracy and throughput.

About the citation to Vaswani et al. (2017), we extended the citation to also refer to the work of Radford et al. (2018).

References:

Improving Language Understanding by Generative Pre-Training, Radford et al. 2018.

About the novelty

We state that our approach is derived from previous work done in the context of conversational speech-to-speech (Defossez et al., 2024), or speech-to-speech simultaneous translation (Labiausse et al. 2025). Nonetheless, this approach has never been proven to be competitive for two fundamental tasks in speech, namely ASR and TTS. We further provide a number of domain-specific contributions that allow this approach to be state of the art on both tasks. The proposed TTS is the first to be streaming both with respect to the output audio, and the input text. Our ASR generalizes to long form transcription of up to 2 hours, with a constant memory usage and no need for explicit cuts and stitches.

Comparison with transducer methods

We thank the reviewer for pointing to this relevant line of works [6]. However, those methods are not decoder-only, as they require at least two sequential models, namely the encoder and prediction network. Those approaches are still relevant to our work and we added them to the discussion in the related work. We couldn’t however find an open source implementation, or results on a public benchmark that we could compare to.

TTS baselines

We compared our models to a large number of streaming and non streaming TTS baselines, namely F5-TTS (Appendix, Section H), Orpheus, DIA and CSM. F5-TTS is non-streaming because it uses diffusion, while the 3 others use a text prefix, so that the latency would be dependent on the length of the prefix and time to process it. Besides, to the best of our knowledge, Orpheus and DIA use a non causal codec. Only our method provides a fixed guaranteed latency, as it is the first method to be streaming both with respect to the output audio and input text, with improved accuracy. This should not be considered a limitation of our work but a proof of the importance of our contribution.

ASR baselines

To the best of our knowledge, WhisperStreaming is the only publicly available streaming method for ASR. We also compare to a number of non streaming methods, achieving competitive results.

About the use of gold standard data

Existing gold standard data does not cover all use cases: existing datasets are often segmented to small utterances, and mostly cover mono-speaker data. The use of real world data allows to cover long-form modeling for both ASR and TTS, as well as richer dialog interactions. In Section H, we do present a TTS model trained only on publicly available dataset with real transcript. See the Section after as we improved significantly the results since the submission, and extended the benchmarks. For the ASR model, we use a mixed strategy, with a pretraining on real world, long formed data with pseudo labels, and a fine tuning on more specific datasets with true transcripts.

ASR, long-form and latency

What is the purpose of reporting the results on both short- and long-form ASR if no latency measure is presented?

DSM-ASR is a streaming model, hence it is naturally adapted to transcribing long-to-infinite sequences of incoming audio. Handling those requires no changes w.r.t. the short sequences. In contrast, many standard approaches, such as Distil-Whisper rely on chunking and stitching (https://arxiv.org/abs/2311.00430, S3), which requires handling the stitching and can harm their transcription quality. Hence, we believed it is interesting to compare performance on the long-form data, too.

As for the latency, due to relying on local attention with limited span (equivalent to 30 seconds in the final model), DSM-ASR’s latency is largely independent from the length of the inputs. In order to verify that, we measured the average step time when processing batches of 400 audio sequences. First we used long form sequences extracted from the Rev16 dataset (16 utterances up to 2h long), we got an average step time of ~77ms which meets the requirements for real-time processing (< 80ms). Then we ran the same measurement on sequences of at most 30 seconds and got a similar value of ~77ms.

We will add those observations to the text.

Improved results for DSM-TTS

We improved the results for the DSM-TTS model in several ways:

  1. We updated the Transformer over the Q dimension to use partial weight sharing, following Labiausse et al. (2025). This reduces the size of the DSM-TTS model from 3.7B parameters to 1.8B parameters.
  2. We reduced the delay between the text and audio from 2 seconds to 1.28 seconds (or 16 steps), reducing the latency with a batch size of 1 from 185 ms to 150 ms, and for a batch size of 64, from 708ms to 403ms.
  3. Training for 750k updates instead of 250k updates.
  4. Previous work uses a more important loss weight over the semantic tokens than over the acoustic ones (e.g. Defossez et al. (2024)). We observed it was detrimental when training a TTS model and reduced it from 100 to 10.

Those changes improved accuracy while reducing its size and latency. We report updated results hereafter (including a speaker similarity metric following F5-TTS methodology). We will update subjective evaluations in the camera ready.

Model# Params.WER EnglishWER FrenchSpk. Sim. EnglishSpk. Sim. French
DSM-TTS submission3.9B3.6%6.4%0.700.70
DSM-TTS rebuttal (250k updates)1.8B2.0%3.2%0.720.73
DSM-TTS rebuttal (750k updates)1.8B1.6%3.0%0.740.75

Improved results for DSM-TTS trained on public datasets

We applied point (4) from the previous section, to the model introduced in the Appendix, Section H. We also trained a 900M parameter model with Q=32 codebook levels. We report the results hereafter.

Model# Params.WER (LibriSpeech)Speaker Sim. (LibriSpeech)
F5-TTS336M2.42%0.66
DSM-TTS Q=16 (at submission)750M2.12%0.56
DSM-TTS Q=16 (at rebuttal)750M1.95%0.67
DSM-TTS Q=32 (at rebuttal)900M1.68%0.71

Finally, following Du et al. (2025), e.g. CosyVoice 3, we added evaluations on the SEED test-en dataset. Compared with F5-TTS (non streaming), we achieve better WER but worse speaker similarity. Our 750M gets close to the performance of CosyVoice 3-1.5B RL, while being twice as small, and not requiring a reinforcement learning based fine tuning stage.

Model# Params.WER (Seed-EN)Speaker Sim. (Seed-EN)
F5-TTS336M2.25%0.76
CosyVoice 3-1.5B RL1.5B1.45%0.70
DSM-TTS Q=16750M1.58%0.70
DSM-TTS Q=32900M1.71%0.73

References:

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models, Anastassiou et al. 2024.

CosyVoice 3: Towards in-the-wild speech generation via scaling-up and post-training, Du et al. 2025.

评论

I thank the authors for the response, and I provide my answers below:

Batching

I still do not understand why the batching optimization should be a desirable feature for research work on streaming processing, promoted as a fundamental contribution. As the authors mentioned, this can impact large-scale requests through APIs, which is rather an engineering contribution for such use cases that do not necessarily apply to all use cases, which, moreover, I believe has nothing to do with the core research on streaming processing, and nothing is preventing applying batching to existing works, as I mentioned in my original review.

Related work

Unfortunately, the authors have not understood my comment, and the usage of Seamless (without Streaming) is very misleading, as it recalls the family of the offline Seamless M4T models, while the only model that works on streaming is Seamless Streaming (https://huggingface.co/facebook/seamless-streaming). So, I suggest that the authors accurately revise the content of the related works in the paper (both related works and introduction), which, as I said, is not clear and almost neglects the existing literature on streaming processing (which I have extensively cited in my original review).

Novelty + Transducer models

The response of the authors doesn't address the novelty concern as it refers to the application of the previous methods (Defossez et al., 2024; Labiausse et al. 2025) to the streaming ASR and TTS scenarios and does not clarify what the "domain-specific contributions" are. Moreover, my original concern was about selling the "delay conditioning" as a novel contribution, while extensive literature has worked exactly on this direction, as mentioned in my review. This aspect has not been addressed in the response. Lastly, the mention of the [6] was to testify that the feasibility of using decoder-only architectures for streaming purposes has already been proved by the related work (and followed by many others), and it is not, therefore, a novel contribution of this work, as it has been claimed "Yet, a major limitation of these decoder-only is their incompatibility with streaming".

ASR Baseline "WhisperStreaming is the only publicly available streaming method for ASR" + TTS Baselines + all additional results

WhisperStreaming is not the only available streaming method for ASR, as extensive research has been done in this field. As mentioned above, SeamlessStreaming (already cited by the authors) is one of the many models publicly available supporting streaming ASR (also StreamSpeech, among others). Moreover, my concern was about the lack of motivation for the baseline choice (which still has not been motivated in the rebuttal), other than raising serious concerns about the only comparison done for ASR, which still holds true. Lastly, I do not see how the results presented in the last part of the rebuttal address the weaknesses or questions presented in my review.

As an important point, I would like to point out that the first weakness raised in my review (about the choice of monotonic tasks and neglecting translation for validating the framework) has not even been mentioned in the rebuttal. Lastly, the response given for the modality sampling (which was included in the related work response, but it is a weakness per se) does not clarify at all how the proposed solution solves the problem of "current models operate by regularly sampling the audio or video modality, preventing their applications to continuous transcription or translation" (lines 40-42). In general, the rebuttal is difficult to follow as it does not specifically address each of my concerns and questions (for instance, it starts with the batching, which is birelfy mentioned in the 5th weakness and 4th question), and it is very hard to map the weaknesses and the related responses (if any) given by the authors.

评论

We thank the reviewer for coming back to us with some of the points that we couldn’t clarify. We will keep our reply short and centered around the points raised.

Batching

It is not straightforward to engineer a solution if a model wasn’t designed from the ground up with a given objective in mind. For instance, the encoder-decoder design of Whisper meant that StreamingWhisper is inefficient in terms of compute, as we show in our measurements. In the case of the Transducer architecture that is used in [6], each batch entry will need to run the Predictor model at different times rather than batched together.

Unfortunately, the authors have not understood my comment, and the usage of Seamless (without Streaming) is very misleading,

In the current discussion we agree we should have used the name SeamlessStreaming. Any reference to Seamless in our previous reply should be understood as SeamlessStreaming. We ensured that we are citing the proper version in the manuscript, as advised on the SeamlessStreaming model card.

The response of the authors doesn't address the novelty concern

Our most important contribution is as follows: when working on the ASR and TTS tasks, we show that pre-aligning the text and audio domains to the same frame rate allows us to obtain a decoder-only model that is fast, simple, streaming and batchable, while achieving better accuracy than the existing state of the art.

Our domain specific contributions are as follow: We show that the distillation of Whisper-medium on a large audio corpus into DSM-ASR already yields large WER gains (we added this evaluation for Reviewer XjrX, the teacher model reaches 8.1% WER on the leaderboard, against 6.4% for our method). We propose to use dynamic time warping for replacing errors in the Whisper transcripts while retaining the word timestamps to fine tune on gold standard data. This improves the WER to the final 6.3%. We introduce the action and lookahead stream for DSM-TTS. We provided reviewer vNWT with ablations on the importance of those two contributions.

Moreover, my original concern was about selling the "delay conditioning" as a novelty

We confirm that we will credit [8] for supporting variable delay, as well as to StreamSpeech, which also supports selecting the delay.

[6] was to testify that the feasibility of using decoder-only architectures

Could you confirm that the indented citation is indeed “A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability.”? After reading again this paper, we could not find a decoder-only model presented there, but instead a Transducer model, with two Transformers, one for “Predictor” and one for the “Encoder”, following the terminology from [6].

WhisperStreaming is not the only available streaming method for ASR

We apologize for having missed that SeamlessStreaming also supported ASR without translation. We are working to add comparisons on the ASR leaderboard for it, along with StreamSpeech.

about the choice of monotonic tasks and neglecting translation for validating the framework)

We selected the tasks of ASR and TTS which are the two most competitive and mainstream tasks in speech processing. We agree that we could have covered more ground but this would be left for future work, as we consider that developing state-of-the-art methods for ASR and TTS is of wide enough interest to the NeurIPS community. Finally, we also bring to the attention of the reviewer that non-monotonic tasks can be reduced to monotonic tasks through appropriate alignment during data preprocessing, as shown by Labiausse et al. (2025) for speech translation.

Lastly, I do not see how the results presented in the last part of the rebuttal address the weaknesses or questions presented in my review.

We provided those updated results to all reviewers, as they represent significant improvements over our initial results and the state of the art for TTS.

We thank again the reviewer for engaging with us, and hope that we provided more definitive answers.

评论

It is not straightforward to engineer a solution if a model wasn’t designed from the ground up with a given objective in mind. For instance, the encoder-decoder design of Whisper meant that StreamingWhisper is inefficient in terms of compute, as we show in our measurements. In the case of the Transducer architecture that is used in [6], each batch entry will need to run the Predictor model at different times rather than batched together.

I do agree that it's a valuable effort and allows for large-scale adoption, but, anyway, the batching was neither the core of the paper (as the title says, it focuses on streaming with delay control) nor the core of the raised weaknesses in my review (see original comment).

We confirm that we will credit [8] for supporting variable delay, as well as to StreamSpeech, which also supports selecting the delay.

The delay conditioning is not only contained in these works, but it's the focus of simultaneous and streaming research (see all cited papers in my original review), starting from approaches like wait-k. Therefore, it does not represent a novelty of the current work, as also acknowledged now by the authors. In addition, the usage of decoder-only architectures for streaming applications is also not a novelty of this work, as it was proposed for ASR since 2023 with the work Wu, Jian, et al. "On decoder-only architecture for speech-to-text and large language model integration." 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023. and followed by many subsequent works (see the 154 papers citing this work), including streaming ones (e.g., Shoutao Guo, Shaolei Zhang, and Yang Feng. 2024. Decoder-only Streaming Transformer for Simultaneous Translation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8851–8864, Bangkok, Thailand. Association for Computational Linguistics).

We apologize for having missed that SeamlessStreaming also supported ASR without translation. We are working to add comparisons on the ASR leaderboard for it, along with StreamSpeech.

I appreciate the authors' effort, and I look forward to the results, as I think a single comparison with one model is not enough to demonstrate the effectiveness of the proposed approach.

We selected the tasks of ASR and TTS which are the two most competitive and mainstream tasks in speech processing.

However, these tasks are both monotonic, as I mentioned in my review, and, unfortunately, this is not sufficient to claim that this is a novel "framework for streaming sequence-to-sequence learning across modalities" as it neglects important tasks such as translation, but also other types of tasks. I agree that ASR and TTS alone are of enough interest, but this does not support the claim of a generic seq2seq framework, as it is done in the paper, and the framing of the paper would have been more appropriate if proposed for ASR and TTS tasks, for which experiments are actually presented in the paper.

评论

Thanks a lot for you response.

The delay conditioning is not only contained in these works, but it's the focus of simultaneous and streaming research (see all cited papers in my original review), starting from approaches like wait-k.

As far as we understand, wait-k was implemented for the same modality sequence-to-sequence, e.g. speech-to-speech or text-to-text translation. When considering cross-modality tasks, such as ASR and TTS, one can either use a fixed rate, e.g. output 1 second of audio every 10 words, however this would likely fail on real world data that is not carefully curated to ensure a constant speech rate with no pauses. Or one would need to have some alignment between streams ahead of training, which is precisely what we propose to do. We show in our work that this assumption vastly simplifies cross-domain sequence-to-sequence, as the reviewer noted, by enabling the simple wait-k policy with a decoder-only model.

In addition, the usage of decoder-only architectures for streaming applications is also not a novelty of this work, as it was proposed for ASR since 2023 with the work Wu, Jian, et al.

We agree that in the non-streaming case, the decoder-only with prefix approach works great (the combination that Wu et al works on). We cite a number of prefix-based decoder-only methods, in particular for TTS. We will include this work, too.

including streaming ones (e.g., Shoutao Guo, Shaolei Zhang, and Yang Feng. 2024. Decoder-only Streaming Transformer for Simultaneous Translation.

We thank the reviewer for pointing to this work. In comparison with the method proposed by Guo et al. (2024), we do not try to learn the alignment policy during training. The method from Guo et al. requires a specific attention module with a O(T3)O(T^3) complexity to train (with TT the sequence length), which would make it impractical for long-form, and requires twice as many forward passes in the backbone as our design.

We will include those considerations in the text.

however, these tasks are both monotonic, as I mentioned in my review, and, unfortunately, this is not sufficient to claim that this is a novel "framework for streaming sequence-to-sequence learning across modalities"

We understand that it might look that we are over-stating our contributions, in particular compared to some of the work they cited, e.g. Guo et al. (2024). It was not our intention to claim that we can learn in general a policy for aligning modalities.

We will refine our abstract as follows:

We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems.

评论

We ran evaluation of SeamlessStreaming model using the official script. We also noticed that SeamlessStreaming occasionally produces tags like "#ah" for various vocalizations, which we removed. [...]

I thank the authors for the additional results, even if only on SeamlessStreaming. However, the latency measurement, which is equally important to quality metrics in streaming applications, has not been provided. If the systems operate at very different latency regimes, the comparison is not really fair.

As far as we understand, wait-k was implemented for the same modality sequence-to-sequence, e.g. speech-to-speech or text-to-text translation. When considering cross-modality tasks, such as ASR and TTS, one can either use a fixed rate, e.g. output 1 second of audio every 10 words, however this would likely fail on real world data that is not carefully curated to ensure a constant speech rate with no pauses.

No, this is not correct. Indeed, it was successfully applied to speech-to-text since 2020 [1,2], but, anyway, this was just an instance (the simplest yet popular) of the possible policies applied for "delay conditioning" in the literature, which is not a novelty of this work. See also recent surveys on the topic to find all the other examples [3,4].

We thank the reviewer for pointing to this work. In comparison with the method proposed by Guo et al. (2024), we do not try to learn the alignment policy during training. The method from Guo et al. requires a specific attention module with a O(T3)O(T^3) complexity to train (with TT the sequence length), which would make it impractical for long-form, and requires twice as many forward passes in the backbone as our design.

Again, this was only one of the papers successfully employing decoder-only architectures for streaming applications (see [5] for another instance, but many more exist), which is clearly not a novelty of this work, and is the point of one of the raised weaknesses.

We understand that it might look that we are over-stating our contributions, in particular compared to some of the work they cited, e.g. Guo et al. (2024). It was not our intention to claim that we can learn in general a policy for aligning modalities.

Throughout the paper, it is often claimed that the framework is "applicable to many sequence-to-sequence problems" but then it is applied to ASR and TTS only. As I mentioned before, it is not a problem itself, but then the overall claims of the original paper should be toned down, and this would require a thorough revision of the paper writing, which, as also suggested by Reviewer XjrX, will also benefit from a more complete literature review and a new framing of the actual contributions of the work given the above considerations.

[1] Ma, Xutai, Juan Pino, and Philipp Koehn. "SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation." AACL 2020.

[2] Xingshan Zeng, Liangyou Li, and Qun Liu. 2021. RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2461–2474, Online. Association for Computational Linguistics.

[3] Prabhavalkar, Rohit, et al. 2023. "End-to-end speech recognition: A survey." IEEE/ACM Transactions on Audio, Speech, and Language Processing 32 (2023): 325-351.

[4] Sara Papi, Peter Polák, Dominik Macháček, Ondřej Bojar; How “Real” is Your Real-Time Simultaneous Speech-to-Text Translation System?. Transactions of the Association for Computational Linguistics 2025; 13 281–313.

[5] Chen, P., Sun, S., Shan, C., Yang, Q., Xie, L. (2024) Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study. Proc. Interspeech 2024, 4468-4472, doi: 10.21437/Interspeech.2024-1853

评论

We ran evaluation of SeamlessStreaming model using the official script. We also noticed that SeamlessStreaming occasionally produces tags like "#ah" for various vocalizations, which we removed.

We get the following results:

ModelAmiEarnings22GigaSpeechLScleanLSotherSPGISpeechTED-LIUMVoxPopuli
SeamlessStreaming45.031.821.66.810.615.412.413.9
DSM-ASR11.710.69.71.74.32.03.46.8

From this Table, we can conclude that DSM-ASR has considerably lower WER than SeamlessStreaming, thus outperforming the second streaming baseline --- in addition to the WhisperStreaming family we used initially.

评论

However, the latency measurement, which is equally important to quality metrics in streaming applications, has not been provided. If the systems operate at very different latency regimes, the comparison is not really fair.

For this comparison, it would be easier to focus on a single dataset, LibriSpeech test-clean. SeamlessStreaming obtains WER of 6.8%. According to the results produced by their evaluation script, the Length-Adaptive Average Lagging (LAAL) metric is 1.2 seconds.

Having in mind that DSM-ASR is a fixed-time delay model, we invite the reader to look at Figure 4 (left) in our paper. We see that conditioned and unconditioned DSM-ASR, at 1.2 s delay, have WER of roughly 1.9-2.1%. At 0.8s delay, our model would have WER under 2.5%.

Hence we conclude that, based on these measurements, DSM-ASR pareto-dominates SeamlessStream on the WER-latency front.

Moreover, upon analyzing Table 29 in the Seamless report, it seems that AL/LAAL and WER metrics are relatively inflexible wrt the decision threshold used (although here the evaluation is done on 90 languages).

审稿意见
3

The paper presents an innovative Delayed Streams Modeling (DSM) method and demonstrates its application in the realms of Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) tasks. This work offers a valuable streaming solution for a variety of sequence-to-sequence problems by effectively leveraging the temporal delays between input and output streams.

优缺点分析

Strength

  1. The paper is well-structured and clearly written, supported by comprehensive experiments and comparisons. The authors' intention to open-source the code is commendable and will facilitate reproducibility and further research.
  2. The proposed idea is innovative and represents a novel approach to the problem.
  3. The ASR performance is strong, achieving results that are even comparable to non-streaming TTS models.

Weakness

  1. The details of the DSM-TTS implementation, particularly regarding the lookahead and action streams (Sec 3.3), could benefit from further clarification. a. For instance, Figure 3 lacks clear indications of which streams correspond to which functions. Additionally, the speaker conditioning approach appears somewhat confusing. b. The reason using a fine-tuned mini-codec for each speaker is not well-explained. If this approach is indeed used, it may imply that the model is not designed to handle unseen speakers. c. The implementation of dialogue generation and the provision of multi-speaker embeddings are not clearly explained in this section.
  2. The Real-Time Factor (RTF) of 2.7 is relatively high, especially for a TTS model intended for streaming applications. This may limit its suitability for real-time practical applications. The evaluation of F5TTS in the appendix reports an RTF of 14.7, which seems inconsistent with typical values observed in the literature (usually less than 0.5). It is possible that there may be some misunderstanding or misinterpretation of the RTF calculation.
  3. Some references, particularly in the dialogue generation section, are missing.

[1] Zhang, L., Qian, Y., Zhou, L., Liu, S., Wang, D., Wang, X., ... & Zeng, M. (2024). CoVoMix: Advancing zero-shot speech generation for human-like multi-talker conversations. Advances in Neural Information Processing Systems, 37, 100291-100317.

[2] Lu, H., Cheng, G., Luo, L., Zhang, L., Qian, Y., & Zhang, P. (2025). SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation. arXiv preprint arXiv:2501.00805.

[3] Ju, Z., Yang, D., Yu, J., Shen, K., Leng, Y., Wang, Z., ... & Li, X. (2025). MoonCast: High-quality zero-shot podcast generation. arXiv preprint arXiv:2503.14345.

问题

In addition to the points raised in the weakness section, I still have several further questions:

  1. The pre-training data employ "whisper-transcribed" data, yet the ASR performance exceeds that of Whisper-large v3. What factors contribute to this improvement, and why does the model outperform the Whisper-large v3 despite using its transcriptions for pre-training?

  2. The DSM-TTS is reported to have a delay of 2 seconds (as mentioned in the paper on line 185), which may be considered lengthy for real-time human-interactive applications. However, the latency reported in Appendix Table 5 is 185 milliseconds. Could you clarify the distinction between the latency mentioned here and the actual latency experienced in real-world applications?

  3. The throughput of the model is notably high. What specific factors or optimizations contribute to this high throughput??

局限性

yes

最终评判理由

There are still several questions and concerns that need further clarification.

Specific details, such as Mimi’s training loss and strategies, need to be explained more clearly to enhance the comprehensive understanding of the research. The statement “this pipeline is not used for training” needs clarification, especially regarding the rationale for its inclusion in the article. The paper could benefit from additional references and more clearly articulated innovative points.

Therefore, the current score will be maintained

格式问题

No

作者回复

We thank the reviewer for their careful review of our work, and suggestion. We clarify and account for them in the following sections. We also report improved result for the TTS in the last two sections.

Clarifications about DSM-TTS

We updated Figure 3 to indicate the action and audio streams in the legend.

We want to clarify that the speaker embedding encoder is not fine tuned per speaker. We clarified Section 3.3 as follow:

Each speaker audio extract is encoded with a speaker encoder and results in a speaker embedding with a fixed dimension. We > concatenate the speaker embedding from the different speakers, sum them with an absolute positional embedding, and feed them through cross-attention layers to the backbone. The speaker encoder has the same architecture as the encoder of the Mimi codec, and is initialized with its weights. We keep the weights of the convolutional layers frozen for stability, but let its Transformer layers be fine-tuned in an end-to-end fashion with the language model conditioned on it.

About dialog generation

Regarding the question about dialogs, we want to clarify:

  1. How we allow the model to handle more than one speaker at a time, with precise control over what is said by a given speaker.
  2. How we generate synthetic dialog data only for the purpose of evaluation.

(1) Our internal dataset is made of real world audio with any number of participants, with no diarization, but we estimate it with Pyannote (Bredin, 2023). A training sample is 2 min long, taken from a longer audio. We extract speaker embeddings for up to 5 speakers (in order of first appearance) using utterances that fall outside of the given 2 min window. We use a special token in the text stream to indicate the start of turn of the first speaker, denoted MAIN, and another one, OTHER, to indicate the start of the turn of any other speaker. We only use 2 tokens because at inference time we are only interested in generating dialogs. At inference time, we pass either one (monologue) or two (dialogs) speaker embeddings and use the MAIN and OTHER special tokens to trigger a change of speaker.

(2) The pipeline for dialog generation is described in the Appendix B, and the evaluation dataset will be open sourced. Note that this pipeline is not used for training. We thank the reviewer for the relevant references regarding dialog generation which we added to the manuscript. Unlike [2, 3], the main goal of our work is not to generate the most realistic dialogs, but to provide diverse evaluation data points for our TTS system (e.g. covering daily life, technical, and conversations with many spoken numbers). The dialog TTS from [1] is relevant to our work, but we couldn't compare to it (no public implementation or result on public benchmark). We note that their method is not streaming as the acoustic model is based on diffusion, and is limited to 8kHz audio. Their evaluations of statistics over change of turns would be of interest for future work. We added this discussion to the manuscript.

About the real time factor

We take the convention that the real time factor is RTF = generated_duration / computation_duration, e.g. higher is better. We understand that the two conventions exist in the literature, but we take care of explaining the convention, e.g. in Table 6 (“RTF is higher than 1 if the model can produce audio in real time”). In particular, for a batch size of 1, the throughput and RTF are the same. For F5-TTS we recomputed the RTF on a H100 for fairness using the authors provided code. The authors of F5-TTS mention a RTF of 6.7 with our convention, with the difference being explained by the better performance of an H100.

We want to highlight one more time that all the proposed models are faster than real time even with large batch sizes.

About the throughput

The high throughput comes from our model being easily batchable and streaming: the ASR is a decoder-only Transformer, so that only minimal work is done when a new frame of audio is available. Unlike other methods like StreamingWhisper that repeatedly processes the whole context with a small shift, only a single 1-timestep forward every 80ms is performed in the backbone. The use of attention with finite context allows it to operate with constant memory even for long input audio. For the TTS, there are two extra components, the speaker embedding and the Transformer over the Q-dimension for predicting the audio tokens. When a new TTS request comes in, we recompute the cross attention speaker embeddings in less than 1ms on a H100. The Transformer over the Q-dimension has no state across timesteps, taking only the current latent output of the main Transformer, and is thus also easily batched. The difference of throughput between the ASR and TTS is mostly explained by the extra computation required with predicting the audio tokens. Note that will open source both weights and inference code, allowing to reproduce those numbers under real world conditions.

Delay and time to first token for the TTS

The audio stream is shifted by two seconds to the future, e.g. 25 time steps. However, during the first 25 steps, no audio is needed, as this part corresponds to special padding tokens at train time. This means that only the main backbone Transformer and the linear layer providing the action stream need to be computed. On a H100, this takes around 5ms (for a batch size of 1). This means that those 25 steps can be processed in less than 125ms, with only the condition that enough text to cover those 2 seconds is provided by the user, e.g. a few words. Note that all existing models require the user to provide the entire text at once instead of a few words. The rest of the time is spent generating the audio tokens through the full RQ-Transformer and decoding them. When batching out-of-sync requests, not all users might be in the initial stage at the same time. This explains the degradation of the latency in that case, as we need to interleave backbone-only steps with full RQ-Transformer steps to allow both users in the initial phase to catch up, while generating enough audio for the others. This is detailed in the Appendix, Section G.

Achieving better performance than Whisper

The improvements over Whisper come from two sources:

  1. After the pretraining stage DSM-ASR achieves the WER of 6.4%. Here, Whisper-Medium (WER 8.1% on the official OpenASR leaderboard) plays the role of a teacher in a pseudo-labelling scenario. We hypothesize that the observed improvement over the teacher comes from (a) smoothing across a larger and more diverse set of real-world audio, implicitly leading to domain adaptation, (b) low-temperature sampling that would eliminate non-systematic errors of the teacher model. Note that such improvements through teacher/student distillation have been observed before, e.g. on ImageNet classification (Yalniz et al. 2019). We also perform augmentations such as codebook dropout.
  2. At the fine-tuning stage DSM-ASR gets WER of 6.3%. We train on ground-truth transcripts that come with the standard ASR datasets (see Appendix A.1). At this stage, Whisper is only used to derive the timestamps.

Reference:

Billion-scale semi-supervised learning for image classification, Yalniz et al. 2019.

Improved results for DSM-TTS

We improved the results for the DSM-TTS model in several ways:

  1. We updated the Transformer over the Q dimension to use partial weight sharing, following Labiausse et al. (2025). This reduces the size of the DSM-TTS model from 3.7B parameters to 1.8B parameters.
  2. We reduced the delay between the text and audio from 2 seconds to 1.28 seconds (or 16 steps), reducing the latency with a batch size of 1 from 185 ms to 150 ms, and for a batch size of 64, from 708ms to 403ms.
  3. Training for 750k updates instead of 250k updates.
  4. Previous work uses a more important loss weight over the semantic tokens than over the acoustic ones (e.g. Defossez et al. (2024)). We observed it was detrimental when training a TTS model and reduced it from 100 to 10.

Those changes improved accuracy while reducing its size and latency. We report updated results hereafter (including a speaker similarity metric following F5-TTS methodology). We will update subjective evaluations in the camera ready.

Model# Params.WER EnglishWER FrenchSpk. Sim. EnglishSpk. Sim. French
DSM-TTS submission3.9B3.6%6.4%0.700.70
DSM-TTS rebuttal (250k updates)1.8B2.0%3.2%0.720.73
DSM-TTS rebuttal (750k updates)1.8B1.6%3.0%0.740.75

Improved results for DSM-TTS trained on public datasets

We applied point (4) from the previous section, to the model introduced in the Appendix, Section H. We also trained a 900M parameter model with Q=32 codebook levels. We report the results hereafter.

Model# Params.WER (LibriSpeech)Speaker Sim. (LibriSpeech)
F5-TTS336M2.42%0.66
DSM-TTS Q=16 (at submission)750M2.12%0.56
DSM-TTS Q=16 (at rebuttal)750M1.95%0.67
DSM-TTS Q=32 (at rebuttal)900M1.68%0.71

Finally, following Du et al. (2025), e.g. CosyVoice 3, we added evaluations on the SEED test-en dataset. Compared with F5-TTS (non streaming), we achieve better WER but worse speaker similarity. Our 750M gets close to the performance of CosyVoice 3-1.5B RL, while being twice as small, and not requiring a reinforcement learning based fine tuning stage.

Model# Params.WER (Seed-EN)Speaker Sim. (Seed-EN)
F5-TTS336M2.25%0.76
CosyVoice 3-1.5B RL1.5B1.45%0.70
DSM-TTS Q=16750M1.58%0.70
DSM-TTS Q=32900M1.71%0.73

References:

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models, Anastassiou et al. 2024.

CosyVoice 3: Towards in-the-wild speech generation via scaling-up and post-training, Du et al. 2025.

评论

Thank you for your reply. However, I still have a few questions that I hope you could kindly clarify for me.

Firstly, how did the authors fine-tune the MiMi Codec to achieve the speaker encoder function? I am curious about the specific techniques or methodologies they employed. Additionally, why did they opt for using speaker embeddings instead of other potential features? I am concerned that this choice might lead to a decrease in speaker similarity.

I also noticed that the authors mentioned that the model supports up to five people. Could you explain how this is achieved? Specifically, I am interested in understanding the implementation details beyond just the main and other speakers.

Furthermore, I am intrigued by the use of dialogue only in the data generation and testing phases. Could you shed some light on why this particular approach was chosen? What are the potential applications for the generated dialogues?

Lastly, I would like to understand more about the Real-Time Factor (RTF) and delay to first token issues. While it is true that many current models, including TTS and ASR, have already surpassed real-time capabilities, I am interested in understanding how the streaming model's advantages can be demonstrated.

评论

Thank you so much for your interest. Please find our answers below.

Firstly, how did the authors fine-tune the MiMi Codec to achieve the speaker encoder function? I am curious about the specific techniques or methodologies they employed. Additionally, why did they opt for using speaker embeddings instead of other potential features? I am concerned that this choice might lead to a decrease in speaker similarity.

We chose to rely on the Mimi codec because we know from its reconstruction metrics that its latent space preserves speaker information, while operating at a low framerate. This makes these embeddings a good representation for the generative model to use through cross-attention. Finetuning this encoder when training the TTS furthermore improves the performance as it allows the encoder of Mimi to focus solely on speaker-relevant features and provide a more informative conditioning to the decoder.

I also noticed that the authors mentioned that the model supports up to five people. Could you explain how this is achieved? Specifically, I am interested in understanding the implementation details beyond just the main and other speakers.

The number of possible speakers is a hyperparameter that is upper-bounded by the maximum number of speakers per sequence in the training data and the reliability of diarization on such data. Given a training window, sampled from a longer file, pyannote provides segments labeled by who is speaking. For each speaker, we extract a 10s sample outside of the window but inside the same file where only this speaker is speaking. The 5 representative samples are passed independently through the encoder, and the resulting embeddings are concatenated along the time axis for the main architecture to attend to. When fewer than 5 speakers are present in a sequence, the missing speaker embedding are replaced by a learnable padding embedding.

Furthermore, I am intrigued by the use of dialogue only in the data generation and testing phases. Could you shed some light on why this particular approach was chosen? What are the potential applications for the generated dialogues?

We are interested in dialogue as it challenges the common assumption of most TTS systems that a sequence is a monologue (only one speaker speaks). Thus, when generating dialogues (e.g. a podcast), typical systems concatenate single speaker sequences generated independently. We hypothesize that this reduces the naturalness of generated dialogues, where each turn is dependent on the previous one, and that a “dialogue-aware” TTS like DSM-TTS can generate more realistic dynamics between turns.

Lastly, I would like to understand more about the Real-Time Factor (RTF) and delay to first token issues. While it is true that many current models, including TTS and ASR, have already surpassed real-time capabilities, I am interested in understanding how the streaming model's advantages can be demonstrated.

We identify several key factors for real-time inference: 1) can the model stream its output (e.g. we can listen to the TTS output as it’s generated), 2) can the model stream its input (e.g. we can listen to the TTS output while the input text is ingested) 3) does the streaming inference support batching. Autoregressive models typically provide 1) but not 2): passing a pre-generated long text as input will result in almost instant audio playback, however when the text is being generated on-the-fly, the model needs to wait for the end of text generation to start audio generation. This significantly increases the latency of conversational systems as the one shown on our sample webpage. Then, supporting batching is a key factor to scaling streaming inference, as it allows processing many concurrent requests in real-time.

We hope our clarifications help!

评论

Thank you for the author's thoughtful reply. However, it seems that there may have been a slight misunderstanding regarding my initial concerns. I still have several questions and concerns that I hope we can address further.

In the paper, the author mentions the strategies and results related to dialogue generation, which is quite interesting. However, I believe that some specific details, such as the training loss and strategies of Mimi, as well as more details of the DSM-TTS could be explained more clearly. These details are essential for a comprehensive understanding of the research, and I feel that they are currently missing. Additionally, I am still somewhat unclear about the statement “this pipeline is not used for training.” If this pipeline is not used for training, I am curious about the rationale behind including it in the paper. The appendix B that the author mentioned is purely for evaluation. Could the author please provide some clarification on this point?

Moreover, through the discussions with reviewer B8AT, I have also noticed that the paper could benefit from additional references and more clearly articulated innovative points. I believe that addressing these aspects would significantly enhance the overall quality of the work.

Given these considerations, I will maintain my current score, but I am hopeful that the author will find my feedback constructive and will be able to address these concerns in future revisions.

评论

Dear authors,

I encourage you to answer the remaining questions of the reviewer during the extended discussion period.

Thank you!

审稿意见
4

This paper introduces Delayed Streams Modeling (DSM), a framework for streaming speech-text sequence-to-sequence learning. DSM builds on an autoregressive backbone to process streaming input and generate output sequences, both operating at a constant frame rate to enable efficient batching. The output sequence is delayed relative to the input, allowing access to additional context. On ASR, DSM achieves a word error rate (WER) close to the best offline model on the Open ASR Leaderboard with just a 2.5-second delay. On TTS, it outperforms baseline non-commercial models in synthesis quality.

优缺点分析

Strengths

  1. DSM achieves a WER within 0.2% of the best offline model on the Open ASR Leaderboard, while maintaining a 2.5-second delay in a streaming ASR setup.
  2. The architecture naturally supports efficient batching during streaming inference, as demonstrated in Table 4 and Figure 5.

Weaknesses

  1. Table 2 omits models like Parakeet-TDT-v2, which also support long-form ASR over several hours, limiting the completeness of comparison.
  2. The paper lacks a clear differentiation from prior architectures such as Moshi and Hibiki, which also use streaming input and delayed output. A more thorough discussion of DSM’s unique contributions is needed.
  3. The paper does not analyze how computational latency scales with batch size in the ASR setting.
  4. DSM has not been evaluated on streaming speech translation, a more challenging task involving word reordering and more complex reasoning.

问题

  1. Have you considered initializing the model with a pretrained LLM backbone instead of training from scratch?
  2. In Equation (6), the second equation should use Y~t\tilde{Y}_t instead of Y~1\tilde{Y}_1.

局限性

yes

最终评判理由

I'll keep my positive score of 4.

Reasons

  1. The authors add the baseline parakeet model and RTF under different batch sizes that I asked for.
  2. The empirical performance of the model is amazing, reaching offline level accuracy with a streaming model.
  3. However, I also agree with the concerns raised by reviewer B8AT, that the novelty is weak and the citations are problematic. So I won't raise my score.

格式问题

None.

作者回复

We thank the reviewer for their suggestions to improve the paper, corrections and comments. We reply to them in the following sections.

Discussion about the related work and contribution

As discussed in the related work of our submission, our method extends the approach used by Moshi and Hibiki to two fundamental tasks of speech: ASR and TTS. The contribution of our work is to show that those methods can be competitive and even outperform existing methods, while scaling better: our ASR model generalizes to audio up to 2 hours long with no need for cutting and merging over chunks, while our TTS is as far as we know the only model able to synthesize audio while being streaming over the text input. To the best of our knowledge, this line of methods has never been shown to work so competitively on those two tasks. Besides, we contribute a number of task specific improvements. Given that similar methods were already successfully applied to speech-to-speech tasks such as conversational models or simultaneous tasks, we did not cover those in the present paper.

About Parakeet

Thanks for bringing our attention to Parakeet-TDT-v2. The reason we did not include this model initially is that its Hugging Face page mentions that it can handle up to 24 minutes of audio [*]. In contrast, the datasets we consider in the long-form evaluation have segments of up to 2 hours long. Following your suggestion, we are working to include it in Table 2.

[*] “enabling efficient transcription of audio segments up to 24 minutes in a single pass” (https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)

Scaling of the ASR with the batch size

In the following table, you will find the real time factor and throughput across various batch sizes for the 3B model, on a single H100 Nvidia GPU.

Batch sizeRTFThroughput
16.96.9
324.4141.4
643.5224.0
2561.49380.1

Initializing with a pre-trained model

During preliminary experiments, we tested a warm start from a pretrained text model, or a pre-trained unsupervised text-audio model (similar to the pre-training stage of Moshi, Defossez et al (2025)), but after sufficient training, we observed no benefit for the task of ASR and TTS, unlike what has been observed for speech-to-speech models.

Improved results for DSM-TTS

We improved the results for the DSM-TTS model in several ways:

  1. We updated the Transformer over the Q dimension to use partial weight sharing to make it more compact, following Labiausse et al. (2025). This reduces the size of the DSM-TTS model from 3.7B parameters to 1.8B parameters.
  2. We reduced the delay between the text and audio from 2 seconds to 1.28 seconds (or 16 steps), reducing the latency with a batch size of 1 from 185 ms to 150 ms, and for a batch size of 64, from 708ms to 403ms.
  3. Training for 750k updates instead of 250k updates.
  4. Previous work uses a more important loss weight for the cross entropy over the semantic tokens than over the acoustic (e.g. Defossez et al. (2024)). While this was shown to improve unconditional generation, we observed it was detrimental when training a TTS model. We reduced the semantic token cross-entropy weight from 100 to 10.

The combination of those changes allowed us to improve our model's performance while reducing its size and latency. We report in the following table the previous and new results (including a speaker similarity metric, computed in the same way as in the Appendix, Section H, e.g. following F5-TTS). We will include the updated subjective evaluations in the camera ready.

Model# Params.WER Eng.WER Fra.Speaker Sim. Eng.Speaker Sim. Fra
DSM-TTS submission3.9B3.6%6.4%0.700.70
DSM-TTS updated (250k updates)1.8B2.0%3.2%0.720.73
DSM-TTS updated (750k updates)1.8B1.6%3.0%0.740.75

Improved results for DSM-TTS trained on public datasets

We applied point (4) from the previous section, e.g. reducing the cross entropy weight on the semantic token to the model introduced in the Appendix, Section H. We also trained a 900M parameter model with Q=32 codebook levels. We report the results hereafter.

Model# Params.WER (LibriSpeech)Speaker Sim. (LibriSpeech)
F5-TTS336M2.42%0.66
DSM-TTS Q=16 (at submission)750M2.12%0.56
DSM-TTS Q=16 (at rebuttal)750M1.95%0.67
DSM-TTS Q=32 (at rebuttal)900M1.68%0.71

Finally, following Du et al. (2025), e.g. CosyVoice 3, we further added evaluations on the SEED test-en dataset. Compared with F5-TTS (non streaming), we achieve better WER but worse speaker similarity on this dataset. Our 750M gets close to the performance of CosyVoice 3-1.5B RL, while being twice as small, and not requiring a reinforcement learning based fine tuning stage.

Model# Params.WER (Seed-EN)Speaker Sim. (Seed-EN)
F5-TTS336M2.25%0.76
CosyVoice 3-1.5B RL1.5B1.45%0.70
DSM-TTS Q=16750M1.58%0.70
DSM-TTS Q=32900M1.71%0.73

References:

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models, Anastassiou et al. 2024.

CosyVoice 3: Towards in-the-wild speech generation via scaling-up and post-training, Du et al. 2025.

评论

Thank you for the response.

parakeet-tdt-v2

They do have a script for long-form ASR up to 3 hours here https://huggingface.co/spaces/nvidia/parakeet-tdt-0.6b-v2

while our TTS is as far as we know the only model able to synthesize audio while being streaming over the text input.

CosyVoice does support streaming text input, though the latency is a bit large.

评论

Thanks a lot for pointing this page out! We re-run the evaluations for parakeet using the script linked from the page, and below is the updated version of Table 2 (we will include these changes in the paper).

We see that Parakeet-tdt-0.6b-v2 has a lead on the WER scores (mean of 7.1), outperforming all models in the evaluation. However, we want to emphasize that, unlike DSM-ASR, Parakeet is not a streaming model.

ModelAvg.TED-LIUMMeanwhileRev16Earnings21
Non-streaming
DISTIL-LARGE-V28.73.77.812.211.2
WHISPER-LARGE-V29.04.46.313.611.8
Parakeet 0.67.13.04.910.610.2
Streaming
WHISPER MEDIUM.EN9.03.96.713.012.5
WHISPER LARGE-V38.13.46.111.411.4
DSM-ASR7.92.95.712.310.6
评论

Thank you for the response. I don't have other questions.

审稿意见
4

This paper introduces Delayed Streams Modeling (DSM), a flexible framework for streaming multimodal sequence-to-sequence learning. DSM employs a decoder-only language model to process time-aligned token streams across modalities, with controlled delays between streams to balance quality and latency. It unifies tasks like automatic speech recognition (ASR) and text-to-speech (TTS) under a single architecture, enabling bidirectional generation.

Extensive experiments show DSM achieves state-of-the-art performance on both ASR and TTS, competing with offline baselines while supporting streaming inference, long-form sequences, and real-time latency control via delay conditioning. The authors provide detailed implementation details, experimental results across diverse datasets, and plans to release code, models, and evaluation data.

优缺点分析

Strengths

  1. Novel Streaming Formulation: DSM addresses a critical gap in streaming seq2seq learning by introducing 可控 delays between aligned streams, enabling real-time inference without sacrificing performance. This contrasts with existing streaming models that often trade off batching, symmetry, or quality.
  2. Unified Architecture: By leveraging discrete token representations for audio (via neural codecs) and text, DSM unifies ASR and TTS into a single decoder-only framework, supporting bidirectional generation—a significant advance over modality-specific architectures (e.g., Tacotron for TTS vs. LAS for ASR).
  3. Strong Empirical Performance: Across short and long-form ASR tasks, DSM outperforms streaming baselines (e.g., Whisper-Streaming) and matches top offline models (e.g., Parakeet-TDT-V2) with an average WER of 6.3%. For TTS, it achieves the lowest WER among open-source models, with competitive speaker similarity to commercial systems (e.g., ElevenLabs).
  4. Rigorous Experimentation: The authors evaluate across diverse datasets (e.g., LibriSpeech, TED-LIUM, custom long-form dialogs) and provide comprehensive metrics (WER, latency, subjective quality), enhancing result credibility.

Weaknesses

  1. Similar methods: For Delayed Streaming Modeling (DSM), it seems that there have been similar explorations in sequence modeling before [1][2], which to some extent limits the originality of this article.
  2. Model Size: DSM-ASR uses a 3B-parameter backbone, and DSM-TTS is 3.7B parameters, which may restrict deployment on resource-constrained devices—an important consideration for streaming applications.

[1] Gao, Heting, et al. "LUCY: Linguistic Understanding and Control Yielding Early Stage of Her." arXiv preprint arXiv:2501.16327 (2025).

[2] Ding, Ding, et al. "Kimi-audio technical report." arXiv preprint arXiv:2504.18425 (2025).

问题

Questions

  1. The TTS architecture uses an action stream and a lookahead text stream. Were there ablation studies comparing these components to simpler alternatives (e.g., fixed pauses)? How do they impact generation fluency?
  2. The author should conduct a more comprehensive investigation of existing work to demonstrate the differences and advantages of the method proposed in this article compared to existing methods.
  3. The paper mentions using Mimi codec for audio tokenization. How sensitive is DSM’s performance to the choice of codec (e.g., compared to SoundStream)?
  4. Have authors conducted some ablation experiments on the model size to verify its usability under the condition of a small number of parameters?

局限性

yes

最终评判理由

Although similar methods have been explored by other researchers in [1] and [2], the author's exploration is more detailed and confidence with sufficient experiments.

格式问题

N/A

作者回复

We thank the reviewer for their time, comments and suggestions for improving our submission. We reply to the different points in the following sections.

Contributions

We thank the reviewer for the reference to relevant work. Note that [2] was published at the end of April 2025, and thus would be considered concurrent work under the guidelines of Neurips. Regarding [1], it focuses solely on speech-to-speech conversational models. While we agree that the modeling of parallel streams has been used successfully, even before [1,2], for speech-to-speech (such as Moshi, (Defossez et al., 2024), or Hibiki, (Labiausse et al. 2025)) or audio generative modeling, (Copet. et al. (2022) cited in [2]), this class of methods has so far never been used for two fundamental tasks in speech, namely TTS and ASR, with state of the art accuracy, and in a fully streaming and batchable fashion. In particular, no existing TTS method is streamable with respect to the text, and we show that our ASR is able to generalize to audio of many hours out of the box. We will update the paper with this discussion and citation to the relevant works.

Model sizes

ASR

We agree that the 3B ASR model as-is might not be suitable for edge applications. To address this concern, we have also pretrained and fine-tuned, using the data described in the paper, a model with a 300M-parameter backbone. At evaluation on short-form ASR, we get the following WER metrics (to be inserted in Table 1):

  • ami: 16.36%,
  • earnings22: 13.87%,
  • gigaspeech: 11.14%,
  • Librispeech clean: 2.17%,
  • Librispeech other: 6.75%,
  • Spgispeech: 2.71%,
  • Tedlium: 4.24 %
  • Voxpopuli: 8.36 %

The across-dataset average is 8.2%, which is comparable with the whisper-medium (8.1%, 765M parameters). Note that for fair comparison with whisper-medium, our parameter count should also include the encoder from Mimi, which is 48M parameters, e.g. a total of 348M parameters.

TTS

See the section hereafter about "Improved results for DSM-TTS" where we described how we decreased the model size from 3.9B to 1.8B while improving the quality and accuracy of the model.

Ablations on lookahead stream and action stream vs. fixed padding

We provide hereafter aggregated results for speaker similarity and WER evaluated as in Table 3, when using a model with no lookahead, or using a fixed padding, e.g. ignoring the result of the action stream. For the fixed padding, we either force a fixed amount of padding after the start of the word, or a fixed amount of padding after its last text token. We notice that not using the lookahead has limited impact on the speaker similarity, but deteriorates the WER. On the other hand, using a fixed padding pattern has a clear impact on the speaker similarity, likely due to the fixed prosody. While the current rule at Neurips prevents us from sharing new audio samples, the result feels off, with unnatural spacing between words.

ModelWER EnglishWER FrenchSpk. Sim. EnglishSpk. Sim. French
DSM-TTS baseline1.60%3.02%0.7430.745
No lookahead3.51%3.25%0.7430.746
Fixed padding, 4 from start2.86%3.60%0.6940.690
Fixed padding, 5 from start2.69%3.48%0.6910.700
Fixed padding, 2 from end2.32%3.13%0.7150.698

Ablation on the codec

We thank the reviewer for the idea of comparing our approach with different codecs. While SoundStream is not open source, we could instead use Encodec (Defossez et al. 2022). Encodec operates at 75 Hz. In order to keep a roughly similar number of tokens generated per second of audio, we decided to use 8 codebook levels, e.g. 600 tokens per second, vs. 400 when modeling Q=32 codebook levels with Mimi. We decided to keep slightly more tokens per second for Encodec because (1) the token cardinality is 1024 instead of 2048 (2) being less recent, it benefited less from the advances in architecture and losses, so that using less than 600 tokens per second would lead to a limited quality.

The model was trained following the same procedure as described in the Appendix, Section H, e.g. on a mix of publicly available datasets. We report results on the Librispeech clean test set, following the methodology introduced by F5-TTS (Chen et al. 2024). We observe that while the results are worse than with Mimi, most likely due to the worse acoustic quality of Encodec, as well as the absence of a semantic token, the results are still on par with state of the art baselines on this dataset such as F5-TTS, showing the generality of our methods across codecs.

ModelRVQ levelsFrame rateTokens per sec.BandwidthWER (LibriSpeech)Speaker Sim. (Librispeech)
F5-TTS----2.42 %0.66
DSM-TTS with Mimi3212.5 Hz4004.4 kbps1.68 %0.71
DSM-TTS with Encodec875 Hz6006 kbps2.45 %0.68

References:

High Fidelity Neural Audio Compression, Defossez et al. 2022.

Improved results for DSM-TTS

We improved the results for the DSM-TTS model in several ways:

  1. We updated the Transformer over the Q dimension to use partial weight sharing to make it more compact, following Labiausse et al. (2025). This reduces the size of the DSM-TTS model from 3.7B parameters to 1.8B parameters.
  2. We reduced the delay between the text and audio from 2 seconds to 1.28 seconds (or 16 steps), reducing the latency with a batch size of 1 from 185 ms to 150 ms, and for a batch size of 64, from 708ms to 403ms.
  3. Training for 750k updates instead of 250k updates.
  4. Previous work uses a more important loss weight for the cross entropy over the semantic tokens than over the acoustic (e.g. Defossez et al. (2024)). While this was shown to improve unconditional generation, we observed it was detrimental when training a TTS model. We reduced the semantic token cross-entropy weight from 100 to 10.

The combination of those changes allowed us to improve our model's performance while reducing its size and latency. We report in the following table the previous and new results (including a speaker similarity metric, computed in the same way as in the Appendix, Section H, e.g. following F5-TTS). We will include the updated subjective evaluations in the camera ready.

Model# Params.WER EnglishWER FrenchSpk. Sim. EnglishSpk. Sim. French
DSM-TTS submission3.9B3.6%6.4%0.700.70
DSM-TTS updated (250k updates)1.8B2.0%3.2%0.720.73
DSM-TTS updated (750k updates)1.8B1.6%3.0%0.740.75

Improved results for DSM-TTS trained on public datasets

We applied point (4) from the previous section, e.g. reducing the cross entropy weight on the semantic token to the model introduced in the Appendix, Section H. We also trained a 900M parameter model with Q=32 codebook levels. We report the results hereafter.

Model# Params.WER (LibriSpeech)Speaker Sim. (LibriSpeech)
F5-TTS336M2.42%0.66
DSM-TTS Q=16 (at submission)750M2.12%0.56
DSM-TTS Q=16 (at rebuttal)750M1.95%0.67
DSM-TTS Q=32 (at rebuttal)900M1.68%0.71

Finally, following Du et al. (2025), e.g. CosyVoice 3, we further added evaluations on the SEED test-en dataset. Compared with F5-TTS (non streaming), we achieve better WER but worse speaker similarity on this dataset. Our 750M gets close to the performance of CosyVoice 3-1.5B RL, while being twice as small, and not requiring a reinforcement learning based fine tuning stage.

Model# Params.WER (Seed-EN)Speaker Sim. (Seed-EN)
F5-TTS336M2.25%0.76
CosyVoice 3-1.5B RL1.5B1.45%0.70
DSM-TTS Q=16750M1.58%0.70
DSM-TTS Q=32900M1.71%0.73

References:

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models, Anastassiou et al. 2024.

CosyVoice 3: Towards in-the-wild speech generation via scaling-up and post-training, Du et al. 2025.

评论

Contributions

Although similar methods have been explored by other researchers in [1] and [2], the author's exploration is more detailed.

Others

Thank you for the author's response to other questions.

最终决定

The paper introduces Delayed Streams Modeling (DSM), a streaming method for decoder-only architectures that extends prior work (Moshi) by adding a controllable inter-stream delay. This mechanism enables flexible, real-time cross-modal processing. DSM is evaluated on automatic speech recognition (ASR) and text-to-speech (TTS), trained on English and French speech automatically transcribed with Whisper, and tested on both short- and long-form benchmarks.

The discussion period was active and constructive. However, I cannot recommend acceptance for several reasons. First, incorporating the numerous reviewer comments—many of which address substantive issues—would require rewriting significant portions of the paper, resulting in a substantially different manuscript that would merit a full re-review. Second, the contribution is largely engineering-oriented, which would be acceptable if it advanced the state of the art through rigorous experimental design. Instead, reviewers identified important flaws in the experimental protocol that would require major revisions. Third, while the paper claims to address “Streaming Sequence-to-Sequence Learning,” the experiments are limited to monotonic tasks. The authors’ rebuttal argues that non-monotonic tasks can be reduced to monotonic ones via alignment during preprocessing, but this relies on an external model for alignment—an engineering workaround rather than a principled solution. This is particularly problematic given the generality implied in the paper’s title.

Overall, I believe a thoroughly revised version addressing all reviewer feedback should be resubmitted. Given the incremental nature of the work and the heavy experimental focus relative to prior studies in this line, a journal submission may be more appropriate.