6.4

/10

Poster4 位审稿人

最低4最高4标准差0.0

4.0

置信度

创新性2.8

质量2.5

清晰度2.5

重要性2.5

NeurIPS 2025

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

leying zhang,Yao Qian,Xiaofei Wang,Manthan Thakker,Dongmei Wang,Jianwei Yu,Haibin Wu,Yuxuan Hu,Jinyu Li,Yanmin Qian,sheng zhao

OpenReview PDF

提交: 2025-05-03更新: 2025-10-29

摘要

关键词

text-to-speechdialogue generationflow-matching

评审与讨论

审稿意见

评分: 4置信度: 52025-06-01

The paper introduces CodiFM, a non-autoregressive framework that generates realistic multi-speaker dialogues directly from transcriptions using a flow-matching-based generative model. It employs techniques such as speaker disentanglement, sentence-level alignment, and prompt-level random masking to improve the naturalness of dialogue synthesis. Additionally, it supports controllable speech overlap and precise timing. The performance of CodiFM is compared against strong baselines like MoonCast and Sesame in terms of speech quality, speaker consistency, and inference speed.

优缺点分析

Quality

The authors reasonably implemented a model and data setup for generating natural-sounding, multi-speaker dialogues and conducted appropriate experiments to support their claims. They also clearly outlined the limitations of their proposed methodology.

Clarity

The paper is generally well-structured and clearly articulates its contributions, making most explanations intuitive and accessible. However, several clarifications are needed.
Figure 1 contains substantial information; adding detailed explanations for each element directly to the caption would significantly aid comprehension, as currently, its full meaning becomes clear only after reading the entire paper.
In the "Related Work" section (line 92), citations for the CSS methodology appear incomplete; references [1, 2] should likely be included.
In Figure 2, the meaning of the term "Prompt Candidate" (top-left) requires clarification. Specifically, line 151 mentions available monologue segments—is this exactly what "Prompt Candidate" represents, and does the model select prompts from these candidates?
In Section 3.1,
- It's unclear how the text streams corresponding to the speaker sequences are exactly fed into the model. Clarifying explicitly if each speaker’s utterance undergoes separate text encoding (or embedding) followed by concatenation, and clearly differentiating how this process differs from a single stream approach, would better highlight the strengths of the proposed modeling method. Currently, even considering the description around line 176 ("~ the construction of the input text streams z."), the precise method remains ambiguous.
- Further, since [Spk1] and [Spk2] tokens appear in both text streams, the structural reason that allows the model to distinguish between individual speakers despite this overlapping information needs clearer articulation.
- The approach used for identifying sentence-level alignment within overlapping speech segments should be explicitly described.
- It is unclear how the text streams' lengths are aligned precisely with the corresponding audio durations. Specifically, are the t_start and t_end indices based on audio timestamps, and how exactly are timings synchronized between streams z1 and z2? Clarifying these alignment details is essential to ensure the method's reproducibility.
In Section 3.3,
- Regarding the phrase in line 149, "~ to ensure robust and diverse speaker conditioning," the authors should explicitly define "robustness" and "diversity." Is robustness implying consistent model performance even with unseen speakers, and diversity indicating coverage of a broader range of speaker characteristics?
- Concerning the context embedding $m_{ctx}$ mentioned in line 153 and visualized in Figure 1, during training, $m_{ctx}$ appears to be derived from partial real dialogue segments, some of which overlap with $z$ . However, during inference, $m_{ctx}$ could come from entirely unrelated speech segments, suggesting potential train-inference mismatch. This could theoretically lead to performance degradation. To mitigate this, have the authors considered strategies such as using randomized segments or shuffling within the selected prompt, potentially eliminating the need for loss masking altogether?
In Section 4.1,
- At line 197, the phrase "a market-leading speech enhancement API" requires a proper citation or explicit naming of the exact API used, as it is currently unspecified.
- For the "simulated dialogue-style data" described in lines 203 and 207, it is understood that the authors concatenate speech segments from two different speakers (datasets). However, it would help if the authors clarified how they maintain coherent context between these segments. Explicitly addressing how context continuity is ensured or justified in these simulated dialogues is crucial for understanding the reliability and validity of their data augmentation approach.
In Section 4.3,
- Have you compared your approach with Moshi [3] as a baseline? If not, explicitly stating the rationale for excluding Moshi from baseline comparisons would strengthen the evaluation's completeness.
- Regarding the 15 professional linguistic experts employed for CMOS evaluations, please clarify the hiring or recruitment process, such as whether a platform like Amazon MTurk was used or if other methods were applied.
In Section 5.1,
- At line 275, the statement "containing prompts and prior audio from both speakers" appears applicable to both Sesame and CodiFM based on Figure 2. Clarifying explicitly how the input formulation for CodiFM differs from Sesame would better highlight the methodological advantages of the proposed approach.
- At line 278, the claimed "stable and reliable output quality" contrasts somewhat with the provided supplementary samples, where CodiFM outputs seem slightly less expressive and more flat compared to the baselines. It would be beneficial if the authors discussed this aspect clearly.
- Based on personal observation, individual utterances from Sesame sound more natural, but it appears less consistent in maintaining speaker identity compared to CodiFM. Given this:
  - In Table 7, speaker consistency and fluency are evaluated together. Wouldn't evaluating these two criteria separately yield more accurate insights?
  - Have the authors measured the actual error rates related to speaker inconsistency for each model? Such measurement is crucial because high inconsistency could over-penalize models that otherwise generate natural-sounding speech.
  - Additionally, performance differences related to prompt length should be explicitly evaluated or discussed to better contextualize and interpret results.
In Section 5.2,
- Evaluating performance based on a single selected long dialogue sample may limit the statistical validity of the analysis. Could you please share the samples of the selected dialouge?
- Regarding the interpretation at line 293 ("additional conditioning does not improve speaker accuracy"), rather than stating no improvement, it seems more appropriate to conclude that additional conditioning reduces variance in speaker similarity scores.
  - This interpretation is supported by the observation that CodiFM and Sesame exhibit similar trends without per-utterance prompts, while MoonCast significantly deviates in Figure 4 (a).
  - Given that per-sentence prompts naturally reduce variability in similarity scores, this might adversely impact naturalness in dialogue synthesis. Supplementary audio samples indeed suggest MoonCast maintains a stylistically consistent but somewhat unnatural and disconnected delivery style. Clarifying this aspect would strengthen the analytical insights of the paper.
In Section 6,
- At line 321, stating that "simulated dialogue data may introduce noise" implies ambiguity regarding its usefulness. Clarifying explicitly whether including this data (despite noise) remains beneficial specifically due to the necessity of handling overlapping speech would prevent confusion.
- Separately, poorer performance with simulated data might be due to extreme overlap (100% overlap scenarios) or context discontinuity between concatenated segments. Have the authors conducted a detailed analysis on specific problematic samples to identify underlying issues? Providing concrete observations or examples from such analyses would significantly enhance understanding of limitations and potential improvements.
In Table 4, the authors should clarify whether WER calculations for single utterances and overlapping segments were performed identically. The noticeable discrepancy—maintained similarity and UTMOS but worsened WER when moving from condition 2 to 3—suggests potential issues specifically in overlapping speech. Explicitly addressing how overlaps were handled in WER calculations would resolve this confusion.

Significance

The analysis of the two-stage training and the impact of data characteristics at each stage is insightful and impactful. Readers looking for substantial real-time factor (RTF) reduction without sacrificing significant quality will likely find the proposed method useful.
However, based on sample listening, the improvements are not groundbreaking across all dimensions, including speech accuracy and expressiveness, beyond the primarily emphasized speaker consistency. Additionally, evaluations appear somewhat narrow and potentially biased. Expanding evaluation methodologies and addressing possible biases could significantly enhance the impact of this work.

Originality

The paper clearly differentiates itself from prior work, presenting unique solutions to address identified problems and experimentally validating efficiency improvements.
However, the related work survey has gaps, such as missing key citations, which somewhat weaken the comprehensiveness of the literature review. Addressing these omissions would further strengthen the paper's originality and academic thoroughness.

[1] Guo, H., Zhang, S., Soong, F. K., He, L., & Xie, L. (2021, January). Conversational end-to-end tts for voice agents. In 2021 IEEE Spoken Language Technology Workshop (SLT) (pp. 403-409). IEEE.

[2] Lee, K., Park, K., & Kim, D. (2023, June). Dailytalk: Spoken dialogue dataset for conversational text-to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.

[3] Défossez, A., Mazaré, L., Orsini, M., Royer, A., Pérez, P., Jégou, H., ... & Zeghidour, N. (2024). Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037.

问题

Regarding the intermediate representations mentioned at line 111, it's clear that avoiding audio tokens offers efficiency gains, but does it also yield performance improvements? Specifically, as reported in [4], using a timely compact codec as an intermediate representation improved WER compared to directly modeling mel-spectrograms. Conducting explicit performance comparisons would clarify this trade-off, enabling informed decisions in the research community. Have the authors explored or conducted comparative experiments using such intermediate representations and if so, what were the findings?
In Section 3.4 (line 173), have the authors empirically compared the performance of simultaneous unconditional conditioning of prompt speech and text versus separate unconditional conditioning? Additionally, approaches like dual classifier-free guidance [5] might be relevant; has this been explored experimentally?
In Table 5, why did UTMOS scores decrease after fine-tuning? Could the authors provide the specific audio samples used for evaluating each case to facilitate detailed analysis?
Regarding line 195 ("3,000 hours of English podcast data"), details about data acquisition are missing. Was this dataset collected via web crawling, and how were copyright concerns addressed? An explicit ethical review discussion on this matter seems necessary.
Concerning line 209, setting the overlap rate to 100% may lead to unnatural dialogue scenarios, as complete overlap is rare in natural speech. Could the authors share some training samples exhibiting this full overlap scenario to clarify the nature of such data and its potential influence on model performance?
The official Sesame demo on Hugging Face (https://huggingface.co/spaces/sesame/csm-1b) uses significantly longer prompts compared to the prompts provided in the supplementary samples of this paper. Typically, shorter prompts can degrade the quality of the synthesized speech. Have the authors explicitly investigated how the length of prompts affects performance, particularly in a direct comparison with Sesame using longer prompts?
The proposed methodology naturally raises interesting considerations when extending to real-user interactive applications. In such a scenario, the following questions become relevant, and the authors' intuitions would be valuable:
- How might the proposed non-autoregressive approach be practically extended to interactive scenarios? For instance, introducing autoregressive modeling might seem necessary to handle real-time responsiveness, but could this dilute the efficiency and speed advantages central to CodiFM?
- Additionally, while simultaneous generation is highlighted as a benefit (line 103), it could pose challenges for real-user interactions, where incremental responsiveness might be required. Have the authors considered how simultaneous generation might complicate or limit real-world interactive applications, and how could this be effectively addressed?

[4] Lee, K., Kim, D. W., Kim, J., Chung, S., & Cho, J. (2025). DiTTo-TTS: Diffusion transformers for scalable text-to-speech without domain-specific factors. In The Thirteenth International Conference on Learning Representations.

[5] Lee, Y., Yeon, I., Nam, J., & Chung, J. S. (2024, April). Voiceldm: Text-to-speech with environmental context. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 12566-12571). IEEE.

局限性

yes

最终评判理由

After the first round: I've raised my score from 2 to 3, acknowledging the authors' detailed clarifications and additional experiments:

Resolved Issues:

Methodological details regarding model input construction and conditioning strategies are largely clarified.
Concerns around the evaluation procedures and data characteristics were significantly addressed.

Remaining Issues:

The procedure for generating simulated dialogue data remains unclear, especially regarding how context coherence is ensured.
The evaluation of baselines (e.g., Sesame) with sufficiently long prompts was not comprehensively addressed, potentially biasing the comparative analysis.
Questions related to training-inference mismatch, specifically prompt-context conditioning, remain unresolved.

Given these critical remaining concerns, especially regarding simulated data creation and prompt-length effects, the submission still falls slightly short of the acceptance threshold. I encourage the authors to thoroughly address these points in future revisions.

After the last round: Based on the authors’ detailed rebuttal and clarifications, many of my initial concerns regarding methodology, data construction, and evaluation have been satisfactorily addressed. In particular, the explanation about prompt exclusion from loss calculation (masking), the simulated dialogue data format, and the challenges related to overlapping speech reconstruction were clear and convincing.

However, some issues remain, including the need for explicit incorporation of prompt length performance analysis in the main paper and addressing a few missing citations. These points, while important, do not undermine the overall contribution.

Given the substantial progress made in clarifying key points and the promising results demonstrated, I recommend increasing the score to 4. I encourage the authors to carefully address the remaining minor issues in the final revision to strengthen the paper further.

格式问题

No concerns.

作者回复

2025-07-29

Dear Reviewer 49qE,

Thank you for your comprehensive review and valuable feedback. We use C,O,S,Q to abbreviate Clarity, Originality, Significance and Question for our responses.

C1-C3, O2 We will modify the Figure 1 with detailed explanations, and add references [1~5] you pointed out in the corresponding sections.

C4During training, we extract a random segment from each dialogue sample to serve as the speaker prompt. This segment must consist of consecutive frames spoken exclusively by a single speaker. We refer to such segments as "candidates."

C5 Our method processes each speaker's text as a two-stream sequence (z = [ $z_1$ , $z_2$ ]), which is first concatenated under the following description, and then encoded.Each stream includes a prompt part and a synthesized part. The prompt part uses [spk1] and [spk2] as indicators (not text tokens) to replace the transcription of the prompt, distinguishing which segments belong to each speaker. The synthesized part contains the actual dialogue, aligned with mel-spectrogram frames at the frame level. For example, in a 6-second dialogue with 100 frames per second, if spk1 speaks 30 characters from 1-3s and spk2 speaks 20 characters from 4-5s, $z_1$ includes 100-130 real characters and 130-300 continuation tokens [P], while $z_2$ includes 400-420 real characters and 420-500 tokens [P]. Other tokens in z are silence tokens.

For the real-world dialogue, we utilize the ASR and diarization tool to get the sentence-level timestamp $t_{start}$ and $t_{end}$ . These timestamps are not ground truth and may contain errors. We also use simulated LibriTTS data to provide accurate ground-truth timestamps, even for overlapping speech segments. This approach contrasts with single-stream methods by enabling temporal alignment and features like specific silence insertion and overlapping, as shown in our demo.

C6.1, C6.2 The robustness and diverse means the model is robust given prompts under challenging and noisy acoustic environments. We achieve that by eliminating the dependence of prompt’s transcriptions, while other model face serious performance degradation if the ASR transcription is wrong, demonstrated in Table R1 replied to reviewer W9cW. Additionally, the random chosen of $m_{ctx}$ enables in-context learning by allowing the model to extract speaker prompts directly from training dialogue samples. This design can even improve speaker similarity because the prompts and generated speech share the same acoustic environment.

C7.1 We did not mention the name of the speech enhancement API due to the double-blind review policy. We will cite it in the camera-ready paper if the manuscript is accepted.

C7.2 Our current simulated data did not take contextual coherence into consideration due to its complexity and our goal is to model overlapped speech generation. we recognize the value of contextual coherence and will revisit this consideration in future work.

C8.1 The open-source version of Moshi does not support multi-speaker dialogue generation. Moshi's dialogue capability is primarily focused on human-user and agent interactions, not the generation of speech for multiple speakers. While Moshi's paper mentions a multi-stream TTS model for dialogue synthesis, there are no publicly available codes or data to enable reproduction of their described model.

C8.2: The 15 professional linguistic experts we selected were specifically trained in linguistics and experienced in subjective testing, and they were crowd-sourced for this evaluation. Due to the double-blind review policy, we cannot explicitly name the company platform used but will disclose it upon paper acceptance.

C9.1: The main differences between our input format and the baselines lie in two key perspectives (1) CodiFM employs multiple text streams to distinguish "who speaks when and what," whereas Sesame utilizes only a single text stream with only the content and the speaker id, which can more readily lead to speaker confusion. (2) CodiFM places the prompt at the very beginning of the input sequence and generates the entire dialogue in parallel. Sesame, being an AR model, generates each utterance sequentially. While its prompts are also at the beginning, this can cause confusion because the previously generated audio is often more proximate to the segment being synthesized than the initial prompt.

C9.2: The primary focus of this paper is on generating synthesized dialogue with accurate pronunciation by the correct speaker and appropriate turn-taking, including silences and overlaps. The "stable and reliable output quality" mentioned on Line 275 refers to the enhanced stability of our model in terms of speaker accuracy and content fidelity, minimizing instances where Speaker A utters content intended for Speaker B. In this work, we did not specifically design or optimize for speech expressiveness. As a zero-shot model, the expressiveness should be guided by the given prompts in addition to speaker timbre.

C9.3 and Q6: Thank you for your valuable suggestion. We conducted an additional fluency test using CMOS scores as per your request, and the results are presented in Table C93-1.

Table C93-1: Fluency CMOS

	Fluency-CMOS
MoonCast	-0.644
Sesame	-0.317
CodiFM	0.000

To assess speaker change accuracy, we introduced two metrics: Diff-SC (difference between correct and generated speaker changes, ideally zero) and Speaker Change Error Rate (SCER) (proportion of dialogues with incorrect speaker change counts). Table C93-2 proves that our CodiFM outperformed others on both metrics, proving its superior accuracy in handling speaker changes.

Table C93-2: Evaluation for Speaker Change

Model	Diff-SC	SCER
Mooncast	-0.78 ±2.14	47.57%
Sesame	-0.66±1.87	21.98%
CodiFM	-0.16±0.57	9.50%

Table C93-3 demonstrates that prompt length affects Sesame's performance, with shorter prompts leading to poorer results. In contrast, our CodiFM model is robust to varying prompt lengths, showing minimal impact on performance. MoonCast, however, exhibits less stable performance.

Table C93-3: SA-WER/SA-SIM evaluation of different speaker prompt length

Model	1-3s	3-6s	6-9s	9-12s	12-15s
Mooncast	21.526/0.297	34.218/0.215	7.723/0.380	46.02/0.276	3.72/0.480
Sesame	7.425/0.373	5.584/0.364	6.114/0.447	5.889/0.421	4.865/0.446
CodiFM	6.95/0.556	6.002/0.543	6.002/0.535	6.002/0.543	6.002/0.542

C10.1 Thank you for highlighting the potential for misunderstanding in Figure 4. We acknowledge that Figure 4 was initially generated using statistics from a single dialogue, which may have led to misinterpretations. After averaging across multiple dialogues, our model consistently maintains its leading performance.

C10.2 We will revise our description accordingly.

C11.1, C11.2 and C12 Adding simulated dialogue data can hurt model performance, likely due to noise from concatenating utterances with different acoustic environments. However, simulated LibriTTS data is usable because it it clean. The performance degradation noted in Table 4 is not related to overlapping speech, as the test set does not include overlapping speech and WER was calculated after segmenting sentences with the Deepgram ASR+Diarization API.

S2 We acknowledge that preferences vary and baseline models may excel in expressiveness due to their larger parameters and datasets. We've addressed concerns about evaluation bias by introducing a new, more diverse real-world test set with varied acoustic conditions and conversational prosody (Table R2 for reviewer qZtn), and we've expanded our analysis to include longer dialogue durations (Table R3 for reviewer qZtn).

Q1 The utilization of timely compact codec as an intermediate representation does not always yield performance improvements for dialogue generation task. [9] have demonstrated that using semantic tokens as intermediates could cause instances of words being omitted or duplicated in synthesized dialogues with the Hubert semantic tokens.

Q2 We dropped both the text and audio conditions for $p_{uncond}$ . In the remaining cases, we dropped the audio condition with $p_{dropx}$ . We will revise the paper accordingly.

Q3 Fine-tuning with dialogue data, led to a slight, but not significant, decrease in UTMOS scores, dropping from 3.48±0.19 (pre-train only) to 3.35±0.17 (pre-train and fine-tune). It is because the dialogue data is not clean even after speech enhancement.

Q4 The 3,000-hour dataset used in this study is internal and consists of two-person conversations. Due to confidentiality constraints, we are unable to disclose specific details regarding data acquisition. However, the dataset has been reviewed by our internal legal department to ensure compliance with copyright and ethical standards.

Q5 Thank you for pointing that out. Indeed, full overlap data is uncommon in real life. Currently, we’ve used the LibriTTS dataset to simulate overlapping data (0%–100%), similar to speech separation methods. In the future, we’ll integrate real-world overlap statistics to improve our simulations.

Q7 Thank you for your interest in extending CodiFM to real-user interactive applications. While this lies beyond the scope of our current study, which focuses on use cases such as video dubbing and two-host podcast generation. We would like to share a few of our initial thoughts. A flow-matching model could potentially support streaming by developing a chunk-aware, causal flow-matching variant [1]. Additionally, techniques such as consistency distillation or leveraging mean flows could enable one-step inference. Therefore, it also has the potential to real-user interactions.

[1] Z. Du, et al, CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

评论- Response to Rebuttal by Authors

2025-08-01

Thanks to the detailed replies and additional experimental results provided by the authors, most ambiguities, including those concerning the scope of the research, have been addressed effectively. However, several key questions remain:

Regarding C6.1 and C6.2, I understood that the randomly selected $m_{ctx}$ was drawn solely from the current dialogue being trained. My original intent was to inquire whether choosing prompts from the actual speaker’s prior speech, rather than only within the current dialogue, could help reduce training-inference mismatch. Do the authors believe selecting prompts strictly within the given dialogue is sufficient, or is it practically too complex to retrieve prompts specifically tied to a speaker?
In C7.2, I still find it difficult to understand how simulated dialogues can be effectively created without considering "contextual coherence." While I agree that achieving perfect dialogue flow and context alignment is beyond the current scope, simply concatenating two speakers' utterances discussing entirely unrelated topics seems insufficient to form meaningful dialogue data. My expectation was that, since Librispeech-style datasets typically allow matching utterances to original texts, the authors could reconstruct dialogues using original metadata or similar methods. Could the authors provide some example transcripts of such simulated dialogues?
For C8.1, assuming my understanding is correct, wouldn't it also be possible to simulate Moshi’s method by replacing user utterances with a randomly selected speaker’s utterances from the dataset?
Regarding C7.1 and C8.2, while it's understandable if some reproducibility-related information is confidential due to direct author affiliation, I find the reasoning of withholding such information due to double-blind review constraints unconvincing, unless explicitly author-identifying.
If my understanding is accurate, all baseline results reported in the main paper and rebuttal (excluding Table C93-3) were obtained using relatively short prompts. However, according to Table C93-3, Sesame outperforms the proposed method when longer prompts (12–15 seconds) are used. Thus, it seems important to validate whether the proposed method consistently demonstrates improved performance even under evaluation with longer prompts.
Concerning Q1, it's clear that relying solely on HuBERT (which primarily emphasizes semantic information) would pose problems. My original question aimed at intermediate representations combining semantic and acoustic information (e.g., audio codecs). Should we interpret the authors’ choice to directly model mel-spectrograms as simply following prior work, without explicitly testing these combined representation options?
In relation to Q2, as the authors clarified, the unconditional dropping of text and speaker information occurs separately during training. My original intent was to inquire if the authors had explicitly evaluated performance differences when applying separate classifier-free guidance for text and speaker during inference.
(To Ethical Reviewer) Regarding the authors’ response to Q4, it would be beneficial if ethical concerns about data collection and copyright management were explicitly addressed in the ethics review process.

Given that the authors' clarifications resolved most confusion, I will raise my score accordingly. However, several critical points remain open, and I look forward to these being addressed in the upcoming discussions.

评论- Response to Reviewer 49qE (2/2)

2025-08-02

Further Question 4

C7.1 speech enhancement API is an internal API.

C8.3 linguistic experts were hired by using an internal crowdsourcing platform.

We apologize for not disclosing the relevant information, as we were concerned that doing so might risk desk rejection by violating the double-blind review policy.

Further Question 5

To demonstrate that our model maintains a stable advantage even with very long prompts, we conducted an additional experiment. We selected 20 speech clips, each longer than 20 seconds, from 20 different speakers in the LibriSpeech dataset. We then trimmed these clips to lengths of 10, 15, and 20 seconds.

For transcription, we used Whisper-large-v3, and for the text component, we used 100 dialogues from the Dailydialog dataset. As shown in Table FQ5, our model's SA-WER and SA-SIM performance remains consistently strong across all prompt lengths. This further confirms that our model performs reliably well, regardless of how long the input prompt is.

Table FQ5: SA-WER/SA-SIM of the long-prompt scenario

Model	10s	15s	20s
MoonCast	23.05/0.397	25.08/0.439	18.19/0.435
Sesame	15.80/0.519	10.93/0.510	10.49/0.544
CodiFM	6.15/0.556	6.48/0.558	6.10/0.559

Further Question 6

We chose not to use an audio codec for the following key reasons, which primarily concern the training process and model performance:

Limitations of Single-Speaker Codecs for Multi-Speaker Support: While models like SoundStorm, Mooncast, and Sesame use audio codecs, their codecs are not open-source. The open-source codecs that are available are typically trained on single-speaker data. These are poorly equipped to handle overlapping speech and can produce unnatural transitions between speakers. While multi-speaker codecs are a promising area of research, they require extensive semantic information for annotation, making them challenging to train and it is beyond the scope of this paper. Therefore, we chose to model the mel-spectrogram directly, which provides greater flexibility and is better suited for multi-speaker extensions.
Mitigating Error Propagation: Our architecture is a single, end-to-end model. This design choice is deliberate, as it eliminates the error propagation problems common in two-stage models, where errors from the first stage can be amplified in the second.
Efficiency and Reliability: Based on the negative effects and suboptimal results observed in previous works, we decided against training a new audio codec from scratch. This allows us to focus our efforts on developing a robust and reliable end-to-end solution.

Further Question 7

We found it beneficial to apply a separate classifier-free guidance for the audio prompt during training. We dropped both the text and audio conditions for $p_{uncond}$ . In the remaining cases, we dropped the audio condition with $p_{dropx}$ .

During inference, we calculate the unconditional results by dropping all conditions simultaneously rather than dropping each component separately. Table FQ7 shows that adding a $p_{dropx}$ is beneficial compared with using $p_{uncond}$ alone.

Table FQ7: SA-WER/SA-SIM for different training classifier-free guidance configuration.

$p_{uncond}$	$p_{dropx}$	SA-WER	SA-SIM
20%	0%	7.031	0.557
20%	20%	6.309	0.563

Further Question 8

Please check our responses to Ethical Reviewer

Finally, we would like to express our gratitude again for your time and effort in reviewing and discussion. Please do not hesitate to let us know if you have any further concerns or comments. We would be more than happy to address them.

[1] CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

[2] Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

[3] Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

[4] MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

[5] FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

[6] LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

[7] Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

评论- Response to Reviewer 49qE (1/2)

2025-08-02

Dear Reviewer 49qE,

We appreciate your detailed feedback and would like to clarify some key points regarding our methodology, results and your concern regarding the legality and licensing of the training dataset. The detailed responses are listed below.

Further Question 1

Thank you for the opportunity to clarify our choice of speaker prompts. We believe our approach of selecting a prompt from the current speech is more beneficial for several key reasons, and we will revise the manuscript to articulate this more clearly.

First, this method aligns with the principles of audio continuation [1,2] or In-context-learning [3,4] demonstrated in recent monologue TTS work. By sampling a prompt from the same dialogue (either at the beginning or randomly chosen), we ensure strong continuity between the prompt’s speaker characteristics (e.g., style, timbre, acoustic environment) and the segment to be generated.

Second, our method avoids this potential train-inference mismatch. Our training procedure, which uses a mask for the prompt segment, showing in Figure 1, ensures that the prompt and the target text are distinct, which perfectly mirrors the inference process. In contrast, using a prompt from a prior speech is feasible but could introduce an acoustic mismatch due to differences in the recording environment.

We acknowledge that in some traditional works, such as FastSpeech2 [5], a separate speech is used as a prompt to prevent information leakage. However, this is primarily a concern when the prompt and target share the same text transcription. In our work, the prompt's transcription is intentionally different from the dialogue text to be synthesized, thus mitigating this risk.

Finally, from a practical standpoint, retrieving a suitable monologue segment from a separate dialogue is computationally inefficient, especially with large-scale datasets. This process would require significant overhead for identifying the target speaker's dialogue and isolating a suitable monologue segment to use as a prompt.

Further Question 2

To simulate dialogue from the LibriTTS datasets, we used the following method, and we will open-source this data-simulation code.

We begin by randomly selecting two distinct speakers.
Next, we retrieve a list of all available dialogues for each of these two speakers.
We then generate a new dialogue with a predetermined maximum duration, such as 30 seconds. We interleave speech segments from Speaker A and Speaker B into this new dialogue until it reaches the maximum duration.
Once the dialogue reaches the maximum duration, we save the generated dialogue and begin simulating a new one using any remaining speech segments.

Below is an example of the generated results. Each line follows the format spkid|content|start_time(ms)|end_time(ms):

1789|We wot not, said the fishers, but he keepeth it no counsel but that he is a knight of King Arthur's, and by the mighty lord of this isle he setteth nought.|0|8430
1160|I had my share of it; for, as soon as I got back to my seat in the Assembly, I was put on every committee for answering his speeches and messages, and by the committees always desired to make the drafts.|4430|18150 
1789|Then the lady prayed the fishers to bring him to her place.|12150|15170

It is important to note that our method does not guarantee contextual consistency between the two speakers. Nevertheless, within each individual speaker’s speech segments, we ensure that the topic remains consistent. We appreciate the suggestion to reconstruct dialogues using original metadata or similar approaches. However, identifying speech segments that are not only contextually aligned but also involve utterances from two speakers is a challenging task.

Further Question 3

Our primary task is dialogue generation, which involves taking two speaker's voices (timbre) and producing a multi-turn conversation. We determined that the open-source Moshi model is not suitable for this purpose for the following reasons:

Turn-by-Turn Generation and Uncontrollable Silences: Even if we replace the user's text prompt with a monologue from another speaker, Moshi generates the dialogue on a turn-by-turn basis. This process introduces an uncontrollable silence between each turn, which is not ideal for natural dialogue flow.
Inconsistent Speech Synthesis: The "pseudo-user" speech in Moshi is synthesized using a standard Text-to-Speech (TTS) model. This approach is not comparable to our work, as it does not capture the natural back-and-forth of a real dialogue.

Furthermore, the baseline Sesame model we use shares a similar architecture to Moshi but provides superior performance, making it a more robust benchmark for our work.

评论- Response to Rebuttal by Authors

2025-08-02

I appreciate the authors' detailed response. It allowed me to clearly understand your perspectives regarding my questions, as well as pinpoint exactly where we agree and where our opinions diverge. Below are my follow-up comments.

FQ1: I acknowledge and agree with the authors' explanation regarding the standard usage of prompts in prior in-context learning and continuity scenarios. However, under the current approach—where the prompt is essentially a subset of the same dialogue being synthesized—I still find it difficult to accept the claim that "the prompt and the target text are distinct." More specifically, I might have misunderstood the statement that "the prompt's transcription is intentionally different from the dialogue text to be synthesized." In my view, there still seems to be a training-inference mismatch; fully addressing this issue would require selecting prompts from audio clips separate from the dialogue currently being modeled. I agree this would be practically inefficient, but wouldn't introducing "acoustic mismatch due to differences in the recording environment" actually result in a simulation that is closer to inference-time conditions? Although this issue slightly deviates from the main scope of your paper, I am bringing it up again to fully clarify your perspective.
FQ2: Thank you for sharing the samples and providing a detailed explanation of your data construction process. Your explanation clearly addresses how the simulated dialogue dataset was constructed, and I found it somewhat surprising that the model performs well even with this kind of data. While your argument—that maintaining speaker consistency enables effective dialogue simulation—is convincing, I still suspect that my initial impression about the reduced naturalness of inter-speaker dialogue might stem partly from this data construction method. Specifically, while individual speakers' utterances are coherent and accurate, the overall turn-taking felt somewhat less expressive and natural.

Additionally, you mentioned the dataset involved two speakers, but the provided examples contain multiple speakers. Could this be because the examples exceed 30 seconds, resulting in additional speakers appearing within the sample?
FQ3: Thank you for the clarification. Your explanation effectively addressed why Moshi was inappropriate as a baseline and reinforced my understanding of why Sesame was a more suitable choice.
FQ4: I understand the internal dependency constraints clearly now. Thank you for the clarification.
FQ5: Thank you for sharing the additional experimental results. They clearly demonstrate that your method maintains superior performance over baselines even with longer prompts (20s). In my view, including these results explicitly in the paper would significantly highlight the strength of your proposed approach.
FQ6: Regarding point #1, my understanding is that Sesame utilizes Moshi's mimi, an open-source, multi-speaker codec specifically developed for Moshi. Thus, it seems reasonable to expect that mimi can naturally handle overlapping speech scenarios. Concerning points #2 and #3, I agree there are inherent trade-offs involved; while directly predicting mel-spectrograms may yield better reconstruction and eliminate the need for additional codecs, employing compact latent representations might offer advantages like improved alignment. In my opinion, empirical validation would be necessary to clarify these trade-offs fully.
FQ7: It is encouraging to see that separately dropping inputs during training leads to performance improvements. Additionally, applying guidance separately during inference also seems promising; I recommend considering this approach in future work.
FQ8: Thank you for your response. I hope the ethical reviewers carefully assess this aspect.

评论- Response to Reviewer 49qE

2025-08-05

Dear Reviewer 49qE,

Thank you for your follow-up comments and valuable advice. We will incorporate the analysis of prompt length into our paper and explore separate classifier-free guidance in our future work.

Regarding your remaining questions about the relationship between the prompt and training sample, the simulated dialogue, and the use of speech codecs, we have provided detailed responses below

FR1: about the difference between prompt and the training dialogue sample (FQ1)

To clarify how our prompts are distinct from the target segments, even when both come from the same dialogue, we'll use a specific example.

Let's imagine a 30-second training sample where Speaker 1 talks from 1-5s, 10-15s, and 20-25s, and Speaker 2 fills the rest of the time. We randomly select two prompts from this dialogue: Prompt 1 is the segment from 2-4s, and Prompt 2 is the segment from 26-29s. We then extract the mel-spectrograms for these two prompts and place them at the beginning of the input sequence.

Crucially, when we calculate the loss, we apply a mask to exclude the 2-4s and 26-29s segments. This means the model's performance is only evaluated on the remaining parts of the dialogue (0-2s, 4-26s, and 29-30s). Therefore, while the prompt and target segments originate from the same conversation, they are kept distinct during the training process.

This strategy is similar to the in-context learning approach used for monologues. For example, within a 10-second monologue, we might randomly select the 3-5s segment to serve as the prompt. In this case, the loss would only be calculated on the remaining segments, specifically the 0-3s and 5-10s portions.

FR2: about the simulated dialogue (FQ2)

We agree that simulated dialogue, while useful, often lacks the expressive turn-taking found in natural conversation. When we relied solely on this data, we observed a similar unnatural quality in the inferred samples. This highlights the need for careful use of simulated data, and we recognize that simulated data should not be used alone.

To clarify, the example we provided involved only two speakers, identified as 1789 and 1160, across three turns. Speaker 1789 contributed two sentences, and speaker 1160 contributed one. All the data we used for this study contained either one or two speakers.

FR3: about Mimi Codec on overlapping speech (FQ6)

Thank you for the suggestion. We think that it's still challenging to expect Mimi and similar codecs to effectively handle overlapping speech within a single-channel waveform.

Although the Mimi Codec was trained on data with multi-speaker dialogues, we believe it's still an interesting area to explore how well a speech codec models overlapping speakers, especially at a low frame rate of 12.5Hz. In a simple experiment, we found that the mel-spectrogram reconstruction L2 loss for a two-speaker overlapping speech was 2.03, compared to 0.71 for a monologue. This suggests that the reconstruction quality for overlapping speech is not yet satisfactory.

We appreciate your advice and we will certainly take this into consideration for our future work. We plan to investigate the performance of speech tokens on handling overlapping segments and we will explore the feasibility of using these tokens in dialogue generation work. We will provide a more detailed comparison in a future analysis.

We would like to express our gratitude again for your time and effort in reviewing and discussion. Please do not hesitate to let us know if you have any further concerns or comments. We would be happy to address them.

评论- Response to Rebuttal by Authors

2025-08-05

Thank you for the detailed responses.

FR1: I now understand that the prompts are excluded from the loss calculation, as illustrated in Figure 1. Previously, I had focused solely on whether the text was included and overlooked this detail. Your explanation clarifies that this approach is effectively equivalent to the span masking technique used in prior literature.
FR2: Thank you for the additional explanation. Reviewing the ‘spkid|content|start_time(ms)|end_time(ms)’ format helped me fully understand your approach.
FR3: I appreciate you sharing the L2 loss values. I agree with your reasoning regarding the challenges of codec reconstruction for overlapping speech.

Most of my initial concerns and follow-up questions have been satisfactorily addressed. Therefore, I am inclined to raise my score. However, to fully deserve this score, the final revision should incorporate the discussed clarifications and address the missing citations.

Overall, I recommend increasing the score with the expectation that these improvements will be reflected in the final version.

审稿意见

评分: 4置信度: 42025-07-02

The paper introduces CodiFM, a fully non-autoregressive (NAR) framework for generating high-quality, zero-shot multi-speaker dialogues. Unlike previous approaches that rely on autoregressive decoding or intermediate representations (like audio tokens), CodiFM directly predicts mel-spectrograms from disentangled, multi-stream transcriptions using a flow-matching-based generative model. Key innovations include transcription-level speaker disentanglement, sentence-level alignment, and prompt-level masking, enabling precise control over speaker timing, overlap, and identity. Trained with a two-stage curriculum on large-scale monologue and simulated dialogue data, CodiFM achieves good performance in speech quality, speaker consistency, and inference speed, outperforming strong baselines like MoonCast and Sesame. It also supports fine-grained timing control, overlapping speech, and cross-lingual voice cloning, making it highly suitable for real-world applications such as podcast generation and virtual agents.

优缺点分析

Strengths:
- The proposed approach is well-motivated, and the experimental results demonstrate its feasibility and effectiveness.
- The writing of the paper is clear and easy to follow.
Weaknesses:
- While the paper is technically sound, the level of novelty is somewhat limited, as some of the key design choices — such as using fully non-autoregressive methods for speech synthesis and avoiding explicit duration modeling — have already been established in prior TTS literature.
- The generalizability of the proposed method to real-world applications remains unconvincing. The training data used in this work is substantially smaller than that of several baseline methods (the training data processing pipeline does not clearly demonstrate which steps contribute to generating higher-quality data.), and the prompts used during evaluation are drawn from the LibriSpeech test-clean set. Given that these prompts differ significantly from real-world user speech in terms of content, style, and acoustic conditions, it is difficult to assess how well the model would perform when conditioned on more realistic prompts.

问题

How long can the synthesized dialogue be with the current model?
The conclusion drawn in Section 5.2 seems somewhat inconsistent with subfigure (a) of Figure 4. If I understand correctly, MoonCast appears to exhibit overall higher similarity than Sesame, and its performance continues to improve with more dialogue turns. Of course, this might be due to the analysis being based on a single data example. Providing average results over a batch of data would make the conclusion clearer and more convincing.

局限性

Yes

最终评判理由

The authors have provided comparative experiments under realistic and long-duration scenarios in their rebuttal, which has alleviated some of my concerns. However, I still feel that the novelty of the proposed method is somewhat limited, as it appears to resemble training an E2TTS model on conversational data. Nevertheless, considering that the field of dialogue generation is still at a relatively early stage, I believe this work can be regarded as having a certain level of contribution at this point. This is the reason why I ultimately assigned a score of 4.

格式问题

None

作者回复

2025-07-29

Dear Reviewer qZtn,

We sincerely appreciate your efforts in reviewing our paper and providing us with valuable, constructive feedback. We have carefully considered all your comments and suggestions, which have significantly helped us improve the manuscript. We have conducted additional experiments on realistic scenarios and analyzed the model's performance across different durations, and these results will be incorporated into the revised version of our paper. The detailed responses are listed below.

R1: About similarity to existing works (Weakness 1)

To achieve state-of-the-art (SOTA) performance in dialogue speech synthesis, we have fully leveraged existing SOTA technologies in our model building. While this may give the impression that CodiFM is a mere combination of existing methods, these individual methods, on their own, do not fully address the critical challenges inherent in our specific scenario: generating synthesized dialogue with accurate pronunciation by the correct speaker and appropriate turn-taking, including silences and overlaps.

To clarify our distinct contributions, we have provided a comparative analysis with existing works in Table R1. Our key contributions are as follows:

Our work enables the generation of entire dialogues with accurate pronunciation by the accurate speaker, effectively resolving the speaker confusion problem prevalent in prior AR-based models.
We employ a novel multi-stream Transcription-Level Speaker Disentanglement strategy to enable the model to precisely understand "who speaks what and when.
We propose the Prompt-Level Random Masking strategy to enhance in-context learning capabilities specifically within dialogue scenarios. This strategy eliminates the need for prompt transcription, thereby removing the dependency on ASR tools and simplifying the generation process.

Table R1: System comparison

System	Arch	Duration	Monologue	Dialogue	Controllable Speed	Prompt’s transcription	Silence/Overlap
VoiceBox	NAR	Phoneme	Yes	No	No	Yes	No
E2TTS	NAR	Utterance	Yes	No	Yes	Yes	No
MoonCast	AR	/	Yes	Yes	No	Yes	No
Sesame	AR	/	Yes	Yes	No	Yes	No
CodiFM	NAR	Utterance	Yes	Yes	Yes	No	Yes

R2: Generalizability to real-world application (Weakness 2)

We appreciate your concern regarding the generalizability of using LibriSpeech for evaluating real conversation styles. We agree that assessing performance in real-world scenarios is important.

To address this, we have designed an additional test set. The speaker prompts for this new set are derived from 10 distinct real-life dialogues from the NCSSD dataset-CEN, as proposed in Reference [31]. This dataset is collected from the internet and features prompts with natural conversational prosody, including various acoustic scenarios such as background noise. The corresponding text content for this test set is sourced from the DailyDialog dataset [37].

As presented in Table R2, our model, CodiFM, consistently demonstrates superior performance compared to baseline models even under these challenging real-world conditions.

Table R2: System performance comparison under real-world scenarios

	SA-WER	SA-SIM	UTMOS
MoonCast	26.26	0.276	2.09
Sesame	25.19	0.250	1.80
CodiFM	9.32	0.322	2.89

R3: About CodiFM’s maximum inference duration (Question 1)

Our training data is limited to durations under 30 seconds due to resource constraints. However, our model is capable of inferring dialogues longer than 30 seconds, and our existing test set includes numerous dialogues exceeding one minute in length. We conducted evaluations on four manually designed dialogues of varying durations, assessing SA-WER and SA-SIM as presented in Table R3. We found that CodiFM's performance begins to degrade after 90 seconds, and this degradation is also observed with MoonCast. Sesame fails to generate dialogues longer than 120 seconds. Furthermore, Sesame exhibited unstable results even under 30 seconds due to speaker confusion issues.

Table R3: SA-WER/SA-SIM evaluation of different duration of synthesized data

	30s	60s	90s	120s
Mooncast	5.81/0.647	4.57/0.576	13.95/0.487	19.75/0.516
Sesame	24.41/0.677	5.22/0.585	3.98/0.263	Failed
CodiFM	6.97/0.686	5.88/0.648	14.74/0.618	14.64/0.627

It is important to clarify that MoonCast, although designed for long-form generation, employs a chunk-wise Flow-matching detokenizer. For each turn in a long dialogue, it calls the Flow-matching detokenizer, requiring the audio of each speaker prompt to be provided every time. Therefore, it can't be considered a purely long-form generation model because the audio detokenizer still generates and concatenates multiple audio segments. Similarly, Sesame generates each utterance separately, relying on previous history, which means it is not a truly long-form generation model and is consequently less impacted by increased duration.

R4: About the mis-leading visualization of Figure 4a (Question 2)

Thank you for pointing out the potential for misunderstanding in Figure 4. We acknowledge that Figure 4a was indeed generated using statistics from a single dialogue, which may have led to misinterpretations.

After averaging across multiple dialogues, our model consistently maintains its leading performance. In contrast, Sesame and MoonCast exhibit unstable generation, leading to inconsistent results across different scenarios. For instance, in Table R2, we observe that MoonCast's speaker similarity on this realistic test set is slightly higher than Sesame's. Given that both models frequently encounter speaker confusion problems, their performance shows high variance, making it challenging to compare them when both are unstable.

We will update this Figure 4a in the revised manuscript with a new visualization that incorporates multiple speech samples, thereby providing a more representative and clearer illustration. Additionally, we will update these new samples on our revised demo page.

Finally, we would like to express our gratitude again for your time and effort in reviewing our paper. Considering this is the first attempt at non-autoregressive multi-talker dialogue generation and that we have incorporated extensive comparisons under realistic and long duration scenarios, we would appreciate it if you could consider increasing your score. Please do not hesitate to let us know if you have any further concerns or comments. We would be happy to address them.

2025-08-05

Dear Reviewer qZtn,

We hope we have addressed your questions. Please let us know if you have any further concerns, as the discussion between the reviewers and authors will end soon. Thanks!

Best regards,

Authors

评论- follow-up

2025-08-06

I appreciate the authors’ response, which has alleviated some of my concerns. Therefore, I will raise my score.

审稿意见

评分: 4置信度: 52025-07-03

CodiFM is presented as a fully non-autoregressive, flow-matching model that performs zero-shot dialogue generation. The method adapts the E2TTS recipe to multi-speaker scenarios by first splitting a dialogue transcript into separate text streams for each speaker, padding each sentence so its length matches the target mel duration, and inserting special silence tokens wherever a given speaker is not speaking. During training only one short reference utterance per speaker is provided and its loss region is masked to prevent the model from simply copying the prompt. Training proceeds in two stages: the model is first pretrained as a monologue zero-shot TTS system, and is then fine-tuned on several thousand hours of dialogue audio. Experiments show that, relative to strong baselines such as Mooncast and Sesame, CodiFM retains higher pronunciation accuracy and speaker-similarity scores with lower real-time factor (RTF).

优缺点分析

Strengths

The paper shows that the E2TTS recipe can be adapted to multi-speaker settings with a few thousand hours of dialogue data.
CodiFM outperforms Mooncast and Sesame on pronunciation accuracy, SA-SIM, and real-time factor.
The authors dissect the effect of data composition, the two-stage training schedule, and the use of silence tokens.

Weaknesses

CodiFM requires exact start/end times for every utterance, whereas Mooncast and Sesame learn timing implicitly. The comparison is therefore not apples to apples, and the requirement itself is unrealistic in many settings.
The baselines support long, natural conversations with non-verbal sounds. CodiFM’s ability to handle non-verbal events or very long samples is not demonstrated.
Speaker-similarity metric may overstate quality. SA-SIM compares each generated sentence to its single reference prompt, potentially penalizing desirable prosodic variation over a longer dialogue.

问题

Can CodiFM learn timing from text-level turns only?

A fair comparison would train your model without ground-truth sentence timings (analogous to how Mooncast/Sesame operate) and evaluate whether it can infer durations.

Non-verbal events and long-form generation

Mooncast generates laughter, sighs, etc., and scales to lengthy dialogues. Can CodiFM do the same?

Prosodic variability

In multi-sentence dialogues, later utterances may naturally deviate in emotion or prosody from the initial prompt. How does CodiFM balance speaker similarity with such variability, and is SA-SIM the right metric?

局限性

The method assumes external knowledge of sentence-level timing for each speaker. This timing information limits usability in real-world dialogue synthesis, where only turn-level transcripts are typically available.

最终评判理由

While I initially had concerns regarding the lack of duration modeling in the NAR approach compared to the AR baseline, the demo convincingly demonstrated the necessity of the proposed model for real-world applications such as video dubbing, where dialogue must fit within specific time constraints. Moreover, the model showed no disadvantage against the baseline in challenging cases like long sentences, and the remaining concerns were well addressed. Therefore, I raise my score from borderline reject to borderline accept.

格式问题

作者回复

2025-07-29

Dear reviewer zySF,

We sincerely appreciate your efforts in reviewing our paper and providing us with valuable, constructive feedback. We have clarified some unclear writing that caused misunderstandings and added extra evaluations, which will appear in the revised version. The detailed responses are listed below.

R1: Importance of introducing duration (Weakness1 and Question1)

We acknowledge your query regarding the importance of introducing duration control. Currently, there are no existing Non-Autoregressive (NAR) dialogue generation models, compelling us to compare our work primarily with Autoregressive (AR) models.

A key distinction of our approach, in contrast to conventional NAR models such as VoiceBox and FastSpeech which necessitate highly accurate duration prediction, is our reduced dependency on precise duration, requiring only utterance-level duration information. During training, durations are extracted using ASR and diarization tools. We recognize that these tools may introduce inaccuracies, meaning our extracted durations are not exact ground truth. Crucially, during inference, our model does not rely on ground truth durations. For instance, for a single dialogue, diverse durations can be employed to generate a variety of dialogues.

We emphasize that the ability to control speaker duration addresses a critical gap in existing AR models. Many real-world scenarios, such as video dubbing or advertisements with strict time limits (e.g., 15-second spots), necessitate precise duration control for each speaker. Furthermore, as illustrated in the first sample on our demo page, our model can generate crucial silences—for example, a waiter instructing someone to wait—which AR-based baseline models are unable to achieve.

It is important to note that CodiFM does not strictly require users to provide duration information. We offer an optional functionality where duration can be roughly estimated by counting syllables in the text.

R2: The non-verbal and long dialogue generation capability (Weakness2 and Question2)

Regarding non-verbal capabilities such as laughter, our model's generation of such behaviors is data-driven, consistent with findings in prior work [9]. While we did not explicitly annotate these behaviors in our dataset, their presence allows our model to generate them, albeit in a non-controllable manner. We observe that Mooncast also exhibits non-controllable laughter generation, sometimes at inappropriate moments, as demonstrated in audio sample 2. We will include additional samples to showcase our model's non-verbal behavior generation capabilities in the revised demo page. We plan to annotate future datasets to enable controllable generation of such functionalities.

Due to resource constraints, our training data currently consists of segments less than 30 seconds in duration. Nevertheless, our model demonstrates the ability to infer dialogues longer than 30 seconds, and our test set includes numerous dialogues exceeding one minute. To further investigate long dialogue generation, we manually designed four dialogues with varying durations. We observed that both CodiFM and Mooncast exhibit degraded performance after 90 seconds, while Sesame fails to generate dialogues longer than 120 seconds. Additionally, Sesame showed unstable results even for dialogues under 30 seconds, primarily due to speaker confusion issues.

Table R2 presents the SA-WER/SA-SIM evaluations for synthesized data of different durations. None of the compared systems support streaming. For long-form generation, AR models typically require significant waiting times. In contrast, our CodiFM can generate each segment in parallel with a Real-Time Factor (RTF) of less than 0.3, significantly reducing waiting times.

While Mooncast is specifically designed for long-form generation, it actually utilizes a chunk-wise Flow-matching detokenizer. For each turn in a long dialogue, it calls the Flow-matching detokenizer, requiring the audio of each speaker prompt to be provided every time. Therefore, it can't be considered a purely long-form generation model because the audio detokenizer still generates and concatenates multiple audio segments. Similarly, Sesame generates each utterance separately, relying on previous history, which means it is not a truly long-form generation model and is consequently less impacted by increased duration.

Table R2: SA-WER/SA-SIM evaluation of different duration of synthesized data

Model	30s	60s	90s	120s
Mooncast	5.81/0.647	4.57/0.576	13.95/0.487	19.75/0.516
Sesame	24.41/0.677	5.22/0.585	3.98/0.263	Failed
CodiFM	6.97/0.686	5.88/0.648	14.74/0.618	14.64/0.627

R3: About SA-SIM metric (Weakness3 and Question 3)

We would like to clarify that our proposed SA-SIM metric is designed to measure speaker identity similarity relative to the speaker prompt, rather than prosodic similarity. Even with variations in prosody, the speaker similarity for the same speaker consistently remains high.

We did, however, account for the potential for prosodic variations in speakers over different time intervals. To address this, we separately calculated speaker consistency, as detailed in Figure 4b of Section 5.2. This was achieved by comparing the similarity of multiple segments from the same speaker within the same conversation. Our results demonstrate that CodiFM can better maintain speaker identity regardless of prosodic changes. Our experiments further revealed that Mooncast and Sesame exhibit lower SA-SIM scores because certain speakers' voices undergo complete identity changes, not merely prosodic shifts.

R4: About the choice of NAR instead of AR model

The core of several questions revolves around the fundamental distinctions between NAR and AR models. While all previous research in dialogue generation has focused on AR models, there is a lack of purely NAR dialogue models. Although AR models have demonstrated effectiveness in dialogue generation, they possess several drawbacks:

Speaker Confusion: As evidenced by the SA-SIM results, some sentences may be attributed to an incorrect speaker.
Slow Inference Speed: AR models typically suffer from slow inference.
Uncontrollable Generation: They lack control over crucial aspects such as duration, speaking speed, and silence.

NAR models, on the other hand, can effectively mitigate these limitations. Their efficacy has been established in monologue generation, and they demonstrate strong potential in Text-to-Speech (TTS) generation tasks. This motivated our investigation into a purely NAR model for dialogue generation.

Finally, we would like to express our gratitude again for your time and effort in reviewing our paper. Considering this is the first attempt at non-autoregressive multi-talker dialogue generation and that we have added a comparison with previous work, we would appreciate it if you could consider increasing your score. Please do not hesitate to let us know if you have any further concerns or comments. We would be happy to address them.

2025-08-05

Dear Reviewer zySF,

We hope we have addressed your questions. Please let us know if you have any further concerns, as the discussion between the reviewers and authors will end soon. Thanks!

Best regards,

Authors

审稿意见

评分: 4置信度: 22025-07-03

This paper systematically tackles the long-overlooked challenge of multi-speaker zero-shot speech dialogue generation: ensuring speaker consistency, controllable speech overlap, and fast inference without relying on intermediate token representations. The authors introduce CodiFM—an end-to-end, fully non-autoregressive framework based on flow matching—and propose three mechanisms to overcome current limitations. To validate the approach, the authors compile a mixed training corpus, and diverse simulated overlapping-speech samples. Extensive evaluations against state-of-the-art baselines show that CodiFM delivers significantly better speaker consistency and overlap naturalness, achieves 4–7× faster inference, and earns the highest human ratings for fluency and interactivity.

优缺点分析

Strengths

The paper presents a technically sound framework, combining transcription-level speaker disentanglement, sentence-level alignment, and prompt-level masking within a fully non-autoregressive (NAR) flow-matching architecture. The proposed training curriculum and data mixing strategies are practical and well-explained.
The paper is generally well-written and logically structured, with clear motivation and modular presentation of model components.
The work addresses an important and underexplored problem: zero-shot multi-talker dialogue synthesis with overlapping speech and fine-grained timing control.
The integration of flow matching with multi-stream transcription input is novel within the domain of zero-shot multi-speaker TTS.

Weaknesses

Despite the claim that CodiFM works without prompt transcriptions, the advantages of this property are not convincingly quantified or analyzed in ablation studies.
The experiment is too simple. The scalability and robustness of the model to more than two speakers or longer, more complex dialogues are not demonstrated.
Most technical components (e.g., flow-matching, classifier-free guidance, silence token modeling) are adapted from existing literature. The novelty primarily lies in how these pieces are combined.
Typo. There are two Step 3 in Figure 1.

问题

Could the authors clarify what fundamental modeling challenges are newly addressed in the multi-speaker dialogue generation context? What makes CodiFM more than a direct combination of known techniques?
What are the difficulties in expanding beyond two person conversations or other languages?

局限性

Yes

最终评判理由

The author addresses the issues I am concerned about. I find the idea of this paper interesting, and I keep my score.

格式问题

作者回复

2025-07-29

Dear Reviewer W9cW,

We sincerely appreciate your efforts in reviewing our paper and providing us with valuable, constructive feedback. We have addressed the raised concerns by clarifying certain aspects of our methodology and incorporating additional analyses regarding the influence of prompt transcription and our system's performance in extended and complex scenarios. These revisions will be reflected in the updated version of the paper. Our detailed responses are listed below.

R1: Benefits of not using prompt’s transcription (Weakness 1)

Our decision to not utilize prompt transcription is motivated by two primary factors:

First, this design enhances the model's robustness across diverse speaker prompts. Conventional TTS models typically rely on ASR models or user-provided transcriptions. However, when prompts contain noise or exhibit unconventional speaking styles (e.g., whispering or speech with accent), ASR performance can degrade significantly, leading to inaccurate transcriptions (e.g., WER exceeding 50%). As demonstrated in Table R1, an inaccurate prompt transcription can severely compromise the quality of the generated speech in TTS systems. Therefore, our approach is not aimed at improving baseline performance but rather at mitigating performance degradation when confronted with challenging prompts and eliminating dependency on ASR models.

Second, this approach facilitates in-context learning by enabling the extraction of speaker prompts from the same dialogue samples used during training. Obtaining accurate transcriptions for arbitrary segments of an utterance is challenging. By removing the requirement for prompt transcription, our model can utilize any monologue segment from a training dialogue sample as a speaker prompt during the training phase.

Table R1: WER (%) evaluation for monologue and dialogue models with different prompt’s transcription

Scenario	Model	Correct trans	50% wrong trans	100% wrong trans
Monologue	CosyVoice2	6.44	28.40	64.12
	F5TTS	4.41	33.96	67.14
	MastGCT	2.62	37.70	40.81
Dialogue	MoonCast	2.83	6.16	25.37
	Sesame	2.06	2.12	8.68

R2: Scalability and Robustness of Model (Weakness 2, Question 2)

Our current experiments are constrained by resource and data limitations, specifically a dataset containing only two speakers. Despite these limitations, our proposed framework is language-independent and can be readily extended to accommodate multiple speakers. Future work will focus on developing a multi-speaker, multilingual system with three or more speakers.

Nonetheless, our model exhibits robust performance during inference, even for dialogues three times longer than those in our training data (less than 30s). We conducted evaluations on four manually designed dialogues of varying durations, assessing SA-WER and SA-SIM as presented in Table R3. We found that CodiFM's performance begins to degrade after 90 seconds, a degradation also observed with MoonCast. Sesame fails to generate dialogues longer than 120 seconds. Furthermore, Sesame exhibited unstable results even under 30 seconds due to speaker confusion issues.

Table R2: SA-WER/SA-SIM evaluation of different duration of synthesized data

	30s	60s	90s	120s
Mooncast	5.81/0.647	4.57/0.576	13.95/0.487	19.75/0.516
Sesame	24.41/0.677	5.22/0.585	3.98/0.263	Failed
CodiFM	6.97/0.686	5.88/0.648	14.74/0.618	14.64/0.627

Moreover, we have evaluated our model under extreme multi-turn dialogue generation scenarios. While our training data consists of segments less than 30 seconds, with 81% having fewer than two speaker changes and only 0.5% exceeding six speaker changes (maximum of 12), CodiFM successfully generated dialogues with 20 or more speaker changes, where each speaker speak only single words, without errors. In contrast, baseline models failed in these challenging conditions due to severe speaker confusion. We will incorporate these challenging samples into the revised demo page.

R3: About similarity to existing works (Weakness 3 and Question 1)

To achieve state-of-the-art (SOTA) performance in dialogue speech synthesis, we have fully leveraged existing SOTA technologies in our model building. While this may give the impression that CodiFM is a mere combination of existing methods, these individual methods, on their own, do not fully address the critical challenges inherent in our specific scenario: generating a synthesized dialogue with accurate pronunciation by the correct speaker and appropriate turn-taking, including silences and overlaps.

Let’s clarify our contributions compared to existing work, which can be further illustrated in Table R3:

Our work enables the generation of entire dialogues with accurate pronunciation by the accurate speaker, effectively resolving the speaker confusion problem prevalent in prior AR-based models.
We employ a multi-stream Transcription-Level Speaker Disentanglement strategy to enable the model to precisely understand "who speaks what and when".
We propose the Prompt-Level Random Masking strategy to enhance in-context learning capabilities in dialogue scenarios. This strategy eliminates the need for prompt transcription, thereby removing the dependency on ASR tools.

Table R3: System comparison

System	Arch	Duration	Monologue	Dialogue	Controllable Speed	Prompt’s transcription	Silence/Overlap
VoiceBox	NAR	Phoneme	Yes	No	No	Yes	No
E2TTS	NAR	Utterance	Yes	No	Yes	Yes	No
MoonCast	AR	/	Yes	Yes	No	Yes	No
Sesame	AR	/	Yes	Yes	No	Yes	No
CodiFM	NAR	Utterance	Yes	Yes	Yes	No	Yes

R4: About the two Step 3 in Figure 1(Weakness 4)

We appreciate you bringing this point to our attention. You're right to highlight the potential for confusion with "Step 3" in Figure 1. Our intention was to depict the training stage and the inference stage as separate processes, each with three distinct steps. We'll revise Figure 1 to clearly differentiate between the steps of the training process and the steps of the inference process, ensuring there's no ambiguity.

Finally, we would like to express our gratitude again for your time and effort in reviewing our paper. Considering this is the first attempt at non-autoregressive multi-talker dialogue generation and that we have incorporated extensive comparisons under challenging scenarios with previous works, we would appreciate it if you could consider increasing your score. Please do not hesitate to let us know if you have any further concerns or comments. We would be happy to address them.

2025-08-05

Dear Reviewer W9cW,

We hope we have addressed your questions. Please let us know if you have any further concerns, as the discussion between the reviewers and authors will end soon. Thanks!

Best regards,

Authors

最终决定Accept (poster)

2025-09-17

The submission introduces CodiFM, a fully non-autoregressive framework for zero-shot multi-speaker dialogue generation using flow matching. The paper tackles the problem of speaker consistency, overlapping speech, and efficient inference, while avoiding reliance on intermediate token representations. The rebuttal provided additional experiments showing robustness across prompt lengths, realistic noisy conditions, and extended dialogue durations. Major concerns of reviewers were addressed. Ethical concerns regarding dataset licensing and human evaluation standards were also addressed.

Overall, strengths outweigh weaknesses, and I recommend to accept this paper.