PaperHub
5.3
/10
Poster3 位审稿人
最低5最高6标准差0.5
5
5
6
4.3
置信度
正确性3.0
贡献度2.7
表达3.0
NeurIPS 2024

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

OpenReviewPDF
提交: 2024-05-12更新: 2024-11-06
TL;DR

We introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation

摘要

关键词
text-to-speechdialogue generationzero-shot

评审与讨论

审稿意见
5

The authors introduce the CoVoMix framework for human-like monologue and dialogue generation. They point out the shortcomings of the previous multispeaker dialogue systems that they are less explored, and there are lack of high-quality, spontaneous conventional datasets. The proposed CoVoMix achieves high naturalness and zero-shot speaker similarity in both monologue and dialogue generations with its proficiency in the fluency of dialogue turn-taking and spontaneous behavior generation.

优点

  1. The paper overall is easy to follow.
  2. It is the first attempt to propose zero-shot, human-like, multi-talker conversational mixed speech generation with a simultaneous multi-stream semantic token prediction and a multi-talker flow-matching-based acoustic model.
  3. The paper introduces dialogue evaluation metrics: Turn-taking Statistics, Para-linguistic Behaviors, and Speech Consistency.
  4. The example demo video extensively shows the generated conversation with naturalness and intelligibility.

缺点

  1. While the paper shows the objective and subjective evaluation results for monologue and dialogue generation, it does not explicitly show the performance comparisons between the previous literature. The authors may compare the proposed model with previous dialogue papers.
  2. The authors address that they employ the Fisher dataset which is curated for robust speech recognition. In order to generalize the proposed model, I would suggest the author to utilize one or two more datasets to verify the effectiveness of the proposed model.

问题

Please refer to the Weakness section.

局限性

The authors have adequately addressed the limitations in that they observed instances of words being omitted or duplicated occasionally in synthesized speech.

作者回复

We sincerely appreciate your efforts in reviewing our paper and providing valuable and constructive feedback. We have implemented the Soundstorm model for dialogue synthesis as previous work to compare with our CoVoMix model, which will be included in the revised version. The detailed responses are listed below

R1: About comparison with previous work (Weakness 1)

We completely agree with your comments regarding the need to compare our proposed model with previous dialogue papers. We have also dedicated a significant amount of time to surveying this area. However, to the best of our knowledge, there are no suitable baselines available for comparison. Our work generates the entire dialogue at once, incorporating natural turn-taking behaviors such as overlap. In contrast, previous studies generate each utterance separately and then concatenate them, or generate each utterance sequentially without considering overlapping speech during speaker turn changes. The latter type of model, such as Soundstorm, is not officially open-sourced. Public implementations, however, were only evaluated on LibriSpeech, and there is no implementation for dialogue synthesis, which requires a special design for speaker turn changes.

To address your concerns, we implemented a SoundStorm-style baseline using the Fisher dataset based on our understanding of the paper. We utilized the EnCodec model for acoustic tokens (since Soundstream is not publicly available) and HuBERT for semantic tokens (the same in CoVoMix). Soundstorm lacks a mechanism to handle two-channel speech for training and overlapping speech from two speakers. Therefore, we had to reprepare the training and test datasets to ensure the speech was mono-channel with no overlapping parts. To isolate other effects, we used oracle semantic tokens for the model comparison with results shown in Table R1.

Table R1: Objective evaluation results in non-overlapped dialogue test set across models

ModelSIMMCDNISQA
GroundTruth0.59/2.85
GroundTruth-EnCodec0.532.392.56
CoVoSingle0.474.412.88
SoundStorm0.255.602.49
CoVoMix0.464.982.88

R2: About extra dataset (Weakness 2)

We are eager to utilize one or two more datasets to verify the effectiveness of the proposed model. However, to the best of our knowledge, there are few multi-speaker conversational datasets available for training. One of our novel contributions is the proposal of a data-processing pipeline to leverage speech recognition datasets, such as the Fisher Dataset. We propose a comprehensive strategy for processing ASR datasets, including both training and evaluation for monologue and dialogue speech generation.

It is beneficial to utilize more datasets for training to improve model generalization. According to our observations, CoVoMix demonstrates a certain generalization capability across different domains. For example, the transcripts used for generating example demo videos are derived from DailyDialog[1], a dialogue dataset. We will consider simulating conversational datasets for training purposes in future work.

Ref: [1] Li, Yanran, et al. "Dailydialog: A manually labelled multi-turn dialogue dataset." arXiv preprint arXiv:1710.03957 (2017).

Finally, we would like to express our gratitude again for your time and effort in reviewing our paper. Considering this is the first attempt at multi-talker conversational mixed speech generation and that we have added a comparison with previous work, we would appreciate it if you could consider increasing your score. Please do not hesitate to let us know if you have any further concerns or comments. We would be happy to address them.

评论

I agree with your statement that there are not enough datasets for evaluating or comparing the models' performances. I also want to know what's the major difference between the proposed model and Voicebox[1] or Audiobox [2] in terms of the model architecture. I believe those two models also utilize the Flow Matching technique for generating speech.

[1] Le, Matthew, et al. "Voicebox: Text-guided multilingual universal speech generation at scale." Advances in neural information processing systems 36 (2024).

[2] Vyas, Apoorv, et al. "Audiobox: Unified audio generation with natural language prompts." arXiv preprint arXiv:2312.15821 (2023).

评论

Yes, you are right. For speech generation, CoVoMix, Voicebox, and AudioBox all utilize the flow-matching technique. The model architecture for both CoVoMix and Voicebox is transformer-based encoder. AudioBox, however, includes an additional transformer-based voice prompt encoder alongside the transformer encoder backbone. A detailed comparison of the model architectures can be found in Figure 6 of the appendix in our paper, with Figure 6a representing the Voicebox-style model and Figure 6d representing CoVoMix.

Our CoVoMix acoustic model differs from Voicebox in two main ways:

  1. the input feature is different. CoVoMix utilizes semantic token sequences predicted by an auto-regressive text-to-semantic model, while Voicebox receives the phoneme sequence, where the phone duplication is predicted by a non-autoregressive duration predictor (AudioBox leverages raw character sequence).
  2. CoVoMix can receive multi-stream prompts and multi-stream semantic token sequences, with each stream representing one speaker. This allows it to generate overlapping speech, i.e., two people speaking simultaneously in one channel. However, neither Voicebox nor AudioBox have a mechanism to handle this.

Therefore, Voicebox alone cannot fully address the critical issues in our scenario. Human dialogues are naturally characterized by turn-taking behaviors, such as overlaps and hesitations, as well as non-verbal behaviors like laughter and coughing. The main reason is that Voicebox requires accurate phoneme-level alignment to train the duration predictor and acoustic model. However, achieving precise forced-alignment using conventional tools is challenging, especially for speech with spontaneous behavior, noisy backgrounds, and overlapping speech. These alignment inaccuracies can lead to significant performance degradation.

Thank you again for your time and effort in reviewing.

评论

I appreciate that you pointed out the differences between the proposed model and the previous literature in terms of model architecture. Regarding the critical response to my questions and the potential for development of the dialogue generation, I would raise my rating to borderline accept.

评论

Dear reviewer LEoq,

We appreciate your efforts in increasing the rating for our paper. Further suggestions and concerns are welcome until the end of the reviewer and author discussion period.

Thanks!

Authors

审稿意见
5

This study proposes a personalized speech synthesis model capable of generating monologues or dialogues. The study achieves this goal through the development of a text-to-semantic token model and the conversion of semantic tokens into mel-spectrograms. By utilizing the Fisher dataset, which contains natural speech characteristics, the proposed model is able to produce naturalistic utterances that include paralingustic components such as laughter and coughing.

优点

  1. Dialogue generation, which involves converting given conversations into speech, has not been sufficiently explored. This study implements a dialogue generation model that not only addresses this gap but also incorporates personalization features.

  2. Various metrics that facilitate evaluation have been devised.

  3. The included demo provides a convenient way to distinguish the strengths of their model, and the various figures in the paper help facilitate understanding of the text.

缺点

  1. The task in this study is similar to existing dialogue-to-speech tasks, with the addition of personalization features. However, the method of adding personalization does not appear to be novel or specifically tailored for dialogue. Instead, it seems to be a combination of existing approaches.
  • The text-to-semantic token approach has already been widely adopted in many previous studies, particularly with recent advancements in personalized speech synthesis. The use of an autoregressive method, as employed in this study, has been extensively researched. The authors' contribution appears to extend this to dialogue data. However, the approach of using an autoregressive text-to-semantic prediction model for dialogue has already been utilized in Soundstorm.
  • The acoustic model handles the personalization aspect and shows a structure almost identical to Voicebox. Similarly, apart from an increase in channels compared to the original Voicebox, there seem to be no additional distinguishing features.
  1. The paper mentions that Soundstorm generates in a sequential manner, but it should be noted that Soundstorm also uses an autoregressive model for text-to-semantic conversion and a non-autoregressive model based on MaskGIT for the acoustic model, similar to the approach in your model. Therefore, the statement in the paper that "generated in a sequential manner and thus sounds less realistic" could equally apply to CoVoMix. Concerns are raised that issues such as spontaneous behaviors might stem from the differences in the data used for training rather than the proposed method itself. Given that Soundstorm is one of the few spoken dialogue generation models that support personalization, it seems necessary to compare models trained on the same data.

问题

  1. I am curious if the HuBERT tokens used in this study contain any speaker information at all. Specifically, when performing personalization in the acoustic model, could any speaker information potentially present in the HuBERT tokens degrade the personalization performance?

  2. To model laughter, it seems that the semantic tokens must include information about laughter. I am curious whether the self-supervised model that incorporates such laughter information in the semantic tokens is limited to the Fisher dataset, or if the speech tokenizer is designed to work robustly with other datasets as well.

  3. Unofficial Implementations of Soundstorm have been made publicly available on the web. For example, https://github.com/ZhangXInFD/soundstorm-speechtokenizer offers an acoustic model capable of personalized speech synthesis using semantic tokens, similar to your approach. I'd like to ask if you have considered comparing your model with Soundstorm using publicly available implementations.

局限性

They addressed these aspects in the conclusion, limitation, future work and broader impacts section.

作者回复

We sincerely appreciate your efforts in reviewing our paper and providing us with valuable, constructive feedback. We added a table to highlight the unique aspects of our model in comparison with previous works. Additionally, we implemented the Soundstorm-style baseline for dialogue synthesis to compare with our CoVoMix model. The detailed responses are listed below.

R1: About similarity to existing works (Weakness 1)

To achieve state-of-the-art (SOTA) performance in dialogue speech synthesis, we have fully leveraged existing SOTA technologies in our model building. This is why CoVoMix appears to be a combination of existing methods. However, these methods alone could not completely address the critical issues in our scenario: human dialogues are naturally characterized by turn-taking behaviors, such as overlaps and hesitations and non-verbal behaviors, such as laughter and coughing

Let’s clarify our contributions compared to existing work, which can be further illustrated in Table R1:

  1. Our work generates the entire dialogue at once. In contrast, previous studies generate each utterance separately and then concatenate them, or generate each utterance sequentially without considering overlapping speech during speaker turn changes, such as in Soundstorm.
  2. We employ simultaneous multi-stream semantic token prediction from dialogue text, with each stream representing an individual talker. We use a multi-talker flow-matching-based acoustic model to generate a mixed mono mel-spectrogram given multiple contexts.
  3. We propose a comprehensive strategy for processing ASR datasets, including both training and evaluation for monologue and dialogue speech generation.

Table R1: System comparison

SystemSemanticAcousticDataMonologueDialogueNon-verbal BehaviorsTurn-takingOverlapping speech
VoiceBoxPhonemeMel-spectrogram (flow-matching)High Quality monologueYesNoNoNoNo
SoundStormSemantic TokensCodec tokens (NAR)High Quality monologue and dialogue (internal)YesYesNoyesNo
CoVoMixmulti-stream Semantic TokensMel-spectrogram (multi-stream flow-matching)ASR dialogueYesYesYesYesYes

R2: About comparison with SoundStorm (Weakness 2, Question3)

The main difference between Soundstorm and CoVoMix lies in the generation process. Although Soundstorm is a non-autoregressive model, its dialogue generation pipeline relies on an autoregressive text-to-semantic model, which generates speaker A and B's content in a sequential ABABAB way and thus fails to produce overlapping content. In contrast, CoVoMix generates multiple streams of semantic tokens (including silence tokens) in parallel, with each sequence corresponding to one speaker. Since these streams are temporally aligned, multiple streams may speak simultaneously, resulting in a more natural dialogue with such overlapped speech.

Regarding your concerns that spontaneous behaviors might arise from differences in the training data, this is one of the critical issues we have addressed. We have proposed a comprehensive strategy for processing ASR datasets, including both training and evaluation for monologue and dialogue generation.

Since Soundstorm is not officially open-sourced, we have considered comparing CoVoMix with publicly available implementations of Soundstorm. However, these implementations were only evaluated on LibriSpeech, and thus there is no implementation for dialogue synthesis, which requires special design for speaker turn changes. Instead, we implemented a Soundstorm-style baseline using the Fisher dataset based on our understanding of the paper. We utilized the EnCodec model for acoustic tokens (since Soundstream is not publicly available) and HuBERT for semantic tokens (the same in CoVoMix). Since Soundstorm is unable to handle two-channel or overlapping speech for training, we had to reprepare the training and test datasets to ensure the speech was mono-channel with no overlapping. To isolate other effects, we used oracle semantic tokens for the model comparison shown in Table R2.

Table R2: Objective evaluation results in non-overlapped dialogue test set across models

ModelSIMMCDNISQA
GroundTruth0.59/2.85
GroundTruth-EnCodec0.532.392.56
CoVoSingle0.474.412.88
SoundStorm0.255.602.49
CoVoMix0.464.982.88

R3: About speaker information in HuBERT tokens (Question 1)

We did notice that HuBERT tokens contain speaker information. As shown in Appendix B.2, there is a larger speaker similarity gap between models using predicted and oracle semantic tokens than phoneme sequences. We have attempted to solve this problem using various methods, such as extracting semantic tokens from voice-converted utterances to remove the original speaker identity during the training stage. However, the performance has not met our expectations and is therefore not reported in the paper. We will continue investigating this issue in the future.

R4: Robustness of Laughter in Speech Tokenizer (Question 2)

We believe that Speech Tokenizer HuBERT trained on the Fisher dataset contains tokens representing laughter. The laughter can be generated by giving a tag in the text or automatically generated by using contextual info. In addition, we observed that CoVoMix shows a certain generalization capability across different domains. For example, the transcripts used for generating demo videos are derived from DailyDialog (a dialogue dataset) with manually annotated positions of laughter.

However, according to References [8], HuBERT trained on Fisher shows degraded performance when generalized to LibriSpeech due to domain mismatch.

Finally, we hope we have addressed all your concerns. We would appreciate it if you could consider increasing your score.

评论

Thank you for your kind response. I have one question. In the official demo of SoundStorm, in the first video, there seems to be overlapping speech, and the second spoken dialog sample it seems there are some nonverbal cues such as laughter. This appears to be slightly different from what you mentioned. Could you please provide an explanation regarding this?

评论

Thank you for bringing our attention to the demos from Soundstorm. As you mentioned, we also noticed overlapping speech and laughter in their demos. Here is our explanation:

According to the description of dialogue synthesis in the Soundstorm paper, there is no special mechanism to handle overlapping speech and nonverbal behaviors such as laughter. However, they do use a symbol to indicate speaker turns. We believe the generated overlapping speech and laughter were unintentionally learned from their training corpus, which contains 100,000 hours of dialogues, including samples of overlapping speech and laughter. This corpus is almost 100 times larger than what we used. Additionally, both the semantic token model and the codec model were trained using the same corpus. Therefore, there might be semantic tokens that can represent overlapping speech/laughter, and codec tokens that can render overlapping speech/laughter. However, this generation is based on context and model capability, meaning users cannot control it.

In contrast, our CoVoMix leverages a multi-stream semantic model and a multi-stream flow-matching acoustic model, allowing users to control the length of overlapping speech and the position of laughter, in addition to automatic generation for a given context.

Thank you again. We have revised the corresponding part in the paper accordingly.

评论

Dear Reviewer yp1s,

We hope we have addressed your questions. Please let us know if you have any further concerns, as the discussion between the reviewers and authors will end soon. Thanks!

Best regards,

Authors

评论

I have understood your comments, and I appreciate your kind response. Accordingly, I have slightly adjusted the score upward.

评论

Dear reviewer yp1s,

We appreciate your efforts in increasing the rating for our paper. Further suggestions and concerns are welcome until the end of the reviewer and author discussion period.

Thanks!

Authors

审稿意见
6

This paper proposed CoVoMix, a zero-short TTS model for multi-speaker conversations. CoVoMix consists of a multi-stream text-to-units model, a flow-matching acoustic model for mixed-speaker spectrogram generation and HiFiGAN vocoder for waveform generation. The major contribution of CoVoMix is that it is one of the first attempts to generate natural multi-talker conversations in zero-shot manner and according to experimental results and demos, CoVoMix is able to synthesize natural pause, overlap and laughter in the conversation.

优点

  1. CoVoMix is able to zero-shot generate natural two-speaker dialogue from text without additional input on spontaneous behaviors like laugh. Objective and subjective evaluations further substantiate this.
  2. This line of work is a crucial step towards natural human-machine dialogue. There are prior TTS works like CHATS, which also support natural turn taking, and CoVoMix further extends it to support zero-shot, so that user can designate the speaker's voice.

缺点

  1. Unclear writing in the method section. Line 149-152, author says "we divide the semantic embedding into two distinct segments in the final linear layer of the decoder", does it mean dividing each embedding into two or dividing the embedding sequence into two halves?
  2. Insufficient baseline comparison in the experiments. CoVoMix is only compared to CoVoSingle and a baseline sharing similar architecture. Though many of the prior works are not public, there are open-sourced TTS models like SoundStorm, and public but not open-sourced models like GPT-4o. More baseline comparison will provided readers a better sense of CoVoMix's performance.
  3. The authors use speaker diarization for turn-taking statistics, but not for speaker similarity in dialogue, so we basically don't know whether it can recover target speaker's voice for dialogue generation.
  4. Speech consistency is only evaluated based on speaker similarity instead of flow of speech, whether laughter is appropriate and etc.

问题

  1. Clarify line 149-152.
  2. The author mentions that "the potential errors in speaker diarization could impact the fairness of the comparison", is there any examples or numbers that can be shared so we better understand the situation? also, why is it hard for human evaluation?
  3. In speech consistency, the CoVoMix shows better consistency of different utterances than CoVoSingle. But CoVoSingle synthesizes each sentence given same target speaker prompt right? How can this be? Or is this speaker similarity measuring characteristics beyond speaker?

局限性

  1. CoVoMix is only applied to two-talker scenario instead of more.
  2. CoVoMix lacks the option for personalizing turn-taking. The turn-taking phenomenon itself varies from person to person and even the same person in different moods.
作者回复

We sincerely appreciate your efforts in reviewing our paper and providing us with valuable, constructive feedback. We have clarified some unclear writing that caused misunderstandings and added a baseline system for comparison, as well as a speaker similarity subjective test for dialogue, which will appear in the revised version. The detailed responses are listed below.

R1: About unclear presentation of Line 149-152 (Weakness 1, Question 1)

Lines 149-152 indicate that we need to divide the semantic embedding into two halves. For instance, if there is a semantic embedding of shape [B,T,D], we divide it into [B,T,:D/2] and [B,T,D/2:] to obtain these two embeddings. We have revised the paper to clarify this.

R2: About insufficient baseline comparisons (Weakness 2)

We agree that providing more baseline comparisons would give a better understanding of CoVoMix’s performance. We have also spent a significant amount of time surveying this. However, to the best of our knowledge, there are no suitable baselines available for comparison. For instance, Soundstorm models and codes are not officially open-sourced. The publicly available implementation was only evaluated using LibriSpeech. Therefore, there is no Soundstorm implementation for dialogue synthesis, which requires a special design for speaker turn changes. The original paper also lacks sufficient details on this aspect. Furthermore, GPT-4o was not accessible during our research period.

We have tried to reproduce the dialogue model of Soundstorm based on our understanding of the paper, but the results were not as reported, so we did not put the results into the paper. To address your concerns, we continue implementing a SoundStorm-style baseline using the Fisher dataset. We utilized EnCodec for acoustic tokens (since Soundstream is not publicly available) and HuBERT for semantic tokens (the same in CoVoMix). Soundstorm lacks a mechanism to handle two-channel speech for training, and consequently, it does not model overlapping speech from two speakers effectively. Therefore, we had to reprepare the training and test datasets to ensure the speech was mono-channel with no overlapping parts. To isolate other effects, we used oracle semantic tokens for the model comparison shown in Table R1.

Table R1: Objective evaluation results in non-overlapped dialogue test set across models

ModelSIMMCDNISQA
GroundTruth0.59/2.85
GroundTruth-EnCodec0.532.392.56
CoVoSingle0.474.412.88
SoundStorm0.255.602.49
CoVoMix0.464.982.88

R3: About speaker similarity in dialogues (Weakness 3)

To address your concerns, we use oracle diarization results for similarity evaluation shown in Table R1. We also added a subjective evaluation for speaker similarity in a dialogue as in Table R3. Ten dialogues randomly selected were manually segmented into multiple single-speaker utterances to avoid speaker diarization errors. (For turn-taking statistics, we had to leverage an automatic diarization system due to the large size of whole testing set). We had 15 linguistic experts evaluate the speaker similarity of these utterances compared to the prompt speaker.

Table R3: Speaker similarity across models for dialogue generation

ModelSMOS
CoVoSingle0.00
CoVoMix0.60

R4: About speech consistency and flow of speech (Weakness 4)

We agree that speech consistency should also consider the flow of speech and appropriate laughter, in addition to speaker similarity, which is predominantly used in related papers. In our dialogue naturalness subjective test, we have provided with the following specific guidelines (please refer to Fig.12 in appendix):

...evaluating how closely the dialogue resembles a natural conversation in terms of fluency, the rhythm, the intonation ... Consider how seamlessly the conversation flows from one speaker to the other, the appropriateness of pauses, and how these transitions contribute to a realistic conversational experience

Consequently, the flow of speech and appropriate laughter have already been taken into account in the subjective test scores.

Moreover, we added Table R4 to show the distribution of F0, demonstrating that CoVoMix produces speech more similar to the Ground Truth in multi-turn dialogues.

TableR4: F0 distribution across models for dialogue generation

ModelF0
GroundTruth253±80
CoVoSingle229±87
CoVoMix255±59

R5: About the potential errors in speaker diarization (Question 2)

We use an open-source diarization tool, pyannote. It achieves an 11.9% Diarization Error Rate in its original domain (AMI dataset), but it may be less accurate on the Fisher Dataset due to domain mismatch.

The challenge for human evaluation lies in the co-appearance of multiple talkers in a dialogue. An interfering speaker can influence the speaker perception during human evaluation. Moreover, although humans can easily distinguish between two speakers of different genders, more than 60% of the utterances in Fisher dataset are of the same gender. To address this issue, we manually segmented the testing utterances into several single-speaker segments and had linguistic experts evaluate the speaker similarity. The results are shown in Table R3.

R6: About speech consistency of CoVoSingle (Question 3)

Speech consistency involves comparing the speaker similarity among different segments within the same dialogue. Although CoVoSingle demonstrates high zero-shot capability, it is less likely to generate multiple identical utterances with the same prompt due to the sampling of the flow-matching model. In contrast, CoVoMix addresses this consistency issue by generating the entire dialogue at once, rather than through multiple generations and concatenation.

Finally, we would like to express our gratitude once again for your time and effort in reviewing our paper. We would greatly appreciate it if you could consider increasing your score.

评论

Most of my concerns are answered and I've increased my score.

评论

Dear reviewer XEst,

We appreciate your efforts in increasing the rating for our paper. Further suggestions and concerns are welcome until the end of the reviewer and author discussion period.

Thanks!

Authors

作者回复

Dear Reviewers,

Thank you for your efforts in reviewing our paper. We greatly appreciate your acknowledgment of our contributions, including our first attempt at zero-shot, human-like, multi-talker conversational mixed speech generation, the various metrics to facilitate evaluation, and the good demos and figures for demonstration.

Regarding the main concern about the comparison with previous work, such as Soundstorm, we have dedicated a significant amount of time to surveying this area. However, to the best of our knowledge, there are no suitable baselines available for comparison. Although there are publicly available implementations of Soundstorm, these implementations were only evaluated on LibriSpeech and thus do not include dialogue synthesis, which requires special design for speaker turn changes. We have implemented a Soundstorm-style baseline using the Fisher dataset based on our understanding of the paper. We did not use it for comparison in the paper because the results were not as expected. We will continue to improve it and add the results to address your concerns.

Additionally, we will include new results of the speaker similarity subjective test for dialogue generation in the revised version.

Please check the details in the responses to the individual reviewers.

Thanks again!

Authors

最终决定

CoVoMix represents a novel approach to synthesizing multi-talker (though specifically two-speaker) conversations. Zero–shot speech generation has received a substantial amount of attention. The novelty in this approach is the use of a multi-stream model to be able to natively synthesize overlapping speech (a necessary component of natural human-human conversation as backchanneling and smooth turn taking involve overlapping speech).

The major weakness of this work is that the comparison to other similar single-stream approaches is limited. This is a very fair criticism, and one that the authors should take seriously. The discussion around the overlapping speech in SoundStorm demos suggest that the model (benefiting from a lot of data) is, in fact, able to generate overlapping speech using a single stream model. Since the semantic tokens used in SoundStorm represent overlapping speech, it is not obvious that this is not controllable, only that the semantic token prediction needs access to both speaker’s content. Of course, this version of SoundStorm is not publicly available making this investigation particularly challenging.