PaperHub
7.8
/10
Poster4 位审稿人
最低3最高5标准差1.0
3
3
5
5
ICML 2025

Aligning Spoken Dialogue Models from User Interactions

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

Aligning real-time multistream spoken dialogue model with user interaction data and AI feedback

摘要

关键词
Speech AlignmentAudio Language ModelConversational Model

评审与讨论

审稿意见
3

This paper introduces an alignment framework for full-duplex spoken dialogue models (like Moshi). The authors construct preference data from real user interaction data and then fine-tune a spoken dialogue model (Moshi) using direct preference optimization (DPO). Preference pairs are constructed by using a model to first find a wrong answer and then revise it to a correct answer. The proposed alignment method can improve model's performance on question answering tasks and model safety. The authors also conduct human evaluation to assess model's coherence, engagement and helpfulness.

给作者的问题

In Equation (2), the context includes the speech generated by the model, while the numerator of the probability corresponds to the text portion of the output. Why is it designed this way? It seems somewhat counterintuitive, as the speech and text generated by the model should be parallel sequences without a sequential relationship.

论据与证据

The proposed method can improve model's performance on question answering tasks and model safety. The authors conduct comprehensive experiments on their aligned models, including benchmark results and human evaluations. The authors also conduct detailed ablation studies to explore the effectiveness of different implementation settings.

方法与评估标准

  1. The Methods section introduces too much background knowledge, such as the details of Moshi and DPO.
  2. The proposed method has limited innovation, with its main contribution focused on how to construct preference data pairs.
  3. Although the paper is about aligning full-duplex spoken dialogue models, the evaluation mainly focuses on question answering, safety and multi-turn ability (like consistency and engagement), ignoring the features of full-duplex models.

理论论述

The article contains relatively little theoretical proof.

实验设计与分析

Overall, the experimental design is relatively solid. The writing in the experimental section is somewhat disorganized. For example, Section 4.1 mentions several data ratios, but I couldn't find the corresponding experiments in the results section.

补充材料

There is no supplementary material.

与现有文献的关系

The paper primarily relies on Moshi as the base model and DPO as the preference learning algorithm. It specifically designs a method for constructing preference data and related human evaluation methods.

遗漏的重要参考文献

No.

其他优缺点

See Above

其他意见或建议

Line 246: No corresponding content in Appendix C. Equation 1 and Equation 3 are the same equation.

作者回复

We are thankful for your time and careful feedback. We answer specific points below:

Question

In Equation (2), [...], a sequential relationship.

The model’s text and audio must have some dependency pattern, otherwise the two streams would quickly have different content.

Other patterns have been explored in the literature [1], such as first generating all text tokens, then all audio tokens, however it limits the possibility to interrupt at any time (if the model gets interrupted in the middle of generating the audio, we would need to backtrack to erase the text that was never voiced), and increases latency.

Co-generating both audio and text tokens in an interleaved and auto-regressive fashion has no such limitations. While we only penalize the text contribution to the likelihood of a trajectory in the DPO loss, the model is trained to be fed with both the text and audio part of a trajectory, and would be largely out of domain if only fed with text.

Concerns on Methods & Evaluation

The Methods section introduces [...] Moshi and DPO.

We will move the content on Moshi to a different section, and reorganize the DPO part to focus more on our adaptation to the multistream setting. We hope this revision could make the methods section clearer, while keeping the paper self-contained.

The proposed method has limited innovation [...] preference data pairs.

We believe that the key observations are more nuanced:

  • To the best of our knowledge, this is the first work studying how to enhance full-duplex spoken dialogue models with large-scale interaction data.
  • How to construct and mix preference pairs from multi-turn dialogues itself is not straightforward, especially given that spoken conversations present unique characteristics (e.g. overlaps and silences), have a different distribution than the written text (e.g., more concise and simple sentences), with a much higher number of turns. We believe that our designs and reported results can be interesting for the community and practitioners.
  • Through extensive experiments, we show that using synthetic voice data can still significantly improve model performance. This offers several practical advantages, such as preserving privacy and mitigating the challenges of collecting extensive human speech corrections in real-world applications (much harder than for text).

Although the paper is about aligning full-duplex [...] models.

We agree that full-duplex spoken dialogue models exhibit a wide range of interesting phenomena, from the semantic content to the temporal dynamics, at turn-level and dialogue-level.

As assessing conversational dialogue models is a complex task [2], our work mainly aims to improve the quality of the content (in particular, the turn-level factual correctness, safety, the high-frequency timing problems, and the multi-turn dialogue-level quality) in the spoken dialogue, that we disentangled from other specific phenomena that require more fine-grained and targeted assessments.

Evaluation of full-duplex models is also a very nascent area. We are aware of some very recent work on evaluating the turn-taking behaviour [3, 4], but they are not fully open-sourced yet. We agree that the evaluation and improvement of temporal dynamics for conversational models are an important and very timely direction for future work.

Experimental Designs & Other Comments

Experimental Designs Or Analyses

Thank you for the feedback. We will refine the flow of the experimental section to better link with the results. For the data ratios, we will clearly update them in Table 2.

In particular:

  • Type-A corresponds to the 20% with content-only issues.
  • Type B+C corresponds to the 57% with timing-only issues.
    • Within the timing-only issues:
      • Type-B: 18% is the model cutting the user.
      • Type-C: 72% is the model not answering within appropriate time.

Line 246: No corresponding content in Appendix C

The reference refers to Appendix C of the cited paper [5] for details on TTS (on page 56). We will revise to make the point clearer.

Equation 1 and Equation 3

Thank you for the feedback. We wanted to emphasize after changing the notation of the policy, but we agree that the equation is redundant and will remove it.

Please let us know if the answers addressed your questions, and if we can address further questions you might have. Thanks again!

[1] Nachmani, E., et al. "Spoken question answering and speech continuation using spectrogram-powered llm." ICLR, 2024.

[2] See, A., et al. "What makes a good conversation? how controllable attributes affect human judgments." ACL, 2019.

[3] Arora, S., et al. "Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics." ICLR, 2025.

[4] Lin, G., et al. "Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities." arXiv, 2025.

[5] Défossez, A., et al. "Moshi: a speech-text foundation model for real-time dialogue." arXiv, 2024.

审稿意见
3

This paper integrates DPO and its variants into Moshi, a full-duplex voice interaction model, to enhance several aspects such as content and timing. To accomplish this, the authors collected dataset and used it for training and evaluation. Notably, aside from concurrent research, all published studies on end-to-end pipeline-based voice interaction models have relied on supervised fine-tuning, underscoring the significance of this approach.

update after rebuttal

I have considered the authors' rebuttal. The additional evidence and clarifications provided resolved the questions I previously raised. Consequently, my positive opinion on this submission is unchanged.

给作者的问题

1. Loss of context-based prosody

  • In the data generation process (synthesize speech from user and answer of LLM), doesn’t the prosody and naturalness that depends on the conversation context get lost? Doesn’t this weaken Moshi’s strength? My understanding is that Moshi synthesizes entire spoken dialogs for data augmentation, not a single utterance, which helps maintain naturalness of spoken dialog between two speakers. If this paper is meant to demonstrate or verify the possibility of applying DPO, then I believe it should test one or two additional backbone models to demonstrate the method’s general applicability. On the other hand, if the goal is to propose a new model, then there should be experiments or demos focusing on acoustic aspects—one of the backbone model’s strengths and a key element of a voice interaction model.

2. Effect of reinforcement learning on text-to-speech

  • If reinforcement learning is only applied to the text, does it affect text-to-speech performance at all? For example, does pronunciation accuracy get worse? I’m curious about any difference in pronunciation accuracy before and after DPO.

3. Methodological novelty

  • Apart from applying the DPO family of methods to a voice interaction model, there does not seem to be much additional methodological novelty—though I recognize it may be the first approach of its kind.
  • The motivation for collecting training data directly from people seems a bit weak. In the end, the data collected from humans is resynthesized through TTS (for privacy). Recently, many cases build a spoken dialog dataset using TTS model. Is the main benefit of your approach that you can preserve the timing of human interventions in conversations? Could we simply use an LLM to create written dialog data (including the timing for interventions), and then apply TTS to produce a synthetic dataset?

4. Plans for open release

  • I’m curious if there are plans to release the data and model.

5. Lack of acoustic metrics and demos

  • I would like to see an audio demo, or at least some metrics that evaluate the acoustic aspects. I’m curious to see how the reinforcement learning process affects the acoustic quality.

论据与证据

Based on what is presented in the paper, there do not appear to be any overclaims or similar issues. However, there are some additional questions I would like to raise, which are listed below.

方法与评估标准

From what I can see, there are no issues in evaluating the aspects they aim to measure—such as safety, context, and timing.

理论论述

They directly applied formulas from the standard DPO methodology and its variants without any apparent problems.

实验设计与分析

The experimental design and analysis seem sound.

补充材料

I checked the Appendix to see how their data was constructed.

与现有文献的关系

Aside from a single study released around the same time, this is the first instance I’ve seen of applying an RLHF-like approach to the content of a voice interaction model, which I believe is novel. Otherwise, there don’t appear to be any further concerns.

遗漏的重要参考文献

The authors have cited previous work appropriately.

其他优缺点

Strengths

  • Their trial on data creation, the design of each evaluation, and their first-time incorporation of RLHF into the model are definite strengths.
  • Additionally, their explanation of MOSHI, the backbone of their model, is clear and easy to follow.

Any weaknesses have been noted in the “Questions For Authors.”

其他意见或建议

Listed in Quesions For Authors

作者回复

We thank the reviewer for the thorough feedback, and for acknowledging that our work presents a novel contribution for the alignment of voice interaction models.

Questions & Concerns

Loss of context-based prosody

We agree that discarding the original audio (for privacy) loses some audio information, however:

  • we want to clarify that the TTS is done for the user's audio, and for Moshi we used its own generated tokens (except for the generated preferred response, also synthesized);
  • as the resynthesized speech retains the original timestamps, some aspects of the prosody are kept, such as the rhythm and speech rate;
  • for the axes we focused on in the work, the original audio is not absolutely necessary.

Also the original Moshi paper [1] uses synthetic generation for the instruction stage.

Given that in a number of regions in the world, strong protection laws prevent or limit the recording or storage of private and biometric attributes such as a person's voice, our method presents important applications for improving speech systems in a privacy-preserving manner, while respecting local legislations.

The model choice is because Moshi was the only available open-source full-duplex speech-to-speech model when we conducted the project, and for cost/feasibility reasons. We agree that more work is required to further verify the extension to other models, and we will discuss this point in the limitations. We note that the method for building the dataset mixes from multi-turn dialogues is model-independent.

Effect of reinforcement learning on text-to-speech

We computed the WER (lower is better) between each model’s text tokens and Whisper’s transcriptions, before and after alignment, on our human evaluation dataset.

For Moshi (the "matched" voice), after alignment, we observed a slight improvement. For M-Alt-Vox (the "mismatched" voice, i.e. model with a voice different from our synthesized preference data), WER rose a bit, suggesting that adapting to a voice with different characteristics may show mixed effect.

ModelWER (%)
Moshi-Instruct5.70
Moshi-Aligned4.89
M-Alt-Vox-Instruct3.78
M-Alt-Vox-Aligned5.88

Note that because how the conversations are conducted, they are not constrained to be the same, so the numbers indicate an aggregated trend which confirms that pronunciation accuracy is overall preserved during the alignment process.

Methodological novelty

For the method, we want to clarify a few aspects:

  • How to construct and mix preference pairs from multi-turn dialogues is not a straightforward adaptation of the textual approach, given that spoken conversations present unique characteristics (e.g. overlaps and silences, more concise, more turns). We believe that our designs and results can be interesting for the community and practitioners.
  • Our approach shows that using synthetic voices can be effective, offering practical advantages such as preserving privacy and mitigating the challenge of scarcity for speech (feedback) data.

There are several motivations for leveraging live interaction data, instead of directly synthesizing from LLM-written dialog scripts:

  • While synthetic data could approximate "typical" timing, real interactions yield more diverse and realistic phenomena: e.g. mid-sentence clarifications and stoppings, hesitations, silences when thinking, abrupt topic shifts, etc. In practice, we observed that LLMs struggle at providing or assessing realistic timings, e.g., for overlapping speech. By preserving this information, we could keep some aspects of the prosody, speech rate and rhythm.
  • We observe that current LLMs are better at generating written text than content resembling spoken conversations, that is more concise, with more turns, potentially overlapping. This echoes observation in recent work [2]. We observe it's possible to prompt the model to imitate to some extent, but it's harder to generate long, coherent, diverse and speech-like dialogues.
  • Crafting discussion topics and timings can induce biases from the researchers and/or the LLM used. Also, available LLMs will usually refuse to generate synthetic training scripts that would cover adversarial or unsafe topics.

We note that the original Moshi model was trained with LLM-written dialog data, synthetized to speech [1], however our organically collected usage data shows a number of failure points. We show leveraging this data for alignment provides additional gains over the original synthetic instruct approach.

Release plan

Because of privacy concerns, we don't plan to release the materials at this point.

Acoustic aspects

We provide demo samples in an anonymous link for the acoustic aspects.

Please let us know if we addressed your questions. Thanks!

[1] Défossez, A., et al. "Moshi: a speech-text foundation model for real-time dialogue." arXiv, 2024.

[2] Cho, H., et al. "Speechworthy instruction-tuned language models." EMNLP, 2024.

审稿意见
5

This work introduces a framework for aligning real-time, full-duplex spoken dialogue systems using user spoken interactions (building on Moshi system from kyute.ai). Unlike existing preference learning methods focused on text-based models, this approach addresses the complexities of dialog speech, such as interruptions, etc. The authors create a large dataset of 150,000+ preference pairs from multi-turn speech conversations, annotated with AI feedback. Using offline alignment methods (DPO & co, adapted to their multimodal case), they fine-tune an autoregressive speech-to-speech model (Moshi actually). Experiments show that their approach improves factual accuracy, safety, and alignment of spoken dialogue systems.

给作者的问题

-section 3.3: you clearly explain how you annotate problematic Moshi replies, but how do you derive preference data from this? Specifically, from a given context, you identify problematic replies—but where does the preferred (better) response come from? This part remains unclear to me.

-tab.1. incorporating audio tokens for DPO does not really help alignment: this could be more commented/discussed

-section 5. Aboutr RQ #3 « As it is expensive to acquire new preference data, can we leverage data from off-policy model to optimize models with different voices? » Why would this be problematic ? Especially if incorporating audio tokens for DPO does not really help alignment => there is sthing i don’t quite get here, please explain more why using a voice with significantly different characteristics may cause transfer to be problematic

论据与证据

Yes there are.

方法与评估标准

Sure they make sense.

理论论述

This is mainly an experimental paper, no strong theoretical claims here.

实验设计与分析

Yes i checked and they sound very reasonable to me.

补充材料

No i did not go over the supplementary material tbh

与现有文献的关系

Paper is very well positioned related to the literature

遗漏的重要参考文献

no

其他优缺点

Reasons to accept:

-this work might be the first to enhance speech-to-speech dialogue models using large-scale live interaction data.

-the dataset building methodology has also some value: authors build on the Moshi spoken language model, focusing on generating preference data from raw dialogues. They Whisper-transcribed audio interactions and used LLMs to annotate data, flagging problematic replies (20% content-related, 57% timing-related, 23% both)

Reason to reject:

-Potential reasons for rejection could include a lack of clarity on user participation and data collection. Specifically, it is unclear who the users were—whether they were general Moshi users or specifically recruited participants—and whether they knew their conversations were being recorded (I will however not flg the paper for Ethical Review but would like to hear from the authors about that during the rebuttal). Additionally, questions remain about the distribution of the preference dataset (283,740 pairs with overlapping contexts) and whether it will be made publicly available.

其他意见或建议

no

伦理审查问题

N/A

作者回复

We thank the reviewer for the positive assessment and appreciate the insightful and careful feedback. We respond to the points below.

Questions

section 3.3: [...] how do you derive preference data from this? [...] unclear to me.

Thank you for the question. For content-related problems: we feed the conversation history context, the problematic reply, the LLM judge's feedback and the instructions for proposing a response into Mistral Large 2, which generates the improved reply and becomes the preferred response.

The conversation history context starts from the beginning of the conversation, up to and including the user's last response before the model's problematic reply. The LLM judge's feedback includes identified issues along the axes specified in Subsection 3.3, paragraph "Problematic reply identification".

For timing-related issues: if the problem is the model interrupting the user, then the preferred response will be given after the user finishes their utterance. If the semantic content of the initial response is adequate, we keep the same response; otherwise, the model needs to propose a response. If the problem is the model not answering the user, we similarly generate a proper reply, which we put directly after the user’s request.

We will add more clarifications on this part in the paper.

tab.1. incorporating audio tokens for DPO does not really help [...] discussed

The preferences we curated mainly concern semantic aspects and the model's temporal behaviour (e.g., not answering to the user). The actual acoustic contents are not labeled as "preferred/dispreferred". We conjecture that forcing the alignment objective to consider the full audio token probability could introduce noise, harming the model's performance. Focusing on text tokens alone was more stable.

section 5. Aboutr RQ #3 [...] Why would this be problematic ? Especially if incorporating audio tokens for DPO does not really help alignment [...] to be problematic

The preference data uses TTS to resynthesize the user's voice, and for the model side, we use the audio tokens of Moshi itself (except for the preferred response, because we have no existing audio tokens, we also resynthesized using the model's voice). We use this same dataset and didn't resynthesize the audio data when aligning M-Alt-Vox.

During the alignment stage, even if the final objective optimizes the text tokens, the model is still fed with the audio tokens for context. A different voice would create a shift in the context distribution.

In practice, if we fully resynthesize the data with M-Alt-Vox's voice, then we conjecture that we won't see this issue. But this implies a heavier pipeline, and the transfer experiment was to test to the extent to which we can reuse the already synthesized data, and reduce the cost of transferring across voices.

Concerns

Potential reasons for rejection [...] publicly available.

We thank the reviewer for inquiring about data collection.

We collected aggregated data in a privacy-preserving and minimalist manner from organic users of our deployed system (not from specially recruited participants) over a two-week period, following a standard user agreement. This is similar to protocols in past work [1, 2]. This approach yields more authentic, diverse data that can better reflect the interests and feedback.

To protect privacy, we asked users not to share personal data, and in practice, no personally identifying details were retained. Sensitive voice information was never accessed, and there's no human listening to the recorded conversations. Researchers don't have access to any sensitive and personally identifiable information, including vocal attributes.

We promptly discarded original audio after transcribing them into text and allowing a download period for the user. We may lose paralinguistic information, but for the purpose of this work, we choose to only keep the minimal amount of information needed. Our data practices are for instance GDPR-compliant and adhere to principles of minimal, privacy-preserving collection.

Note that the examples in the Appendix are from our human evaluation study, explicitly consented.

As for the preference pairs, they include multiple flagged points from the original multi-turn dialogues. We detect problematic responses in the dialogue, keep the initial problematic response and sample from the other problematic responses if there are any. We will add more information about the distribution in the Appendix (e.g. number of turns).

Because of privacy concerns, we don't plan to release the data at this point.

Please let us know if the answers addressed your questions, and if we can address further questions you might have. Thanks again!

[1] Ram, A., et al. "Conversational ai: The science behind the alexa prize." arXiv, 2018.

[2] Shuster, K., et al. "Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage." arXiv, 2022.

审稿人评论

Thank authors for having addressed my questions and concerns. I have nothing to add here and overall this confirms my positive feedback on the paper.

审稿意见
5

The paper describes an approach to aligning the Moshi spoken dialog model to preference data automatically derived from human-model interactions. The preferences are elicited from transcripts of the spoken interactions via a textual LLM (Mistral Large 2). Context and responses are (re)synthesized via TTS and used for aligning the spoken language model. The effectiveness of alignment is evaluated with regard to factuality, safety and perceived quality of interaction.

给作者的问题

376: It wasn't clear to me why the preference data doesn't fully transfer between different voices, since the data is transcribed and resynthesized. It would be good to explain which part of the data collection and processing pipeline lead to this problem.

论据与证据

Generally the claims are supported by the experiments carried out. A minor issue is that the evidence is limited to a single model (Moshi) while the frameing refers to spoken dialog models in general.

方法与评估标准

The methods are appropriate for the research questions.

理论论述

NA

实验设计与分析

Broadly, the experimental design is solid. The main points of weakness concern the extensive reliance on transcriptions and TTS output rather than genuine spoken data, in the following ways:

  • The authentic user interaction data is automatically transcribed and discarded. The paper motivates claims this is due to unspecified privacy issues, but this is unconvincing without further elaboration. At minimum, the impact of such a drastic transformation of the interaction data onn the outcome should be assessed in some way.
  • Due to the above, context and preference data for alignment need to be re-synthesized via TTS. This likewise likely degrades the usefulness of this data.

If I understand correctly, human evaluation seems to also rely entirely on transcriptions rather than actual spoken interactions. This is especially problematic and many important features of spoken dialog are very hard to decode from a textual transcription. There is no good reason for this design.

Additionally, the preference data is automatically generated using and LLM, which is a practical choice, but it would be good the validate this procedure with regards to quality.

It is important to note that despite of the above limitations, the paper makes a valuable contributions nevertheless.

补充材料

I skimmed the complete supplementary material but did not review it in detail.

与现有文献的关系

The key contribution is the application of offline alignment from interaction data to a full duplex spoken language model.

遗漏的重要参考文献

None identified.

其他优缺点

NA

其他意见或建议

NA

作者回复

We appreciate that the reviewer finds our work valuable, and thank the reviewer for the constructive remarks. We answer specific points below.

Question

376: It wasn't clear to me why the preference data doesn't fully transfer between different voices [...] this problem.

The preference data uses TTS to resynthesize the user's voice, and for the model side, we use the audio tokens of Moshi itself (except for the preferred response, because we have no existing audio tokens, we also resynthesized using the model's voice). We use this same dataset and didn't resynthesize the audio data when aligning M-Alt-Vox.

During the alignment stage, even if the final objective optimizes the text tokens, the model is still fed with the audio tokens for context. A different voice would create a shift in the context distribution.

In practice, if we fully resynthesize the data with M-Alt-Vox's voice, then we conjecture that we won't see this issue. But this implies a heavier pipeline, and the transfer experiment was to test to the extent to which we can reuse the already synthesized data, and reduce the cost of transferring across voices.

Concerns

A minor issue is that the evidence is limited to a single model (Moshi) [...] in general.

The model choice is because Moshi was the only available open-source full-duplex speech-to-speech model when we conducted the project, and for cost/feasibility reasons. We agree that more work is required to further verify the extension to other models, and we will discuss this point in the limitations. We note that the method for building the dataset mixes from multi-turn dialogues is model-independent.

The main points of weakness concern the extensive reliance on transcriptions and TTS [...] This likewise likely degrades the usefulness of this data.

We agree that discarding the original audio loses some audio information, however:

  • we want to clarify that the TTS is done for the user's audio, and for Moshi we used its own generated tokens (except for the generated preferred response, also synthesized);
  • as the resynthesized speech retains the original timestamps, some aspects of the prosody are kept, such as the rhythm and speech rate;
  • for the axes we are focused on in this work, the original audio is not absolutely necessary.

Given that in a number of regions in the world, strong protection laws prevent or limit the recording or storage of private and biometric attributes such as a person's voice, our method presents important applications for improving speech systems in a privacy-preserving manner, while respecting local legislations. For more details on data protection, we kindly refer the reviewer to our reply to reviewer 2ifX.

We also want to note that according to the original Moshi paper [1], the instruction stage also uses synthetic generation.

If I understand correctly, human evaluation seems to also rely entirely on transcriptions [...] for this design.

We agree that relying on transcriptions with timestamps for the human evaluation has limitations, and this was mostly a logistical choice, for:

  • (1) disentangling and reducing the cognitive load of the annotators;
  • (2) focusing on the dialogue-level conversation content.

For (1), having evaluators or speakers score conversations after listening to the full audio (that can be up to 2min+) can introduce confounding factors, such as the memory ability to correctly recall details after the fact. By using transcriptions, we ensured annotators could consistently review entire multi-turn dialogues after the conversation concluded.

For (2), our primary alignment goal was to improve the dialogue content, but the automatic metrics only provide assessment for single-turn conversations. With human evaluation, we wanted to evaluate the dialogue-level quality across multiple turns, including aspects such as consistency and transitions across turns.

We will emphasize this as a limitation, and agree that enriching human evaluation with, for instance, direct listening tests or hybrid assessments could capture a broader spectrum of spoken dialogue characteristics.

Additionally, the preference data is automatically generated using and LLM, which is a practical choice, [...] regards to quality.

For the validation of the LLM-generated preference data, we manually curated a small held-out validation set over the axes we defined to assess the model's responses (e.g., helpfulness, safety, factuality), and inspected the automatically generated preference responses. As a practical choice, the design followed a heuristic approach trying to cover diverse failure modes. We agree that it would be interesting for future work to conduct more fine-grained investigation.

Please let us know if the answers addressed your questions, and if we can address further questions you might have. Thanks again!

[1] Défossez, A., et al. "Moshi: a speech-text foundation model for real-time dialogue." arXiv, 2024.

审稿人评论

Given that in a number of regions in the world, strong protection laws prevent or limit the recording or storage of private and biometric attributes such as a person's voice, our method presents important applications for improving speech systems in a privacy-preserving manner, while respecting local legislations.

I would find this aspect more useful and convincing if the paper made an effort to quntify the impact of discarding the audio and replacing it with transcriptions and re-synthesized audio. As it is, it's not clear to the reader how serious of a problem this approach is, and if the potential advantage from the point of view of privacy is worth the tradeoff.

作者评论

I would find this aspect more useful and convincing if the paper made an effort to quntify the impact of discarding the audio and replacing it with transcriptions and re-synthesized audio. As it is, it's not clear to the reader how serious of a problem this approach is, and if the potential advantage from the point of view of privacy is worth the tradeoff.

We thank the reviewer for emphasizing the importance of quantifying the impact of synthetic vs. real audio.

Because we don't have access to the raw audio of the training data used, but only a much smaller set used for human evaluation, it's difficult to estimate the influence using the same training pipeline.

Instead, we discarded the "user's voice" from the human evaluation audio, resynthesized the user stream using the TTS pipeline, and computed the corpus WER between both transcriptions with Whisper for the user side. This is more a proxy as we don't have the ground-truth transcriptions.

The WER (lower the better) suggests that there can be moderate discrepancy introduced by synthesizing the audio. We also listened to 20 audios and manually checked the errors introduced. The most significant difference is the altered voice attributes. WER errors can include backchannel transcriptions (e.g. "Can you tell me, um, can you give me some recommendations of books to read?" vs. "Can you tell me, can you give me some recommendations of books to read?").

ModelCorpus WER (%)
Moshi-Instruct6.27%
M-Alt-Vox-Instruct6.75%

Given that certain regions have strict data protection laws regarding biometric privacy, our approach shows that it's already possible to improve alignment with the minimal degree of information needed, using synthetized user audio. When the trade-off is present, the choice might be more case-dependent, as the original audio could keep some more diverse characteristics, but also bring data biases. We agree that for future work, a more systematic and quantitative comparison of the influence of real vs. synthetic audio can help further clarify the necessity and impact of different vocal attributes.

最终决定

The paper presents an interesting idea of preference training of an end-to-end, speech-to-speech dialogue response generation model based on AI feedback. The feedback is accumulated to include preferences including linguistic content, factuality, safety and so on, and the model is trained using offline methods. It is a paper worthy of presentation and ICML, however, please note that one of the reviewers suggest extending the evaluation, even after the rebuttal discussion. These suggestions could be considered for the presentation.