A Full-duplex Speech Dialogue Scheme Based On Large Language Model
This work formalizes the problem of full-duplex voice conversation with LLM and presents a method towards this goal.
摘要
评审与讨论
This paper provides a new approach for using language model assistants, specifically enabling real-time interactions with the language model in a full-duplex setting.
In detail, the authors achieved this by additionally training the Llama-3-Instruct model and integrating it with ASR and TTS modules, creating a complete pipeline system. This demonstration system exhibits performance in voice dialogue that surpasses commercial models.
I really like the demo implemented by the authors, and I even opened the code to try running it (although lacking the LLM, I understood its principles). I support accepting this work!
优点
This work provides technical details on achieving full-duplex interaction with language models, offering unprecedented technical specifics (since GPT-4o has not been open-sourced).
This work presents a quick response method to handle user queries. It even shows a clear advantage over GPT-4o in terms of appropriate responsiveness to user interruptions.
缺点
Novelty of Motivation. Considering that this paper uses the Llama-3-Instruct model and makes extensive comparisons with GPT-4o, it can be inferred that this paper was completed after the release date of GPT-4o. GPT-4o was the first language model to achieve full-duplex communication, so the motivation of this paper largely draws from GPT-4o's ideas. The main contribution lies in providing the technical methods to replicate GPT-4o's full-duplex communication.
Performance Degradation. From Table 3, we can see that the method provided by the authors comes at the cost of reduced inference performance of the model, which is a drawback.
Poor Code Readability. I hope the authors can completely refactor their code and present it in the Huggingface format. In my opinion, the main contribution of this work is providing a demonstration system, but the code's readability is poor and hard to operate. If the authors could use a more concise style for the entire model, I believe this would greatly increase the community's interest in this work. For example, in the Llama style:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
问题
A. Can you honestly tell me when you came up with this idea? Was it shortly after the release of GPT-4o?
B. Have you considered building an end-to-end model? Integrating ASR and TTS into the LLM as well to achieve faster response times?
C. Have you considered the limitations of this model?
D. I'm curious, is 20 steps of training really enough for the Llama-3-Instruct 8B model?
局限性
None
We thank the reviewer for their review and positive feedback. We are glad the reviewer like our demo.
Performance Degradation.
In this experiment, we used only the full-duplex conversation data for SFT, which is why other capabilities were compromised. However, if full-duplex conversation data were mixed with other types of data for SFT, this detrimental effect would be mitigated. Essentially, the full-duplex conversation data is merely performing a specific type of alignment. Nowadays, there are numerous alignment training techniques that can prevent adverse impacts.
Poor Code Readability.
Thank you for your suggestion, we will work on further improving the code readability.
A. Can you honestly tell me when you came up with this idea? Was it shortly after the release of GPT-4o?
We started this project around February 2024. The first demo was showcased in March 2024, followed by continuous iterations and updates since then. We can provide git log for furthur clarification if necessary.
B. Have you considered building an end-to-end model? Integrating ASR and TTS into the LLM as well to achieve faster response times?
Please refer to the overall comments.
C. Have you considered the limitations of this model?
Yes, for instance, the issue of the FSM state changing frequently due to varying user speech rates necessitates a robust SFT alignment capability to strengthen the model's handling of such scenarios. Furthermore, whether relying on ASR or a multi-modal model's speech tokenizer to process audio, slicing the speech stream by chunk can introduce errors.
D. I'm curious, is 20 steps of training really enough for the Llama-3-Instruct 8B model?
We found that for LLMs, learning the neural FSM task is relatively straightforward, requiring only a small amount of data and incurring minimal training overhead to achieve the desired results.
I am very grateful to the authors for addressing my concerns. I am willing to support the acceptance of this paper and even its highlighting at the conference. I believe this is a highly promising research direction! Furthermore, as the authors have pioneered this field for the first time, I am more than happy to raise my evaluation score to 8.
Thank you for reconsidering our work and for your positive reevaluation.
This work presents a generative dialogue system capable of operating in a full-duplex manner. The model is based on Large Language Model with the same next token prediction loss, with adapted input sequence to account for possible speaker turn or interruption. This model is designed to work with external ASR and TTS system. This work also proposed a few evaluation metrics to evaluate the generated real time dialogue and show the proposed method is better than other methods.
优点
This paper addresses a relative new area, in which a spoken dialogue need to be generated in real time and mimic the human behavior. This paper proposed a new method that is based on the same loss function that the standard LLM is trained on thus it can leverage the existing pre-trained LLM. This work also proposed using LLM as evaluator to automatically evaluate the quality of the generated dialogues.
缺点
The proposed method is purely based on text tokens and it needs to integrate with some external ASR/TTS models which make it difficult to coordinate with these systems when come to the implementation; converting everything to text in the ASR system may result in loss of information, e.g. user's emotional and tune is lost in this process.
The evaluation in this work is limited though understandably the quality evaluation of generated spoken dialogue can be quite subjective. In the paper, only latency and interruption related metrics are measured while it is less clear whether the generated machine response is helpful or relevant. This could be different from the Regression experiment presented in Table 3, where the main goal is to evaluate whether the generated machine interrupting user is reasonable not or whether a particular backchannel or confirmation sounds reasonable.
问题
In the paper, the author provides text-based prompts asking GPT-4 to generate text-based dialogues but it is unclear to me how these text-based dialogues can be converted to spoken one while mimic the human behavior: for example, for a serialized dialogue like "What a bad weather <usr> yeah", it is unclear to me how to synthesize this dialogue such that the gap between speaker turn sounds nature?
In section 3.2, it states that "it (ASR system) outputs one token chunk prefixed by the tokens <usr> with textual content if the user speaks. When the neural FSM’s state is SPEAK, only chunks with textual content will be sent to the LLM and appended to the LLM’s generated token cache. When the neural FSM is in LISTEN state, a contentless chunk will also be transmitted to the LLM to represent a moment of silence", does it mean during the inference time, the ASR output for every chunk are inserted into the LLM input token sequence every 640ms (plus what the ASR inference time) ?
局限性
The proposed system is not a end-to-end system and it relies on ASR/TTS systems to understand and generate response. All the information that ASR system cannot transcribe is lost in the process.
Thank you for the thoughtful review of our work! Please allow us to address your concerns and answer the questions.
The proposed method is purely based on text tokens...
Please refer to the overall comments.
The evaluation in this work is limited...
In the system prompt we used during our evaluation with GPT4, we included the following requirements:
- "If the timing of the interruption was appropriate, further evaluate the relevance and coherence of the interjection with the previous conversation;"
- "First, provide an analysis of the appropriateness of the interruption timing and the content of the interjection, then list two scores: a. The first score evaluates the timing of the interruption,where 0 represents inappropriate, and 1 represents appropriate; b. The second score evaluates the content of the interjection, where 0 represents inappropriate, and 1 represents appropriate."
In reality, for current large language models, the helpfulness and relevance of the generated content can be guaranteed, and the responses in the data we constructed also meet the criteria of being helpful and relevant.
In the paper, the author provides text-based prompts...
When generating conversational data using GPT-4, we emphasized in our instructions that this was a dataset of conversations between humans and a voice interface, and that the human tone should be as concise and colloquial as possible: "Language Expression: The user’s language should be as colloquial and human-like as possible." Thanks to GPT-4's powerful linguistic capabilities, and after numerous trials and manual sampling reviews, we were able to generate data that closely resembled everyday conversation. We provide the prompt used to construct the data in Appendix A.
During training, tokens representing "state transitions" are inserted into the text to denote state transitions, like "Hi, could you [S.LISTEN] tell me [C.LISTEN] the result of 2+3 [S.SPEAK] Sure, the result of 2 + 3 is 5.". In practical inference, the various scenarios of gaps between speakers represented by empty ASR chunks and the model will produce state transitions in neural FSM to accommodate the gaps properly. For instance, if a certain consecutive number of empty chunks are received, the LLM will need to output the [S.SPEAK] token next. This can be produced in the transcript generation process.
In section 3.2, it states that "it (ASR system) outputs one token chunk...
The ASR will always be listening, much like a person would, but if the FSM's state is SPEAK, empty audio chunks are not passed to the LLM. Therefore, there are three scenarios in total:
- If the FSM's state is LISTEN, all chunks, regardless of whether they are empty or not, are sent to the LLM until the FSM decides to change the state;
- If the FSM's state is SPEAK and the audio received by the ASR is empty, it is not passed to the LLM;
- If the FSM's state is SPEAK and the audio received by the ASR is non-empty, it is immediately passed to the LLM.
The work introduces an engineering effort which makes the generative system capable of speak and listen without waiting for the end of the entire sequence. The main contribution is combining one LLM with instruction tuning and a careful prompt with two ASR and TTS models. The system demonstrates significant improvements in conversation response latency and interruption precision compared to existing systems.
优点
S1. The paper surpasses commercial state of the art software on voice-based dialogue. S2. The unification of multi tasks into one single task for LLM (next token prediction) is an original idea to keep the training and finetuning time as low as possible.
缺点
W1. It's not clear to me why we need to rely on three components which should increase latency compared to one big unified model. W2. The evaluation metrics adopted may not capture the complexity of real-life conversations. W3. There is little comparison with already existing work in this area (i.e. https://github.com/AbrahamSanders/realtime-chatbot) W4. There is little experimentation on understanding if the model capability has been retained or not on real world example. Table 3 gives academic comparison, but it would be interesting to see real world example capability.
问题
Q1. Did the authors evaluate the GPT4 instruction tuning transcripts? Can this introduce potential bias and weaken safety measures? Q2. Did the authors try to distill the Llama3 8B model to reduce model footprint and maybe improve latency as well?
局限性
Yes
We appreciate that the reviewer understands and recognizes the contributions of this work. We address the main concerns as follows.
W1. It's not clear to me...
Please refer to the overall comments.
W2. The evaluation metrics...
In practical scenarios, indeed, more issues should be considered (such as excessively long historical conversation records, poor ASR recognition results, frequent interruptions by noise, etc.), but it's challenging to construct a benchmark that covers all these complex situations comprehensively. Our test set endeavors to cover different states of human-machine dialogue at the text level, primarily to validate the effectiveness of the neural FSM we proposed. Based on this framework, when deploying the model in real-world applications, one can to construct a more diverse dataset reflective of real scenarios to enhance the model's capabilities.
W3. There is little comparison...
As of the time of submitting this paper, no similar open-source implementation, data, or model for LLM + full-duplex functionality could be found. The paper in this github link is not available. The system presented in the link does not measure the conversation quality in terms of response time and interruption accuracy.
W4. There is little experimentation...
To validate the feasibility of the LLM + full-duplex solution, this paper exclusively used full-duplex conversation data for SFT, which compromised other capabilities. If full-duplex conversation data were mixed with other types of data for SFT, this detrimental effect could be avoided.
Q1. Did the authors evaluate the GPT4...
The volume of constructed data is not large, and it mainly consists of daily conversations. We manually reviewed the dialogue scripts, ensuring that they do not contain topics that might pose security risks. Adding this data for SFT essentially serves as a form of alignment. The impacts can be balanced by employing other alignment techniques.
Q2. Did the authors try to distill...
Due to the reuse of the KV cache, the response latency demonstrated by our solution is already very low, reaching levels comparable to human-to-human conversation. Using a smaller model would indeed result in lower response latency, but at the cost of reduced understanding and reasoning capabilities. Considering the actual speaking rate of humans, the model's decoding speed only needs to exceed a certain threshold, and does not need to be exceptionally fast. We lean towards utilizing larger-scale models in practical applications to ensure a high-quality conversational experience.
Thanks to the authors to address some of my concerns. I still believe there is more work to do and extra comparison before claiming the system is a new SOTA, but I am happy to raise my evaluation score to 6.
We are grateful for your re-evaluation and for taking the time to reassess our research.
We thank the reviewers for their careful reading, valuable comments, kind words and recognition of the contributions. We first answer a question raised by multiple reviewers.
One big unified model instead of integrating with some external ASR/TTS models
Training an end-to-end unified multimodal LLM requires substantial computational resources and no open-sourced model with the similar language modeling capability as LLaMA-8B is currently available. This paper presents a solution that achieves full-duplex dialogue effects using only existing open-source models, with very low latency.
We also note that the concept of neural FSM is not limited to pipelines composed of separate models. Interaction with the physical world necessitates the system to make decision on producing output and waiting for output. Large Language Models, when equipped with the capability to understand and product audio tokens, can still utilize neural FSM to make such decisions. Our paper introduces a mechanism enabling the model to learn how to interact with the physical world through three abstract modules. An end-to-end multimodal large language model can be seen as a changing the implementation of these three abstract modules, where perception module becomes audio tokenizer, and motor function module turn into audio token decoder. We anticipate that the neural FSM approach will remain compatible with this type of model. However, due to the lack of open-source LLMs equipped with audio modality, we are unable to test this hypothesis.
Another interesting question is whether this idea is derived from GPT-4o. We sincerely state that this work is developed independently from GPT-4o, which is a close-sourced commercial product released in May 13, 2024. We started this project around February 2024. The first internal demo was showcased in March 2024, followed by continuous iterations and updates since then. We are willing to provide git commit histories as proof for the development dates if the program committee deems it necessary.
This work presents a full-duplex generative dialog system that enables real-time interactions with a Large Language Model, i.e., the authors introduce a neural finite-state-machine and model the tasks of responding to, waiting for, or interrupting the user as next token prediction on a serialized view of the dialog in real-time. The paper relies on using existing open-sourced models. The authors report improvements over baselines in conversation response latency and interruption precision with a smaller model.
One weakness that the reviewers raised was regarding the need to use ASR and TTS models instead of a unified model, and how the errors could propagate or information may be lost due to this design choice. Although a valid point, the overall approach of the paper may be applied to such unified model. A second concern was raised regarding whether the metrics reported in the paper (latency and interruption precision) are sufficient to evaluate dialog (e.g., whether the response is helpful or relevant). This is a valid question and I disagree with authors' response that "in reality, for current large language models, the helpfulness and relevance of the generated content can be guaranteed" as evidenced by the works that attempt to improve the quality of response of LLMs. The authors should make this clear as a limitation of their experimental work in the final draft. Furthermore, the authors should avoid making hypothetical statements in the paper that are not backed by evidence (e.g., lines 326-328 of the paper).
Given the focus of the paper on supporting natural multi-turn dialog and considering the limitations in evaluation discussed above, I recommend accepting the paper for a poster presentation.