SpeechAlign: Aligning Speech Generation to Human Preferences
摘要
评审与讨论
This paper proposes to apply preference optimization techniques (which have proven useful in aligning text language models’ outputs to users’ preferences) to speech language models that generate sequences of discrete audio representations and then speech. The particularity of the preference dataset is that it does not rely on truly human preferences but is simply made up of audio (AR) tokens obtained from natural speech and synthetic speech (the ones coming from natural speech being the preferred ones).
Different preference optimization (PO) strategies are investigated: RLHF-PPO, DPO, etc and human evaluation is conducted to evaluate the synthetic speeches obtained. DPO seems to be the most performant PO method. An iterative process where the updated speech generation model is used to create a new and more challenging preference dataset (with synthetic / natural speech pairs) further improve the speech generation quality.
优点
The integration of human feedback to align speech outputs to human preferences is a new topic addressed in this paper with convincing results on speech generation.
缺点
-the preliminary analysis on ‘distribution gap’ is interesting in itself but it is not clear how it really relates with SpeechAlign (in other words: why SpeechAlign solves this distribution gap observed earlier)
-the way Best-of-N Sampling (BoN) is presented is confusing: it is at the same level of PO methods but while reading its description it looks more like a decoding approach than a model alignement approach => ??
问题
-4.1 Exp setup: how the training parameters (lr, bs) are chosen ?
-Few is said about the vocoder that generates speech from the acoustic tokens ?
-Will the PO datasets be shared ?
局限性
-I don't see so many limitations to this paper
We sincerely appreciate the time and effort you have dedicated to reviewing our paper. Your insights and comments have been invaluable to refining our research. Our responses are as follows:
Q1: How does the distribution gap relate to SpeechAlign?
There is a distribution gap between golden AR tokens and synthetic AR tokens, which adversely impacts the performance of the TTS model. SpeechAlign calibrates the output of codec language models to the golden codec distribution and bridges the distribution gap by preference optimization, which brings performance improvements.
Q2: Why Best-of-N Sampling (BoN) is at the same level of PO methods?
Here, our BoN selects the sampled result with the highest reward model score as the output. The aim is to utilize a reward model trained on preference data to align large language models with human preferences, thereby enhancing output quality. Essentially, it is a form of decoding-time alignment, which is why we have placed it in this section.
Q3: How are the training parameters (lr, bs) chosen?
For the design of the learning rate, we set approximate magnitudes of the learning rate for different models based on other concurrent RLHF works and our own training experience. Subsequently, we conducted experiments with different learning rates to ensure that the results provided in this paper are optimal. For selecting the batch size, it should be as large as possible to accelerate training, under the condition of ensuring that the GPU memory is not fully occupied.
Q4: The vocoder that generates speech from the acoustic tokens.
Sorry to confuse. The vocoder we use is the pre-trained decoder of SpeechTokenizer[1] model. We'll add that in our paper.
Q5: Will the PO datasets be shared ?
Yes, we will open source all these datasets for further research.
[1]Zhang, Xin, et al. "Speechtokenizer: Unified speech tokenizer for speech large language models." arXiv preprint arXiv:2308.16692 (2023).
This paper introduces a method to improve speech generation in a speech language model via preference optimization. The method relies on creating a dataset of "gold" speech tokens produced by a neural codec model from a speech sample, contrasted with synthetic tokens produced by a speech generating model from text. The model is then trained to preferably generate tokens closer to the "gold" ones, via a number of preference optimization methods, and evaluated using both automatic metrics (WER and speaker similarity) and human preference judgments. The results show that the model can iteratively improve its performance based on the preference dataset.
优点
The method is simple yet non-obvious.
The existence and impact of the distribution mismatch between gold and synthetic tokens is demonstrated via preliminary analysis and experiments.
A number of different preference optimization algorithms are tested.
The evaluation is carried out on two datasets, and accompanied by a thorough analysis and several ablations.
缺点
The description of the preference optimization algorithms (COH, DPO, RLHF-PPO) is hard to understand without already being familiar with the relevant papers.
The results are presented without showing their variance, even though the underlying data should be available as the paper mentions evaluating each model 10 times. It would be good to have the spread of the scores available in addition to the mean.
问题
What is meant by the "reference model" in line 173?
局限性
No discussion of limitations within the body of the paper.
We sincerely appreciate the time and effort you have dedicated to reviewing our paper. Your insights and comments have been invaluable to refining our research. Our responses are as follows:
Q1:The description of the preference optimization algorithms.
Thank you for your valuable feedback. We apologize for any confusion caused by the descriptions of the preference optimization algorithms. Due to space constraints in the paper, we did not provide a detailed explanation of these methods. We will offer a detailed explanation in the subsequent versions of the paper. Brief introductions of each algorithms are as follows:
Chain-of-Hindsight (CoH): CoH improves the quality of responses by constructing a "chain of hindsight" that enables the model to learn from past experiences, especially user feedback. The CoH technique involves converting all types of responses into a sequence of sentences and fine-tuning the model by leveraging the language model's reading comprehension abilities. For instance, in the paper lines 159-162, we design different prompts for positive and negative samples during training, and during inference, we use the prompt for positive samples to guide the model in generating correct responses.
RLHF-PPO: RLHF-PPO is a reinforcement learning method based on human feedback. It involves training a reward model on preference data to score responses based on quality and then training a policy model (the model requiring RLHF) to generate responses that maximize rewards. This process is complex and often unstable.
Direct Preference Optimization (DPO): DPO is a new method for optimizing language models based on human preferences. Unlike RLHF, DPO does not rely on explicit reward modeling or reinforcement learning. DPO works by increasing the log probability of preferred samples and decreasing the log probability of non-preferred responses. In contrast to traditional methods that use a preference model to train a reward model and then train a policy based on that reward model, DPO directly defines preference loss based on the policy.
Q2:The variance of evaluation results.
Thanks for the reminder. We provide the variance here and we will add the variance score of each result to our final paper.
| Model | Librispeech | VCTK | ||||||
|---|---|---|---|---|---|---|---|---|
| Model | WER (↓) | SIM (↑) | QMOS (↑) | SMOS (↑) | WER (↓) | SIM (↑) | QMOS (↑) | SMOS (↑) |
| Groundtruth | 4.0 | - | 4.40 | 4.06 | 2.4 | - | 4.33 | 4.60 |
| SpeechAlign-sft | 7.2±0.009 | 0.87±0.004 | 3.20±0.08 | 3.20±0.09 | 8.8±0.012 | 0.79±0.004 | 3.27±0.12 | 3.13±0.09 |
| Continue SFT | 8.0±0.008 | 0.88±0.005 | 3.07±0.07 | 3.13±0.06 | 9.8±0.014 | 0.80±0.003 | 3.20±0.08 | 3.20±0.08 |
| SpeechAlign-CoH | 7.3±0.008 | 0.89±0.004 | 3.33±0.07 | 3.47±0.08 | 10.2±0.008 | 0.81±0.004 | 3.53±0.06 | 3.73±0.09 |
| SpeechAlign-BoN | 8.0±0.012 | 0.88±0.005 | 3.40±0.12 | 3.70±0.06 | 7.5±0.009 | 0.79±0.005 | 3.50±0.07 | 3.40±0.08 |
| SpeechAlign-RLHF-PPO | 7.1±0.010 | 0.89±0.004 | 3.60±0.09 | 3.87±0.06 | 8.5±0.010 | 0.80±0.006 | 3.53±0.08 | 3.80±0.05 |
| SpeechAlign-DPO-Iter1 | 6.7±0.009 | 0.88±0.005 | 3.20±0.06 | 3.33±0.05 | 8.5±0.011 | 0.82±0.004 | 3.33±0.06 | 3.07±0.09 |
| SpeechAlign-DPO-Iter2 | 6.2±0.008 | 0.89±0.003 | 3.67±0.04 | 3.40±0.07 | 8.0±0.008 | 0.83±0.006 | 3.33±0.07 | 3.33±0.06 |
| SpeechAlign-DPO-Iter3 | 6.0±0.009 | 0.90±0.005 | 3.73±0.08 | 3.93±0.06 | 7.9±0.004 | 0.83±0.006 | 3.47±0.05 | 3.60±0.06 |
Q3:The meaning of reference model.
The reference model is initialized with a model trained through SFT, and typically, it is the same model as the policy model. The reference model remains frozen and does not participate in training. Its role is to ensure that during the reinforcement learning process, the distribution disparity between the policy model and the reference model should not be too large.
Many thanks for your response. My positive assessment of the paper remains unchanged.
This study analyzes the training-inference mismatch that occurs in codec language models, a branch of personalized speech synthesis research, and mitigates it through preference optimization methods. By avoiding the labor-intensive process of collecting human preference test results, the researchers efficiently gathered data and used it to further fine-tune the model, achieving improved results in personalization.
优点
- Preference optimization has not yet been sufficiently explored in personalized speech synthesis, and this study demonstrates its effectiveness for this purpose.
- They conducted thorough observations and analyses of the problem they defined.
- The illustrations and figures added to the paper aid in comprehension, and the writing is clear and easy to understand.
- They evaluated various components and details they used within the paper, effectively demonstrating the impact of each component.
缺点
I believe the evaluation with other baselines might be insufficient. I have included additional evaluation questions below for consideration.
问题
- (Questions) Distribution Gap in Section 2.3:
- You mentioned a distribution gap between golden AR tokens and synthetic AR tokens. Which attributes cause this distribution difference? Generally, discrete features are known to be more robust against error propagation compared to continuous features, and many studies have leveraged this characteristic for TTS. Does the distribution gap arise because the SpeechTokenizer fails to remove acoustic information within the AR token, leaving residual information that causes a mismatch in synthetic AR tokens? What specific information is being mismatched?
- Scalability and Dataset Size:
- While recent zero-shot personalized speech synthesis models typically train on large-scale datasets of at least tens of thousands of hours, this study used the relatively smaller LibriSpeech dataset. This raises questions about the scalability of your methodology. Can the model achieve better performance with more data, as demonstrated by many recent models that show excellent personalized speech synthesis performance using vast datasets? Evaluating the model's potential with larger datasets is crucial for understanding its full capabilities. Although Section 5.2 provides an analysis of data size, it is limited to the amount of data from LibriSpeech and still falls short compared to recent studies in terms of data volume and data distribution (data from various source such as GigaSpeech, ...).
- Baselines
- Given the smaller dataset size compared to recent zero-shot approaches, it would be helpful to know the relative competitiveness of your current model (trained solely on LibriSpeech) against recent personalization technologies. Even if the performance falls short compared to models trained on over 10,000 hours of data, if the gap is not significant, this could indicate an important direction of research focused on efficient personalized modeling with less data. Although recent personalized speech synthesis trends show a lack of open-source models, it might still be possible to use samples from their demo pages to compare metrics like SIM, QMOS, and SMOS.
局限性
They addressed their limitations in the appendix.
Thank you for your kind response. Regarding Q2, I realize I may have misunderstood the point. Nonetheless, I appreciate the experiments you conducted to increase the data. Additionally, thank you for the comparison with current high-performing models in response to Q3. I will raise the score to 5 points.
I have an additional question: Is your methodology applicable to non-autoregressive state-of-the-art models (e.g., NaturalSpeech, Voicebox, SoundStorm, HierSpeech++, etc.)? If so, have you attempted to apply it to any of these models?
Thank you for improving the scores! Regarding your new question, I believe that the SpeechAlign method can also be applied to non-autoregressive speech generation methods.
For discrete-token-based methods (such as SoundStorm), we can use the same process as in SpeechAlign to construct a preference dataset and then modify the cross entropy loss in SoundStorm to the DPO loss. For diffusion/flow-matching-based methods (such as Naturalspeech2 and Voicebox), we can refer to Tango 2[1] and SPIN-Diffusion[2], which apply DPO to diffusion. We'll expand SpeechAlign to non-autoregressive methods in the next version.
[1] Majumder, Navonil, et al. "Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization." arXiv preprint arXiv:2404.09956 (2024).
[2] Yuan, Huizhuo, et al. "Self-play fine-tuning of diffusion models for text-to-image generation." arXiv preprint arXiv:2402.10210 (2024).
We sincerely appreciate the time and effort you have dedicated to reviewing our paper. Your insights and comments have been invaluable to refining our research. Our responses are as follows:
Q1: Which attributes cause this distribution difference between golden AR tokens and synthetic AR tokens?
The distribution gap between golden AR tokens and synthetic AR tokens arises from several factors during the training of the AR model. Insufficient training, limited data availability, or constraints imposed by cross-entropy can prevent the model from accurately learning the distribution of golden AR tokens. Consequently, the model lacks the capability to generate AR tokens that fit within the golden distribution. It is important to note that this issue is unrelated to our use of SpeechTokenizer tokens.
Q2: Scalability and Dataset Size
Our baseline model, SpeechAlign-sft, was trained on the large-scale dataset Multilingual LibriSpeech, which encompasses 40,000 hours of speech data. SpeechAlign primarily enhances speech language model performance through preference optimization, which typically requires significantly less data than what is used for pretraining. To examine scalability, we conducted experiments on the GigaSpeech dataset using SpeechAlign-DPO. The results, as shown below, indicate that increasing the dataset size can lead to further performance improvements:
| Dataset Size | WER | SIM |
|---|---|---|
| 50K | 6.7 | 0.88 |
| 250K | 6.6 | 0.88 |
| 3M (10000 hours) | 5.6 | 0.91 |
Q3: Baselines
SpeechAlign is not orthogonal to other zero-shot TTS approaches; rather, it can be applied to enhance any zero-shot TTS method and brings iterative self-improvement to them. Thus, the performance of SpeechAlign is closely linked to the base zero-shot TTS approach. In our follow-up work[1], SpeechAlign was applied to Voicecraft, a state-of-the-art personalized speech synthesis method, showing promising improvements.
The comparative results with some current state-of-the-art Zero-shot TTS systems are as follows:
| Method | WER | SIM |
|---|---|---|
| UniAudio (demo) | 2.4 | 0.93 |
| Voicecraft (open-source) | 8.4 | 0.84 |
| SpeechAlign | 6.0 | 0.90 |
[1] Chen C, Hu Y, Wu W, et al. Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback. arXiv preprint arXiv:2406.00654, 2024.
The paper introduces "SpeechAlign," a method aimed at improving text-to-speech (TTS) performance by aligning speech generation with human preferences. It addresses the distribution mismatch between ground truth AR tokens and predicted AR tokens in neural codec language models. The proposed method involves preference optimization and iterative training, which has shown to enhance the performance of TTS systems.
优点
- Insightful Analysis: The paper provides a valuable insight into the distribution mismatch between ground truth AR tokens and predicted AR tokens, identifying it as a key issue affecting TTS performance.
- Effective Methodology: The proposed preference optimization and iterative training strategies are well-reasoned and demonstrate a clear improvement in TTS performance.
- Comprehensive Evaluation: The experimental results are extensive and include both subjective and objective evaluations, providing strong evidence for the effectiveness of the proposed method.
缺点
- Performance Gap: Despite the improvements, the performance of the proposed method still falls short of the ground truth.
- Limited Scope: The method is specifically tailored to the VALL-E based TTS model and does not present a general framework for other types of speech language models.
问题
- Hyperparameter Tuning: Did the authors comprehensively tune the hyperparameters of the baseline methods, including exploring different learning rates?
- Reward Model Accuracy: What is the accuracy of the reward model used in the preference optimization process? Iterative Training: Why does only DPO have iterative training, while SpeechAlign-RLHF-PPO does not? What justifies this choice?
- Generalizability: Can the method generalize to other single-stage codec-based TTS systems like Voicecraft, where there is no distribution mismatch between AR and NAR models, and a single AR model generates all levels of speech tokens?
局限性
- The method is more like a specific method to alleviate the domain mismatch or error propagation of the pipeline codec-based TTS model.
- The method relies on ground truth tokens as the chosen samples in the preference data, potentially overlooking other high-quality token sequences that can also serve as the chosen samples.
We sincerely appreciate the time and effort you have dedicated to reviewing our paper. Your insights and comments have been invaluable to refining our research. Our responses are as follows:
Q1: Performance Gap: The performance of the proposed method still falls short of the ground truth.
We acknowledge that there is still an inevitable distribution gap between the synthetic AR tokens generated by our model and the ground truth AR tokens, as illustrated in Figure 1(b). Despite the optimization by SpeechAlign, which brings the distribution of synthetic AR tokens closer to that of the ground truth AR tokens, they do not completely overlap. This remaining distribution gap contributes to the performance gap between the proposed method and the ground truth. Although we conclude in Section 5.1 that Iterative Self-Improvement has an upper bound, we believe the reason is that as the number of iterations increases, the negative samples are of too high quality and too similar to the positive samples, which increases the difficulty of reward modeling and thus degrades the performance of the reward model. However, we believe that by increasing the number of parameters and the amount of data for the reward model during the iterative PPO process, the reward model can maintain strong capabilities. Consequently, the model's performance can continuously improve and approach the ground truth following preference optimization.
Q2: Limited Scope: The method is specifically tailored to the VALL-E based TTS model.
The core of SpeechAlign lies in utilizing ground truth samples and synthetic samples to construct a preference dataset, which, combined with preference optimization, allows the model to continuously self-improve. This approach is not limited to the VALL-E based TTS model, which comprises AR and NAR models. In [1], SpeechAlign was applied to an AR-only TTS model under the Voicecraft framework, resulting in promising performance improvements. This demonstrates that SpeechAlign is not exclusively tailored to the VALL-E based TTS model.
Q3: Hyperparameter Tuning:
For all the baseline systems, different hyperparameters were tuned to ensure the results presented in the paper were optimal.We tuned the hyperparameters for all baseline systems to ensure that the results reported in the paper represent their optimal performance.
Q4: Reward Model Accuracy
We selected 200 speech samples from VCTK test set to measure the reward model's accuracy, which resulted in an accuracy score of 0.87.
Q5: Iterative Training: Why does only DPO have iterative training?
PPO training exhibits instability and requires considerable time and computational resources to succeed. Therefore, we only conducted iterative optimization on DPO. We plan to include iterative PPO experiments in the camera-ready version if our paper is accepted.
Q6: Generalizability: Can the method generalize to other single-stage codec-based TTS systems like Voicecraft?
In our follow-up work [1], SpeechAlign was applied to Voicecraft, demonstrating improvements in zero-shot TTS. The results are as follows:
| Method | WER | SIM |
|---|---|---|
| VoiceCraft | 8.4 | 0.84 |
| VoiceCraft + SpeechAlign-DPO | 7.2 | 0.91 |
Q7: The method is more like a specific method to alleviate the domain mismatch or error propagation of the pipeline codec-based TTS model.
The answer is the same as that of Q2.
Q8: overlooking other high-quality token sequences that can also serve as the chosen samples
SpeechAlign focuses on exploring methods to make speech language models self-improve iteratively without external data, so we don't use other high-quality token sequences in our paper. But the framework is compatible with other high-quality token sequences.
[1] Chen C, Hu Y, Wu W, et al. Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback[J]. arXiv preprint arXiv:2406.00654, 2024.
Hello, Reviewer. The author has submitted a response to your comments. Whether or not it addresses your concerns, it would be greatly appreciated if you could acknowledge that you have reviewed the reply.
Thanks for the response. The author address most of the questions, and the follow up work [1] on Voicecraft does convince me that the method can be generalized to other codec-based TTS. I raised my score from 6 to 7.
The paper introduces "SpeechAlign," a method designed to enhance spoken LM performance by aligning speech generation with human preferences. The performance improvement results from addressing the distribution mismatch between ground truth AR tokens and predicted AR tokens in spoken LMs, thereby improving the quality of synthesized speech. The paper does not have significant drawbacks.
Two reviewers raised concerns about the paper's motivation, specifically regarding the focus on addressing the distribution mismatch between ground truth AR tokens and predicted AR tokens. The authors' rebuttal addressed these concerns.