暂无评分数据
ICLR 2025
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
TL;DR
We propose RALL-E, a robust codec language modeling method for text-to-speech (TTS) synthesis that uses chain-of-thought prosody prompts and duration-guided masking to improve the robustness.
摘要
We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis.
While previous codec language modeling methods have demonstrated impressive performance in zero-shot TTS, they often struggle with robustness issues, such as unstable prosody (irregular pitch and rhythm/duration) and high word error rates (WER), largely due to their autoregressive prediction style.
RALL-E addresses these issues through chain-of-thought (CoT) prompting, which breaks the task into simpler steps to improve the stability of TTS.
First, RALL-E predicts prosody tokens (pitch and duration) from the input text and uses them as intermediate conditions to guide the prediction of speech tokens in a CoT manner.
Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer, enforcing the model to focus on the corresponding phonemes and prosody tokens during speech token prediction.
Comprehensive objective and subjective evaluations show that RALL-E significantly improves robustness in zero-shot TTS compared to the baseline method VALL-E, reducing WER from $5.6\%$ to $2.5\%$ without reranking, and from $1.7\%$ to $1.0\%$ with reranking.
Furthermore, RALL-E outperforms several prior approaches aimed at improving the robustness of codec language models, and successfully synthesizes challenging sentences that VALL-E struggles with, lowering the error rate from $68\%$ to $4\%$.
关键词
robust text-to-speech synthesiscodec language modelschain-of-thought prompting
评审与讨论
PC编辑台拒稿
直接拒稿原因
This paper is desk rejected because the Github URL reveals the author's identity (https://ralle-demo.github.io/RALL-E/), which is linked in the introduction. This breaks double blind review.