TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
摘要
评审与讨论
This paper presents an end-to-end framework for speech-to-speech translation that preserves speaker and voice characteristics while leveraging unsupervised training. The authors also refine the speech tokenizer by distilling semantic information and enhance the sampling mechanism to support textless NAR acoustic modeling.
优点
The authors have made significant improvements throughout the S2ST pipeline, resulting in a solid contribution. They have also conducted experiments with various datasets and compared their results to those of capable models such as Seamless for S2ST and Encodec for Codec, achieving reasonable improvements. Although some of the experimental results are mixed, given the limited computational resources (only 32 GPUs) and datasets (only 5k hours), the improvements are still impressive and demonstrate the effectiveness of the proposed methods.
缺点
-
The paper is well-written and informative, but it covers a wide range of topics, which can make it overwhelming to read. The introduction and subsequent sections could be restructured to better emphasize the different contributions. For example, the introduction covers four bullet points, where the first two are modeling designs for S2ST and the last two are Codec related. Then in the method section (sec 3), everything is presented together. It would be helpful to dissect section 3 into two main sections and add pointers from the introduction sections to improve clarity. Additionally, consider adding visualizations for the distillation strategy in the main content to further illustrate the proposed methods.
-
My major concern with the proposed framework is the inference speed. It would be helpful to include an analysis of the inference speed of the proposed architecture, as the use of an autoregressive decoder for predicting Codec tokens may significantly slow down the process, even with deduplicated units.
问题
-
Following up on my second point in weaknesses, could you provide some comparisons with the UnitY system or others in terms of inference speed? Additionally, there are several direct S2ST translation works [1,2,3] that utilize non-autoregressive modeling on the CVSS dataset, and it might be worth comparing with such methods.
-
I am a little confused about Figure 2's target clip encoding. What is the motivation for using that representation for the [sep] token? From your description (lines 116-120), I do not find any mention of the use of the target speech clip. Can you explain why the pooled representation of that clip is used and what happens during inference when such clips do not exist?
[1] Lee et al., (2022). Direct speech-to-speech translation with discrete units [2] Huang et al., (2023). Transpeech: Speechto-speech translation with bilateral perturbation [3] Tan et al., (2024). DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation
局限性
No concerns on the limitations.
We sincerely appreciate your efforts in reviewing our paper and providing us with valuable, constructive feedback. The detailed responses are listed below.
R1. About presentation (Weakness 1)
We will reorganize the paper, particularly by dividing section 3 into two main subsections and adding pointers from the introduction to improve clarity. With an additional page allowed in the main content of the final version, we will also consider adding visualizations to further illustrate the proposed methods.
R2. About inference speech (Weakness 2, Question1)
Your concern about inference speed is valid. Generally, the inference speed for AR models is slower than that of non-AR models. To address this, we have conducted an analysis of inference speed. The RTF (real-time factor) for our model architecture is close to 1, while that of seamless expressive (with NAR T2U modeling) is 0.3, as measured on an Nvidia A6000 GPU with fp16. We believe that the inference speed could be improved by 2 to 4 times by leveraging Grouped Code Modeling, as proposed in VALL-E 2 (https://arxiv.org/pdf/2406.05370). Our current application scenario is video dubbing, which can be done on the cloud and offline. Therefore, inference speed has not been thoroughly investigated in this study.
R3. Clarifying target clip encoding (Question 2)
The pooled representation of the target clip serves as the acoustic prompt that the generation of the first layer codec can condition on. It is similar to the acoustic prompt in Vall-e, except we use only one token instead of a token sequence for acoustic prompting. The benefit of using a single token is that it creates an information bottleneck, preventing too much semantic information from passing through. This way, we hope the model can learn purely acoustic/speaker-related information. We will clarify this in the revised version.
In response to your question about what happens during inference when such clips do not exist, our ablation study results presented in Table 3 indicate that the speaker similarity suffers. Without this prompt, the first layer codec will produce a neutral voice, and the NAR model will struggle to convert it to the target voice.
Thanks authors for clarifying a few points mentioned in my review and provide additional data points on the inference speed. I will maintain my score and evaluation.
This paper proposes a end-to-end speech translation framework that adds several improvements to the existing textless encoder decoder speech translation architectures.
优点
- There are several interesting ideas in this paper, such as the isochrony embedding and layer beam search.
- The proposed method out-performs one of the previous SOTA on En-Fr translation tasks.
缺点
- The writing needs to be improved. There are some typos here and there (e.g. line 280 page 7). And the writing clarity can be improved.
- There is a lack of ablation study to investigate how much does each design choices affect the model performance.
- The evaluation has only been performed on translation tasks between En and Fr. It's not clear if the model will perform well between other language pairs, especially between English and non-European languages.
- For isochronic translation, the model can either choose to translate as usual and then adjust the speech rate to fit into the timing boxes, or the model can be more smart in translating in the optimal way to reflect the isochrony constraint. I couldn't find any discussion regarding this aspect in this paper.
问题
See weaknesses.
局限性
Adequately discussed.
We sincerely appreciate your efforts in reviewing our paper and providing us with valuable, constructive feedback. The detailed responses are listed below.
R1. About improving writing and clarity (Weakness 1)
We will reorganize the paper, include ablations in the main section, revise the unclear writing and conduct a thorough proofread. These revisions will appear in the updated version.
R2. About few ablation study (Weakness 2)
We have presented the ablation studies, such as w/ and w/o acoustic embedding, the choice of different codec, w and w/o text BPE and Layer Bean Search in NAR acoustic model. Due to the limited space for the main content, we had to present the ablation studies in the appendix. We will also add an ablation study related to isochrony preservation in the revised version. In this study, we compared our model with the following three baselines. We will report the ASR-BLEU, SLC_p (Speech Length Compliant, as defined in the paper), and Overlap ratio (i.e., speech overlap between the reference and the hypothesis) as follows.
| ASR-BLEU | Overlap | SLC_0.2 | SLC_0.4 | |
|---|---|---|---|---|
| No IC | 30.81 | 0.689 | 0.63 | 0.87 |
| Dec IC | 30.51 | 0.748 | 0.75 | 0.90 |
| Dec IC + FPI | 30.45 | 0.766 | 0.77 | 0.91 |
| Enc IC (Proposed) | 30.62 | 0.784 | 0.82 | 0.95 |
where
-
No Isochrony control (No IC).
-
Isochrony control on the decoder (Dec IC). This involves adding the Isochrony embedding to the input of the encoder as another positional embedding. We implemented the method from ref [1] in our system.
-
Isochrony control on the decoder with future pause information (Dec IC + FPI). This is an improvement over above 2. In addition to the distance to the global end and VAD information, two extra pieces of information are encoded: the distance to the next pause and the number of pauses in the future. We implemented the method from ref [2] in our system.
Ref: [1] Y. Wu, et al. “VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing,” AAAI, 2023.
Ref: [2] P. Pal, et al. “Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters”, Interspeech, 2023.
Please let us know if more ablation studies should be added.
R3. About evaluation on more language-pairs (Weakness 3)
We will validate the method with an additional language pair in future work. Our model is built on a multi-lingual Seamless model, and we believe the proposed methods can be extended to other language pairs, including those between English and non-European languages. The main hurdle to this kind of extension investigation is the availability of publicly accessible data.
R4. About the discussion on isochronic translation (Weakness 4)
We will add discussions regarding isochronic translation as follows in the revised version.
The conventional method for isochronic translation involves first translating as usual and then adjusting the speech rate to match the length of the source speech. This approach ensures that the translation quality is not compromised by isochronic control. However, for long videos with multiple utterances, an inconsistent speaking rate can significantly affect the naturalness of the translated speech.
We aim to use isochrony control to translate optimally by considering both the timing boxes and the speech rate in real application scenarios. Both should align with the source speech. In our proposed method, the generation of both text and speech is conditioned on global isochrony information. Our experimental results also show that this approach can improve ASR-BLEU score compare to Isochrony control on decoder, meaning the model is more confident and accurate in the generation and make less error like repetition and truncation.
Finally, we would like to express our gratitude once again for your time and effort in reviewing our paper. Considering the interesting ideas, SOTA performance, adequate ablation studies, and improved presentation of our paper, we would greatly appreciate it if you could consider increasing your score.
Thanks for the rebuttal. I would suggest to at least put part of the ablation studies as part of the main paper as they are very important. I am a bit reluctant to raise my score now since a lot has been promised in the "updated version" of the paper which I haven't seen yet. I will consider raising the score once I read that.
Dear Reviewer ujH4,
Thank you for your suggestions. We appreciate you are considering raising your score. However, due to the review rule, we cannot send you a revised pdf on the system. We also fully understand your reluctance. Let’s try our best to include the revision within this 5,000-character comment box. Otherwise, we will have to ask AC how we can send you an updated version anonymously.
With an additional page allowed in the main content of the final version, we will include the following on that page.
Ablation Studies
1. Acoustic Embedding
We compared the inference with and without acoustic embedding and presented the results in Table 3. Without the acoustic embedding, speaker similarity scores decreased by 0.040 in French-English (Fr-En) translations and by 0.032 in English-French (En-Fr) translations. Additionally, there was a slight decline in the AutoPCP scores.
Tabel 3. Ablation on the acoustic embedding in the joint translation model.
| ASR-BLEU | BLEU | SIM | A.PCP | Nat. | |
|---|---|---|---|---|---|
| TransVIP Fr-En | 32.60 | 35.34 | 0.320 | 2.49 | 3.19 |
| - A.Embed | 32.47 | 35.18 | 0.280 | 2.45 | 3.23 |
| TransVIP En-Fr | 27.28 | 33.02 | 0.395 | 2.67 | 3.40 |
| - A.Embed | 26.84 | 33.15 | 0.362 | 2.45 | 3.46 |
2. Choice of Codec
We compared training TransVIP using different codecs: SpeechTokenizer [9] and our SASC. In this study, the joint translation model was trained with a subset containing only CVSS-T Fr-En uni-direction data. For the NAR acoustic model, SASC uses 16 codec layers, while SpeechTokenizer uses 8 layers, as it only has an 8-layer version. With both codec we have trained a full system(AR+NAR) with CVSS-T data only. The results are presented in Table 4. Compared to SpeechTokenizer, the model trained with SASC exhibits superior performance in all aspects. Most notably, the speaker similarity improved by 0.04, from 0.226 to 0.264, aligning with the improvement in codec re-synthesis results.
Tabel 4. Ablation on the Choice of Codec
| Codec Model | ASR-BLEU | BLEU | SIM | SLC 0.2 | SLC 0,4 | Nat. |
|---|---|---|---|---|---|---|
| SpeechTokenizer | 29.81 | 34.18 | 0.226 | 0.76 | 0.93 | 3.02 |
| SASC | 30.62 | 34.30 | 0.264 | 0.82 | 0.95 | 3.09 |
3. NAR Acoustic Model
We conducted two comparisons. Firstly, we compared the performance of a NAR acoustic model with and without text input, i.e., using BPE as input. Secondly, we assessed the inference results with and without the utilization of the Layer Beam Search (LBS) algorithm to determine its impact on performance enhancement. The results are presented in Table 5, where it indicates that the textless model consistently outperforms the model with text input across all metrics of ASR-BLEU, speaker similarity, and naturalness. Moreover, employing LBS yields superior results compared to greedy decoding.
Tabel 5. Ablation on the BPE and Layer Beam Search
| NAR Model | ASR-BLEU | SIM | Nat. |
|---|---|---|---|
| NAR w/o text | 32.60 | 0.320 | 3.19 |
| - LBS | 32.30 | 0.309 | 3.17 |
| NAR w/ text | 31.52 | 0.307 | 3.10 |
| - LBS | 31.03 | 0.298 | 3.09 |
4. Isochrony Control
We compared our proposed Isochrony control method with and without using and other strategies, and presented the results in Table 6, where it demonstrates that our approach achieves the best performance in terms of BLEU score and isochrony evaluation metrics.
Tabel 6. Ablation on the isochrony control strategy
| BLEU | Overlap | SLC_0.2 | SLC_0.4 | |
|---|---|---|---|---|
| No IC | 30.81 | 0.689 | 0.63 | 0.87 |
| Dec IC | 30.51 | 0.748 | 0.75 | 0.90 |
| Dec IC + FPI | 30.45 | 0.766 | 0.77 | 0.91 |
| Enc IC (Proposed) | 30.62 | 0.784 | 0.82 | 0.95 |
where
a. No Isochrony control (No IC).
b. Isochrony control on the decoder (Dec IC). This involves adding the Isochrony embedding to the input of the encoder as another positional embedding.
c. Isochrony control on the decoder with future pause information (Dec IC + FPI). This is an improvement over (b). In addition to the distance to the global end and VAD information, two extra pieces of information are encoded: the distance to the next pause and the number of pauses in the future.
Furthermore, we have made several improvements to the paper. We employ a professional proofreading service to fix typos and improve the writing. We add a discussion on isochronic translation, as shown in our previous response. We also rewrite several paragraphs such as the acoustic encoder to make it easier to understand the design and to make the structure clearer
Please let us know if you have any further concerns.
best regards,
Authors
Thanks for the material. I decided to raise my score by 1.
Dear Reviewer ujH4,
We have checked with AC, and there isn't a way to send an updated paper. Therefore, we can only use the official comment box for any updates. In fact, most of the ablation studies have already been included in the appendix of the submitted version (please refer to the page14-16). We have revised them and moved them to the main content in the updated version.
Further suggestions and concerns are welcome. Each time, we can leverage this 5,000-character comment box to address them. Thanks!
Best,
Authors
This paper proposes TransVIP, a speech to speech translation model with voice and isochrony preservation, i.e. pauses and segment durations are preserved between the source and the target, for example for automatic dubbing applications. The proposed model architecture is modular, with multiple encoders for semantic, acoustics and isochrony information, intermediate text output, a non auto regressive acoustic model to generate a sequence of codes which are then decoded to produce a waveform. Contributions include the modular architecture trained end to end, the model capabilities, in particular isochrony preservation. Experiments show that the proposed approach is either competitive with or outperforms a strong baseline (SeamlessExpressive) on translation quality, speaker and prosody similarity while substantially improving isochrony preservation.
优点
- This is important research work with important applications such as automatic dubbing, especially since most of the research on speech to speech translation does not emphasize isochrony preservation.
- The proposed architecture is novel
- The empirical results are positive and compare to a strong baseline
缺点
- The empirical evaluation could be improved: validate the method in one more language pair and optionally compare to cascaded solutions.
- In terms of presentation, the paper could be more self contained and include ablations in the main part of the paper vs the appendix. The paper could also benefit from proofreading (there are quite a few typos).
问题
“acoustic information(A)”: the title only talks about voice preservation but the evaluation also measures prosody preservation. Could the authors clarify which components of the model are specifically designed to preserve prosody?
局限性
The authors acknowledge that only the French-English pair is involved but we still consider this a weakness for a translation related paper.
We sincerely appreciate your efforts in reviewing our paper and providing us with valuable, constructive feedback. The detailed responses are listed below.
R1. About empirical evaluation (Weakness 1)
We will validate the method with one more language pair in future work. Additionally, to compare to cascaded solutions, we will add the results of cascaded ST + TTS solutions as follows,
| ASR-BLEU | BLEU | SIM | AutoPCP | Rate | Pause | SLC_0.2 | SLC_0.4 | Nat. | |
|---|---|---|---|---|---|---|---|---|---|
| ST + StyleTTS (Fr->En) | 33.57 | 34.58 | 0.173 | 2.74 | 0.33 | 0.51 | 0.56 | 0.85 | 3.25 |
| TransVIP (Fr->En) | 32.60 | 35.34 | 0.320 | 2.49 | 0.55 | 0.44 | 0.70 | 0.91 | 3.19 |
| ST + VALLE-X (En->Fr) | 22.50 | 34.89 | 0.418 | 2.87 | 0.27 | 0.54 | 0.65 | 0.89 | 3.32 |
| TransVIP (En->Fr) | 27.28 | 33.02 | 0.395 | 2.67 | 0.45 | 0.65 | 0.81 | 0.99 | 3.40 |
where
-
ST is Seamless Expressive Speech-to-Text translation model
-
We evaluated cascaded system by using TTS model from StyleTTS (open sourced) and VALLE-X (implemented) and reported the better one in terms of objective measurements.
R2. About presentation (Weakness 2)
We will reorganize the paper, include ablations in the main section, and conduct a thorough proofread. These revisions will appear in the updated version.
R3. About prosody preservation(question1)
Our framework has not been explicitly designed to preserve prosody. Therefore, prosody is not purposefully maintained but is preserved alongside the voice feature. We have kept the metric in the paper to provide a comprehensive comparison with Seamless. Recently, we have observed an increase in the use of explicit prosody modules in zero-shot TTS. We may consider adding one in future work.
Dear Authors,
Thank you for the additional experiments! I rechecked my review scores which were already quite high so I'm not planning to modify them. For completeness, it may be good to include both ST + StyleTTS and ST + VALLE-X for both directions. In an updated version, it would also be interesting to discuss the strengths and weaknesses of both systems since the proposed approach is not outperforming the baseline in all categories. Can you clarify why the trend for BLEU is the reverse of the trend for ASR-BLEU?
Best,
--Reviewer kQBQ
Dear reviewer kQBQ,
Thanks again for your appreciation! As for the cascade system result, we currently do not have a TTS system that performs well in both English and French. This version of ValleX does not perform well in English and StyleTTS is only capable of English. So we have to use two separate models for different directions. By the time of the final version, we will probably be able to have an updated version of ValleX and report the performance in both directions.
Different model's strengths and weaknesses are good points for discussion. The StyleTTS is trained on a clean but small LibriTTS dataset. So its audio is clean, and the ASR-BLEU is high, but speaker similarity is poor. On the other hand, ValleX is trained on large real data. Therefore its speaker and prosody similarity is high but ASR-BLEU and noise resistance are poor (The performance drops when the input prompt is noisy). Our system reached a balance between similarity and ASR-BLEU, surpassing the Seamless baseline in most metrics with limited data.
I think this can also explain the reversed trend in BLEU and ASR-BLEU. The margin between BLEU and ASR-BLEU reflects how accurately the model pronouns the word. StyleTTS is an accurate baseline while ValleX is not that accurate, resulting in the reversed trend.
I hope this solves your puzzle and thanks again for your review.
best regards,
Authors
The paper introduces TransVIP, a novel speech-to-speech translation system designed to maintain both the speaker's voice characteristics and isochrony during the translation process. TransVIP simplifies the complex task of speech-to-speech translation (S2ST) by breaking it down into two sequential subtasks while retaining an end-to-end framework. It conditions the generation of the target speech not just on semantic information, but also on isochrony and acoustic details extracted from the source speech. The paper demonstrates the effectiveness of TransVIP through experiments on French-English translation, showing superior performance compared to state-of-the-art models.
优点
- The motivation is interesting. The recent studies have paid attention to voice preservation during speech-to-speech translation (S2ST), and this paper further proposes to preserve the isochrony information for ideal speech translation effects. This may provide useful insights for future work.
- The paper decouples the S2ST model into multiple modules and offers several significant innovations.
- The proposed method achieves new state-of-the-art performance.
缺点
- The title may not be fully representative, as only a part of the innovation focuses on voice and isochrony information preservation. As shown in Section 3, only the first subsection is closely related to the title. While the latter subsections present good innovations, the overall relevance among them could be strengthened.
- The experimental comparison is limited. This paper focuses on voice and isochrony preservation, but does not provide any comparison with related work, like VALL-E X and PolyVoice. As Seamless does not consider voice information, the real advantage of voice preservation is unclear.
- This paper proposes multiple innovations with many techniques. However, there are few ablation studies to analyze the individual components. There is also a lack of in-depth analysis on the design choices for voice and isochrony preservation.
问题
The paper is difficult to read, as some technical details are missing, making it challenging to fully understand the design. For example, it is unclear how the acoustic encoder learned from scratch can be expected to extract the desired acoustic information.
局限性
Not applicable
We sincerely appreciate your efforts in reviewing our paper and providing us with valuable, constructive feedback. The detailed responses are listed below.
R1. About not fully representative title (Weakness 1)
Thank you for acknowledging the numerous innovations presented in our paper. We will refine the title or add a subtitle to encompass as many aspects as possible.
R2. About limited experimental comparisons (Weakness 2)
We are always eager to compare our work with related research to verify the effectiveness of our proposed methods. However, most works, such as PolyVoice and MSLM-S2ST, neither have open-sourced models/codes nor matched language pairs for investigation. As we know, S2ST is a complicated system, and reproducing others' entire systems is both challenging and unaffordable. Moreover, we need to point out that Seamless does consider voice information. According to its technical report (https://arxiv.org/pdf/2312.05187), they explicitly encode the speaker and prosody information into the speech generation process. Additionally, the Seamless team leveraged far more data for training than we did. Therefore, we already leveraged a very strong baseline to compare with our model.
We will also add following results from cascaded system (ST+TTS) as a comparison to our model.
| ASR-BLEU | BLEU | SIM | AutoPCP | Rate | Pause | SLC_0.2 | SLC_0.4 | Nat. | |
|---|---|---|---|---|---|---|---|---|---|
| ST + StyleTTS (Fr->En) | 33.57 | 34.58 | 0.173 | 2.74 | 0.33 | 0.51 | 0.56 | 0.85 | 3.25 |
| TransVIP (Fr->En) | 32.60 | 35.34 | 0.320 | 2.49 | 0.55 | 0.44 | 0.70 | 0.91 | 3.19 |
| ST + VALLE-X (En->Fr) | 22.50 | 34.89 | 0.418 | 2.87 | 0.27 | 0.54 | 0.65 | 0.89 | 3.32 |
| TransVIP (En->Fr) | 27.28 | 33.02 | 0.395 | 2.67 | 0.45 | 0.65 | 0.81 | 0.99 | 3.40 |
where
-
ST is Seamless Speech-to-Text translation model
-
We evaluated cascaded system by using TTS model from StyleTTS (open sourced) and VALLE-X (implemented) and reported the better one in terms of objective measurement.
R3. About few ablation studies and in-depth analysis (Weakness 3).
We have presented the ablation studies, such as w/ and w/o acoustic embedding, the choice of different codec, w and w/o text BPE and Layer Bean Search in NAR acoustic model. Due to the limited space for the main content, we had to present the ablation studies in the appendix. We will also include an ablation study related to isochrony preservation in the revised version. In this study, we compared our model with the following three baselines. We will report the ASR-BLEU, SLC_p (Speech Length Compliant, as defined in the paper), and Overlap ratio (i.e., speech overlap between the reference and the hypothesis) as follows.
| ASR-BLEU | Overlap | SLC_0.2 | SLC_0.4 | |
|---|---|---|---|---|
| No IC | 30.81 | 0.689 | 0.63 | 0.87 |
| Dec IC | 30.51 | 0.748 | 0.75 | 0.90 |
| Dec IC + FPI | 30.45 | 0.766 | 0.77 | 0.91 |
| Enc IC (Proposed) | 30.62 | 0.784 | 0.82 | 0.95 |
where
-
No Isochrony control (No IC).
-
Isochrony control on the decoder (Dec IC). This involves adding the Isochrony embedding to the input of the encoder as another positional embedding. We implemented the method from ref [1] in our system.
-
Isochrony control on the decoder with future pause information (Dec IC + FPI). This is an improvement over above 2. In addition to the distance to the global end and VAD information, two extra pieces of information are encoded: the distance to the next pause and the number of pauses in the future. We implemented the method from ref [2] in our system.
Ref: [1] Y. Wu, et al. “VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing,” AAAI, 2023.
Ref: [2] P. Pal, et al. “Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters”, Interspeech, 2023.
Please let us know if more ablation studies should be added.
R4. About unclear writing (Question 1)
We will revise the paper as clear as possible. For example, we have refined the description as follows to address how the acoustic encoder learned from scratch can be expected to extract the desired acoustic information.
Many previous works have adopted a reversed gradient approach to remove semantic information from acoustic features. However, these approaches require an additional decoder and training objective, which increases the training burden. In our work, we use the information bottleneck to train the acoustic extractor. The sum pooling serves as the information bottleneck, preventing too much information, especially semantic information, from passing through. Additionally, we design the system to use part of the target speech as input and predict the other part, making it less likely for the acoustic encoder to learn anything semantically meaningful. This approach integrates seamlessly into the original training process, with no extra modules or loss required.
Finally, we would like to express our gratitude once again for your time and effort in reviewing our paper. Considering the multiple innovations, adequate ablation studies, added comparisons and improved presentation of our paper, we would greatly appreciate it if you could consider increasing your score.
Thank you for the detailed clarification and sorry for the late response. While the explanation has addressed most of my concerns, I still feel that the direct comparison to strong voice cloning systems is not sufficient. The voice information is not the primary focus of the Seamless system.
Additionally, I believe the title of the paper is an important consideration. The authors should choose a title that is representative of the work presented, to accurately convey the scope and focus of their research.
Overall, I will raise my score to 5, as the response has addressed the majority of my initial concerns.
Dear Reviewer WU4H
We hope we have addressed your questions. Please let us know if you have any further concerns, as the discussion between the reviewers and authors will end soon. Thanks!
Best regards,
Authors
Dear reviewer WU4H,
We appreciate your efforts in increasing the rating for our paper. Your suggestions and comments are all valid, and we will address them in either the final version or future work due to the short rebuttal period. Additionally, further suggestions and concerns are welcome until the end of the reviewer and author discussion period.
Thanks!
Authors
Dear Reviewers,
Thank you for your efforts in reviewing our paper. We greatly appreciate your acknowledgment of our contributions, including multiple innovations, state-of-the-art performance, and important research work. However, we received diverse ratings, ranging from scores of 4 to 7. We noticed that two reviewers who gave the lower scores both had concerns about few ablation studies.
In fact, we have presented several ablation studies, such as with and without acoustic embedding, the choice of different codecs, with and without text BPE, and Layer Beam Search in the NAR acoustic model. Due to the limited space for the main content, we had to present these ablation studies in the appendix. If this was one of the reasons for the lower scores, we sincerely hope the reviewers could adjust their scores accordingly.
Additionally, we will include new results in the revised version, including:
• A comparison with a cascaded system, i.e., ST+TTS.
• An ablation study related to isochrony preservation.
We will also reorganize the content, include the ablation studies in the main content and conduct a thorough proofread to improve the presentation.
Please check the details in the responses to the individual reviewers.
Thanks again!
Authors
The system describes a comprehensive end-to-end speech-to-speech architecture using latest ideas around joint semantic-acoustic tokens and joint text-encoders. Notably, the work consciously integrates isochrony (beyond simple rate adjustment), generally neglected in strong systems despite its importance to automated dubbing. The result improves on the Seamless work (in En-Fr, Fr-En) which is also component-wise engineered, considers voice, and uses far more data. I agree with reviewers that there are several interesting ideas and the system is SOTA in an overall sense; gains in isochrony metrics while preserving naturalness are particularly strong.
Some reviewers noted the single language pair and not comparing to dedicated voice-cloning systems are a weakness (RTF is worse but less relevant offline), and a desire for more ablations (though there are many moving parts + most were already in appendix or rebuttal) met by the authors. I believe this stronger architecture will be informative and cautiously expect the methods to generalize. All four reviewers ultimately recommend some form of acceptance. I recommend Acceptance with the expectation of results on an additional language pair in the camera-ready (preferably with more linguistic divergence), and that authors consider reorganizing to move key ablations from appendix to main.