PaperHub
7.2
/10
Poster4 位审稿人
最低3最高5标准差0.8
5
3
3
4
ICML 2025

High-Fidelity Simultaneous Speech-To-Speech Translation

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

natural, high-quality simultaneous speech translation with voice preservation

摘要

关键词
audio language modelsspeech translationmultimodal language modelsspeech-to-speech

评审与讨论

审稿意见
5

The paper introduces Hibiki, a decoder-only model for simultaneous speech-to-speech (S2ST) and speech-to-text (S2TT) translation. Unlike offline approaches, Hibiki translates speech in real-time using a multistream language model that synchronously generates text and audio tokens. The model leverages contextual alignment to determine optimal delays for translation, improving fluency and speaker similarity. Experimental results show strong performance in French-English translation, with real-time capabilities on both GPUs and mobile devices.

给作者的问题

  • I want to know more about the alignment-aware TTS, how is it implemented exactly to consider the alignment during synthesis?
  • For the text-only pertaining, I am wondering what is the machine translation performance after the training? Since the model is trained from scratch, I am assuming it is always trained to perform translation task? Or is it trained for general LM task then adapted into a decoder-only MT model?

论据与证据

The paper claims Hibiki achieves state-of-the-art translation quality, speaker similarity, and naturalness, supported by BLEU scores, speaker similarity metrics, and human evaluations. The contextual alignment method is validated through ablation studies, demonstrating its impact on latency-quality trade-offs.

However, the claim that Hibiki provides an optimal balance between latency and accuracy is questionable, as Seamless achieves lower latency (LAAL and End Offset). So Hibiki does achieve better quality but it might sacrifice the latency to some extent. I think this is okay though because the author adds human evaluation and show Hibiki is preferred.

方法与评估标准

The paper employs standard evaluation metrics for S2ST, including BLEU for translation quality, speaker similarity (cosine similarity), and MOS for human evaluation. It also uses LAAL (Length-Adaptative Average Lagging) to measure latency, ensuring a fair comparison with existing models.

One issue is that the experiments are limited to French-English, making it unclear how Hibiki generalizes to other languages.

理论论述

N/A

实验设计与分析

The experimental design is generally strong, with comprehensive comparisons to Seamless and StreamSpeech. Ablation studies effectively highlight the impact of alignment strategies and classifier-free guidance. However, latency trade-offs need more discussion, as Hibiki has higher lag than Seamless. Additionally, the alignment-aware TTS system lacks detail, making it difficult to verify how timing constraints are enforced during synthesis. The missing Appendix C (mentioned in section 3.2 line 203) further limits transparency.

补充材料

Yes, I saw the visualization of context-aware alignment.

与现有文献的关系

Hibiki builds on prior work in S2ST, alignment modeling, and multistream processing. It extends Seamless (Barrault et al., 2023) and StreamSpeech (Zhang et al., 2024a) with an adaptive alignment approach and improved speaker transfer. Its multistream modeling is inspired by Moshi (Défossez et al., 2024), originally designed for full-duplex spoken dialogue.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

We thank the reviewer for their constructive feedback.

Updates in the revised version of the paper

We first inform the reviewer that we will update the reported results of Hibiki-M in Table 2 after fixing an issue that further improves performance. Following the reviewers' suggestions and thanks to the extra page provided for the camera ready, we will revise the paper on the following aspects:

  • Clarify the model's architecture, configurations and the nature/size of the datasets used.
  • Add quality/latency trade-off curves by varying hyperparameters of our contextual alignment method.
  • Add COMET evaluation scores.
  • Extend experiments to the English->French direction.
  • Improve the references section and discuss similar works pointed out by the reviewers.

Comments

On the generalization of the contextual alignment method to many language pairs

We acknowledge that our method is only illustrated by a single language direction in the paper. However, there is no specific aspect of our method with respect to the language pair, but the fact that MADLAD -- which we used to derive contextual alignment -- performs well on these languages. We expect our method to work as strongly on pairs of languages where SOTA text translation models perform well and can thus allow us to derive reliable contextual alignments. Given the massively multilingual nature of MADLAD or even more recent systems like GemmaX2-28-9B (Cui et al., 2025), we expect this approach to be a good candidate for scaling to many language pairs. We provide the reviewers examples of contextual alignments with other languages. As a first step towards more language directions, we have extended our experiments to the English->French direction and provide experimental results that we will add to the revised paper.

On the quality/latency trade-off

As mentioned in Section 3.2.2 (l.183), we enforce a 2s delay between words associated through contextual alignment as we found it to provide a good balance between latency and translation quality. We acknowledge that this choice can be reconsidered and that trade-off curves would provide a clearer picture to the reader. We thus produced a trade-off curve by varying the delay, as asked by the reviewers. The results reported in this quality/latency study show that Hibiki provides an overall better trade-off than Seamless. We will add this figure to the revised version of the paper.

On the alignment-aware TTS

We acknowledge that a non-negligible part of the technical details inherited from Défossez et al. (2024) were not exposed in our paper as we preferred to focus on the data creation pipelines and the experimental protocol and results. We will take the opportunity of the extra page provided for the camera ready to improve the clarity of technical details such as the alignment-aware TTS. Moreover, we would like to highlight that there is no missing Appendix to our paper as we referrerd to Appendix C of Défossez et al. (2024) at Section 3.2 l.203. We will improve the formulation in the updated version of the paper.

As a rapid summary of the explanations given in Appendix C of Défossez et al. (2024), one can force text tokens directly in the text stream of a TTS model derived from the Moshi architecture. Thanks to the contextual alignment, we can ensure that a given text token is fed at the right timestamp (that is not too early) using PAD tokens instead to delay its insertion.

On text-only pretraining

The text-only pretraining phase is that of standard next token prediction, there is no adaptation to a MT model in the text-only pretraining phase. In early development, we tried alternating batches of text translation and speech translation (starting from a pretrained text model), however while this model did perform quite well in text translation, this did not result in any measurable improvement of the speech translation. We hypothesize that this lack of transferrability is due to the fact that MT samples were built by concatenating the source and target texts in the text stream which is radically different from what is seen in the text stream for a speech translation sample (where the source is audio-only and the target is time-aligned). This highlights a challenge in aligning text and speech representations in speech-text LLMs for each modality to benefit the other, which we believe will be critical to extend speech translation to more language pairs as text translation data is much more accessible than speech translation data.

审稿意见
3

This paper introduces Hibiki, a decoder only model for simultaneous speech-to-speech/text translation. Hibiki adapts the architecture of a full-duplex dialogue model Moshi to simultaneous translation by modeling source speech as user input and target speech as agent response. To train Hibiki, the authors synthesize trajectories of simultaneous translation by leveraging the log probabilities of a pretrained machine translation model. The experimental results on Fr-En direction of CVSS dataset demonstrate that 1) Hibiki shows higher speech quality and voice transfer than strong baselines like Seamless while has larger latency; 2) Hibiki is able to conduct efficient batched inference and the distilled version even able to run on a smartphone in realtime.

update after rebuttal

Some of my major concerns have been addressed. However, the latency remains somewhat high, as shown in the quality-latency trade-off presented in the rebuttal. Combined with the limited coverage of language directions, I will maintain my current score.

给作者的问题

  1. Is there way to adjust the latency of Hibiki at inference time? If so, does it provides a better quality-latency trade-off curve than Seamless?
  2. Is Hibiki able to generalize to unbounded speech? By unbounded speech I mean streaming speech input with infinite length.
  3. Is Hibiki still better than StreamSpeech using only CVSS training data?

论据与证据

Claim 1: The architecture of a full-duplex dialogue model Moshi can be adapted for simultaneous translation.

This claim is supported. It is natural to regard source speech as user speech input and target translation speech as agent response output. Also, the experiments show that this modeling is able to conduct simultaneous translation effectively.

Claim 2: Decoder-only architecture enables efficient inference.

This claim is also supported. It is true that prior architectures are hard to do efficient batch inference due to their complex policy design, while a decoder-only model with implicit policy makes it much more convenient.

方法与评估标准

Method

  1. The method adapts a dialogue model to simultaneous speech to speech translation natually.
  2. The translation and source-target alignment are both generated by MADLAD-3B model. The author lacks quality analysis here.
  3. Hibiki does not support adjusting the latency during inference, which means you need to train multiple models for each latency level.

Evaluation

  1. Dataset: CVSS is a common dataset for evaluating speech to speech translation.
  2. Latency metric: LAAL and Offset are commonly used metrics for latency evaluation.
  3. Translation quality metric: BLEU is a widely used metric for translation quality evaluation. However, BLEU is still a n-gram based method, and is outperformed by many later neural metric like COMET, MetricX as shown in recent WMT workshops.
  4. Speech quality metric: The human evaluation is conducted only on 30 speech samples. May not demonstrate enough statistical significance.

理论论述

No theoretical claims.

实验设计与分析

  1. Hibiki is trained on more and refined speech data compared to baseline StreamSpeech. The comparison is not fair.
  2. The comparison in Table 2 not that informative since the comparison is not at the same latency. Vast literatures in both simultaneous speech-to-text and text-to-text translation [e.g., 1-2] already show that the quality can be much higher if the allowed latency is higher.
  3. There are other existing ways to build the source-target alignment, like the one introduced in [3], but the author does not compare with them.
  4. The experiment only tested Fr-En direction, but simultaneous translation could behave very differently on different language directions due to difference in linguistic structures. More language directions are needed.

[1] Papi, S., Turchi, M., Negri, M. (2023) AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation. Proc. Interspeech 2023, 3974-3978, doi: 10.21437/Interspeech.2023-170 [2] Donglei Yu, Xiaomian Kang, Yuchen Liu, Yu Zhou, and Chengqing Zong. 2024. Self-Modifying State Modeling for Simultaneous Machine Translation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9781–9795, Bangkok, Thailand. Association for Computational Linguistics. [3] Wang, M., Vu, T. T., Wang, Y., Shareghi, E., & Haffari, G. (2024). Conversational simulmt: Efficient simultaneous translation with large language models. arXiv preprint arXiv:2402.10552.

补充材料

No.

与现有文献的关系

  1. Hibiki is one of the first decoder-only models for simultaneous speech-to-speech translation and exhibit advantages in efficient batched inference. There are similar findings before in simultaneous text translation [1-2], but not in simultaneous speech-to-speech translation.
  2. Synthesizing source-target alignment is not a new idea, [1] previously proposed a word-alignment-based approach. However, the perplexity-based method introduced in this paper is new, as far as I know.

[1] Wang, M., Vu, T. T., Wang, Y., Shareghi, E., & Haffari, G. (2024). Conversational simulmt: Efficient simultaneous translation with large language models. arXiv preprint arXiv:2402.10552. [2] Yu, D., Zhao, Y., Zhu, J., Xu, Y., Zhou, Y., & Zong, C. (2025). SimulPL: Aligning Human Preferences in Simultaneous Machine Translation. arXiv preprint arXiv:2502.00634.

遗漏的重要参考文献

The key contribution of this paper is the decoder-only architecture for simultaneous speech-to-speech translation and a synthetic alignment building method. Both of which are discussed by [1] in the context of simultaneous text-to-text translation, but not cited nor discussed.

[1] Wang, M., Vu, T. T., Wang, Y., Shareghi, E., & Haffari, G. (2024). Conversational simulmt: Efficient simultaneous translation with large language models. arXiv preprint arXiv:2402.10552.

其他优缺点

The writing needs improvement. The author assumes some prior knowledge of Moshi model, RQ-transformer and related techniques. It would be better to have a figure illustrating the model architecture, so that it is easier to understand for broader audience.

其他意见或建议

  1. Figure 2 is a bit confusing at first glance. A more complete illustration of the architecture will be helpful.
  2. What is the noise augmentation techniques used in Hibiki?
  3. line 311-316, the description of EOS tokens is confusing.
作者回复

We thank the reviewer for their constructive feedback.

Updates in the revised version of the paper

We first inform the reviewer that we will update the reported results of Hibiki-M in Table 2 after fixing an issue that further improves performance. Following the reviewers' suggestions and thanks to the extra page provided for the camera ready, we will revise the paper on the following aspects:

  • Clarify the model's architecture, configurations and the nature/size of the datasets used.
  • Add quality/latency trade-off curves by varying hyperparameters of our contextual alignment method.
  • Add COMET evaluation scores.
  • Extend experiments to the English->French direction.
  • Improve the references section and discuss similar works pointed out by the reviewers.

Comments

On the generalization of the contextual alignment method to many language pairs

We acknowledge that our method is only illustrated by a single language direction, however there is no specific aspect of our method with respect to the language pair but the fact that MADLAD -- which we used to derive contextual alignment -- performs well on these languages. We expect our method to work as strongly on pairs of languages where SOTA MT performs well and allows for reliable contextual alignments. Given the massively multilingual nature of MADLAD, it is a good candidate for scaling to many language pairs. We provide the reviewers examples of contextual alignments with other languages. As a first step towards more language directions, we have extended our experiments to the English->French direction and provide experimental results that we will add to the revised paper.

On the quality/latency trade-off and controllable latency

As mentioned in Section 3.2.2 (l.183), we enforce a 2s delay between words associated through contextual alignment. We acknowledge that this choice can be reconsidered and produced a trade-off curve by varying the delay, as asked by the reviewers. The results reported in this quality/latency study show that Hibiki provides a better trade-off than Seamless. We will add this figure to the revised version of the paper.

We also acknowledge that the proposed version of Hibiki does not allow for inference-time latency control. We could rely on conditional training, as we did for the speaker similarity, to simultaneously train the model on multiple latency levels making it possible to control the latency at inference by changing the conditioning. We will add this mention to the limitations section.

On references

We acknowledge the contributions made by Papi et al. (2023), Wang et al. (2024) and Yu et al. (2025). We also acknowledge the progress made in streaming and speech translation for complex language pairs such as English-Japanese as highlighted in Ahmad et al. (2024). We will add these references to the related work in the updated version of our paper.

On the usage of a single text translation model

We used a single model for translation and alignment as we expect this model to be the most appropriate to derive a reliable likelihood-based alignment. However we acknowledge in Section 4.6.2 that we may overfit MADLAD and diversifying the models used to generate and align data may improve the robustness of our system.

On neural evaluation of quality

We added COMET evaluations and executed the comet-compare script which gave the following system ranking: MADLAD-3B > Hibiki > Seamless with a p-value < 0.05.

On statistical significance of human evaluation

Indeed we may get more robust estimation from more samples, however the gap between approaches (as demonstrated by error intervals) is wide enough such that we consider these results trustworthy. We also encourage the reviewer to listen to the example webpage.

Answer: What is the noise augmentation techniques used in Hibiki?

We use samples from freesound.org that are randomly added with various intensities to the input audio during training. We will add details about it in the revised version of the paper and the associated code will be released with the training code.

Answer: Is Hibiki able to generalize to unbounded speech?

The updated version of Hibiki that we will release is trained with windowed attention to extrapolate beyond a few minutes.

Answer: Is Hibiki still better than StreamSpeech using only CVSS training data?

We acknowledge the fact that we never trained Hibiki on CVSS data only as we aimed to handle real-world use cases with longer and diverse speech inputs, while CVSS only contains single sentences of a few seconds.

审稿意见
3

This paper introduces a model named Hibiki for real-time speech-to-speech translation. Hibiki employs a multi-stream architecture to synchronously process source and target speech, and generates both text and audio through multi-task learning. Trained with a weakly supervised method, Hibiki demonstrates SOTA performance in a French-to-English translation task, achieving good translation quality, speaker fidelity, and naturalness. Its simple inference process supports batch processing and real-time deployment on devices.

给作者的问题

  • How well does the approach generalize to other languages, especially low-resource ones?

  • Can the latency-quality trade-off be adjusted at inference, or is it fixed based on training?

  • Does using stochastic sampling for decoding lead to inconsistent translations across different runs?

  • How tightly are the text and speech token streams aligned—does each word directly correspond to a speech segment?

论据与证据

All the claims seem reasonable, however, the proposed methods are only validated on English-Franch, and their effects on other languages ​​need study.

方法与评估标准

The proposed methods and evaluation criteria generally make sense.

理论论述

I have checked all equations and they are all correct.

实验设计与分析

I have reviewed the experimental designs and analyses presented in the paper, and they generally appear to be sound and valid.

补充材料

I review the Appendix of the paper and demo page.

与现有文献的关系

The paper study an important question in speech domain, real-time speech-to-speech translation.

遗漏的重要参考文献

I think all the related works are discussed in this paper.

其他优缺点

Strengths

  • Hibiki integrates simultaneous speech-to-speech and speech-to-text translation into a single decoder-only model, simplifying inference.
  • Achieves strong BLEU scores, outperforming previous models in both offline and real-time speech translation.
  • Produces fluent, well-paced speech with better voice preservation than prior models.
  • Uses a weakly-supervised alignment method to determine optimal delays, improving real-time accuracy.
  • Simple inference process allows for batched GPU translation and real-time on-device deployment.

Weaknesses

  • Relies heavily on synthetic training data, requiring high-quality ASR, MT, and TTS models.
  • The paper written need to be imporved and is a little difficult to follow, many details are derived from the Moshi paper.
  • Currently evaluated only on French-English, raising questions about generalizability.
  • Speaker similarity is improved but not perfect, and accent transfer may not always be desirable.

其他意见或建议

  • Testing on more language pairs and domains would strengthen claims of generalizability.

  • A more intuitive explanation of multistream decoding would help readers understand the model’s structure.

  • Incorporating real interpreter speech in training or fine-tuning could improve performance.

  • Exploring alternative speaker adaptation methods might enhance voice retention without needing classifier-free guidance.

作者回复

We thank the reviewer for their constructive feedback.

Updates in the revised version of the paper

We first inform the reviewer that we will update the reported results of Hibiki-M in Table 2 after fixing an issue that further improves performance. Following the reviewers' suggestions and thanks to the extra page provided for the camera ready, we will revise the paper on the following aspects:

  • Clarify the model's architecture, configurations and the nature/size of the datasets used.
  • Add quality/latency trade-off curves by varying hyperparameters of our contextual alignment method.
  • Add COMET evaluation scores.
  • Extend experiments to the English->French direction.
  • Improve the references section and discuss similar works pointed out by the reviewers.

Comments

On the generalization of the contextual alignment method to many language pairs

We acknowledge that our method is only illustrated by a single language direction, however there is no specific aspect of our method with respect to the language pair but the fact that MADLAD -- which we used to derive contextual alignment -- performs well on these languages. We expect our method to work as strongly on pairs of languages where SOTA MT performs well and allows for reliable contextual alignments. Given the massively multilingual nature of MADLAD, it is a good candidate for scaling to many language pairs. We provide the reviewers examples of contextual alignments with other languages. As a first step towards more language directions, we have extended our experiments to the English->French direction and provide experimental results that we will add to the revised paper.

On the quality/latency trade-off and controllable latency

As mentioned in Section 3.2.2 (l.183), we enforce a 2s delay between words associated through contextual alignment. We acknowledge that this choice can be reconsidered and produced a trade-off curve by varying the delay, as asked by the reviewers. The results reported in this quality/latency study show that Hibiki provides a better trade-off than Seamless. We will add this figure to the revised version of the paper.

We also acknowledge that the proposed version of Hibiki does not allow for inference-time latency control. We could rely on conditional training, as we did for the speaker similarity, to simultaneously train the model on multiple latency levels making it possible to control the latency at inference by changing the conditioning. We will add this mention to the limitations section.

On incorporating real interpreter speech in training or fine-tuning

Real interpreter speech would indeed be an ideal source of data. However, given the scarcity of such data in terms of volume, number of speakers, covered languages, etc. we believe that developing pipelines for synthetic paired data generation is the best path towards scaling speech translation to more languages and conditions.

On speaker similarity and Classifier-Free Guidance (CFG)

Accent transfer is indeed a limitation that we mention in the comments on the ablation on classifier-free guidance in Section 4.6. We expect that both a high speaker similarity and a reduced accent can be achieved by labeling our data with a speaker identification system that is invariant to accent and more accurate in terms of identity.

While CFG offers fine-grained control on the strength of conditioning, it also doubles the computational cost at inference. Cideron et al. (2024) have proposed distilling the post-CFG logits into a student model. Since our submission, we have experimented this method to distill the logits with γ=3\gamma = 3 into a student model such that the latter can run without CFG. In our long-form evaluations, while Hibiki-M without CFG reaches a speaker similarity of 0.33, after CFG-distillation it reaches 0.38 (without CFG), close to the 0.39 obtained with CFG. This suggest we can use distillation to remove the need for CFG at inference.

On decoding with stochastic sampling

Stochastic sampling indeed induces variability in the output, some of which is desirable (e.g. acoustic diversity) while some is undesirable (unreliable translation). We thus use a lower top-k inference parameter on the text stream compared to the audio streams to disentangle acoustic and linguistic diversity, keeping the former high while lowering the latter.

On the alignment of text and speech tokens

As described in Section 3.4.4 of Défossez et al. (2024), special PAD and EPAD (End of PADding) tokens are inserted in the text stream to account for the difference between the constant framerate of the audio tokens and the variable rate of text tokens.

审稿意见
4

This paper proposes a state-of-the-art speech-to-speech translation system called Hibiki. This is a chunk-based decoder-only model based on Mimi codec, and a number of techniques (alignment-related, synthetic data creation, classifier-free guidance, etc.) are introduced to achieve state-of-the-art performance in several public benchmarks. The method seems to be only applied to En-Fr.

update after rebuttal

I acknowledge the authors' efforts in presenting the trade-off curve and providing additional English-to-French translation results, and I have accordingly raised my score. I also encourage the authors to follow through on their stated commitments, including open-sourcing their code.

给作者的问题

  • Section 3: "XX is padded"—What happens when XX is longer than YY? Is YY padded in that case?
  • Will the implementation in Section 4.2 be open-sourced?
    • "We build a French-English speech translation dataset of approximately 40K hours in each language." Which data sources were used? Will this dataset be released?

论据与证据

This paper proposes a number of techniques

  • target text scaffolding
  • contextual alignment based on an off-the-shelf MT system and its application to target text/audio alignment
  • Alignment-aware TTS generation
  • Speaker similarity improvement in TTS generation
  • Conditional training with classifier-free guidance

The effectiveness of these techniques is validated experimentally (e.g., Albations in Tables 4/5)

方法与评估标准

The paper presents four evaluation metrics: fidelity, measured through subjective MOS scores (Table 3), speaker similarity, ASR-BLUE, and latency measures such as LAAL. Given that this study focuses on simultaneous speech translation, it is crucial to analyze the performance-latency tradeoff, particularly through ASR-BLUE and LAAL. However, Table 2 provides only a single condition, making it difficult to thoroughly examine this tradeoff.

I suggest that the authors illustrate trade-off curves by varying latency control parameters and compare their method against competitors, discussing the advantages and limitations of each approach.

理论论述

This paper does not have theoretical claims.

实验设计与分析

As mentioned earlier, the paper should focus more on the performance-latency tradeoff rather than drawing conclusions based on a specific latency setup. For example, Section 4.6 states that Hibiki outperforms Seamless, but a 1.4-second latency is quite large for a speech interface. If generating trade-off curves or testing various latency conditions is not feasible, the authors should at least soften their claims to account for this limitation.

补充材料

I checked Figure 7 in the appendix section to check the contextual alignment examples.

与现有文献的关系

SImultaneous speech-to-speech translation is one of the most important human language technologies to remove language barriers in the world.

遗漏的重要参考文献

The paper sufficiently cites related work. However, I would like to highlight that advancements in simultaneous speech-to-speech translation have not been driven solely by major industries but also by contributions from various researchers in the IWSLT community. I recommend that the authors acknowledge these efforts by citing relevant IWSLT summary papers.

其他优缺点

Strengths

  • Achieves state-of-the-art performance in simultaneous speech-to-speech translation. The results in Table 1 are impressive, as the proposed approach outperforms the offline system despite operating in a streaming setting.
  • Proposes several techniques to enhance performance, with their effectiveness validated through an ablation study.

Weaknesses

  • The training procedure is complex, making it difficult to reproduce the results. While Section 1 states that the authors will release the code, models, and dataset, it is unclear whether the release will include the full dataset creation process and detailed training configurations. I recommend the authors clarify this.
  • The performance-latency tradeoff between Seamless and the proposed method is not clearly analyzed (see my comments above).
  • The alignment methods appear to be tailored to a specific language pair, raising concerns about their applicability to other language pairs.

其他意见或建议

  • Section 3.1.1 requires some prior knowledge of the Mimi codec but is well written.
  • Section 3.1.4 is difficult to understand. As this section presents the main proposed architecture, it requires more detailed explanations, such as equations or figures, to enhance clarity.
  • Section 4.6 presents strong results, but it would be more informative if the authors included details on the training data for each system and the number of parameters.
作者回复

We thank the reviewer for their constructive feedback.

Updates in the revised version of the paper

We first inform the reviewer that we will update the reported results of Hibiki-M in Table 2 after fixing an issue that further improves performance. Following the reviewers' suggestions and thanks to the extra page provided for the camera ready, we will revise the paper on the following aspects:

  • Clarify the model's architecture, configurations and the nature/size of the datasets used.
  • Add quality/latency trade-off curves by varying hyperparameters of our contextual alignment method.
  • Add COMET evaluation scores.
  • Extend experiments to the English->French direction.
  • Improve the references section and discuss similar works pointed out by the reviewers.

Comments

On the generalization of the contextual alignment method to many language pairs

We acknowledge that our method is only illustrated by a single language direction in the paper. However, there is no specific aspect of our method with respect to the language pair, but the fact that MADLAD -- which we used to derive contextual alignment -- performs well on these languages. We expect our method to work as strongly on pairs of languages where SOTA text translation models perform well and can thus allow us to derive reliable contextual alignments. Given the massively multilingual nature of MADLAD or even more recent systems like GemmaX2-28-9B (Cui et al., 2025), we expect this approach to be a good candidate for scaling to many language pairs. We provide the reviewers examples of contextual alignments with other languages. As a first step towards more language directions, we have extended our experiments to the English->French direction and provide experimental results that we will add to the revised paper.

On the quality/latency trade-off

As mentioned in Section 3.2.2 (l.183), we enforce a 2s delay between words associated through contextual alignment as we found it to provide a good balance between latency and translation quality. We acknowledge that this choice can be reconsidered and that trade-off curves would provide a clearer picture to the reader. We thus produced a trade-off curve by varying the delay, as asked by the reviewers. The results reported in this quality/latency study show that Hibiki provides an overall better trade-off than Seamless. We will add this figure to the revised version of the paper.

We also acknowledge that the proposed version of Hibiki does not allow for inference-time latency control. We could rely on conditional training, as we did for the speaker similarity, to simultaneously train the model on multiple latency levels making it possible to control the latency at inference by changing the conditioning. We will add this mention to the limitations section.

On the release of code and data

We acknowledge that some critical parts of our framework are particularly challenging to reproduce, in particular the synthetic data generation using contextual alignment and we will release our code for these steps along with training and inference code, trained models and around 900h of synthetic paired data with voice preservation corresponding to our speech translation fine-tuning dataset introduced in Section 4.2. In order to build the speech translation training dataset, we relied on various data sources and will release the portion that we are allowed to release by their license.

On references

We thank the reviewers for their suggestion and acknowledge the contributions made by Papi et al. (2023), Wang et al. (2024) and Yu et al. (2025) that are particularly relevant with respect to our work. We also acknowledge the progress made in streaming and speech translation for complex language pairs such as English-Japanese as highlighted in Ahmad et al. (2024). We will add these references to the related work in the updated version of our paper.

Answer to: Section 3: "X is padded" - What happens when X is longer than Y ? Is Y padded in that case?

At this level of explanation (l.106), we also assume that the modeling of Y knowing X should be causal. This implies that Y is longer than X.

最终决定

This paper proposes a decoder-only speech-to-speech translation model Hibiki. The paper interates multiple techniques (alignment-related, synthetic data creation, classifier-free guidance, etc.) and achieves high performance in several public benchmarks.

Strengths of the paper:

  • The proposed model Hibiki achieves state-of-the-art performance in simultaneous speech-to-speech translation.
  • Hibiki is able to conduct efficient batched inference and the distilled model is able to run on a smartphone in realtime.

Weakness of the paper:

  • The model is only evaluated on French-English direction. It is not enough evidence about its generalization.
  • The human evaluation only contains 30 samples.

All reviewers are satisfied with author responses.