6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

4.0

置信度

创新性3.0

质量2.8

清晰度2.5

重要性2.5

NeurIPS 2025

SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

Chenyang Le,Bing Han,Jinshun Li,Chen Songyong,Yanmin Qian

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

摘要

关键词

Simultaneous Speech TranslationStreaming Text-To-Speech

评审与讨论

审稿意见

评分: 4置信度: 52025-06-24

The paper presents a learned policy for simultaneous speech translation (SimulST) called SimulMEGA. SimulMEGA combines a prefix-based SimulST policy, which is trained in an unsupervised way, with Mixture-of-Experts (MoE) to improve the system's performance. The framework is applied to both speech-to-text (S2T) and text-to-speech (TTS). The SimulST model is trained in two steps: an offline pretraining and a simultaneous training with a standard prefix loss and a novel refiner loss for the MoE. The performance is validated on six languages and popular benchmarks and ablation studies are presented.

优缺点分析

Strengths:

It is the first time that the MoE approach is applied to SimulST, to the best of my knowledge.
The architectural modifications required to apply this method are minor, facilitating its adoption.
The results of the proposed architecture are promising, outperforming the tested baselines.
The ablation study clarifies the importance of some experimental choices.

Weaknesses:

No results reported using LAAL, which is the most used latency metric for SimulST (e.g., see the last IWSLT Evaluation Campaign editions from 2024 and 2025)
Insufficient baselines and lack of clarity in their presentation: The paper compares with "Seamless", but it is not clearly stated that the original SeamlessM4T models (both Medium and Large-v2) are different from the simultaneous version of Seamless, StreamingSeamless [1], which I suppose is the model adopted for the comparisons in Figure 3. Moreover, the wait-k policy (which is one of the two terms of comparisons for SimulST) has been outperformed for many years now (see also the related works), and is not a strong term of comparison. The choice of the remaining baseline, Dig-SST, has not been justified, and it is unclear why recent state-of-the-art SimulS2TT works (already cited by the authors), such as models tailored for SimulST tasks and publicly available (such as StreamSpeech [2] and NAST-S2X [3]) and policies applicable to models without any retraining (such as EDAtt [4] and AlignAtt [5]), have not been used for this comparisons. It is good to report additional baselines, such as wait-k, but the set of baselines should be carefully selected and clearly explained to ensure the validity of the paper's claims. Likewise, results for the SimulS2ST task lack comparison with the aforementioned works (StreamSpeech and NAST-S2X), which also cover the speech-in-output modality.
It is not clear if the system actually processes long-form speech (akin to real-world streaming processing) or if the speech has been previously segmented into shorter segments. It seems that the authors used CoVoST and FLEURS, which are human-segmented, but no details about this are provided. Following recent work [6], this should be clarified to position the paper's contribution appropriately.
While it is true that the approach requires a minimal modification in the architecture, it requires an ad-hoc simultaneous training stage to be applied. Moreover, the pretraining stage, despite being offline, already entails some choices for making the model more efficient during inference (chunk-AR and NAR), and even if desirable in general, it is not the standard way offline S2TT and S2ST models are trained, meaning that the same findings might be not applicable when starting from common, publicly-already available models.

[1] Barrault, L., Chung, Y.A., Meglioli, M.C., Dale, D., Dong, N., Duppenthaler, M., Duquenne, P.A., Ellis, B., Elsahar, H., Haaheim, J. and Hoffman, J., 2023. Seamless: Multilingual Expressive and Streaming Speech Translation. arXiv preprint arXiv:2312.05187.

[2] S. Zhang, Q. Fang, S. Guo, Z. Ma, M. Zhang, and Y. Feng, “StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning,” 2024.

[3] Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, and Min Zhang. 2024. A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1557–1575, Bangkok, Thailand. Association for Computational Linguistics.

[4] S. Papi, M. Negri, and M. Turchi, “Attention as a Guide for Simultaneous Speech Translation,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, 2023, pp. 13 340–13 356.

[5] S. Papi, M. Turchi, and M. Negri, “AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation,” in INTERSPEECH 2023, 2023, pp. 3974–3978.

[6] Papi, S., Polak, P., Macháček, D. and Bojar, O., 2025. How “Real” is Your Real-Time Simultaneous Speech-to-Text Translation System?. Transactions of the Association for Computational Linguistics, 13, pp.281-313.

问题

I suggest adding LAAL as a metric, since it is now the standard for SimulST evaluation and avoids rewarding over-generating systems.
Why did the author choose the wait-k and the DigSST policies for comparison while excluding state-of-the-art works such as StreamSpeech, NAST-S2X, EDAtt, and AlignAtt?
What is the framework used for evaluation? I suggest adding to the inference details (and, maybe, in the limitations) that the model has been tested on short-form and human-segmented test sets such as CoVoST and FLEURS, thereby simulating only part of a real-world SimulST model.
How much does the choice of the mixed chunk-AR and NAR encoder affect the final results?

局限性

Limitations are really strict to the point. I suggest elaborating a little on the drawbacks mentioned in lines 299-301.

最终评判理由

I decided to keep my score, as I think it already represented the correct assessment of this work. The weaknesses are clear from my comments, as I think that, despite being a performing method (and the reason for my positive rating), the overall paper is strongly based on architectural requirements that are not standard at all in practice (i.e., inference-efficient approaches such as chunk-AR and NAR) and therefore the claim that it only requires a lightweight finetuning is not true. The additional experiments provided in the rebuttal regarding the change in architecture are not compared with other state-of-the-art methods (AlignAtt, EDAtt) applied to the mentioned architectures, but are only compared with SeamlessStreaming. Lastly, the evaluation framework is not very clear, as it does not follow standard frameworks and would be difficult to reproduce.

格式问题

No.

作者回复

2025-07-30

Dear Reviewer,

Thank you for your insightful comments and constructive feedback on our work. We have carefully addressed each of your concerns as follows:

Addressing Weaknesses

Latency Evaluation: We have incorporated numerical evaluation results on the CoVoST2 testset using LAAL, as the latency metric, presented in Table 1. The results confirm our original conclusion that SimulMEGA continues to marginally outperform baseline methods. In the final version, all quality-latency curves will be systematically re-plotted using the LAAL metric for comprehensive comparison.
Baselines and presentation clarification:
- While AlignATT did not function effectively with our base model (due to persistent attention fixation on particular frames leading to over-generation), we successfully integrated EDATT (Papi et al., 2023), which shares conceptual similarities. As shown in Table 1, EDATT outperforms both DIG-SST and wait-K baselines, further validating our approach's competitive advantage.
- Publicly available implementations of StreamSpeech and NAST-S2X differ significantly from SimulMEGA in three critical dimensions: model architecture scale, training data volume, and multilingual capability scope. To provide meaningful comparison while acknowledging these differences, we present results on the CVSS FR→EN benchmark in Table 2, where SimulMEGA achieves more than 10 BLEU points higher than these public implementations. We maintain AL (Average Lagging) as the latency metric to ensure comparability with the original publications. These comparative results will be included in the appendix of the final version.
- Regarding the "Seamless" baseline: this refers to the integrated system described in Section 6 of the Seamless documentation, which combines Seamless Expressive with Seamless Streaming components. We will expand our baseline descriptions in the appendix to prevent any potential misinterpretation.
Real-world streaming processing: Our supplementary materials include a deployed web demonstration that processes streaming input using a VAD (Voice Activity Detection) system to segment audio into optimal 10-20 second chunks for model processing. While standardized long-form evaluation benchmarks remain scarce in the field, we constructed a specialized test set following Whisper's methodology, comprising 17 TED-LIUM talks with an average duration of nine minutes. We translated the reference transcripts to Chinese using ByteDance's Doubao translation service for evaluation. The S2TT performance metrics are presented in Table 3.
Reproducibility on standard pretrained ST model: Unlike attention-based policies (EDATT and AlignATT) or divergence-based approaches (DIG-SST), SimulMEGA introduces two features: (1) it fine-tunes the base model without freezing parameters, and (2) it learns the read/write policy without relying on empirical patterns. Our training process demonstrates stable convergence in both loss values and routing score distributions. Theoretical analysis suggests minimal impact from the chunk streaming mask design, which is empirically validated by consistent performance across different mask configurations. Furthermore, SimulMEGA's successful adaptation to streaming TTS demonstrates the method's robustness across both architectural variants and task domains.

Responding to Questions

Evaluation framework: While we did not utilize SimulEval as our primary evaluation framework, we implemented a rigorously controlled evaluation pipeline where all models undergo identical processing conditions. Our latency metrics directly implement the same computational definitions as SimulEval to ensure measurement consistency with established SimulST literature.
Effect of streaming mask: Although comprehensive ablation studies on architectural variations were not our primary focus, we conducted targeted experiments evaluating different mask configurations. Our findings indicate that the 640ms chunk mask configuration yields performance equivalent to processing without chunk masking. However, we observed measurable performance degradation with the 320ms configuration, suggesting an optimal balance between temporal resolution and contextual completeness at the 640ms threshold.

We appreciate your thorough review and welcome any additional questions you may have regarding these clarifications. All revisions outlined above will be incorporated into the final manuscript to enhance both methodological transparency and experimental rigor.

Table 1: CoVoST2 Evaluation

	X-EN					EN-X
Seamless Streaming	0.3	0.5	0.7	0.9	1	0.3	0.5	0.7	0.9	1
BLEU	33.50	33.75	34.05	34.37	35.05	33.35	33.70	33.83	34.11	35.41
LAAL	2.280	2.410	2.522	2.665	3.149	1.966	2.065	2.144	2.268	2.660
Wait-K	2.5	1.9	1.5	1.1	0.5	2.5	1.9	1.5	1.1	0.5
BLEU	29.27	31.82	32.64	33.45	34.93	34.25	35.83	36.12	36.49	37.41
LAAL	1.731	2.151	2.296	2.565	3.309	1.975	2.294	2.376	2.541	3.127
DigSST	0.1	0.07	0.05	0.03	0.01	0.1	0.07	0.05	0.03	0.01
BLEU	31.43	33.53	35.05	35.99	36.52	17.12	17.51	18.10	19.44	23.38
LAAL	2.035	2.237	2.442	2.752	3.340	0.734	0.770	0.808	0.902	1.252
EDATT	0.03	0.025	0.02	0.015	0.01	0.03	0.025	0.02	0.015	0.01
BLEU	33.70	34.81	35.60	36.09	36.56	35.00	35.82	36.77	37.63	38.20
LAAL	1.880	2.062	2.301	2.610	3.056	1.963	2.131	2.384	2.688	3.129
SimulMEGA	0.7	0.6	0.5	0.4	0.3	0.7	0.6	0.5	0.4	0.3
BLEU	35.58	35.97	36.27	36.46	36.60	37.83	38.11	38.14	38.28	38.39
LAAL	1.729	1.860	1.982	2.104	2.233	2.059	2.246	2.400	2.542	2.690

Table 2: Evaluation on CVSS FR-EN

SimulMEGA	Threshold	0.8	0.7	0.6	0.5	0.4	0.3	0.2	0.0
	ASR-BLEU	33.42	34.23	33.79	34.08	34.46	34.72	35.08	35.74
	AL	1.391	1.511	1.605	1.767	1.810	1.929	2.058
NAST-S2S	t_s + t_a	320+320	640+640	1280 +1280	1920+1280	1920+1920	2560+1280	2560+2560
	ASR-BLEU	19.67	19.15	20.20	21.77	22.70	23.58	24.88
	AL	-0.393	1.533	3.330	3.975	4.335	4.408	4.976
StreamSpeech	C×40ms	320	640	960	1280	1920	2560	3200	4800
	ASR-BLEU	22.89	24.41	25.00	25.20	25.50	25.68	25.75	26.14
	AL	1.270	2.326	2.803	3.146	3.450	3.812	4.157	4.873
Seamless	Threshold	0.3	0.5	0.7	0.9	1.0
	ASR-BLEU	31.63	31.60	31.60	31.62	31.99
	AL	1.819	1.947	2.058	2.199	2.826

Table 3: Long-form S2TT result on TED-LIUM

	BLEU	AL	LAAL
SimulMEGA	55.13	16.15655	24.62662
Seamless	38.18	7.016107	15.42395

2025-08-01

I thank the authors for their response. Below, they can find my answers:

We have incorporated numerical evaluation results on the CoVoST2 testset using LAAL

I appreciate the inclusion of the LAAL results. However, I would like to know which methods can achieve a latency $\leq1.5$ (that is commonly referred to as low latency regime), as in the original EDAtt paper, we have results spanning $1-1.5s$ . I believe this is important because common real-world SimulST scenarios require low-latency constraints.

While AlignATT did not function effectively with our base model (due to persistent attention fixation on particular frames leading to over-generation)

Both AlignAtt and EDAtt are based on cross-attention scores, and the usage of one should be possible if it is possible to use the other. Attention "fixation" can be easily solved by a frame-wise normalization, to regularize the attention patterns. I suggest that the authors apply such a method to both AlignAtt and EDAtt for solving the attention sink problem, allowing for a comparison with state-of-the-art approaches (as mentioned also by Reviewer U9os). Lastly, it is important to notice that EDAtt and AlignAtt are policies that do not require any retraining as they just rely on the cross-attention scores of the original model, while the proposed method requires specific architectural choices and specific fine-tuning.

Real-world streaming processing + Question 1

I appreciate the elaboration on long-form speech. I would like to point out that, following the paper that I mentioned during the review, the specific settings used in the experiments should be adequately described in the main paper (VAD segmentation), as the evaluation framework should be clear to the reader to correctly interpret the reported results, especially when a publicly-available and widely-used framework as SimulEval is not used. It seems that the long-form approach adopted in Table 3 is similar to [1], while direct approaches to long-form (without the need for an additional segmentation step) also exist [2,3], and [2] does not require retraining. I believe that a further elaboration of the motivation for the VAD-based approach reported in the supplementary material, against the aforementioned approaches, or pointing them out as potential further improvements/future directions on the long-form scenario--that is, the real-world scenario-- will make the discussion more complete.

[1] Polák, Peter, and Ondřej Bojar. "Long-form end-to-end speech translation via latent alignment segmentation." 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024.

[2] Sara Papi, Marco Gaido, Matteo Negri, and Luisa Bentivogli. 2024. StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3692–3707, Bangkok, Thailand. Association for Computational Linguistics.

[3] Siqi Ouyang, Xi Xu, and Lei Li. 2025. InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model. In Findings of the Association for Computational Linguistics: ACL 2025, pages 3032–3046, Vienna, Austria. Association for Computational Linguistics.

Reproducibility on standard pretrained ST model

Unfortunately, the response given by the authors doesn't address the concern regarding the "a priori" design choice of the model. Specifically, the claim of minimal modification does not hold, as the architecture chosen for the experiments, implementing inference-efficient approaches (chunk-AR and NAR), is inherently different and already optimized for simultaneous tasks. However, this is not the case for standard S2TT and S2ST models that do not usually employ such strategies (e.g., Whisper and Seamless). Therefore, the concern of my original review, "meaning that the same findings might not be applicable when starting from common, publicly available models," has not been addressed.

2025-08-01

Thank you for your prompt and thoughtful response. We appreciate the opportunity to clarify a few points where we believe there may be some misunderstanding.

in the original EDAtt paper, we have results spanning $1-1.5s$

Actually, in the original EDAtt paper, results are reported in 1 - 2.5s spanning of AL. At 1.5s AL , the method suffers a degradation of approximately 5 BLEU points (around 25%) compared to full-context performance. In contrast, in our experiments, EDAtt incurs only a 3 BLEU drop (10%) at 1.8s LAAL , indicating that our implementation of EDAtt serves as a stronger baseline than the original. This suggests that our overall framework achieves better latency-accuracy trade-offs.

Attention "fixation" can be easily solved by a frame-wise normalization.

We respectfully disagree with the claim that attention fixation "can be easily solved by frame-wise normalization." Our implementation of EDAtt strictly follows the official codebase from FBK-FairSeq , where frame-wise normalization is already applied. However, neither the AlignAtt paper nor its public code includes such normalization. Moreover, it is unclear how normalization—applied before an argmax operation—would fundamentally resolve fixation issues, as the argmax is invariant to scale. We would welcome further clarification on this point.

while direct approaches to long-form (without the need for an additional segmentation step) also exist

While it is true that direct long-form streaming methods exist, their practical applicability is limited. For instance, StreamAtt introduces up to 2 seconds of computational latency , which exceeds typical real-time requirements. InfiniSST, though innovative, still relies on crafted alignment data and exhibits performance degradation. In contrast, our VAD-based segmentation enables processing of up to 30-second segments with stable performance in practice. We fully acknowledge that automatic segmentation of unbounded speech and preservation of history context information remains challenging, and we have noted this as a direction for future work. And we will add the current reliance on VAD to the limitation.

it is important to notice that EDAtt and AlignAtt are policies that do not require any retraining as they just rely on the cross-attention scores of the original model, while the proposed method requires specific architectural choices and specific fine-tuning.

We emphasize that our method does not require specific architecture . Its successful application to a decoder-only architecture (e.g., Whisper-based models) demonstrates its architectural flexibility. Regarding fine-tuning, we only use the same data as in the original pre-training—no additional labeled data is introduced. The results show that our approach significantly outperforms EDAtt, particularly in low-latency regimes. Thus, the choice boils down to whether one prefers to improve streaming performance through lightweight fine-tuning (as we propose) or accept higher latency and quality degradation with attention based policies.

the claim of minimal modification does not hold, as the architecture chosen for the experiments`

Finally, the base model used in our experiments was adapted from Whisper, which originally formed part of a "wait-K" streaming system. The modifications we made were aimed at balancing efficiency and performance, not introducing architectural novelty. There is no evidence that our method is restricted to non-standard architectures. In fact, we are among the few works to validate our approach across two fundamentally different architectures (encoder-decoder and decoder-only). For your concern, we have conducted a quick experiment with full-parameter finetuning of Qwen-omni encoder + Qwen_3_0.6B on en-zh pair. The result is shown below. SimulMEGA(whipser) is the model in the paper and SimulMEGA(SpeechLLM) is the model with Qwen architecture.

Seamless	0.3	0.5	0.7	0.9	1
	37.28	37.28	37.60	37.94	38.55
	2.074	2.178	2.260	2.407	2.813
SimulMEGA(whipser)	0.8	0.7	0.6	0.5	0.4	0.3	0.2
	42.24	42.77	43.06	43.11	43.14	43.17	43.33
	2.036	2.305	2.516	2.680	2.823	2.976	3.139
SimulMEGA(SpeechLLM)	0.8	0.7	0.6	0.5	0.4	0.3	0.2
	43.56	43.85	43.98	43.97	44.07	44.26	44.58
	2.050	2.237	2.458	2.544	2.734	2.864	3.026

2025-08-05

Actually, in the original EDAtt paper, results are reported in 1 - 2.5s spanning of AL. At 1.5s AL , the method suffers a degradation of approximately 5 BLEU points (around 25%) compared to full-context performance. In contrast, in our experiments, EDAtt incurs only a 3 BLEU drop (10%) at 1.8s LAAL , indicating that our implementation of EDAtt serves as a stronger baseline than the original. This suggests that our overall framework achieves better latency-accuracy trade-offs.

I was saying that a low latency regime (up to 1/1.5s) is important in SimulST, and the current paper does not report metrics for this scenario (latency up to 1.5s). Moreover, since the underlying model is different compared to the original model, which is a weaker model trained on MuST-C only for two languages separately, it is not surprising that it can achieve better quality results. However, this has nothing to do with ensuring that the proposed framework is better in terms of latency and accuracy tradeoff. The absence of a low-latency regime is a serious weakness of the current experimental results.

Both AlignAtt and EDAtt are based on cross-attention scores, and the usage of one should be possible if it is possible to use the other. Attention "fixation" can be easily solved by a frame-wise normalization, to regularize the attention patterns. I suggest that the authors apply such a method to both AlignAtt and EDAtt for solving the attention sink problem, allowing for a comparison with state-of-the-art approaches (as mentioned also by Reviewer U9os). Lastly, it is important to notice that EDAtt and AlignAtt are policies that do not require any retraining as they just rely on the cross-attention scores of the original model, while the proposed method requires specific architectural choices and specific fine-tuning.

The important point here is that both AlignAtt and EDAtt should work as they both rely on the same source of information (cross attention). It is unlikely that the second can be used while the first cannot, as attention sinks should be present in both cases. However, the authors managed to show the results using EDAtt. The normalization should be taken frame-wise, while the argmax is token-wise. An example of that implementation in the mentioned repo is this.

While it is true that direct long-form streaming methods exist, their practical applicability is limited. For instance, StreamAtt introduces up to 2 seconds of computational latency [...] In contrast, our VAD-based segmentation enables processing of up to 30-second segments with stable performance in practice. We fully acknowledge that automatic segmentation of unbounded speech and preservation of history context information remains challenging, and we have noted this as a direction for future work. And we will add the current reliance on VAD to the limitation.

I don't see "2 seconds of computational latency" added in the original StreamAtt paper, as it also achieves competitive results to gold-segmented upper bounds with AlignAtt. Anyway, I was critical about the missing details and discussions about the evaluation frameworks in the main paper, and still think that they should be included in the paper settings rather than in the limitations, as they are a core part of the SimulST empirical validations.

We emphasize that our method does not require specific architecture [...] Regarding fine-tuning, we only use the same data as in the original pre-training—no additional labeled data is introduced. The results show that our approach significantly outperforms EDAtt, particularly in low-latency regimes. Thus, the choice boils down to whether one prefers to improve streaming performance through lightweight fine-tuning (as we propose) or accept higher latency and quality degradation with attention based policies.

This is indeed not true, as I also already mentioned in my comments, the architecture chosen for the experiments implements inference-efficient approaches (chunk-AR and NAR), is inherently different, and already optimized for simultaneous tasks, while standard S2TT and S2ST models do not usually employ such strategies. Therefore, the proposed method is not a lightweight fine-tuning but requires specific architectural choices. Moreover, as I also mentioned before, the 1.7s is not low latency (see also IWSLT campaigns for references about latency regimes used in the community).

There is no evidence that our method is restricted to non-standard architectures.

Unfortunately, there is also no evidence that the method works with standard architectures, such as Whisper. Indeed, no chunking and NAR frameworks have been used for building Whisper, which is a standard autoregressive encoder-decoder model.

2025-08-06

As you request we have extend the regime to around 1 s LAAL. However, for practical applications, we still do not recommend using this specific regime.
We also successfully implement AlignATT after adding the frame-wise normalization. Thank you for suggestion on the implementation. Experiment shows that EDATT and AlignATT is similar in performance. And SimulMEGA shows even more advantage in low latency regime.
"2 seconds of computational latency" comes from Table 2 of StreamAtt paper. We get the extra computation latency by (LAAL_CA - LAAL_NCA). Please clarify if this interpretation is incorrect.
We can slightly widen the 1.5 s LAAL boundary as translations between Mandarin and Latin languages are included in the evaluation.
Our commitment is to explicitly state the current reliance on VAD for real-world deployment as a limitation. All evaluation details will be covered and we have tested that our evaluation pipeline perfectly matches the result of SimulEval.
We have shown versatility of our method by training on streamTTS on top of CosyVoice2. We have also switched the architecture to QwenAudio-style SpeechLLM and shows it reaches better result. SimulMega success on each attempt and always converges to the best policy given the training dataset. We believe this hold for strong grounding that our method is not picky on starting base. We hope our effort invested in this not to be dismissed.

Evaluation result in CoVoST X2EN, BLEU score of base model is 37.04

EDATT	0.05			0.03	0.025	0.02	0.015	0.01
BLEU	27.46			33.7	34.81	35.6	36.09	36.56
LAAL	1.132			1.88	2.062	2.301	2.61	3.056
AlignATT		32	48			64		80
BLEU		29.92	32.81			35.13		35.72
LAAL		1.339	1.713			2.179		2.463
SimulMEGA	0.9	0.8	0.7	0.6	0.5	0.4	0.3	0.2
BLEU	32.46	34.97	35.58	35.97	36.27	36.46	36.6	36.73
LAAL	1.142	1.562	1.729	1.86	1.982	2.104	2.233	2.390

2025-08-06

The computational latency it's the latency computed considering the actual elapsed time during simultaneous translation (therefore including latency dependent on, for instance, hardware and an efficient codebase), while the ideal latency it's computed considering the latency introduced by the method itself, which is the latency that the authors are considering there. This follows the literature on simultaneous translation; see the SimulEval paper for further reference.

I don't think that I'm too defensive of attention-based methods (and I am not referring to attention-based methods in my response with standard architecture, rather to autoregressive models, as I already mentioned), but assuming that the underlying architecture of the model should already implement inference-efficient approaches (i.e., chunk-AR and NAR) is not realistic, as almost all of them are not based on that.

Lastly, I agree that the discussion is not constructive anymore with such behavior, as I don't get why pointing out a weakness in a work should be disrespectful, especially since my rating is still positive and I'm not penalizing it, but, conversely, I'm trying to point out possible problems and points of improvement.

2025-08-06

We don't mean any disrespect. And we really appreciate your engagement in the conversation, which benefits us a lot. We have already addressed may weakness in our paper under your suggection.

And we are juist trying to prove that architectual choice should not hinder the utility of our method. We can direct fine-tune it from a standard model like Qwen2Audio. And we have given some prove. We realy hope we can reach an agreement on that so everyone can try our on their own base model of any kind. But we respect your judgement on that.

And we have taken your suggection and experiment on a standard QwenAudio architecture. That's all we can do by now. We have removed the potentially controversial words and sorry for your trouble.

审稿意见

评分: 5置信度: 52025-06-29

The paper introduces SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), a novel framework for Simultaneous Speech Translation (SimulST), which enables real-time cross-lingual communication by integrating speech recognition and machine translation under stringent latency constraints. Unlike traditional offline systems, SimulMEGA processes streaming audio input to generate translations incrementally, mimicking human interpreters. The framework addresses challenges in balancing translation quality, latency, and semantic coherence, particularly in multilingual many-to-many translation scenarios. SimulMEGA combines prefix-based training with a Mixture-of-Experts (MoE) refiner module to learn implicit read/write policies without introducing inference-time overhead. It employs a hybrid streaming speech encoder (combining chunk-wise autoregressive and non-autoregressive blocks) and a standard transformer-based text decoder, augmented by a global routing gate and an MoE refiner active only during training. The framework supports both speech-to-text and text-to-speech (TTS) streaming tasks, achieving state-of-the-art performance.

优缺点分析

Strengths:

Simultaneous Speech Translation is a very valuable and promising research topic.
The paper presents a high-quality technical contribution, supported by rigorous evaluation across six language pairs.
The paper is well-structured, with clear descriptions of the SimulMEGA framework.

Weaknesses:

While the evaluation on six language pairs is comprehensive, the paper does not provide detailed analysis of performance across specific language pairs or linguistic phenomena (e.g., low-resource languages or complex syntactic structures). This limits insight into the framework’s robustness across diverse linguistic contexts. Additionally, the TTS evaluation is less detailed than the speech-to-text results.
The performance of the proposed method in Simultaneous Speech-to-Speech Translation should be further evaluated on the CVSS benchmark which is commonly used in previous work.

问题

The paper describes the MoE refiner as comprising prefix and global experts, with merging weights determining the read/write policy. Could the authors provide more details on how these weights are learned during training?
The paper references prior policy-based frameworks but does not explicitly compare SimulMEGA’s policy learning approach to these methods. How does the unsupervised MoE-based policy learning differ from, say, the CTC-based policy or the attention-based policy.

局限性

Yes

最终评判理由

Thank you for your response. After carefully considering both the content of the paper and the rebuttal, I have decided to keep my score unchanged.

格式问题

N/A

作者回复

2025-07-30

Dear Reviewer

Thank you for your valuable feedback. We have carefully addressed each of your concerns as follows:

Addressing Weaknesses

Evaluation on specific language pair: Comprehensive numerical results for each language pair in the CoVoST2 testset are currently available in our supplementary materials. In the final version, we will expand these into a complete table presented in the appendix for easier reference. Additionally, the supplementary materials include detailed translation examples that illustrate linguistic phenomena affecting performance. For instance, we observe higher translation latency between English and Mandarin due to their inverted grammatical structures—a finding that aligns with established linguistic research on language directionality in translation systems.
TTS evaluation: Our current TTS implementation supports two languages, which explains the more focused evaluation scope. Nevertheless, our results demonstrate that SimulMEGA-TTS achieves lower streaming latency while maintaining the high-quality output standards of CosyVoice 2. We invite you to experience the practical performance through our audio samples and video demonstration in the supplementary materials, which showcase real-time translation capabilities across various speech patterns and speaking rates.
Benchmark Clarification: While CVSS [1] is derived from CoVoST2, it specifically focuses on speech-to-speech translation pairs. In the final version, we will revise our benchmark description to accurately reflect this distinction and better align with the S2ST task framework.

Responding to Conceptual Questions

MoE router weights learning: As the MoE mechanism, the router will automatically select the best expert for a given input hidden state. The router weights are learning through gradient decent -- weights for the optimal expert will increase during training as the gradient suggests. Under this mechanism, we turn the MoE router into a read/write decision maker:
- Training Setup: For a given source sentence, the model only accesses the first half during training.
- Expert Roles:
  - A prefix expert analyzes the visible partial input ( $h^{prefix}$ ).
  - A global expert infers high-level context from the full sentence ( $H^{global}$ )
- Policy Learning: The MoE router learns to select between experts when predicting each target token (e.g., preferring the prefix expert for early tokens and the global expert for later tokens). Crucially, no explicit policy labels are used—the router implicitly learns when input is sufficient for generation.
- At inference (streaming mode), the global expert is unavailable (since future input is unknown). If the router "attempts" to use the global expert (i.e., assigns it high weight), this signals insufficient input—trigerring a wait decision. Thus, the MoE mechanism naturally emerges as a read/write policy without supervision.
Difference to CTC-based policy or attention-based policy:

CTC based policy [2] These require learning CTC force alignments between source speech and target translations. In multilingual settings, this becomes problematic as the system must establish alignments across multiple target languages—a challenge with no demonstrated successful implementations in multilingual SimulST to date as far as we know.
Attention-based policy ( [3]) These leverage attention patterns from offline models for streaming decisions. While innovative, they typically require using the offline model without streaming-specific fine-tuning, potentially compromising performance in real-time scenarios.

[1] Jia, Yeting et al. “CVSS Corpus and Massively Multilingual Speech-to-Speech Translation.” International Conference on Language Resources and Evaluation 2022.

[2] S. Zhang, Q. Fang, S. Guo, Z. Ma, M. Zhang, and Y. Feng, “StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning,” 2024.

[3] S. Papi, M. Negri, and M. Turchi, “Attention as a Guide for Simultaneous Speech Translation,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, 2023, pp. 13 340–13 356.

评论- Official Comment by Reviewer FapF

2025-08-09

Thank you for your reply, which has addressed my concerns very well. I hope that the discussion on the policy and related prior work can be included in the final version, as I believe this will enhance the impact of the work. Therefore, I have decided to keep my positive score.

审稿意见

评分: 4置信度: 22025-07-02

This paper proposes SimulMEGA, an unsupervised policy learning framework for Simultaneous Speech Translation. It introduces a MoE Refiner and a routing gate that jointly enable the model to implicitly learn read/write policies without adding inference-time overhead. The proposed system supports both speech-to-text and text-to-speech streaming tasks and demonstrates good latency-quality trade-offs across six languages. The authors provide extensive experiments, including comparisons with SeamlessM4T, streaming TTS baselines, and simultaneous speech-to-speech translation tasks.

优缺点分析

Strengths

The proposed method achieves SOTA performance on the SimulST task, supported by extensive experiments and strong comparisons against baselines.
The paper further extends the evaluation to streaming TTS and simultaneous speech-to-speech translation tasks, demonstrating the versatility of the proposed framework.
The method is relatively simple and efficient, requiring minimal architectural modifications to existing systems while still providing significant performance improvements.
The supplementary materials include comprehensive demos and audio samples.

Weaknesses

The clarity of the paper needs improvement. Many important concepts (such as $h_{\text{prefix}}$ and $L_{\text{offline}}$ ) are not clearly explained, which makes the paper harder to follow.
Since the core contribution of the paper is the Mixture of Experts (MoE) refiner, the authors should provide more motivation for introducing it. Why is MoE necessary? How does it compare to native solutions without MoE? The use of gating mechanisms is also not particularly novel and would benefit from a more in-depth justification.
The implementation details of the streaming TTS component need to be described more thoroughly.

问题

What specific problems does the introduction of the MoE refiner solve, and how does it contribute to the final performance improvement? It would be helpful if the authors could provide additional experiments to explicitly demonstrate the effectiveness of the MoE module.
The authors are encouraged to provide a more detailed explanation of the various loss functions introduced in the model, including their exact formulations. Additionally, key concepts such as $h_{\text{prefix}}$ and other latent representations should be clearly defined.
More detailed implementation descriptions for the Streaming TTS component would be beneficial, particularly regarding system design, latency control, and how streaming is achieved in practice.

局限性

Yes.

最终评判理由

My final score is 4 (borderline accept). I believe this is a paper with innovative contributions. It proposes a MoE Refiner and a routing gate that jointly enable the model to implicitly learn read/write policies without introducing inference-time overhead. The method also achieves very strong results. However, I am not very familiar with this domain (hence my confidence score is 2), and there are some parts I did not fully understand. Therefore, I will keep my current score, which I believe is an acceptable evaluation.

格式问题

No formatting issues.

作者回复

2025-07-29

Dear Reviewer,

Thank you for your thoughtful feedback regarding the accessibility of our work for readers unfamiliar with simultaneous translation. We acknowledge that space constraints limited our ability to include extensive background in the main text, and we sincerely appreciate your engagement with this challenge. Below, we clarify the core concepts to address your concerns.

The fundamental challenge in streaming (simultaneous) translation is the read/write decision problem: as the system processes speech in continuous chunks, it must dynamically decide whether to generate output tokens or wait for future input. Suboptimal decisions lead to excessive latency (waiting too long) or over-generation (outputting incorrect tokens due to incomplete input).

While a robust read/write policy is essential, supervised training data for such policies is scarce, and existing data may not reflect optimal decisions. We thus propose an unsupervised approach where the model learns the policy intrinsically. Our solution leverages a Mixture-of-Experts (MoE) gating mechanism to enable this learning:

Training Setup: For a given source sentence, the model only accesses the first half during training.
Expert Roles:
- A prefix expert analyzes the visible partial input ( $h^{prefix}$ ).
- A global expert infers high-level context from the full sentence ( $H^{global}$ )
Policy Learning: The MoE router learns to select between experts when predicting each target token (e.g., preferring the prefix expert for early tokens and the global expert for later tokens). Crucially, no explicit policy labels are used—the router implicitly learns when input is sufficient for generation.

At inference (streaming mode), the global expert is unavailable (since future input is unknown). If the router "attempts" to use the global expert (i.e., assigns it high weight), this signals insufficient input—trigerring a wait decision. Thus, the MoE mechanism naturally emerges as a read/write policy without supervision.

Regarding your query about "native" baselines: all comparative methods inherently implement some read/write policy (e.g., fixed latency rules or learned triggers). There is no "policy-free" baseline for simultaneous translation, as the task requires such decisions. We have added a new baseline per other reviewers’ suggestions to further strengthen comparisons.

For streaming TTS, the underlying mechanism is identical to simultaneous translation (as noted in Appendix X), differing only in modality-specific implementations. We kept this section concise to prioritize core contributions but are happy to expand it if needed.

We believe all notations (e.g., $h^{prefix}$ , $H^{global}$ , $L^{prefix}$ ) are clearly defined in Figures 1–2 and Equations 4. Should any term remain unclear, we will gladly refine explanations in revision. Thank you for your time and valuable critique—we welcome the opportunity to improve the manuscript further.

2025-08-04

Thank you for your response. I think the author solved my concerns. I will keep my score.

审稿意见

评分: 4置信度: 42025-07-04

This paper presents SimulMEGA, a policy learning framework for simultaneous speech-to-text translation (S2TT) and text-to-speech synthesis (TTS). It introduces a mixture-of-experts (MoE) refiner on top of a standard Transformer architecture to enable adaptive read/write decision-making. The framework also incorporates a joint loss function designed to balance translation quality with latency-aware decision accuracy. Experiments on CoVoST2 and FLEURS across six languages demonstrate that SimulMEGA outperforms Seamless in multilingual settings.

优缺点分析

Strength

SimulMEGA outperforms Seamless in both offline translation quality and the quality-latency trade-off for simultaneous translation, despite being a smaller multilingual model.
The framework naturally extends to simultaneous text-to-speech synthesis, achieving near-offline synthesis quality with a latency of approximately 1 second.

Weakness

The use of Average Lagging (AL) is problematic, as it has been shown to be not robust to over-generation [1]. A more reliable alternative is Length-Adaptive Average Lagging (LAAL), which has been adopted as the standard latency metric in recent IWSLT evaluations.
Missing strong baseline: The paper omits AlignAtt [2], one of the most effective streaming policies that can be directly applied to offline ST models. Including it would provide a more comprehensive and fair comparison.
Lack of experimental details: The paper does not specify the threshold configurations used during training and inference for read/write decision.

[1] Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation (Papi et al., AutoSimTrans 2022)

[2] AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation (Papi et al., InterSpeech 2023)

问题

What is the rationale behind keeping the non-autoregressive (NAR) layers from Whisper? Is this a trade-off between translation quality and computational efficiency?
The score normalization in Equation (7) is somewhat unclear. Could you elaborate on the rationale behind this formulation?
How is inference performed for the SimulMEGA-TTS model?
Which translation API is used to obtain the reference translations?
Lines 280–281 mention router scores in Figure 5(b), but I couldn’t find them in the figure. Could you clarify?
What exactly does $L^{\text{trunc}}$ refer to in line 285?

局限性

The model is not tested on unbounded speech, where input streams last minutes or even hours, as in real-world applications. Evaluating on such long-form inputs is essential for assessing robustness and practical usability in realistic streaming settings.

最终评判理由

My main concerns were:

The use of AL, which has been shown to be a flawed metric.
The absence of strong baselines such as AlignAtt.
Some lack of clarity in the writing.

The authors’ response to my and other reviewers’ comments shows that:

SimulMEGA remains strong under the LAAL metric.
Results for AlignAtt have been added and are not as strong as SimulMEGA.
The method also performs well in long-form inference.

Overall, I will maintain my positive score.

格式问题

None.

作者回复

2025-07-29

Dear Reviewer,

Thank you for your insightful comments and constructive feedback on our work. We have carefully addressed each of your concerns as follows:

Addressing Weaknesses

Latency Evaluation: We have incorporated numerical evaluation results on the CoVoST2 testset using LAAL (Average Lagging) as the latency metric, presented in Table 1. The results confirm our original conclusion that SimulMEGA continues to marginally outperform baseline methods. In the final version, all quality-latency curves will be systematically re-plotted using the LAAL metric for comprehensive comparison.
Baseline Comparison: While AlignATT did not function effectively with our base model (due to persistent attention fixation on particular frames leading to over-generation), we successfully integrated EDATT (Papi et al., 2023), which shares conceptual similarities. As shown in Table 1, EDATT outperforms both DIG-SST and wait-K baselines, further validating our approach's competitive advantage.
Threshold Documentation: The final version will include comprehensive tables in the appendix detailing numerical results alongside thorough explanations of threshold selection methodology. For your immediate reference, Table 1 presents our threshold selection rationale, with additional implementation details available in the supplementary materials.
Notation Correction: We sincerely appreciate your careful reading—the reference to $L^{\text{trunc}}$ in line 285 should indeed be $L^{\text{prefix}}$ . This was an oversight in notation consistency, and we will correct it in the final version.
Real-world streaming processing: Our supplementary materials include a deployed web demonstration that processes streaming input using a VAD (Voice Activity Detection) system to segment audio into optimal 10-20 second chunks for model processing. While standardized long-form evaluation benchmarks remain scarce in the field, we constructed a specialized test set following Whisper's methodology, comprising 17 TED-LIUM talks with an average duration of nine minutes. We translated the reference transcripts to Chinese using ByteDance's Doubao translation service for evaluation. The S2TT performance metrics are presented in Table 2.

Clarifying Conceptual Questions

Architecture Design: Our decision to apply chunk masking selectively rather than across all layers stems from preliminary experiments showing that universal application degraded translation performance. While we acknowledge the value of ablation studies, training additional base models on this data scale would be computationally prohibitive. We note that architectural choices serve as implementation details rather than core contributions of this work.
Routing Score Normalization: The normalization mechanism is designed to align routing scores proportionally with available input information. Since information content is intractable, we approximate it using input duration. For instance, when processing half a sentence, the average routing score should approach 0.5 (with ideal scores transitioning from 0 in the first half to 1 in the second half). To account for boundary uncertainty, we slightly adjust the normalization target beyond the strict duration proportion.
Generation Process: Upon receiving each text chunk, we initiate a new autoregressive generation process incorporating all previous text chunks and generated speech tokens as context. Generation terminates when the routing score reaches the predetermined threshold. This approach maintains full compatibility with vLLM's CosyVoice implementation, achieving real-time factors (RTF) of 0.1 in deployment.
Pseudo-Labeling: We employed Google Translate for pseudo-label generation, collecting translation pairs between November 2024 and March 2025 to ensure temporal consistency in the translation service's behavior.
Router Score Interpretation: The notation $p^{router}$ represents the router probability score. For visual clarification, Figure 6 in the appendix (supplementary materials) illustrates the dynamic behavior of these scores.

We would like to highlight that our supplementary materials contain complete implementation code, comprehensive numerical results, extended appendices, and a video demonstration of our deployed system, which may provide additional context for evaluation.

We sincerely appreciate your thorough review and valuable suggestions. These improvements have significantly strengthened our manuscript, and we welcome any further feedback you might have regarding our responses.

Table 1: CoVoST2 Evaluation

	X-EN					EN-X
Seamless Streaming	0.3	0.5	0.7	0.9	1	0.3	0.5	0.7	0.9	1
BLEU	33.50	33.75	34.05	34.37	35.05	33.35	33.70	33.83	34.11	35.41
LAAL	2.280	2.410	2.522	2.665	3.149	1.966	2.065	2.144	2.268	2.660
Wait-K	2.5	1.9	1.5	1.1	0.5	2.5	1.9	1.5	1.1	0.5
BLEU	29.27	31.82	32.64	33.45	34.93	34.25	35.83	36.12	36.49	37.41
LAAL	1.731	2.151	2.296	2.565	3.309	1.975	2.294	2.376	2.541	3.127
DigSST	0.1	0.07	0.05	0.03	0.01	0.1	0.07	0.05	0.03	0.01
BLEU	31.43	33.53	35.05	35.99	36.52	17.12	17.51	18.10	19.44	23.38
LAAL	2.035	2.237	2.442	2.752	3.340	0.734	0.770	0.808	0.902	1.252
EDATT	0.03	0.025	0.02	0.015	0.01	0.03	0.025	0.02	0.015	0.01
BLEU	33.70	34.81	35.60	36.09	36.56	35.00	35.82	36.77	37.63	38.20
LAAL	1.880	2.062	2.301	2.610	3.056	1.963	2.131	2.384	2.688	3.129
SimulMEGA	0.7	0.6	0.5	0.4	0.3	0.7	0.6	0.5	0.4	0.3
BLEU	35.58	35.97	36.27	36.46	36.60	37.83	38.11	38.14	38.28	38.39
LAAL	1.729	1.860	1.982	2.104	2.233	2.059	2.246	2.400	2.542	2.690

Table 2: Long-form S2TT result on TED-LIUM

	BLEU	AL	LAAL
SimulMEGA	55.13	16.15655	24.62662
Seamless	38.18	7.016107	15.42395

2025-08-03

While standardized long-form evaluation benchmarks remain scarce in the field

There are two options you can consider.

One is ACL 60/60 dev set used in IWSLT 2025 Simul Track https://iwslt.org/2025/simultaneous
The other is RealSI https://github.com/byteresearchcla/RealSI

Table 2

The latency for long-form speech is too large to use in real life. A usable system should keep at least < 5 seconds latency. This is achieved by many prior methods [1, 2, 3].

[1] Siqi Ouyang, Xi Xu, and Lei Li. 2025. InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model. In Findings of the Association for Computational Linguistics: ACL 2025, pages 3032–3046, Vienna, Austria. Association for Computational Linguistics.

[3] Cheng S, Bao Y, Huang Z, Lu Y, Peng N, Xu L, Yu R, Cao R, Han T, Li Z, Liu S. Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice. arXiv preprint arXiv:2507.17527. 2025 Jul 23.

评论- Clarification of long-form latency evaluation

2025-08-03

Thank you for your suggestion and meaningful discussion. We would like to clarify our latency evaluation methodology:

First, our AL and LAAL are computed over the entire long-form audio. Both AL and LAAL assume a uniform information distribution along the input, which rarely holds in long-form speech. As a result, these metrics can be misleading in practice.

Second, the results reported in [1] and [2] use the StreamLAAL metric, which computes LAAL on each pre-segmented audio clip. StreamLAAL is sensitive to the segmentation method, making fair comparisons between systems difficult. In our experiment, segments are produced automatically by a VAD (without known transcripts); therefore, we are unable to report StreamLAAL results.

As suggested by Papi et al. [4], long-form evaluation should rely on automatically segmented speech rather than human segmentation. For instance, Cheng et al. [3] state that “AL and FLAL measure translation latency at the segment level,” which implies use of the original human-segmented RealSI test set; strictly speaking, those are not long-form results. In [5], paragraph-level RealSI results likewise exhibit very large latency figures.

Regarding Table 4, Seamless outperforms our system primarily because it clips speech into shorter (~5 s) segments, whereas we use longer (20–30 s) segments to maximize context. This longer context leads to noticeably better translation quality in our system.

We fully acknowledge that establishing a standard, intuitive, and fair evaluation protocol for long-form translation remains an open challenge. Our long-form experiments were conducted under tight time constraints, and we included them here to demonstrate real-world applicability with VAD-based segmentation. The aforementioned works[1,2,3], though, are inherently more suitable for unbounded speech; they have their drawbacks (Either relying on human-crafted data or suffering high computation costs). In future work, we will pursue semantic-aware automatic segmentation and develop comprehensive, fair long-form evaluation metrics, which we will also add to our limitations section.

[1] Siqi Ouyang, Xi Xu, and Lei Li. 2025. InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model. In Findings of ACL 2025.

[2] Sara Papi et al. 2024. StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection. In ACL 2024 (Long Papers).

[3] Shanbo Cheng et al. 2025. Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice. arXiv:2507.17527.

[4] Papi, S., Polak, P., Macháček, D., & Bojar, O. 2025. How “Real” is Your Real-Time Simultaneous Speech-to-Text Translation System? TACL, 13:281–313.

[5] Shanbo Cheng et al. 2024. Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent. arXiv:2407.21646.

2025-08-08

StreamLAAL is sensitive to the segmentation method, making fair comparisons between systems difficult.

StreamLAAL is at least better than VAD + LAAL.

segments are produced automatically by a VAD (without known transcripts); therefore, we are unable to report StreamLAAL results.

You are testing on TED-LIUM, which has meta data detailing the start and end timestamp of each gold utterance, so you can actually calculate StreamLAAL.

Cheng et al. [3] state that “AL and FLAL measure translation latency at the segment level,” which implies use of the original human-segmented RealSI test set; strictly speaking, those are not long-form results.

I consulted the authors and they confirm they evaluate on unsegmented test set and segment long-form hypothesis into segments by human annotators and then do segment-level AL, so it is long-form result.

2025-08-08

The following is StreamLAAL result on TED-LIUM with the provided segmentation. We also add another result on EN-FR pair. The previous result was just for showcasing the capability of our deployed online system. And we have acknowledged that it will be better to follow IWSLT standard for evaluation.

	EN-ZH		EN-FR
	BLEU	SteamLAAL	BLEU	SteamLAAL
SimulMEGA	52.16	2.227	35.53	1.527
Seamless	38.19	1.965	35.48	1.967

最终决定Accept (poster)

2025-09-17

(a) Summary:

This paper introduces SimulMEGA, a Mixture-of-Experts (MoE) based framework for simultaneous speech translation (SimulST) that learns read/write policies in an unsupervised manner. The method extends to both speech-to-text (S2TT) and streaming text-to-speech (TTS), with minimal architectural modifications. The authors demonstrate improved trade-offs between latency and quality compared to Seamless and other baselines across six languages. They also provide ablation studies, long-form speech experiments, and supplementary demonstrations.

(b) Strengths:

Clear technical innovation through the introduction of an MoE refiner that yields implicit read/write policies without inference-time overhead
Strong empirical performance, outperforming Seamless on CoVoST2 and FLEURS benchmarks with smaller model size
Extension beyond S2TT to simultaneous TTS and speech-to-speech translation, highlighting versatility
Thorough experimental analysis, including ablations, long-form settings, and supplementary demonstrations
The method is relatively lightweight and adaptable to multiple architectures, as shown by experiments on Whisper-derived and Qwen-based systems

Evaluation originally lacked LAAL (Length-Adaptive Average Lagging), now the standard in SimulST. Authors added results during rebuttal
The selection of baselines ignored state-of-the-art works such as StreamSpeech or NAST-S2X.
The long-form evaluation relies on VAD-based segmentation rather than direct unbounded-streaming methods, which reviewers noted as a limitation.
Some reliance on non-standard architectural modifications (chunk-AR, NAR encoders) limit the general applicability to standard S2TT/S2ST architectures.

(d) Reasons for decision:

The paper presents meaningful contributions to SimulST, with demonstrated empirical improvements over strong baselines. It proposes a novel use of MoE for implicit policy learning and shows adaptability across tasks and architectures. However, limitations in evaluation protocol, baseline coverage, and clarity reduce its case for spotlight or oral. Thus poster acceptance is appropriate: the work is innovative and technically solid, but does not yet demonstrate the broad maturity and impact required for a higher recommendation.

(e) Discussion and rebuttal:

Reviewer U9os raised concerns about the use of AL as a latency metric, absence of strong baselines (e.g., AlignAtt), and lack of threshold details. The authors addressed these by providing LAAL results, adding EDAtt and AlignAtt comparisons, and clarifying thresholds. This satisfied the reviewer, who maintained a positive score.
Reviewer Eauk emphasized lack of clarity around MoE motivation and TTS implementation. The rebuttal clarified the roles of prefix/global experts and provided more implementation details, which addressed most concerns.
Reviewer FapF requested deeper analysis across language pairs and more detailed S2ST benchmarks. The authors provided supplementary tables and clarified benchmark scope. Reviewer acknowledged the responses positively.
Reviewer 2Fcn strongly emphasized missing baselines and clarity on evaluation frameworks. Authors added AlignAtt results, low-latency regime experiments, and comparisons to NAST-S2S and StreamSpeech. Nonetheless, the reviewer remained critical about the reliance on non-standard architectures and the absence of truly low-latency (<1.5s) evaluation. The final score stayed borderline positive.

Overall, all reviewers acknowledged improvements after rebuttal. While some concerns (especially about architecture assumptions and low-latency evaluation) were only partially addressed, the reviewers did not oppose acceptance. The consensus points to a technically solid, novel contribution that deserves inclusion as a poster.