Learning Monotonic Attention in Transducer for Streaming Generation
摘要
评审与讨论
This paper presents a solution to the challenges faced by Transducer-based streaming generation models in handling non-monotonic alignments, particularly in simultaneous translation tasks. The authors introduce a learnable monotonic attention mechanism that integrates with Transducer decoding. Utilizing the forward-backward algorithm, they infer alignment probabilities between predictor states and input timestamps, allowing adaptive adjustments in attention scope. Experimental results show significant performance improvements of the MonoAttn-Transducer, offering a robust method for complex streaming generation tasks. Their code is publicly available.
优点
- This paper improves the Transducer architecture, making it more suitable for streaming generation tasks.
- The experiments demonstrate that the proposed method achieves commendable performance in the Speech-to-text Simultaneous Translation task.
缺点
- The experiments in this paper only provide results on the Speech-to-Text Simultaneous Translation task, and do not validate the method on other streaming generation tasks.
- Under lower latency conditions (about 1000 ms) for the EnDe and EnEs tasks, the performance of MA-T appears to be inferior to that of CAAT, another transducer-based method. This may indicate that the proposed method in this paper is not highly effective.
- The proposed method in this paper shares a very similar model structure with both CAAT (Liu et al. 2021) and MILK (Arivazhagan et al 2019), which raises concerns about the lack of novelty in the article.
Reference
[1]Dan Liu, Mengge Du, Xiaoxi Li, Ya Li, and Enhong Chen. Cross attention augmented transducer networks for simultaneous translation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
[2]Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, and Colin Raffel. Monotonic infinite lookback attention for simultaneous machine translation. In Anna Korhonen, David Traum, and Llu´ıs Marquez (eds.), ` Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019.
问题
- After incorporating a unidirectional encoder, the model structures of MA-T and CAAT are very similar. What are the specific differences between the two? How does MA-T manage to perform attention computations while ensuring that Memory Overload remains O(1)?
- Using AL as a latency metric may not accurately assess the phenomenon of over-generation. Could you provide alternative latency metrics, such as LAAL or LAAL-CA?
- Regarding the comparison between MA-T and TAED, could you provide more experimental results to demonstrate that your method truly achieves better performance than MA-T? From the perspective of AL and BLEU, while TAED has a larger computational overhead, it does achieve higher translation quality at a lower latency and the overhead is still acceptable.
Thank you for your insightful comments! We sincerely appreciate the time and effort you dedicated to reviewing our work. Below, we provide detailed responses to your questions and hope these clarifications could address your concerns.
———— Response to Questions about Model Design and Complexity ————
[Q1]. After incorporating a unidirectional encoder, the model structures of MA-T and CAAT are very similar. What are the specific differences between the two? How does MA-T manage to perform attention computations while ensuring that Memory Overload remains O(1)?
In transducer, introducing attention module will result in an exponentially large alignment space between source and target, making its training process infeasible. Our MonoAttn-Transducer and CAAT solve this problem in two different ways.
CAAT solves this problem by moving all cross-attention modules in the predictor on the top of self-attention modules, which ensures that the representation of each predictor state is independent of the specific READ/WRITE path and is solely determined by the number of source tokens accessible to that predictor state. However, this solution has following drawbacks:
- CAAT imposes restrictions on the design of the predictor, requiring that the cross-attention must come after all self-attention, disallowing the typical interleaved structure. This constrains the model's expressive power.
- During training, it is necessary to compute the representation for each predictor state when attending to all possible numbers of source tokens. This results in the predictor's forward pass being executed times within the PyTorch computation graph, significantly increasing GPU memory overhead.
In this work, we solve this problem by estimating the posterior probability of alignment between source and target (Eq. 5 and Eq. 7 in paper). This estimation from output lattice directly gives us clues of how much source each target token can attend to in training. We use this posterior alignment probability to estimate the expected representation for each predictor state, thus ensuring that Memory Overload remains , and without any restrictions on the design of the predictor.
This nice property allows our model to scale effectively to scenarios involving very long sequences, such as streaming speech generation, which we will discuss in the following.
[W3]. The proposed method in this paper shares a very similar model structure with both CAAT and MILK.
As analyzed in the previous, our MonoAttn-Transducer differs significantly from CAAT. Our model reduces the predictor's memory consuption from to at training and imposes no restrictions on the predictor's design. Moreover, MonoAttn-Transducer learns monotonic attention based on the posterior alignment of Transducer, whereas MILK uses learnable Bernoulli variables, which is significantly different.
———— Response to Questions about Experiments ————
[W1]. The experiments in this paper only provide results on the Speech-to-Text Simultaneous Translation task, and do not validate the method on other streaming generation tasks.
In response to your advice, we conduct experiments on textless Speech-to-Speech Simultaneous Translation. This challenging streaming task requires implicitly performing ASR, MT, and TTS simultaneously, and also handling the non-monotonic alignments in streaming generation. We adopted a textless setup in our experiments, directly modeling the mapping between source speech spectrogram and discrete units of target speech [1].
Experiments are conducted on CVSS-C French-to-English Speech-to-Speech Translation dataset. Results are provided below.
| FR-> EN Speech-to-Speech | |||
|---|---|---|---|
| Chunk Size | 320ms | Offline | |
| MonoAttn-Transducer | ASR-BLEU | 18.3 | 19.3 |
| AL (ms) | 118 | - | |
| LAAL (ms) | 972 | - | |
| Transducer | ASR-BLEU | 17.1 | 18.0 |
| AL (ms) | 153 | - | |
| LAAL (ms) | 984 | - |
In speech-to-speech streaming generation, MonoAttn-Transducer also demonstrated a significant improvement in speech generation quality. This indicates that our approach is not limited to streaming text generation but also effective for streaming speech generation.
[Q2]. Using AL as a latency metric may not accurately assess the phenomenon of over-generation. Could you provide alternative latency metrics, such as LAAL or LAAL-CA?
Sorry for the confusion. Actually we provide the LAAL and LAAL-CA results in Table 7, Appendix C. For your convenience, we have included the results from the table here. (Only En-Es here for the sake of space.)
| CS (ms) | 320 | 640 | 960 | 1280 | |
|---|---|---|---|---|---|
| Transducer | LAAL (ms) | 1258 | 1563 | 1942 | 2312 |
| LAAL_CA (ms) | 1444 | 1673 | 2028 | 2389 | |
| MonoAttn-Transducer | LAAL (ms) | 1317 | 1582 | 1957 | 2305 |
| LAAL_CA (ms) | 1501 | 1702 | 2056 | 2387 |
We found that LAAL exhibits a similar trend to AL, showing MonoAttn-Transducer improving performance while having almost no impact on latency. Moreover, the values of LAAL_CA are almost the same as LAAL. This strongly demonstrates the benefits of our model's decoding complexity compared with TAED's .
[W2 & Q3]. Regarding the experimental comparison between our MonoAttn-Transducer, CAAT and TAED.
We would like to point out that the results for CAAT and TAED provided in our paper are directly borrowed from their respective papers. However, a direct comparison with our results may not be entirely fair, as the training corpora are not identical (due to differences caused by knowledge distillation) and the use of additional tricks for enhancement varies (beam search and auxiliary regularation in CAAT).
For CAAT, a recent reproduced results from research commuinity can be found in [2]. (Actually, we follow the data settings in [2] to conduct our experiments).
We compare with results in [2] here.
| MuST-C En-Es | |||||
|---|---|---|---|---|---|
| Transducer | BLEU | 24.33 | 25.82 | 26.36 | 26.40 |
| LAAL (ms) | 1168 | 1466 | 1847 | 2220 | |
| CAAT[2] | BLEU | 25.1 | 26.0 | 26.6 | 26.7 |
| LAAL (ms) | 1020 | 1370 | 2050 | 2380 | |
| MonoAttn-Transducer | BLEU | 24.72 | 26.74 | 27.05 | 27.41 |
| LAAL (ms) | 1230 | 1475 | 1837 | 2204 |
It can be observed that under lower latency conditions (about 1000 ms), our MonoAttn-Transducer shows no significant difference compared to CAAT. In fact, at 1400 ms, it already demonstrates a noticeable improvement. Nevertheless, we do not want to claim that our model outperforms CAAT or other SOTA. What we want to clarify is that the more important contribution of this work is providing a low-complexity algorithm to introduce the attention mechanism into Transducer, effectively addressing the issue of non-monotonic alignment in streaming generation, rather than surpassing other models.
Thank you again for taking the time to review our rebuttal. If you have any further questions, we would be happy to discuss them.
Reference
[1] Lee et al. (2022). Direct speech-to-speech translation with discrete units.
[2] Papi et al. (2023). Attention as a Guide for Simultaneous Speech Translation.
Thank you for your response. However, I’m afraid I will maintain the original score.
Dear Reviewer,
Thank you for taking the time to review our work. We understand your busy schedule.
As the rebuttal period is nearing its close, could you please take a moment to check our response? If you have any further questions or require clarification, we would be more than happy to discuss them with you.
Thank you in advance for your time.
Authors
Dear Reviewer 2LWH,
Thank you for your valuable contributions to the review process for the paper! The authors have submitted their rebuttal, and I would greatly appreciate it if you could take a look and provide your response.
We respect your decision and appreciate your response. But, at least, can you specify your reasons and let us know whether our response addresses your concerns.
We take much effort to address your concerns in a very datailed way. This perfunctory response only shows our efforts in rebutall are not being respected.
This paper describes an extension for Transducers for tasks where the input and output are not monotonically aligned, such as speech translation. The method is evaluated on two languages of the MuST-C benchmark and it is compared to existing methods.
优点
- The paper addresses an interesting setting which has practical implications
- The method is described in detail and a large portion of the paper is dedicated to this description.
缺点
- The method is fairly complex, requiring a range of steps in addition to Transducer training as shown in Algorithm 1 (which is already complex).
- The description of the method is sometimes not very clear. It would be helpful to provide intuition in addition to equations.
- The evaluation was done on languages with relatively similar word order: En-Es and En-De differ in their word order but much less so, then say En-Chinese or English and any non-European language. There are speech translation benchmarks which enable these settings, e.g., FLEURS and which would provide more interesting results.
- It seems a bit surprising that the BLEU improvements for En-Es and En-De between Transducer and the new method seem fairly similar across most chunk sizes (about 1 BLEU, Table 2). I would have expected En-De to benefit more the new method than En-Es given that the word order of En-De is more different.
问题
- Did you consider evaluating on other settings than En-De/En-Es where word order is more different?
Thank you for your insightful comments! We sincerely appreciate the time and effort you dedicated to reviewing our work. Below, we provide detailed responses to your questions and hope these clarifications could address your concerns.
———— Response to Questions about Intuition & Motivation ————
[W1 & W2]. The method is fairly complex, requiring a range of steps in addition to Transducer training as shown in Algorithm 1 (which is already complex). The description of the method is sometimes not very clear. It would be helpful to provide intuition in addition to equations.
Sorry for the confusion. We will try to briefly explain the intuition of this work here.
Transducer is a widely applied architecture in streaming ASR. However, it is less suitable for simultaneous interpretation, where the alignment between languages is often non-monotonic (e.g., SVO language to SOV language). The Transducer's non-monotonicity problem arises from its synchronization mechanism, which relies solely on an MLP-based joiner to connect its encoder and predictor (analogous to the decoder in a Transformer, please see Fig.1 in the paper). Unlike Transformer, it lacks an attention module to facilitate target outputs reordering. Actually, introducing attention module to streaming generation architecture is not trivial, as you have no idea how many source tokens a target token can attend to during training. Specifically, in transducer, introducing attention module will result in an exponentially large alignment space between source and target, making its training process infeasible. In this work, we solve this problem by estimating the posterior probability of alignment between source and target (Eq. 5 and Eq. 7 in paper). This gives us clues of how much source each target token can attend to in training. We use this posterior alignment probability to estimate the attention representation (Eq. 8), which finally helps us to train a Transducer with a monotonic attention synchronization mechanism while maintaining the same complexity as the standard Transducer.
In conclusion, the major contribution of this work is providing a low-complexity algorithm to introduce the attention mechanism into Transducer, effectively addressing the issue of non-monotonic alignment in streaming generation.
———— Response to Questions about Experiments ————
[W3 & Q1]. The evaluation was done on languages with relatively similar word order: En-Es and En-De differ in their word order but much less so, then say En-Chinese or English and any non-European language. There are speech translation benchmarks which enable these settings, e.g., FLEURS and which would provide more interesting results. Did you consider evaluating on other settings than En-De/En-Es where word order is more different?
Thank you for your valuable suggestions. In response to your feedback, we conducted experiments on Covost2 En-Zh Speech-to-Text dataset. (The reason we chose Covost2 is that we currently only have the preprocessed Covost data, which allows us to complete the experiments within the tight schedule.) Results are presented below.
| En-Zh Speech-to-Text | ||||
|---|---|---|---|---|
| Chunk Size | 320 | 1280 | Offline | |
| MonoAttn-Transducer | BLEU | 14.4 | 17.2 | 19.5 |
| AL | 2323 | 2937 | - | |
| Transducer | BLEU | 14.1 | 16.7 | 18.8 |
| AL | 2399 | 2794 | - |
To our surprise, we found the improvement on Covost En-Zh is relatively minor compared to the improvements on Must-C En-De/Es. We suspect the following factors might contribute to this: Covost dataset is relatively noisy. We observed that ASR model trained on its source speech-transcript data has a WER of 30, suggesting the dataset may not accurately reflect the model's performance. Additionally, we noted that even with a chunk size of 320ms, model's latency (AL) on this dataset is as high as 2300ms, indicating much noise in the dataset. We plan to conduct experiments on the FLEURS dataset later and report the results.
To further alleviate your concern, we additionally conduct experiments on French-to-English Speech-to-Speech Simultaneous Translation. This challenging streaming task requires implicitly performing ASR, MT, and TTS simultaneously, and also handling the non-monotonic alignments in streaming generation. We adopted a textless setup in our experiments, directly modeling the mapping between source speech spectrogram and discrete units of target speech [1].
We conduct experiments on CVSS-C Speech-to-Speech Translation dataset. Results are provided below.
| Fr-> En Speech-to-Speech | |||
|---|---|---|---|
| Chunk Size | 320 | Offline | |
| MonoAttn-Transducer | ASR-BLEU | 18.3 | 19.3 |
| AL (ms) | 118 | - | |
| LAAL (ms) | 972 | - | |
| Transducer | ASR-BLEU | 17.1 | 18.0 |
| AL (ms) | 153 | - | |
| LAAL (ms) | 984 | - |
As shown above, MonoAttn-Transducer also demonstrated a significant improvement in French-to-English Speech-to-Speech Simultaneous Translation. This indicates that our approach is also effective for streaming speech generation.
[W4]. It seems a bit surprising that the BLEU improvements for En-Es and En-De between Transducer and the new method seem fairly similar across most chunk sizes (about 1 BLEU, Table 2). I would have expected En-De to benefit more the new method than En-Es given that the word order of En-De is more different.
Indeed, we observed the same phenomenon with BLEU. However, by examining the COMET scores in Table 1, we see that the average improvement for EN-DE reached 1.7 COMET, significantly higher than 0.98 COMET for EN-ES. Considering that COMET scores have been shown to correlate more strongly with human evaluations [2], we think the COMET results can demonstrate the superiority of our approach in handling non-monotonic alignments in streaming generation.
Thank you again for taking the time to review our rebuttal. If you have any further questions, we would be happy to discuss them.
Reference
[1] Lee et al. (2022). Direct speech-to-speech translation with discrete units.
[2] Rei et al. (2022). COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task
I appreciate the authors rebuttal, however, I will maintain my rating since the new results are not more convincing than the existing results.
Thanks for your response.
As we have provided the results you required (speech-to-text En->Zh) and the results in other modalities (speech-to-speech Fr->En) in our rebuttal, could you specify why you are not satisfied with our response to help us improve our work?
Hi Reviewer ak6X,
Thank you once again for your time and effort. I understand how busy your schedule must be. However, could you kindly let us know which parts of our rebuttal you are not satisfied with and what specific information you would like to see in our response?
Your time would be greatly appreciated.
Best regards,
Authors
Dear Reviewer ak6X,
Thank you for your valuable contributions to the review process for the paper! The authors have submitted their rebuttal, and I would greatly appreciate it if you could take a look and provide your response.
This paper proposes MonoAttn-Transducer, a novel approach to enhance Transducer models with learnable monotonic attention for streaming generation tasks. The key contributions are:
- A method to integrate monotonic attention into Transducer's architecture while maintaining its efficient training through the forward-backward algorithm.
- A training algorithm that uses posterior alignment probabilities to estimate context representations for monotonic attention.
- A chunk synchronization mechanism to bridge the gap between training and inference.
- Extensive experiments demonstrating improved performance on simultaneous translation tasks.
优点
Technical Innovation:
- The proposed method cleverly solves the exponential state space problem by using posterior alignments to estimate context representations
- The solution maintains the same computational complexity as vanilla Transducer
- The chunk synchronization mechanism shows thoughtful consideration of practical deployment
Theoretical Foundation:
- The approach is well-grounded in probability theory and previous work on Transducers
- The mathematical derivations are sound and clearly explained
- The relationship between prior and posterior alignments is well-analyzed
Empirical Results:
- Comprehensive experiments on MuST-C dataset
- Strong improvements over baseline Transducer (+0.75-1.0 BLEU, +0.95-2.06 COMET)
- Thorough ablation studies and analysis
- Competitive performance against SOTA methods
缺点
Limited Experimental Scope:
- Experiments focus only on speech-to-text translation.
- Only two language pairs (En->De, En->Es) are tested.
- No experiments on other streaming generation tasks like ASR or TTS.
Training Efficiency:
- The paper doesn't discuss training time comparison with baseline Transducer.
- Memory usage analysis could be more detailed.
- No discussion of potential overhead from posterior alignment calculation.
Algorithmic Limitations:
- The method still requires chunk-based processing.
- The impact of chunk size on performance could be better theoretically explained.
- The choice of diagonal prior distribution seems somewhat arbitrary.
问题
/
Thank you for your insightful comments! We sincerely appreciate the time and effort you dedicated to reviewing our work. Below, we provide detailed responses to your questions and hope these clarifications could address your concerns.
———— Response to Questions about Limited Experimental Scope ————
[W1 & W2 & W3]. Experiments focus only on speech-to-text translation. No experiments on other streaming generation tasks like ASR or TTS. Only two language pairs (En->De, En->Es) are tested.
Following your advice, we additionally conducted experiments on textless speech-to-speech simultaneous translation. Speech-to-speech simultaneous translation encompasses three streaming tasks: ASR, MT, and TTS. This challenging task requires implicitly performing all tasks simultaneously, and also handling the non-monotonic alignments in streaming generation, making it a suitable benchmark to evaluate the capabilities of our MonoAttn-Transducer on streaming speech generation. We adopted a textless setup in our experiments, directly modeling the mapping from source speech spectrogram to discrete units of target speech [1]. We conduct our expereiments on CVSS-C French-to-English Speech-to-Speech Translation dataset. Results are provided below.
| FR-> EN Speech-to-Speech | |||
|---|---|---|---|
| Chunk Size | 320ms | Offline | |
| MonoAttn-Transducer | ASR-BLEU | 18.3 | 19.3 |
| AL (ms) | 118 | - | |
| LAAL (ms) | 972 | - | |
| Transducer | ASR-BLEU | 17.1 | 18.0 |
| AL (ms) | 153 | - | |
| LAAL (ms) | 984 | - |
In speech-to-speech streaming generation, MonoAttn-Transducer also demonstrated a significant improvement in speech generation quality. This indicates that our approach is not limited to streaming text generation but also effective for streaming speech generation.
———— Response to Questions about Training Efficiency ————
[W4]. The paper doesn't discuss training time comparison with baseline Transducer.
[W5]. Memory usage analysis could be more detailed.
[W6]. No discussion of potential overhead from posterior alignment calculation.
Thank you for valuable feedback on efficiency! We provide a discussion here and will include it in a revised version of our paper.
Training Time
We first analyze each step in Algorithm 1 to theoretically compare the training time differences between MonoAttn-Transducer and baseline.
Firstly, we observe that Lines 1, 2, and 6 involve naive matrix computation without requiring gradients, which do not significantly impact the overall overhead.
The additional time overhead introduced by our method arises from Lines 3, 4, and 5. Specifically, this includes an additional forward pass of the predictor component and the computation for the posterior alignment.
The overhead from the posterior alignment calculation is approximately equivalent to the overhead incurred during loss calculation, as both rely on the same forward-backward algorithm.
Emperically, we found MonoAttn-Trasducer is 1.33 times slower than Transducer baseline with the same configuration on Nvidia L40 GPU.
Training Memory
Compared to Transducer baseline, the additional memory overhead of MonoAttn-Transducer comes solely from its Monotonic Attention module.
The extra forward pass of the predictor is performed without gradient computation (i.e., within a with torch.no_grad() context), so it does not contribute to the large computation graph.
Emperically, we observed that the peak memory usage of Transducer baseline is 28GB, while MonoAttn-Transducer exhibits a slightly higher peak usage of 32GB when the total number of source frames is fixed at 40,000 on a single Nvidia L40 GPU.
———— Response to Questions about Algorithmic Limitations ————
[W7]. The method still requires chunk-based processing.
[W8]. The impact of chunk size on performance could be better theoretically explained.
[W9]. The choice of diagonal prior distribution seems somewhat arbitrary.
Regarding Chunk Setting
As you mentioned, chunk-based processing is still a crucial component of our model to work in streaming scenarios.
However, though the chunk size is given manually, the model can flexibly adjust its streaming strategy by outputting blank tokens.
Intuitively, a larger chunk size results in better generation quality but longer latency.
Adaptively selecting an optimal chunk size could be another key problem, and we leave this for future work.
Regarding Choice of Prior
Yes, we did not design the prior distribution in very detail; the alignment prior primarily serves to estimate the alignment posterior.
In fact, as demonstrated in Section 6.1 and Appendix D, we found that the alignment posterior is not sensitive to the choice of alignment prior. We believe this highlights the robustness of our approach.
Thank you again for taking the time to review our rebuttal. If you have any further questions, we would be happy to discuss them.
Reference
[1] Lee et al. (2022). Direct speech-to-speech translation with discrete units.
Dear Reviewer,
Thank you for taking the time to review our work. We understand your busy schedule.
As the rebuttal period is nearing its close, could you please take a moment to check our response? If you have any further questions or require clarification, we would be more than happy to discuss them with you.
Thank you in advance for your time.
Authors
Dear Reviewer uLi8,
Thank you for your valuable contributions to the review process for the paper! The authors have submitted their rebuttal, and I would greatly appreciate it if you could take a look and provide your response.
This paper discusses the challenges faced by Transducer-based streaming generation models, particularly in tasks requiring non-monotonic alignments, such as simultaneous translation. These challenges arise from the input-synchronous decoding mechanism of Transducers, which can result in suboptimal performance. To address this, the authors propose integrating a learnable monotonic attention mechanism within the Transducer architecture. This mechanism uses a forward-backward algorithm to calculate the posterior probability of alignments between predictor states and input timestamps. Consequently, it allows for the estimation of context representations of monotonic attention during training, enabling the model to adaptively adjust its attention scope based on predictions. This innovative approach eliminates the need to explore the exponentially large alignment space. Experiments reveal that the proposed MonoAttn-Transducer significantly improves performance in streaming generation tasks dealing with non-monotonic alignments. The codes for this study are available in the supplementary materials.
优点
-
The authors propose a learnable monotonic attention mechanism within the Transducer architecture to solve the non-monotonic alignments in streaming generation.
-
The propsoed method is effectiveness and easy to reproduce.
缺点
-
I am unable to discern the difference between the proposed monotonic attention mechanism and the cross-attention mechanism.
-
Furthermore, the authors should compare the cross-attention approach (Liu et al., 2021; Tang et al., 2023) with other methods that utilize input history.
问题
N/A
Dear Chairs and Reviewers,
Thank you for your efforts in reviewing our paper.
Unfortunately, we believe the review provided by Reviewer zyS6 strongly violates reviewers' code of conduct. The feedback is not relevant to this work, illogical, and even includes misunderstandings of basic concepts in this area.
We sincerely hope Reviewer zyS6 could read the paper and provide a thoughtful review.
Best regards,
Authors
Dear Reviewer zyS6,
Thank you for your valuable contributions to the review process for the paper! The authors have submitted their rebuttal, and I would greatly appreciate it if you could take a look and provide your response.
Dear Reviewers,
As the rebuttal phase is nearing its end, we would greatly appreciate it if you could kindly take a moment to review our responses. We would be pleased to discuss further questions if you have any.
Best regards,
Authors
Dear Reviewers,
As the rebuttal phase is ending in 48hrs, we sincerely hope you could kindly take a moment to review our responses.
Thank you for your time in advance.
Best regards,
Authors
UPDATE 2024/12/4
The discussion period is finished. Sadly, NONE of the reviewers are willing to engage in the rebuttal discussion. However, we provide a summary of our rebuttal here and hope it will be helpful to anyone interested in this paper. We believe this reflects the value of OpenReview.
Intuition of this work (Question from Reviewer ak6X)
Transducer is a widely applied streaming architecture. However, it is less suitable for tasks where the alignment is non-monotonic (e.g., simultaneous interpretation from SVO language to SOV language). The non-monotonicity problem arises from its synchronization mechanism, which relies solely on an MLP-based joiner. It lacks an attention module to facilitate target outputs reordering. However, introducing attention module to streaming generation architecture is not trivial, as you have no idea how many source tokens a target token can attend to during training. The major contribution of this work is providing a low-complexity algorithm to introduce the attention mechanism into Transducer, effectively addressing the issue of non-monotonic alignment in transducer-based streaming generation.
Why our method performs attention computations while ensuring the memory overload in forward-backward algorithm remains ? (Question from Reviewer 2LWH)
We solve this problem by estimating the posterior probability of alignment between source and target (Eq. 5 and Eq. 7 in paper). This estimation from output lattice directly gives us clues of how much source each target token can attend to in training. We use this posterior alignment probability to estimate the expected representation for each predictor state, thus ensuring that Memory Overload remains .
Questions about Training Efficiency (Question from Reviewer uLi8)
Training Time
We first analyze each step in Algorithm 1 to theoretically compare the training time differences between MonoAttn-Transducer and baseline.
Firstly, we observe that Lines 1, 2, and 6 involve naive matrix computation without requiring gradients, which do not significantly impact the overall overhead.
The additional time overhead introduced by our method arises from Lines 3, 4, and 5. Specifically, this includes an additional forward pass of the predictor component and the computation for the posterior alignment.
The overhead from the posterior alignment calculation is approximately equivalent to the overhead incurred during loss calculation, as both rely on the same forward-backward algorithm.
Emperically, we found MonoAttn-Trasducer is 1.33 times slower than Transducer baseline with the same configuration on Nvidia L40 GPU.
Training Memory
Compared to Transducer baseline, the additional memory overhead of MonoAttn-Transducer comes solely from its Monotonic Attention module.
The extra forward pass of the predictor is performed without gradient computation (i.e., within a with torch.no_grad() context), so it does not contribute to the large computation graph.
Emperically, we observed that the peak memory usage of Transducer baseline is 28GB, while MonoAttn-Transducer exhibits a slightly higher peak usage of 32GB when the total number of source frames is fixed at 40,000 on a single Nvidia L40 GPU.
Experiments on other streaming generation task (Question from Reviewer uLi8 & 2LWH)
we additionally conducted experiments on textless speech-to-speech simultaneous translation. Speech-to-speech simultaneous translation encompasses three streaming tasks: ASR, MT, and TTS. This challenging task requires implicitly performing all tasks simultaneously, and also handling the non-monotonic alignments in streaming generation, making it a suitable benchmark to evaluate the capabilities of our MonoAttn-Transducer on streaming speech generation.
| FR-> EN Speech-to-Speech | |||
|---|---|---|---|
| Chunk Size | 320ms | Offline | |
| MonoAttn-Transducer | ASR-BLEU | 18.3 | 19.3 |
| AL (ms) | 118 | - | |
| LAAL (ms) | 972 | - | |
| Transducer | ASR-BLEU | 17.1 | 18.0 |
| AL (ms) | 153 | - | |
| LAAL (ms) | 984 | - |
In speech-to-speech streaming generation, MonoAttn-Transducer also demonstrated a significant improvement in speech generation quality. This indicates that our approach is not limited to streaming text generation but also effective for streaming speech generation.
2024/11/25
Dear Reviewers,
We have uploaded a revised version of our paper addressing your questions. The updates include:
- Experiments on more streaming generation tasks, discussed in Section 5.
- Analysis of training efficiency, discussed in Section 6.
Thank you for your valuable feedback and we sincerely hope you could take a moment to read our rebuttal.
We would greatly appreciate it if you could reply.
Best regards,
Authors
Dear Reviewers,
It has been a week since we submitted our rebuttal, and we are very hopeful to receive a response.
We would greatly appreciate any discussions or feedback.
Best regards,
Authors
We appreciate the time and effort of everyone involved in reviewing the paper. However, the process makes us strongly feeling the reviewing is not being respected.
1. We used GPT-ZERO to evaluate the official review from Reviewer zyS6. We found that the summary section is highly likely to be generated by AI, and the weakness part lacks basic coherence and clarity.
2. After waiting a long time since submitting our rebuttal, we sadly found that no reviewer was willing to engage in a discussion. It seems all reviewers are even unwilling to share whether their concerns have been addressed or to specify any details they are still worried about. The only comment received is on the overall score. This is TOXIC and disrespects our efforts in the rebuttal period.
We are not begging the reviewer to increase their score. Actually, we welcome any discussion or even criticism in responses. The only hope is these discussions will be professional and specific. We are not trying to be emotional, but it is hard to believe we are encountering such a situation at a top-tier academic conference.
Authors
This paper improves Transducer-based streaming models for tasks that need non-monotonic alignments (like speech-to-text simultaneous translation). Specifically, the authors integrate a learnable monotonic attention mechanism into a Transducer architecture, leveraging the forward-backward algorithm to compute posterior alignment probabilities between predictor states and input timestamps. Experimental results on EN-DE and EN-ES speech-translation tasks suggest that the proposed MonoAttn-Transducer outperforms a standard Transducer baseline.
Strengths (1) The work tackles the known issue of standard Transducers with a principled solution (2) The empirical gains on two benchmarks are solid
Weaknesses (1) Some reviewers pointed out the limited scope of the method which only considers STT on closely related languages. More experiments on diverse language pairs and tasks are preferred; (2) some reviewers questioned the novelty compared to existing works.
Decision
While the reviewers agree that the paper tackles an important real-world challenge for steaming translation tasks, concerns remain about the novelty and experimental scopes, which could not be resolved during the rebuttal phase. So, I am leaning toward rejection and would like to encourage the authors to include more experiments with clearer comparisons in the future version.
The review of zyS6 seems low quality (also raised by the authors), and no further engagement happened during the rebuttal phase. Therefore, the judgment is made by discarding the review of zyS6.
审稿人讨论附加意见
The reviewers mainly questioned the training efficiency and additional experiments on language pairs and tasks. While the authors added additional experiments on FR-EN and Zh-En, the reviewers did not find the new results conveying more convincing conclusions.
Reject