Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
We propose a unified speech recognition method for ASR, VSR, and AVSR.
摘要
评审与讨论
This paper proposes a unified architecture and training method for auditory/visual speech recognition. Building upon this model, the authors introduce a semi-supervised pseudo-labeling method to leverage unlabeled audio-visual data, as well as self-supervised pre-training to enhance model performance. Experiments indicate that the model achieves state-of-the-art performance on A/V/AVSR.
优点
-
This work for the first time proposes an effective model and training procedure for unifying auditory and visual speech content recognition, which is of high novelty and practical significance.
-
The author conducted comprehensive and extensive ablation studies, verifying the characteristics of the model and the effectiveness of each step in the training paradigm. The experimental results are robust and credible, offering significant guidance for related research.
缺点
The article has no obvious flaws, but there are some questions that I hope the authors can clarify (see questions).
问题
-
How is the weight of the teacher model in self-supervised pretraining initialized? Is it initialized randomly or with pretrained weight on another task?
-
Did the author make a comparison between the teacher-student self-supervised pretraining in the paper with masked-autoencoding training of audio and/or visual features? Is the proposed pretraining method superior?
-
Did the author investigate the effect of different masking ratios?
局限性
The authors discuss the limitation of their work in appendix A.
Thank you for your detailed review and time. Below we address the key concerns raised.
How is the weight of the teacher model in self-supervised pretraining initialized? Is it initialized randomly or with pretrained weight on another task?
The teacher model is randomly initialised in pre-training and improves throughout training via an exponential moving average operation through bootstrapping [44], as in related works [15, 17, 18]. We will emphasise this more in the paper.
Did the author make a comparison between the teacher-student self-supervised pretraining in the paper with masked-autoencoding training of audio and/or visual features? Is the proposed pretraining method superior?
We are not aware of works in audio-visual self-supervised learning for speech recognition that employ masked autoencoding (MAE) for pre-training, i.e., predicting raw audio and/or visual features as in [45]. Although the investigation of MAE in the context of audio-visual speech recognition is outside the scope of our work, we agree that it is an interesting direction for future research. Our pre-training task, which combines masked prediction (similarly to MAE) with teacher-student training, aligns more closely with recent successful models in this area, such as RAVEn [17], AV-data2vec [15], and BRAVEn [18].
Did the author investigate the effect of different masking ratios?
Initially, we conducted a set of preliminary experiments and chose to use the masking ratio from [18]. However, we agree that exploring different masking ratios is a valuable ablation study, and thank you for prompting us to investigate this further. We have now conducted experiments with different masking ratios for pre-training (see Table 2a in the rebuttal PDF). We find that the model performs well with a masking probability of 0.4-0.6 and is, interestingly, quite robust to the exact choice of masking ratio. We added this ablation in the Appendix.
Thanks the authors for actively addressing my concerns. I think the current score already reflects my positive evaluation of the paper, so I choose to maintain the existing score.
This paper proposes a training methodology for a single model which can use either audio, visual, or audiovisual features as input for automatic speech recognition. This is done by enforcing a training batch always includes (feature,label) pairs of all three modalities, using a 1D/2D ResNet-18 feature extractor for audio and video, respectively. These features are processed by a Transformer encoder-decoder model to obtain an ASR prediction. Furthermore, the authors explore a semi-supervised fine-tuning approach and a self-supervised initialization stage, both using a student-teacher approach, and within the same unified methodology. This allows the authors to produce a model which is competitive with state-of-the-art models while using a significantly less data.
优点
I think the proposed method is interesting for researchers in the audio-visual ASR domain and will spur future work. The paper is well-written with clear English, barring some questions I have stated below. The authors do a good job presenting their results, referring to details in the appendix where required. The ablation experiments clearly show readers how their proposed methodology behaves and why certain design decisions were made. The authors also shared their code and model checkpoints, which significantly increases the reproducibility and impact of this paper.
缺点
The model architecture seems a bit unclear to me. Specifically, line 88 states the use of a transformer encoder-decoder model. However, line 104 states a single FC layer on top of the encoder for vocabulary predictions, while line 107 states to use the decoder output sequence, which is subsequently not used as . So the decoder is not actually used during fine-tuning? How is inference actually done?
I see no mention of a fixed random seed for running experiments, are all model initialized equal? This seems important as the paper does not have error bars/does not run experiments multiple times
Minor editing comments:
- Table titles must appear above the table as per the formatting instructions.
- The table/figure combinations on Page 6 are confusing. Could you separate the figures as not part of a (sub)table?
- A small description of LRS3 would be desirable for those not familiar with the dataset (e.g., how many hours does the unlabeled portion have (line 190), what is the data source, how was it collected, how large is the test set?)
- line 97: 0.4 and 0.6 seconds for each second of ...
问题
In which settings/experiments is the transformer decoder used?
In table 3 (A), is there a reason for not trying targets A + V + AV, as during fine-tuning?
You state in line 103 that features from the 3 modalities are concatenated along the batch dimension for efficient processing. However, Table 1 (B) shows that random sampling of modalities performs much worse, requiring 3x more epochs for similar performance. So it seems to me it's not only done for efficient processing, but also for effective optimization?
Also, do [13, 15] in line 179 share parameters for each task or not? According to Table 4 they do not, but if you use random sampling of modalities, how does this explain their relevance to Table 1 (B)?
What is the CTC attention in Table 2 (C)? Is this simply equation 3 with ? I might have missed it, but it seems to me the method section does not explain these 2 different loss types?
局限性
The methods requires all data to be audio-video paired. An interesting future direction could be the inclusion of audio-only data in the framework.
Thank you for your detailed review and time. Below we address the key concerns raised.
Line 104 states a single FC layer on top of the encoder for vocabulary predictions, while line 107 states to use the decoder output sequence, which is subsequently not used as 1−λctc=0. So the decoder is not actually used during fine-tuning? How is inference actually done?
We apologise for the confusion; this is a typo. The intended value for the CTC loss weight was 0.1, not 1, and the decoder is indeed used during fine-tuning. Inference is then performed with both CTC and attention scores (see Appendix C.6). We have now fixed this typo in the paper.
I see no mention of a fixed random seed for running experiments, are all model initialized equal?
Thank you for raising this point. We indeed use a fixed random seed (42) for our experiments, and have now clarified this in the paper. Due to high computational demands and in line with previous studies [13-18, 20], we do not include error bars for our main results. However, we do show error bars for a subset of our main experiments in Table 13, where we observe that the results are consistently stable around the mean.
Minor editing comments
Thank you for the editing suggestions, which have now been incorporated into the paper. Specifically, we made the following changes: we placed the captions above all tables; we separated Table 2a and the rest of Table 2 into distinct figures and tables; we added a description of the datasets in the Appendix; and we fixed the typo you pointed.
In which settings/experiments is the transformer decoder used? What is the CTC attention in Table 2 (C)?
We apologise again for the confusion caused by the typo regarding the CTC loss weight, which should be 0.1. In Table 2d, CTC-attention (i.e., including the Transformer decoder) is our default loss. The CTC-only configuration, which corresponds to a CTC loss weight of 1, is used solely for the purposes of this ablation study.
In table 3 (A), is there a reason for not trying targets A + V + AV, as during fine-tuning?
Please note that during fine-tuning, we use only the AV targets (see Figure 1, top-left) because they provide the richest information and allow us to amortise the pseudo-labelling cost across the three modalities. Incorporating all three targets during fine-tuning is challenging, as it is unclear how to efficiently combine / predict the per-modality pseudo-label sequences, which are generated auto-regressively by the decoder. However, during pre-training, which involves only the encoder, we can more easily combine the targets from the three modalities. In Table 2b of the rebuttal PDF, we compare audio-visual targets (our default) with the sum of the per-modality targets. We observe that combining the per-modality targets results does not outperform predicting only the AV targets. Additionally, predicting all targets is more computationally expensive because the teacher encoder must process auditory, visual, and audio-visual features rather than just the audio-visual ones. We added this experiment in the Appendix.
You state in line 103 that features from the 3 modalities are concatenated along the batch dimension for efficient processing. However, Table 1 (B) shows that random sampling of modalities performs much worse, requiring 3x more epochs for similar performance. So it seems to me it's not only done for efficient processing, but also for effective optimization?
Randomly sampling the modalities means that, on average, the model is exposed to only one-third of the modalities at each iteration compared to concatenating all three modalities along the batch dimension. As a result, it would require approximately three times more epochs to achieve similar performance. However, an advantage of our approach is that it allows us to use the same pseudo-labels for all three modalities at each iteration, thus amortising the pseudo-label generation cost across modalities, which would not be possible with random modality sampling. We have updated the caption of Table 1b to explain why random sampling is trained for longer.
Also, do [13, 15] in line 179 share parameters for each task or not? According to Table 4 they do not, but if you use random sampling of modalities, how does this explain their relevance to Table 1 (B)?
[13, 15] do share parameters during pre-training, hence their relevance to Table 1b, but they separately fine-tune the resulting model on ASR, VSR, and AVSR, resulting in a separate model for each task during inference. In contrast, USR yields a single unified model that is capable of performing all three tasks during inference (section 2, “Single model for multiple modalities” includes a relevant discussion).
The methods requires all data to be audio-video paired. An interesting future direction could be the inclusion of audio-only data in the framework.
Indeed, thank you for this interesting suggestion, which we have now added to our Conclusion.
I acknowledge the rebuttal. I see no reason to change my (favorable) score.
This paper proposes USR, a unified speech recognition model that leverages pseudo labels during fine-tuning. It introduces a single model capable of handling three tasks—ASR, VSR, and AVSR—simultaneously, delivering state-of-the-art performance.
优点
-
The paper is well-organized. Although the USR system is relatively complex, the paper presents each module with detailed descriptions and clear illustrations, making it easy for readers to follow.
-
The experiments, including ablations, are extensive. All experimental details are included, making it easy to reproduce the results.
-
The USR system leverages pseudo labels during the fine-tuning stage. While pseudo labeling is not a novel technique in ASR or AVSR, USR enhances the performance of ASR, VSR, and AVSR through carefully designed training procedures. The illustration of the pseudo labeling process is also clear.
-
The system achieves nearly state-of-the-art performance across all tasks.
-
The literature review is thorough.
缺点
-
While not a unique weakness to this paper, the complexity of training current SSL-based VSR or AVSR systems remains a challenge. Introducing additional modalities significantly increases complexity compared to speech-only SSL systems. Notably, the reduction in GPU hours is minimal compared to previous works, and the convergence speed is exceedingly slow. Future work should address these issues.
-
Performance is highly sensitive to certain configurations, such as the ratios of pseudo labels and the use of EMA. However, the paper lacks an analysis of why this sensitivity occurs or suggestions on how to mitigate it. These are common weaknesses in related work.
-
The results do not consistently achieve state-of-the-art performance. The authors should experiment with other hyperparameters, such as learning rates, during fine-tuning to improve outcomes.
-
Failure cases were not discussed too much.
问题
-
During pretraining, have you explored using audio-only targets? If so, what was the performance like compared to AV targets? How does it compare to AV-HuBERT?
-
Why do you incorporate all three features (audio, video, audio-visual) during fine-tuning? Is there a rationale or experimental evidence supporting this approach?
-
There's no need to adhere strictly to the architectures like AV-HuBERT or AV-data2vec. Consider experimenting with more advanced video encoders since visual features are often not well-extracted in previous studies.
-
For pseudo label sampling, why opt for a greedy search? Have you considered trying soft sampling instead?
局限性
The limitations have been discussed in the paper.
Thank you for your detailed review and time. Below we address the key concerns raised.
The complexity of training current SSL-based VSR or AVSR systems remains a challenge.
We recognise that VSR and AVSR systems present unique challenges compared to audio-only systems, and one of our future goals is to improve the computational efficiency of multi-modal speech recognition. Despite these challenges, we believe that audio-visual speech representation learning is an exciting and promising area of research. Leveraging visual modalities like lip movements offers significant advantages in disambiguating difficult utterances, particularly in noisy environments or when audio is unavailable (see Table 12, Appendix). VSR and AVSR can also improve communication for individuals who have difficulty producing voiced speech. Additionally, evidence suggests that incorporating multiple modalities during pre-training can enhance the performance of audio-only systems (e.g., see [13, 17]).
Performance is highly sensitive to certain configurations, such as the ratios of pseudo labels and the use of EMA.
While self-supervised learning systems can indeed be sensitive to hyperparameters, we believe that USR significantly reduces this sensitivity. Our semi-supervised pseudo-labelling framework is sensitive to extreme threshold values (see Table 2a) and somewhat sensitive in terms of VSR to the weighting of labelled and unlabelled losses (see Table 2b). In Section 4.2, we hypothesise that this sensitivity arises from the inherent trade-off between label quantity and quality, which must be balanced. However, as shown in Table 3, USR’s performance remains stable across a range of pre-training hyperparameter configurations. We attribute this stability to our semi-supervised method, which leverages abundant unlabelled samples during fine-tuning. Appendix E demonstrates that our method is more robust to pre-training target types than supervised fine-tuning (compare Table 11b and Table 3a) and does not require additional training tricks for strong performance (see Table 11a), unlike other works (e.g., [13, 17, 18]). We have updated the paper to better highlight USR's reduced sensitivity to pre-training hyperparameters.
The results do not consistently achieve state-of-the-art performance.
We have indeed carefully tuned hyperparameters, such as the learning rate, to obtain our final results. In the well-established and highly competitive LRS3 high- and low-resource benchmarks (see Table 4), our model surpasses (sometimes by a large margin) or matches the previous state-of-the-art modality-specific models in 16 out of 18 cases. In the remaining two cases, we are marginally behind (1.6% vs. 1.4% and 2.4% vs. 2.3% WER). Crucially, we achieve these results using a single model for ASR, VSR, and AVSR, while other methods require separate models for each modality, resulting in ~3x the number of weights during inference. We also achieve SotA results on WildVSR (Table 9) and LRS2 (Table 10).
Failure cases were not discussed too much.
Thank you for the suggestion. We added a discussion on failure cases in the Appendix. See Table 3 in the rebuttal PDF. We observe that, while VSR tends to produce more errors than ASR and AVSR, these errors are often related to phonetically similar sounds, such as "this" vs. "these" or "disguised"" vs. "denies." Additionally, using both visual and auditory modalities (AVSR) can improve the model's ability to distinguish challenging samples, such as "Mali Wear" vs. "malware".
During pretraining, have you explored using audio-only targets? If so, what was the performance like compared to AV targets? How does it compare to AV-HuBERT?
We have explored using audio-only targets for pre-training with our semi-supervised fine-tuning in Table 3a, where we observe that AV targets work best. Still, with audio-only targets, we achieve WERs of 37.3% for VSR, 3.2% for ASR, and 3.1% for AVSR, significantly outperforming AV-HuBERT's 51.8%, 4.9%, and 4.7%, respectively. Additionally, we experimented with supervised fine-tuning (see Appendix E), observing WERs of 43.9%, 4.8%, and 4.6% for the three tasks (see Table 11b). Notably, AV-HuBERT uses a separate model for each task, whereas we use a single unified model.
Why do you incorporate all three features (audio, video, audio-visual) during fine-tuning?
Our goal was to develop a single, unified model capable of performing well on audio-only (ASR), video-only (VSR), and audio-visual (AVSR) data during inference, thereby reducing the computational and memory redundancies associated with separate models per task. To achieve this, we fine-tune the model using all three types of data, enabling it to effectively perform each task. We have now emphasised this point in Section 3.1.
Consider experimenting with more advanced video encoders.
We used Transformer-based architectures with convolutional frontends to align with closely related works [13-18, 20] for fair comparisons (e.g., see Table 4). However, given that USR is agnostic to the choice of encoder architecture, we agree it would be interesting to explore other architectural variants, which could improve results even further. Still, we believe this direction lies outside the scope of our present work and therefore defer it to future research. This idea for future work has now been included in the Conclusion.
For pseudo label sampling, why opt for a greedy search? Have you considered trying soft sampling instead?
We initially chose a greedy search for its efficiency and effectiveness. Based on your suggestion, we now also experimented with a soft sampling approach, where we used weighted sampling at each generation step. The results are in Table 1b of the rebuttal PDF. We see that hard sampling outperforms this variant of soft sampling but believe that exploring more sophisticated methods to effectively increase pseudo-label variety is a promising direction for future research.
Thanks for including the new experiments/results such as failure cases and soft sampling. Although those results were made after the paper submission, I believe that by including them, it will be a more solid paper in the future. I still have some questions:
"In the remaining two cases, we are marginally behind (1.6% vs. 1.4% and 2.4% vs. 2.3% WER)." Have you just tried tuning the learning rate or the mask ratio, or what else? Sometimes you might be able to achieve SOTA by doing this. It is a pity that it is just slightly behind.
Another question is: "we hypothesis that this sensitivity arises from the inherent trade-off between label quantity and quality." Can you detail this? Would high-quality or low-quality labels lead to higher sensitivity? Likewise, what about the quantity?.
Thank you for your prompt follow-up.
Thanks for including the new experiments/results such as failure cases and soft sampling. Although those results were made after the paper submission, I believe that by including them, it will be a more solid paper in the future.
Indeed, thank you for the suggestions, which we believe have strengthened our paper. The conference allows us to revise the paper for the camera-ready deadline, and so we have added these new experiments in the Appendix.
"In the remaining two cases, we are marginally behind (1.6% vs. 1.4% and 2.4% vs. 2.3% WER)." Have you just tried tuning the learning rate or the mask ratio, or what else? Sometimes you might be able to achieve SOTA by doing this. It is a pity that it is just slightly behind.
We extensively tuned hyperparameters (including learning rate, weight decay, and all hyperparameters in our ablations) using the Base model in the low-resource setting and then applied most of the same hyperparameters to train the larger models with more data. The only exceptions were the learning rate and drop path rate, which we separately adjusted for the larger models. While tuning all hyperparameters for each of the six settings in Table 4 could potentially improve results, the high computational demands of training the larger models on the larger datasets made this impractical. Additionally, one of our objectives was to demonstrate the scalability of our method across different model and dataset sizes with minimal extra hyperparameter tuning. We will make these points clearer in the revised paper.
We also emphasise that, in these two cases, our method is only marginally behind BRAVEn in ASR performance, despite using a single model for ASR, VSR, and AVSR, while BRAVEn employs separate models - each of the same size and architecture as our single model - for each task. Given this, we believe that being just 0.1-0.2% behind the best modality-specific model in these two cases is still a strong outcome. Moreover, our paper goes beyond this, matching or surpassing the state-of-the-art modality-specific models in all other tasks and settings (Tables 4, 5, 9, and 10).
"we hypothesis that this sensitivity arises from the inherent trade-off between label quantity and quality." Can you detail this? Would high-quality or low-quality labels lead to higher sensitivity? Likewise, what about the quantity?
This point refers to the 'Quantity/quality trade-off' paragraph in Section 4.2, where we highlight that pseudo-labels, while more abundant due to the availability of unlabelled data, are generally noisier and of lower quality. In contrast, groundtruth labels are of higher quality but less abundant. Our hyperparameters, and , adjust the balance between quantity and quality by controlling the weighting of the labelled versus unlabelled losses for audio/audiovisual and visual inputs, respectively. We hypothesise that moderate sensitivity to these hyperparameters arises because finding the proper balance between quantity and quality is important.
Thank you for the follow-up comments. I still have one more question regarding the 'Quantity/quality trade-off.' Is the takeaway that using only high-quality labels will lead to higher sensitivity to hyperparameters, while introducing pseudo-labels (low-quality labels) will mitigate this situation?
I still have one more question regarding the 'Quantity/quality trade-off.' Is the takeaway that using only high-quality labels will lead to higher sensitivity to hyperparameters, while introducing pseudo-labels (low-quality labels) will mitigate this situation?
Thank you for the insightful question, which touches on the difference between supervised and semi-supervised fine-tuning (as discussed in Appendix E). Let us assume we have a fixed amount of labelled and unlabelled data, where typically, the labelled data is less than the unlabelled (as in the “low-resource” setting). In conventional self-supervised approaches, the unlabelled data is used exclusively during the pre-training phase, with fine-tuning performed on the relatively small set of labelled data. This limited amount of labelled data during fine-tuning can increase the risk of overfitting, making the model more sensitive to hyperparameters. In contrast, USR employs semi-supervised fine-tuning, which incorporates the unlabelled data not only during pre-training but also throughout the fine-tuning process. This strategy reduces sensitivity to hyperparameters by effectively increasing the total volume of training data during fine-tuning. Thus, your assertion that “using only high-quality labels will lead to higher sensitivity to hyperparameters, while introducing pseudo-labels (low-quality labels) will mitigate this situation” is indeed accurate from this perspective. However, it is important to clarify that the reduction in overfitting and hyperparameter sensitivity is not due to the lower quality of pseudo-labels; in fact, improving the quality of pseudo-labels would likely further reduce hyperparameter sensitivity. Instead, we claim that the decreased sensitivity stems from the increased total amount of training data available during fine-tuning.
Finally, it is important to distinguish between this general difference in hyperparameter sensitivity between supervised and semi-supervised fine-tuning, and the 'Quantity/Quality Trade-off' discussion in Section 4.2 and our previous response. Those discussions address why our semi-supervised approach requires tuning of the and hyperparameters (Table 2b), emphasising the importance of appropriately balancing the contributions of the labelled and unlabelled losses.
Thanks for the response. That is clear to me now. I still keep my ratings.
This paper unifies the ASR, VSR, and AVSR tasks in a single model and shows the performance benefits of a single model in LRS3 data. There are several attempts at unifying these three models, but I think this is the first successful trial of realizing it. The paper proposes an effective training strategy to avoid losing performance on each task. Together with their self-supervised training, the model archives SOTA performance in a similar range of the training data.
优点
- the first successful method of realizing the ASR, VSR, and AVSR tasks in a single model while maintaining/improving the performance for each task
- Good reproducibility based on the code release, use of the public data, and detailed experimental configurations/analyses.
- Easy to read. Although the technique is a little bit complicated with a lot of terms depending on the architecture (CTC, attention, modality, training modes (self-supervised/supervised), the paper always provides some rationales (e.g., from the reference or experiments) to justify their methods
- detailed ablation experiments support their design choices and strategies.
- The paper also shows the effectiveness with multiple databases (LRS3, LRS2, and WildVSR)
缺点
- the technical novelty is not very strong. Most techniques are well-known or straightforward (e.g., the use of CTC, pseudo-label filtering, etc.).
问题
- Page 4, line 110: I'm a bit confused about "We set to 1." Do you mean that you always set to 1? No attention weights? Is it related to Table 2-d? Please clarify it.
- Equation (4): Why didn't you prepare a different weight for a and av?
- Section 3.2, Filtering: Did you use the same threshold for CTC and ATT? The dynamic range of c and a could be different, and I'm not sure that using the same threshold is optimal.
- Section 4: Did you only use a Transformer architecture? How about using a Conformer architecture?
- It is not a question but a suggestion. I recommend you emphasize the results of the multiple databases in the abstract to claim the generalization of this work across the database.
局限性
The paper has independent sections about limitations and Societal Impact, which describe the current issue due to the computational cost, the importance of the VSR, and the risk of general speech recognition technology.
Thank you for your detailed review and time. Below we address the key concerns raised.
the technical novelty is not very strong. Most techniques are well-known or straightforward (e.g., the use of CTC, pseudo-label filtering, etc.).
While individual components of our work have been previously used in other studies (as discussed in the Related Work), we believe that the USR framework as a whole represents significant novelty. As noted by the reviewer, earlier efforts to unify ASR, VSR, and AVSR have often lagged behind modality-specific approaches. Our research demonstrates, for the first time in the literature, that a combination of self-supervised and semi-supervised learning can produce a unified model that achieves state-of-the-art performance across all tasks. This success is attributed to key technical design choices, including (but not limited to) the use of a greedy, computationally efficient attention-based pseudo-labelling approach; a multi-modal feature extraction step that enables amortisation of the pseudo-label generation costs across the three modalities; and multi-modal (audio-visual) target prediction in self-supervised pre-training, which previously proved unsuccessful with supervised fine-tuning. Furthermore, we believe that the straightforward and intuitive nature of USR enhances its utility and potential impact in the community.
Page 4, line 110: I'm a bit confused about "We set to 1." Do you mean that you always set to 1? No attention weights? Is it related to Table 2-d? Please clarify it.
We apologise for the confusion; this is a typo. The intended value for the CTC loss weight was 0.1, not 1. We always use CTC-attention training, except for the ablation in Table 2d where we study the performance of a CTC-only loss. We have fixed this typo in the paper.
Equation (4): Why didn't you prepare a different weight for a and av?
In preliminary experiments, we observed that the training dynamics for ASR and AVSR were very similar, and hence decided to use a combined weight for the two modalities in Eq. 4 and Eq. 7 in order to reduce the number of hyperparameters. We have now added this point to the paper (Section 3.1).
Section 3.2, Filtering: Did you use the same threshold for CTC and ATT? The dynamic range of c and a could be different, and I'm not sure that using the same threshold is optimal.
Similarly, we use the same threshold for CTC and attention for simplicity. However, we agree that the dynamic ranges between the two could be different and have run experiments with separate thresholds to investigate this point. The results are in Table 1a in the attached rebuttal PDF. We observe that USR's performance remains consistent across a range of different thresholds, with no clear improvement when using separate thresholds. We added this experiment in the Appendix.
Section 4: Did you only use a Transformer architecture? How about using a Conformer architecture?
We used Transformer-based architectures with convolutional frontends to align with closely related works [13-18, 20] for fair comparisons (e.g., see Table 4). However, given that USR is agnostic to the choice of encoder architecture, we agree it would be interesting to explore other architectural variants, including Conformers, which could improve results even further. Still, we believe this direction lies outside the scope of our present work and therefore defer it to future research. This idea for future work has now been included in the Conclusion.
It is not a question but a suggestion. I recommend you emphasize the results of the multiple databases in the abstract to claim the generalization of this work across the database.
Thank you for this suggestion. We have updated the abstract accordingly.
Thanks for your detailed answers.
As I mentioned in my previous review, although this paper does not have strong technical novelty, it has a lot of insightful findings and values for the first successful method of realizing the ASR, VSR, and AVSR tasks. I appreciate these benefits and make this paper "6: Weak Accep" despite its weakness. Thanks for your explanations for the weakness point, but these explanations are basically the same as my understanding in the first review, and it is not sufficient to change this point. Also, thanks for your additional experiments, which solve some of my questions but they do not change my overall impressions. Thus, I want to keep my score as it is.
We sincerely thank the reviewers for their thoughtful comments, which have greatly contributed to improving our paper. We are pleased that the reviewers recognise the effectiveness of our method (Reviewers d2RY, d9WG, WdRe), the quality of our experiments (Reviewers d2RY, d9WG, Fi7g, WdRe), and the reproducibility of our results (Reviewers d2RY, d9WG, Fi7g). We also appreciate their acknowledgment of the potential impact and practical significance of our work (Reviewers Fi7g, WdRe), as well as the quality of our writing and presentation (Reviewers d2RY, d9WG, Fi7g, WdRe).
We have addressed the reviewers' concerns with individual responses to each review. Please see the attached rebuttal PDF, which includes new experimental results. Key changes to the paper are summarised as follows:
- We included in the Appendix more ablations for self-supervised pre-training (different mask probabilities and target types) as well as for semi-supervised fine-tuning (additional filtering thresholds and comparisons between hard and soft sampling).
- We provided in the Appendix detailed descriptions of the datasets used in the paper.
- We added failure cases and a corresponding discussion in the Appendix.
- We moved captions above the tables and separated Table 2a and the rest of Table 2 into distinct figures / tables.
- We highlighted our state-of-the-art results on LRS2 and WildVSR in the Abstract.
- We improved the clarity of the text, for example, by emphasising our method’s reduced sensitivity to pre-training hyperparameters and clarifying that the teacher is randomly initialised in the pre-training stage.
- We added further ideas for future work in the Conclusion, including exploring alternative encoder architectures and the use of extra audio-only data.
- We fixed typos identified by the reviewers, including an error in the CTC loss weight, which was mistakenly listed as 1 instead of 0.1 in the original paper.
all 4 reviewers vote to accept this paper. The paper is well written, easy to read and is of great interest to the community. The main weakness is that the novelty of the paper can be improved.