PaperHub
8.0
/10
Poster4 位审稿人
最低8最高8标准差0.0
8
8
8
8
4.8
置信度
ICLR 2024

Test-time Adaptation against Multi-modal Reliability Bias

OpenReviewPDF
提交: 2023-09-24更新: 2024-04-21
TL;DR

Reveal a new problem named reliability bias for multi-modal TTA, and propose a new method to achieve reliable fusion and robust adaption.

摘要

关键词
Test-time adaptionImbalanced multi-modal learning

评审与讨论

审稿意见
8

This paper studies the problem of multi-modal test-time adaption (TTA) under the challenge of reliability bias, which refers to the information discrepancies across different modalities due to the distribution shifts between domains. The paper proposes a novel method, RFRA, which consists of two modules: a self-adaptive attention module for reliable fusion across modalities, and a confidence-aware loss function for robust adaption. The paper also provides two new benchmarks for multi-modal TTA with reliability bias based on Kinetics and VGGSound datasets. The paper shows that RFRA outperforms several state-of-the-art TTA methods on these benchmarks under various corruptions.

优点

  1. This paper is well-written, well-organized, and easy to follow.
  2. The paper addresses a novel and important problem, i.e., multi-modal TTA with reliability bias, which has not been well-studied in the literature. Accordingly, the paper proposes an effective method, RFRA, which leverages self-adaptive attention and confidence-aware loss to achieve reliable fusion and robust adaption across modalities. Moreover, the confidence-aware loss is simple but effective and enjoy the non-monotonous gradient property.
  3. The paper provides two new benchmarks for multi-modal TTA with reliability bias, which could facilitate future research on this topic. The paper conducts extensive experiments and ablation studies on these benchmarks to demonstrate the effectiveness and superiority of RFRA over existing TTA methods.

缺点

  1. The paper lacks some experimental details, such as clarifying the terms “Statical” and “Dynamic” in Table 1 and explaining the distinctions between the proposed Self-Adaptive Attention and traditional self-attention. It is unclear what these terms mean and how they affect the performance of different methods. The paper should provide more definitions and discussions on these terms. Moreover, the paper should elaborate on how the Self-Adaptive Attention differs from the conventional self-attention in terms of design, implementation, and advantages.
  2. The confidence threshold gammagamma work as the important parameter in the confidence-aware loss. However, the influence of the confidence threshold γ\gamma on TTA performance is not explored, and it would be beneficial to understand its role more explicitly. The paper could conduct more experiments and analysis to show how different values of γ\gamma affect the accuracy and robustness of TTA. The paper could provide provide some insights on how to choose an appropriate value of γ\gamma.
  3. How does the attention value in Figure 4 calculated? Some explanation is needed. The paper should provide more details on how to compute the attention value for each modality pair in Figure 4.
  4. An outlying problem, how does the audio information used in autonomous vehicle? As most scenarios of autonomous vehicle might use the visual sensor and could your method used in such case?
  5. There is a typographical error in Eq. (4); “A” should be bold. This is a minor mistake that can be easily corrected.

问题

The primary questions for the rebuttal primarily arise from the "weaknesses" section. It would be highly appreciated if the authors could provide further explanations regarding the experiments and address the raised concerns, which will strengthen the paper. Overall, I recommend accepting this paper.

评论

Thanks for the detailed comments. In the following, we will answer your questions one by one.

Q1.1: The paper lacks some experimental details, such as clarifying the terms “Statical” and “Dynamic” in Table 1 and explaining the distinctions between the proposed Self-Adaptive Attention and traditional self-attention. It is unclear what these terms mean and how they affect the performance of different methods. The paper should provide more definitions and discussions on these terms. Moreover, the paper should elaborate on how the Self-Adaptive Attention differs from the conventional self-attention in terms of design, implementation, and advantages.*

A1.1: We apologize for any confusion regarding these terms. Allow us to elaborate:

"Statical" and "Dynamic". "Statical" denotes the state where the layer normalization (LN) and attention-based fusion (AF) layers remain frozen during the test-time phase. Conversely, "Dynamic" implies the updating of parameters within all normalization layers rooted in the source model as most existing TTA methods do in the test-time phase.

Self-Adaptive Attention v.s. traditional self-attention. The traditional self-attention mechanism usually keeps the parameters inherited from the source model and performs information integration across modalities. Apparently, the distribution shift between training and test-time data might hinder the similarity estimation between tokens. As a result, reliable fusion on biased modalities cannot be guaranteed. Instead, we hope that the model could focus more on the unbiased modalities and avoid the interoperation from the bias. To this end, we propose repurposing the cross-modal attention-based fusion layers in a self-adaptive way. As depicted in Figure 2 of the manuscript, our default approach in the RFRA framework involves updating only the parameters of the last Transformer layer (referred to as the AF layer) in a self-adaptive manner. We dub this paradigm as self-adaptive attention-based fusion, abbreviated as "SAF". SAF essentially repurposes the standard AF operation by modulating the parameters within the attention layer with the guidance of the proposed objective function.

Q2: The confidence threshold gamma work as the important parameter in the confidence-aware loss. However, the influence of the confidence threshold γ\gamma on TTA performance is not explored, and it would be beneficial to understand its role more explicitly. The paper could conduct more experiments and analysis to show how different values of γ\gamma affect the accuracy and robustness of TTA. The paper could provide provide some insights on how to choose an appropriate value of γ\gamma.

A2: Thanks for your comment. In response to your concern, we investigate the influence of the only hyper-parameter (i.e., threshold γ\gamma in Eq. 6) in our approach. To this end, we vary γ\gamma in the range of [0.1,0.2,0.3,e1,0.4,0.5][0.1, 0.2, 0.3, e^{-1}, 0.4, 0.5] and perform corresponding experiments on the Kinetics50-C benchmark with fog and traffic corruptions. Results are depicted in Fig. 10 within the revised manuscript. For your convenience, we attach the corresponding numerical results in the following table.

Threshold (γ\gamma in Eq. 6)0.10.20.3e1e^{-1}0.40.5
Video-Fog54.25656.857.457.557.7
Audio-Traffic69.269.169.169.069.369.0

The results illustrate the stability of RFRA across varying threshold values of γ\gamma.

Q3: How does the attention value in Figure 4 calculated? Some explanation is needed. The paper should provide more details on how to compute the attention value for each modality pair in Figure 4.

A3: Thanks for your comment. In Figure 4, the attention values between modalities, denoted as "Attention X-Y" (X, Y \in {A, V}), are calculated using Equation 4 from the manuscript:

A=Softmax(QKTd)\mathbf{A}=\operatorname{Softmax}\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d}}\right).

For instance, "Attention X-X" represents the self-attention, computed by setting both the query (Q\mathbf{Q}) and key (K\mathbf{K}) as the tokens of modality X. Meanwhile, "Attention X-Y" (X, Y \in {A, V}) corresponds to the configuration where the query and key tokens come from modality X and modality Y, respectively.

评论

Q4: An outlying problem, how does the audio information used in autonomous vehicle? As most scenarios of autonomous vehicle might use the visual sensor and could your method used in such case?

A4: Thanks for your comment. It's crucial to note that the multi-modal reliability bias revealed in this study includes but not limited to autonomous vehicles and can manifest across various modalities in diverse scenarios. While the majority of autonomous vehicles primarily incorporate sensors like cameras, radars, and lidars, the potential utilization of microphones and audio data in autonomous driving has received increasing attention. Interestingly, audio information could significantly complement the capabilities of the autonomous system. For instance, as highlighted in [A], audio input aids in emergency vehicle recognition. This underlines the broader applicability of the multi-modal reliability bias challenge and its potential implications for the wider community.

Q5: There is a typographical error in Eq. (4); “A” should be bold. This is a minor mistake that can be easily corrected.

A5: We appreciate your feedback on the notation typos. In the updated version, we have carefully revised the typos.

[A] Ivan Kharitonov. "Survey on Acoustic Sensors in Self-Driving Cars." 2023, https://hackernoon.com/survey-on-acoustic-sensors-in-self-driving-cars](https://hackernoon.com/survey-on-acoustic-sensors-in-self-driving-cars)

审稿意见
8

This paper introduces a novel approach for addressing reliability bias in multi-modal test-time adaptation (TTA), a challenge arising from information disparities between modalities due to distribution shifts. To investigate the impact of reliability bias, the authors conduct comprehensive analyses involving various multi-modal fusion strategies and state-of-the-art TTA methods. The results underscore two pivotal aspects of effective TTA against reliability bias: dynamic information integration across modalities and noise-resilient adaptation across domains. To tackle these challenges, the authors devise a self-adaptive attention module to facilitate reliable cross-modal fusion and a confidence-aware loss function to ensure robustness against noisy predictions. Furthermore, this paper contributes two benchmark datasets focusing on multi-modal action recognition and event classification. Extensive comparison experiments against existing TTA methods and imbalanced multi-modal learning methods validate the effectiveness of the proposed method.

优点

  1. This paper studies a new challenge (i.e., reliability bias) for multi-modal test-time adaption. Test-time adaption methods aim at adopting the pre-trained model from the source domain to the target domain in real-time and most existing of them focus on single-modality tasks against domain shifts. On the one hand, this paper takes the more complex multi-modal scenarios into consideration. On the other hand, the authors study and tackle the reliability bias challenge.

  2. The authors conduct extensive experiments to validate the importance of developing robust TTA methods against reliability bias. On the one hand, the existing cross-modal fusion methods (late fusion, attention-based fusion, etc.) would suffer from reliability bias and cannot achieve reliable cross-modal fusion. On the other hand, the existing TTA method cannot completely reconcile the distribution shifts by updating the parameters of normalization layers, leading to surviving reliability bias across modalities. Furthermore, the authors show that simply handling reliability bias during test time takes more superiority compared to the imbalanced multi-modal learning methods that alter the training process to handle the problem.

  3. The proposed method is novel and technically sound. First, the authors focus on the characteristics of multi-modal TTA and design the self-adaptive attention module that repurposes the attention layers during test time for achieving reliable cross-modal fusion. I think the design would inspire the community to design task-specific parameter modulation instead of solely updating the parameters of normalization layers following most existing methods. Second, to achieve robustness against heavy noise during adaption, the authors propose the robust loss function which not only eliminates the influence of noisy predictions but also boosts utilization of the clean predictions with theoretical guarantees.

缺点

Although this paper is well-motivated and extensively validated, I still have the following concerns or suggestions, hoping to make the paper more clear and solid.

  1. The experiment results are mainly obtained under the setting of severity 5. To establish the method's generality, it is encouraged to expand the empirical results across a spectrum of scenarios, including different severity levels. A broader array of experiments, encompassing various severity levels, would not only fortify the method's reliability but also enhance the comprehensiveness of this study. Moreover, it would be beneficial to evaluate how the method performs in the context of test-time adaptation (TTA) under unbiased reliability conditions. Specifically, investigating the method's effectiveness in both (i.d.d. and non-i.d.d. scenarios would render it more practical and versatile.
  2. The paper contains extensive analysis and experiments, but some settings require further clarification. For instance, it's not entirely clear what "Attention A-V" means in Figure 4 and how the results demonstrate the method's robustness. In the analysis of Figure 5, claims are made about the importance of maintenance between audio and video modalities, but the contrast with clean results is not evident. Additional clarification or supplementary results are needed to support these claims.
  3. There are a few typos and vague statements in the paper, such as "traffic noise in the audio modality" in the caption of Figure 1, "information bias" on Page 2, and inconsistent notations like "(Stat. LN) & AF" or "Stat. (LN & AF)" in Tables 1-3. These should be corrected for clarity and consistency.

问题

My questions mainly lie in some unclear experiment analysis and the generalizability of the proposed approach to a broader range of severity levels.

评论

Thanks for the insightful reviews. We will answer your questions one by one in the following.

Q1.1: The experiment results are mainly obtained under the setting of severity 5. To establish the method's generality, it is encouraged to expand the empirical results across a spectrum of scenarios, including different severity levels. A broader array of experiments, encompassing various severity levels, would not only fortify the method's reliability but also enhance the comprehensiveness of this study.

A1.1: Thanks for your valuable suggestions. In the submission, we have reported the comparison results on VGGSound-C and Kinetics50-C benchmarks under 15 visual corruption and 6 audio corruption with a severity level of 5.

In response to your constructive feedback, we conduct additional experiments to assess our RFRA's performance across different corruption types at severity level 3. We've summarized these results in Tables 10-12 within the revised manuscript. For your convenience, the corresponding results are provided in the following tables.

Kinetics50-C (Video Corruption, Severity Level 3)Gauss.ShotImpul.Defoc.GlassMot.ZoomSnowFrostFogBrit.DigitalContr.Pix.JPEGAVG
Source ((Stat. LN) & LF)46.647.846.97163.474.468.162.158.965.477.668.276.177.17365.1
+ MM-TTA (Dyn. LN)48.850.850.66660.670.963.559.856.358.175.159.372.274.768.762.4
+ Tent (Dyn. LN)44.646.644.971.264.674.668.762.356.565.277.968.576.37773.264.8
+ EATA (Dyn. LN)46.848.247.370.863.974.668.462.358.965.477.868.176777365.2
+ SAR (Dyn. LN)46.747.94770.663.374.468.262.358.965.277.768767772.765.1
+ RFRA (Dyn. LN)49.35049.471.165.77570.364.561.567.178.169.576.677.273.766.6
Source (Stat. (LN&AF))54.154.854.673.568.376.671.569.264.769.579.372.177.679.475.469.4
+ Tent (Dyn. LN)54.255.155.273.669.676.871.969.565.670.279.472.978.379.275.369.8
+ EATA (Dyn. LN)54.454.95573.469.176.571.669.265.169.579.572.377.779.175.269.5
+ SAR (Dyn. LN)54.254.85573.168.276.471.169.164.869.479.17277.479.17569.2
+ RFRA (SAF)56.156.956.473.970.576.672.87068.170.879.373.378.279.675.670.5
评论
VGGSound-C (Video Corruption, Severity Level 3)Gauss.ShotImpul.Defoc.GlassMot.ZoomSnowFrostFogBrit.DigitalContr.Pix.JPEGAVG
Source ((Stat. LN) & LF)45.645.345.455.75457.655.455.153.753.458.553.958.358.156.553.8
+ MM-TTA (Dyn. LN)18.617.515.850.444.351.848.441.428.146.55246.2525251.641.1
+ Tent (Dyn. LN)19.817.218.455.955.357.355.955.345.334.858.456.458.458.457.146.9
+ EATA (Dyn. LN)45.845.645.756.355.2585655.854.454.558.955.358.858.557.154.4
+ SAR (Dyn. LN)45.445.245.255.854.357.755.655.353.953.758.554.258.558.256.753.9
+ RFRA (Dyn. LN)464646.35352.956.354.153.853.3535853.857.756.855.153.1
Source (Stat. (LN&AF))54.754.654.759.358.460.45958.357.457.861.3586160.96058.4
+ Tent (Dyn. LN)54.454.354.458.857.859.758.557.656.958.660.6586160.96058.1
+ EATA (Dyn. LN)54.854.654.859.458.460.359.158.257.558.761.358.961.160.860.158.5
+ SAR (Dyn. LN)54.854.654.759.458.360.358.958.357.558.261.258.360.960.859.958.4
+ RFRA (SAF)55.355.455.46059.161.159.859.258.559.361.959.861.561.560.759.2
Kinetics50-C (Audio Corruption, Severity Level 3)Gauss.Traff.Crowd.RainThund.WindAvg.
Source ((Stat. LN) & LF)74.268.868.766.771.670.470.1
+ MM-TTA (Dyn. LN)72.869.668.968.770.770.370.2
+ Tent (Dyn. LN)74.26969.664.871.971.170.1
+ EATA (Dyn. LN)74.168.869.167.371.870.670.3
+ SAR (Dyn. LN)73.968.868.966.771.670.370
+ RFRA (Dyn. LN)74.269.6706972.770.871
Source (Stat. (LN&AF))75.964.468.770.367.970.369.3
+ Tent (Dyn. LN)73.967.469.269.36972.170.1
+ EATA (Dyn. LN)7665.768.969.869.172.170.3
+ SAR (Dyn. LN)7664.668.769.368.672.269.9
+ RFRA (SAF)76.469.670.87272.672.372.3
评论
VGGSound-C (Audio Corruption, Severity Level 3)Gauss.Traff.Crowd.RainThund.WindAvg.
Source ((Stat. LN) & LF)39.623.82528.736.526.930.1
+ MM-TTA (Dyn. LN)13.87.17.616.210.65.410.1
+ Tent (Dyn. LN)11.24.13.45.212.85.17.0
+ EATA (Dyn. LN)40.323.924.728.736.526.930.2
+ SAR (Dyn. LN)39.923.624.928.736.426.830
+ RFRA (Dyn. LN)44.529.931.533.23731.234.6
Source (Stat. (LN&AF))42.129.419.527.631.229.429.9
+ Tent (Dyn. LN)8.142.34.77.86.15.5
+ EATA (Dyn. LN)46.730.52831.435.433.834.3
+ SAR (Dyn. LN)43.117.38.32931.630.526.6
+ RFRA (SAF)47.332.729.933.238.333.735.8

Q1.2: Moreover, it would be beneficial to evaluate how the method performs in the context of test-time adaptation (TTA) under unbiased reliability conditions. Specifically, investigating the method's effectiveness in both (i.d.d. and non-i.d.d. scenarios would render it more practical and versatile.

A1.2: Thanks for your comment. We understand your concern and conduct additional experiments in the context of test-time adaptation (TTA) under unbiased reliability conditions. Specifically, we directly adapt RFRA and other baseline methods into the test sets of the VGGSound and Kinetics datasets without adding any corruptions. Results are summarized in Table 9 within the revised manuscript. For your convenience, we attach the corresponding results in the following table.

MethodSource (Stat. (LN&AF))TentEATASARRFRA
VGGSound63.362.663.163.163.5
Kinetics5082.382.182.382.382.2

The results indicate that the robustness of RFRA doesn't compromise the performance of the original target domain. In essence, RFRA remains effective even in scenarios where the presence of distribution shifts is uncertain.

Q2.1: The paper contains extensive analysis and experiments, but some settings require further clarification. For instance, it's not entirely clear what "Attention A-V" means in Figure 4 and how the results demonstrate the method's robustness.

**A2.1: ** What "Attention A-V" means. "Attention A-V" denotes the cross-attention from the audio modality to the video modality. Specifically, in Figure 4, the attention values between modalities, denoted as "Attention X-Y" (X, Y \in {A, V}), are derived using Equation 4 from the manuscript:

A=Softmax(QKTd)\mathbf{A}=\operatorname{Softmax}\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d}}\right).

For instance, "Attention X-X" represents the self-attention, computed by setting both the query (Q\mathbf{Q}) and key (K\mathbf{K}) as the tokens of modality X. Meanwhile, "Attention X-Y" (X, Y \in {A, V}) corresponds to the configuration where the query and key tokens correspond to modality X and modality Y, respectively.

How the results demonstrate the robustness. Larger attention values indicate a greater focus on a particular modality during cross-modal fusion. In Figure 4, "Tent" displays slightly better robust fusion effects compared to "AF," potentially due to the narrowed domain gap achieved by repurposing the LN. In contrast, our method demonstrates significant improvements in reliability estimation (attention values) for both video and audio bias situations across varying severities. This highlights the necessity of the self-adaptive attention-based fusion paradigm for multi-modal TTA, as it enhances the model's capability to adapt to reliability bias scenarios.

评论

Q2.2: In the analysis of Figure 5, claims are made about the importance of maintenance between audio and video modalities, but the contrast with clean results is not evident. Additional clarification or supplementary results are needed to support these claims.

Thanks for your comment. We have supplemented more visualization results on different baselines for comparisons. Specifically, Fig. 11 in the Appendix of the manuscript shows the attention matrixes of the source model with vanilla attention-based fusion (AF), and the model adapted by Tent with dynamic LN.

Q3: There are a few typos and vague statements in the paper, such as "traffic noise in the audio modality" in the caption of Figure 1, "information bias" on Page 2, and inconsistent notations like "(Stat. LN) & AF" or "Stat. (LN & AF)" in Tables 1-3. These should be corrected for clarity and consistency.

We appreciate your feedback on the typos. In the updated version, we have carefully revised the typos.

评论

After reading the comments from other reviewer and the response from the authors, I would like to promote my score to 'accept'. This paper deserves to be presented on ICLR 2024.

评论

We sincerely appreciate your positive recognition and assessment of our work!

评论

Dear reviewer 9FmQ,

As the author/reviewer discussion will draw to a close soon, we would like to know if our response has addressed your concerns and questions. If you have any further concerns or suggestions for the paper or our rebuttal, please let us know. We would be happy to engage in further discussion and manuscript improvement.

Thank you again for the time and effort you dedicated to reviewing this work.

审稿意见
8

This paper delves into test-time adaption under the multi-modal setting and reveals an interesting and practical challenge, namely, reliability bias. In the wild, it is common that some modalities would suffer from distribution shifts compared to their counterparts in the source domain. As a result, the task-specific information across the modalities would be more inconsistent, thus contributing to the reliability differences for different modalities. Extensive empirical studies have been conducted to investigate the impact of the reliability bias using different cross-modal fusion strategies. To achieve robust multi-modal TTA against reliability bias, the authors propose a novel method, dubbed reliable fusion and robust adaption (RFRA). Different from the existing TTA methods that mainly repurpose the normalization layers to achieve adaption, RFRA modulates the attention module to achieve reliable cross-modal fusion during test time. Besides, RFRA adopts a new objective function with desirable mathematical properties to combat with noise during adaption. To highlight the necessity of developing reliability-bias robust multi-modal TTA, the authors construct two new benchmarks with different settings of reliability bias based on the Kinetics and VGGSound datasets. Finally, the authors validate the effectiveness of the proposed method and give a deep analysis of the reliability bias challenge.

优点

  1. Revealing a new problem. This paper studies a new and practical challenge in the context of multi-modal test-time adaptation, namely, reliability bias. In the wild, it is evitably to introduce distribution shifts in some modalities. As a result, the task-specific information across the modalities would be more inconsistent, thus contributing to the reliability differences for different modalities. The authors design extensive experiments to investigate the influence of reliability bias under different multi-modal fusion manners and the results validate the necessity of handling reliability bias for multi-modal TTA. I think the revealed challenge would bring some insights to the TTA community.
  2. Constructing meaningful benchmarks. To highlight the necessity of developing reliability-bias robust multi-modal TTA, the authors construct two new benchmarks with different settings of reliability bias based on the Kinetics and VGGSound datasets. Concretely, the benchmarks consist of both video and audio modalities and each modality is with corruptions of different levels so that the reliability bias is simulated. On the one hand, the corruption on video modality follows ImageNet-C, which ensures comparison fairness. On the other hand, the incorporation of audio corruption types extends the utility of this research to the audio domain with TTA. Besides, the benchmarks encompass diverse multi-modal tasks, including action recognition and event classification, thus providing a comprehensive evaluation and supplementing existing multi-modal test-time adaptation tasks.
  3. Innovative paradigm for TTA. This paper proposes a new paradigm for TTA, namely, repurposing the attention layers during test time. Intuitively, the parameter updating of normalization layers like most existing TTA methods can only handle the distribution shifts between domains. In contrast, the parameter modulation of attention layers would help learn the importance difference between modalities, resulting in the reliable fusion for multi-modal TTA. The authors perform extensive experiments to show the superiority of the proposed new paradigm against reliability bias.

缺点

  1. While the paper is well-written, it lacks essential details about CAV-MAE, such as the number of attention layers used for fusion. As many evaluations rely on the CAV-MAE framework, providing this information is crucial for the readers and reviewers to fully grasp the methodology. Additionally, visualizing the corruption types on both video and audio modalities would enhance the paper's clarity and help readers better understand the benchmarks.
  2. The paper introduces the modulation of Q, K, and V parameters to address reliability bias. However, it's not clear why the authors chose to update all parameters of Q, K, and V simultaneously. Explaining this choice and discussing the possibility of updating only the MLP parameters of Q and K would provide valuable insights. Furthermore, while the novel attention layer repurposing is effective for reliability bias, the paper should address whether this approach comes at the cost of efficiency for test-time adaptation.
  3. The paper introduces a crucial hyper-parameter, the confidence threshold (γ\gamma) in Equation 6. While this threshold is fixed in all experiments, it's essential to include ablation studies to explore the sensitivity of the proposed method to variations in this hyper-parameter. A more comprehensive analysis would provide a deeper understanding of the method's robustness.

Minor: Some of the figures, such as Figure 1c and Figure 3b, suffer from low image clarity. Improving the quality of these figures would enhance the paper's visual presentation and make the findings more accessible to readers.

Overall evaluation, I think this paper is above the bar of ICLR, regarding motivation and novelty.


Upon reviewing the response, I note that my concerns have been effectively addressed. Considering the consensus of other reviewers, I wholeheartedly recommend this work with a high level of approval.

问题

My primary concerns revolve around the lack of comprehensive experiment details and the design of the modulation strategies, as highlighted in the weaknesses.

评论

Thanks for your valuable reviews. We would like to address your concerns one by one in the following.

Q1.1: While the paper is well-written, it lacks essential details about CAV-MAE, such as the number of attention layers used for fusion. As many evaluations rely on the CAV-MAE framework, providing this information is crucial for the readers and reviewers to fully grasp the methodology.

A1.1: We apologize for the missing details on the backbone that confuses the reviewer. In the revised manuscript, we have supplemented the details of CAV-MAE ([A]) and more implementation details of our RFRA in Section C. For your convenience, we attach the added statement as follows.

In the implementation, we use the CAV-MAE model as the backbone. CAV-MAE adopts an encoder-decoder-like architecture that is pre-trained on large-scale video data with both the contrastive learning and mask image modeling paradigms. The CAV-MAE encoder consists of 11 Transformer layers dedicated to each modality for the modality-specific feature extraction, alongside one Transformer layer for cross-modal fusion. The input to the CAV-MAE encoder involves 10-second video clips containing both video and corresponding audio data. For the video stream, CAVMAE samples 10 frames within each video clip and randomly selects one frame feeding into the visual Transformer encoder. For the audio stream, each 10-second audio waveform is converted into one spectrogram and then inputted to the audio Transformer encoder.

During the fine-tuning phase, we maintain the visual and audio encoders of the pre-trained model and add one randomly initialized classification head to them. The fine-tuned model is regarded as the source model and denoted as “Source (Stat. (LN & AF))”. To investigate the robustness of different fusion manners, we design another variant of the source model that utilizes 12 Transformer layers for feature extraction and performs late fusion between the classification logit of each modality. The corresponding model variant is denoted as “Source ((Stat. LN) & LF)”. During the test-time adaption phase, unless otherwise specified, all baselines update the parameters of all normalization layers rooted in the source model, i.e., referred to as “Dyn. LN”. In contrast, as depicted in Fig. 2, our default approach in the RFRA framework involves updating only the parameters of the last Transformer layer to ensure reliable fusion, denoted as “SAF”.

Q1.2: Additionally, visualizing the corruption types on both video and audio modalities would enhance the paper's clarity and help readers better understand the benchmarks.

A1.2: Thanks for your valuable suggestion. In this work, to comprehensively evaluate modality bias, we introduce different distribution shifts on the video and audio modalities for the test sets of VGGSound ([B]) and Kinetics ([C]) datasets. For the video corruptions, we follow [D] to apply 15 kinds of corruptions into the video, and each corruption has 5 kinds of severity levels for extensive validations. Specifically, the corruptions on video modality include ''Gaussian Noise", ''Shot Noise", ''Impulse Noise", ''Defocus Blur", ''Glass Blur", ''Motion Blur", ''Zoom Blur", ''Snow", ''Frost", ''Fog", ''Brightness", ''Elastic", ''Pixelate", ''Contrast", and ''JPEG". Similar to the video modality, we add 6 kinds of common audio noise (https://freesound.org) with 5 kinds of severity levels captured in the wild. Specifically, the corruptions on audio modality include ''Gaussian Noise", ''Paris Traffic Noise", ''Crowd Noise", ''Rainy Noise", ''Thunder Noise" and ''Windy Noise".

In response to your valuable suggestion, we've incorporated visualizations of corruption types in both video and audio modalities within the updated manuscript. Fig. 6 showcases the visualization results of various visual corruption types observed within the constructed Kinetics-C benchmark. Additionally, Fig. 7 presents the Mel spectrogram visualizations, highlighting the raw audio and the corresponding audio corruption types. For a comprehensive understanding, we encourage you to refer to the updated manuscript for an in-depth exploration of these visualizations.

评论

Q2.1: The paper introduces the modulation of Q, K, and V parameters to address reliability bias. However, it's not clear why the authors chose to update all parameters of Q, K, and V simultaneously. Explaining this choice and discussing the possibility of updating only the MLP parameters of Q and K would provide valuable insights.

Thanks for your constructive comment. In our default approach, we update WΘhW_{{\Theta}^h} and BΘhB_{{\Theta}^h} (hq,k,v)(h\in {q,k,v}) within the last Transformer layer of the source model to ensure reliable fusion.

In response to your valuable comment, we conduct additional experiments to explore the impact of different repurposing schemes. To this end, we design three variants: one that updates only the query and key projection layers, another that updates only the value projection layers, and a third that updates the final classification head. Results are summarized in Table 8 within the revised manuscript. For your convenience, we attach the corresponding results in the following table.

CorruptionSourceQKVMLPQKV (ours)
Video-Fog46.751.753.649.157.4
Audio-Traffic65.568.867.266.769.0

Results in the following table illustrate that the default setting, updating the query, key, and value projection layers simultaneously, exhibits significant performance superiority. Modulating the classification head demonstrates minimal effectiveness (e.g., from 46.746.7 to 49.149.1). Conversely, the attention modulation scheme achieves adaptive fusion between discrepant modalities, mitigating the multi-modal reliability bias problem (e.g., from 46.746.7 to 51.751.7). Moreover, modulation on the query, key, and value projection layers introduces additional parameters for reliable fusion, resulting in further improvements in robustness (e.g., from 46.746.7 to 57.457.4).

Q2.2: Furthermore, while the novel attention layer repurposing is effective for reliability bias, the paper should address whether this approach comes at the cost of efficiency for test-time adaptation.

Thanks for your valuable suggestion. Different from most TTA methods that update the parameters of normalization layers, our RFRA repurposes the last Transformer layer of the CAV-MAE model as elaborated in Section 2 of the manuscript.

In response to your insightful suggestion, we conduct additional experiments comparing the efficiency of the two paradigms. To this end, we choose the attention-fusion-based CAV-MAE model as the source model (i.e., source (Stat. (LN & AF))), and conduct experiments on the VGGSound-C benchmark. We measure both the size of learnable parameters and the GPU time during the test-time adaptation phase. Results are summarized in Table 7 within the revised manuscript. For your convenience, we attach the corresponding results in the following table.

Method#params (M)GPU time (14,046 pairs)
Tent (Dyn. LN)0.2209.5 seconds
EATA (Dyn. LN)0.2207.6 seconds
SAR (Dyn. LN)0.2286.1 seconds
RFRA (SAF)1.8134.1 seconds

The results underscore that RFRA achieves adaptation more efficiently, primarily due to its module repurposing approach. While the normalization layer updating scheme requires fewer parameters, it necessitates more time for propagation.

Q3: The paper introduces a crucial hyper-parameter, the confidence threshold (γ\gamma) in Equation 6. While this threshold is fixed in all experiments, it's essential to include ablation studies to explore the sensitivity of the proposed method to variations in this hyper-parameter. A more comprehensive analysis would provide a deeper understanding of the method's robustness.

Thanks for your comment. In response to your concern, we investigate the influence of the only hyper-parameter (i.e., threshold γ\gamma in Eq. 6) in our approach. To this end, we vary γ\gamma in the range of [0.1,0.2,0.3,e1,0.4,0.5][0.1, 0.2, 0.3, e^{-1}, 0.4, 0.5] and perform corresponding experiments on the Kinetics50-C benchmark with fog and traffic corruptions. Results are depicted in Fig. 10 within the revised manuscript. For your convenience, we attach the corresponding numerical results in the following table.

Threshold (γ\gamma in Eq. 6)0.10.20.3e1e^{-1}0.40.5
Video-Fog54.25656.857.457.557.7
Audio-Traffic69.269.169.169.069.369.0

The results illustrates the stability of RFRA across varying threshold values of γ\gamma.

评论

Q4: Minor: Some of the figures, such as Figure 1c and Figure 3b, suffer from low image clarity. Improving the quality of these figures would enhance the paper's visual presentation and make the findings more accessible to readers.

We appreciate your feedback. In the revised manuscript, we have enhanced the quality of these figures to improve their visual presentation.

Reference:

[A] Yuan Gong, Andrew Rouditchenko, Alexander H Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R Glass. Contrastive audio-visual masked autoencoder. In ICLR, 2023.

[B] Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audiovisual dataset. In ICASSP, 2020.

[C] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv:1705.06950, 2017.

[D] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2019.

评论

Thank you for upgrading your score! We appreciate the time and effort you dedicated to reviewing this work.

审稿意见
8

This work proposes a method for multi-modal test-time adaption (TTA) in the presence of cross-modal reliability bias. While there are numberer of works in test-time adaption, most of them focus on single-modality tasks and few works consider the practical multi-modal scenarios. In contrast, the authors investigate the characteristics of multi-modal TTA and reveal the task-specific cross-modal reliability bias setting where the information between modalities is unbalanced during test time derived from the distribution shifts across domains in some modalities. The authors conduct analysis experiments finding that the unreliable cross-modal fusion and noisy predictions hinder the robustness against cross-modal reliability bias. As a remedy, this paper proposes a novel method dubbed reliable fusion and robust adaption (RFRA). The idea of RFRA is straightforward. A self-adaptive attention is employed to achieve reliable cross-fusion during test time and a robust loss is adopted to prevent the prediction noise from dominating the adaption process. The authors experimentally validate their method on two new benchmarks against several baselines showing reasonable improvements.

优点

  1. This paper has a good motivation. The authors focus on multi-modal test-time adaption and reveal a NEW task-specific challenge (i.e., cross-modal reliability bias) for the first time. This paper first empirically proves that the existing test-time adaption methods cannot tackle the cross-modal reliability bias problem. Furthermore, the authors also investigate the effect of adopting the existing unbalanced multi-modal learning methods to handle the cross-modal information discrepancy. The unbalanced multi-modal learning methods handle the unbalanced multi-modal data during training time with the help of labeled data, which resembles the test-time training paradigm in the domain adaption community. The results indicate that the multi-modal learning methods cannot take superiority for the challenge. In other words, the authors support the claim that designing an elaborated method for the cross-modal reliability bias during test time is essential.

  2. The authors take an in-depth study on the cross-modal reliability bias challenge and find that robust cross-modal fusion and noise resistance are essential to achieve multi-modal TTA against reliability bias. To this end, the authors design a novel method dubbed reliable fusion and robust adaption (RFRA). On the one hand, RFRA achieves reliable fusion through the self-adaptive fusion module. On the other hand, RFRA employs an elaborately designed objective function to achieve noise-robust adaption. Figs. 1 and 2 clearly depict the motivation and key idea of the paper. Overall evaluation, this paper is with strong motivation, a technical sound approach, extensive experiments, and good writing.

缺点

  1. The authors have conducted extensive evaluations on two newly constructed benchmarks regarding the most challenged setting (severity 5 in this paper), and the results indeed verify the effectiveness of handling the cross-modal reliability bias challenge. Even so, I think some challenging TTA settings that are orthogonal to the cross-modal reliability bias might help to improve the practicalness and impact of this work. First, in the practical multi-modal scenarios, the severity of distribution shifts might dynamically vary. It would make the work more practical if the authors could additionally investigate the robustness of the proposed method under the setting MIXED SEVERITIES. Furthermore, it is also common that the corruption types continually vary in the wild, resulting in the demand for continual test-time adaption (CTTA). The proposed approach would be more solid and universal if the proposed method could work in the MIXED DISTRIBUTION SHIFTS setting (i.e., continual TTA setting).

  2. Some details are missing. It is not clear how many layers of the attention module are used and repurposed during test time adaption in the proposed approach. And how many parameters do these self-attention layers account for? It is encouraged to supplement comparisons with the TTA baselines regarding the number of modulated parameters and the adaption time, which would make the comparisons more comprehensive.

  3. The results of the default setting can be added to the ablation tables for clear clarification. In the current form, the readers need to compare the main table (1,2,3) and the ablation tables for contrast.

  4. I carefully read the paper and found two potential mathematical typos. θsm\theta_{s}^{m} in Line 5, page 4. \partial in Eq. 9.

问题

My questions are mainly in the efficiency comparisons between the proposed SAF module and the TTA baselines, and some clarification on statements. Moreover, I wonder about the effect of the proposed method on the settings of MIXED SEVERITIES and MIXED DISTRIBUTION SHIFTS. Certainly, this is optional during the rebuttal time because the settings are out of the scope of the paper, but I think the results would strengthen this work.

评论

Q1.2: Furthermore, it is also common that the corruption types continually vary in the wild, resulting in the demand for continual test-time adaption (CTTA). The proposed approach would be more solid and universal if the proposed method could work in the MIXED DISTRIBUTION SHIFTS setting (i.e., continual TTA setting).

A1.2: Following the setting of continual TTA ([F, G]), we conduct additional experiments on the Kinetics50-C benchmark under the MIXED DISTRIBUTION SHIFTS setting as the reviewer suggested. In this setting, both baseline methods (Tent and SAR) alongside our RFRA continually adapt to evolving corruption types, and the averaged performance across all corruption types is reported. To ensure comprehensive evaluations, we vary the severity levels from 1 to 5. The results are summarized in Fig. 9 within the updated manuscript. For your convenience, we attach the corresponding numerical result (regarding accuracy) in the following tables.

Kinetics50-C (Video Corruption, Mixed Distribution Shifts)Severity 1Severity 2Severity 3Severity 4Severity 5AVG
Tent ([A])7672.167.139.538.358.6
SAR ([E])76.171.869.064.359.668.2
RFRA (Ours)76.872.569.065.061.769.0
Kinetics50-C (Audio Corruption, Mixed Distribution Shifts)Severity 1Severity 2Severity 3Severity 4Severity 5AVG
Tent ([A])71.270.870.069.468.570.0
SAR ([E])71.070.670.069.869.370.1
RFRA (Ours)72.271.671.070.469.771.0

Notably, the performance of the vanilla TTA method (Tent) significantly degrades, especially under high-level mixed distribution shifts (e.g., 38.3 at Severity 5). In contrast, our RFRA showcases relative robustness against these shifts across different severity levels. It's worth highlighting that while our RFRA was primarily designed to address the challenge of multi-modal reliability bias, it demonstrates consistent performance superiority compared to SAR, an approach explicitly tailored for handling mixed distribution shifts in the continual TTA setting. This observation underscores the adaptability and resilience of RFRA beyond its initial design scope.

Q2.1: Some details are missing. It is not clear how many layers of the attention module are used and repurposed during test time adaption in the proposed approach. And how many parameters do these self-attention layers account for?

A2.1: We apologize for the initial oversight in providing comprehensive details. In the revised manuscript, we have supplemented the details of CAV-MAE ([H]) and implementation details of our RFRA. For your convenience, we attach the added statement as follows.

In the implementation, we use the CAV-MAE model as the backbone. CAV-MAE adopts an encoder-decoder-like architecture that is pre-trained on large-scale video data with both the contrastive learning and mask image modeling paradigms. The CAV-MAE encoder consists of 11 Transformer layers dedicated to each modality for the modality-specific feature extraction, alongside one Transformer layer for cross-modal fusion. The input to the CAV-MAE encoder involves 10-second video clips containing both video and corresponding audio data. For the video stream, CAVMAE samples 10 frames within each video clip and randomly selects one frame feeding into the visual Transformer encoder. For the audio stream, each 10-second audio waveform is converted into one spectrogram and then inputted to the audio Transformer encoder.

评论

During the fine-tuning phase, we maintain the visual and audio encoders of the pre-trained model and add one randomly initialized classification head to them. The fine-tuned model is regarded as the source model and denoted as “Source (Stat. (LN & AF))”. To investigate the robustness of different fusion manners, we design another variant of the source model that utilizes 12 Transformer layers for feature extraction and performs late fusion between the classification logit of each modality. The corresponding model variant is denoted as “Source ((Stat. LN) & LF)”. During the test-time adaption phase, unless otherwise specified, all baselines update the parameters of all normalization layers rooted in the source model, i.e., referred to as “Dyn. LN”. In contrast, as depicted in Fig. 2, our default approach in the RFRA framework involves updating only the parameters of the last Transformer layer to ensure reliable fusion, denoted as “SAF”.

Q2.2: It is encouraged to supplement comparisons with the TTA baselines regarding the number of modulated parameters and the adaption time, which would make the comparisons more comprehensive.

A2.2: Thanks for your valuable suggestion. Different from most TTA methods that update the parameters of normalization layers, our RFRA repurposes the last Transformer layer of the CAV-MAE model as elaborated in Section 2 of the manuscript. In response to your insightful suggestion, we conduct additional experiments comparing the efficiency of the two paradigms. To this end, we choose the attention-fusion-based CAV-MAE model as the source model (i.e., source (Stat. (LN & AF))), and conduct experiments on the VGGSound-C benchmark. We measure both the size of learnable parameters and the GPU time during the test-time adaptation phase. Results in the following table highlight that our RFRA accomplishes adaptation in less time. The efficiency of RFRA can be attributed to its module repurposing approach. Although the normalization layer updating scheme occupies fewer parameters, it demands more time for propagation.

Method#params (M)GPU time (14,046 pairs)
Tent (Dyn. LN)0.2209.5 seconds
EATA (Dyn. LN)0.2207.6 seconds
SAR (Dyn. LN)0.2286.1 seconds
RFRA (SAF)1.8134.1 seconds

Q3: The results of the default setting can be added to the ablation tables for clear clarification. In the current form, the readers need to compare the main table (1,2,3) and the ablation tables for contrast.

A3: We apologize for the confusion arising from the initial presentation. To address this concern, we've made necessary adjustments in the revised manuscript by incorporating the default setting's results, distinctly highlighted in pink, within the ablation tables (Tables 4, 5, and 6).

Q4: I carefully read the paper and found two potential mathematical typos. θsm\theta^{m}_{s} in Line 5, page 4. \partial in Eq. 9.

A4: Thanks for your careful reading. We apologize for the typos and have revised them in the updated manuscript.

Reference:

[A] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.

[B] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In ICML, 2020.

[C] Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test-time model adaptation without forgetting. In ICML, 2022.

[D] Inkyu Shin, Yi-Hsuan Tsai, Bingbing Zhuang, Samuel Schulter, Buyu Liu, Sparsh Garg, In So Kweon, and Kuk-Jin Yoon. Mm-tta: multi-modal test-time adaptation for 3d semantic segmentation. In CVPR, 2022.

[E] Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. In ICLR, 2023.

[F] Yulu Gan, Yan Bai, Yihang Lou, Xianzheng Ma, Renrui Zhang, Nian Shi, and Lin Luo. Decorate the newcomers: Visual domain prompt for continual test time adaptation. In AAAI, 2023.

[G] Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. In CVPR, 2022.

[H] Yuan Gong, Andrew Rouditchenko, Alexander H Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R Glass. Contrastive audio-visual masked autoencoder. In ICLR, 2023.

评论

Thanks for your constructive reviews and suggestions. In the following, we will answer your questions one by one.

Q1.1: I think some challenging TTA settings that are orthogonal to the cross-modal reliability bias might help to improve the practicalness and impact of this work. First, in the practical multi-modal scenarios, the severity of distribution shifts might dynamically vary. It would make the work more practical if the authors could additionally investigate the robustness of the proposed method under the setting MIXED SEVERITIES.

A1.1: In this paper, we reveal the multi-modal reliability challenge for the TTA community and propose a reliable fusion and robust adaption (RFRA) approach to tackle the issue. Following the widely-used TTA evaluation protocol ([A, B, C, D]), we introduce different corruption types to either the video or audio modality. In the main paper, we reported the performance under different corruption types with a severity level of 5.

In response to your insightful suggestion, we conduct additional experiments on the Kinetics50-C benchmark, investigating the robustness of the proposed method under the setting of MIXED SEVERITY. To this end, we create test pairs for each corruption type by blending severity levels 1 through 5, resulting in 5N test pairs, where N represents the original size of the test data. After that, we shuffle the obtained test pairs and randomly choose N pairs for each corruption type. To verify the effectiveness of RFRA under the MIXED SEVERITY setting, we choose the typical TTA method Tent ([A]) and the SOTA TTA method SAR ([E]) as baselines for comparisons. The results are depicted in Fig. 8 within the revised manuscript. For your convenience, we attach the corresponding numerical result (regarding accuracy) in the following tables.

Kinetics50-C (Video Corruption, Mixed Severity)Gauss.ShotImpul.Defoc.GlassMot.ZoomSnowFrostFogBrit.DigitalContr.Pix.JPEGAVG
Tent ([A])58.458.357.073.471.376.071.767.168.363.779.569.474.976.572.969.2
SAR ([E])58.658.256.873.170.175.671.166.467.163.879.469.274.576.572.368.8
RFRA (Ours)59.459.357.973.572.176.272.668.270.368.779.469.775.176.773.770.2
Kinetics50-C (Audio Corruption, Mixed Severity)Gauss.Traff.Crowd.RainThund.WindAVG
Tent ([A])76.168.970.370.468.072.571.0
SAR ([E])76.166.369.169.168.572.270.2
RFRA (Ours)76.571.271.272.272.672.572.7

The results indicate the effectiveness of RFRA in addressing cross-modal reliability bias across various corruption types exhibiting mixed severity levels.

AC 元评审

This paper presents a new approach for addressing reliability bias in multi-modal test-time adaptation. The research problem is well motivated, and the proposed method is technically sound. The paper also introduces two new benchmark datasets, and experimental results are extensive and convincing. Reviewers raised some concerns about technical details, experimental results, ablation studies, and paper writing, which have been adequately addressed in the authors' responses. In addition, some descriptions in the paper are unclear, such as: "Comparisons between different fusion manners", "The blocks of the top left and bottom right denote", and "Pink denote the default setting." The authors are encouraged to proofread the paper carefully.

为何不给更高分

Reviewers raised some comments on technical details and experimental details, which should be carefully addressed in the final version.

为何不给更低分

This paper studies a new problem, i.e., reliability bias for multi-modal test-time adaptation. The paper is well written.

最终决定

Accept (poster)