4.2

/10

withdrawn5 位审稿人

最低3最高5标准差1.0

3.0

置信度

ICLR 2024

Robust Video Perception by Seeing Motion

Lingyu Zhang,Chengzhi Mao,Junfeng Yang,Carl Vondrick

OpenReview PDF

提交: 2023-09-17更新: 2024-03-26

TL;DR

We show that enforcing motion consistency at test time can improve video perception models' robustness.

摘要

关键词

Action RecognitionRobustness

评审与讨论

审稿意见

评分: 5置信度: 32023-10-26

The core idea of the paper is to use motion as a temporal constraint to improve the robustness of video perception tasks. The paper presents a test-time optimization algorithm using the temporal constraint and tests which is then evaluated on UCF101 and HMDB-51 by conducting a video classification task.

优点

Using motion as a constraint for video defense is intuitive and effective for certain types of attacks.
The proposed test-time optimization is intriguing.
The analysis section is resourceful.
The paper is easy to follow.

缺点

The main concern is the assumption of the paper may be too strong. Authors assume all the attacks will destroy the optical flow and that the flows can be recovered given the existing information. In Figure 5, the authors demonstrate how the motion consistency is different in clean, attacked, and defended videos. However, this may not be a general situation, for example, the rectangular occlusion attack (ROA) mentioned in [1]. There are many other types of common video attacks that have not been discussed in the paper.
I understand that the proposed method is a test-time optimization method, but it can not be an excuse to not compare with other relative existing methods in the paper.
Necessary ablations are missing. There are hyperparameters like step size, K, and bounds but the authors did not mention how sensitive is the framework towards those hyperparameters.

[1] Wu, T., Tong, L., & Vorobeychik, Y. (2019). Defending against physically realizable attacks on image classification. arXiv preprint arXiv:1909.09552.

问题

How long will it take for test-time optimization to defend each type of attack?
During the experiments, have authors ever encountered certain types of attacks that were not very suitable for the proposed defense methods?

伦理问题详情

N/A

审稿意见

评分: 3置信度: 32023-10-29

The authors propose a method to help video perception by fixing the potential corruption of video signals. The fix replies on the shift reflected in the motion flows. The idea is to improve the video perception by adopting test-time video repairing. The proposed method is claimed to be effective to both corrupted and attacked (under certain assumptions) videos.

优点

The proposed method is supported with the very intuitive idea and straightforward implementations that is easy to follow.
The writing is basically easy-to-follow. The technical section can be quick to understand even to someone who not actively working in this domain.
The experiments provide empirical evidence to support the effectiveness of the proposed method under different protocols and on different datasets.

缺点

The experiments lack comparison with other existing solutions in this line. This makes the estimation of method significance hard.
The proposed methods rely on the multiple terms of constraints, such as smoothness, brightness etc. However, all these constraints are studied for long in previous works. I thus have no confidence about the novelty of the proposed method in this paper.
In the related works, the authors discuss the motion estimation methods, “Robustify Machine Learning Models.” (I believe “robustify” is a typo here) and “Adversarial Attacks and Defenses on Videos”. However, to someone not familiar with the domain, it is expected to discuss whether the used techniques in the method such as the involved consistency terms and the idea of “repairing then perception” has been adopted in related works. I would expect the authors to make more clear claim and elaboration about the novelty of the proposed method.

问题

Overall, my concern is on the novelty of the proposed method and its empirical evidence. I would expect more elaboration on the novelty and a comparison with related works published recently to help calibrate the experimental significance.

审稿意见

评分: 3置信度: 32023-10-31

This paper proposes a motion-based method for improving the robustness of video classification models to adversarial pertubations and natural corruptions such as noise or challenging wheather conditions. The base model first computes optical flow and then classifies the flow field using an established architecture. As a defense, the authors optimize a small change to the input images in order to minimize the warping loss as used for training unsupervised optical flow estimation methods. The video is then classified using the adapted images with the original model. This test-time optimization is shown to improve the robustness of the model for several attack strategies.

优点

Robustness to noise and adversarial pertubations is a very relevant topic. Explicitely using the motion in videos to improve robustness is a plausible and interesting approach.
The paper is well written and therefore easy to follow.

缺点

In my view, the main weakness of the paper is the empirical evaluation of the adversarial robustness of the model. However I do not have a background in adversarial robustness and would therefore be interested in the other reviewers' opinions about these points.

The evaluation in section 4.3 seems to be irrelevant to me. The adversarial pertubation is optimized using the original, undefended model but tested on the adapted, defended model. From a high performance on these stimuli we do not learn anything about the robustness of the adapted model.
In section 4.4, attacks against the adapted model are considered.
- The first two attacks use knowledge of the test time loss in the optimization of the adversarial pertubation. The additional constraints however seem to be limiting. For example, an adversarial pertubation that hardly changes the optical flow estimate as optimized in "Adaptive Attack 2" will clearly be problematic for the defense. However, adversarial pertubations with this property are only a subset of all adversarial pertubations. There might be strong adversarial pertubations that cannot be found with this approach.
- Using gradient approximation as in Adaptive Attack 3 is a much stronger approach. Consequently, much stronger pertubations are found in which only a small fraction of the performance is restored by the defense (Table 5 last row). As mentioned by the authors, the better performance of the multiple constraints defense seems to mainly stem from the computational cost of attacks.
Only gradient-based attacks are considered in this paper. Since gradient-based attacks are computationally difficult for a defense relying on test-time optimization, it would be interesting to also consider a gradient-free attack (cf. Carlini et al. 2019).

In summary, I am not convinced that the adapted model is adversarially substantially more robust than the base model. The good results reported for many attacks rather seem to be due to the computational cd of optimizing adversarial pertubations due to the use of test-time optimization.

问题

What is the clean performance for the results reported in Table 1? Is it 86.6 as reported in Table 2? If so, the defense seems to only restore model performance to a small degree. Why is this the case?

审稿意见

评分: 5置信度: 42023-11-01

This paper proposes a solution based on the recovery of action information to address the issues of distribution shift and adversarial attacks in computer vision models, thereby enhancing the accuracy of the model during testing. The paper, from the perspective of action information consistency, designs multiple constraints to recover action information and validates the method on two datasets. Experimental results show no significant improvement under noisy conditions but demonstrate satisfactory results under adversarial attacks.

优点

This paper conducts experiments using various types of adversarial attacks, and the proposed method shows clear improvements.
The design of the adversarial attacks is innovative. The paper also discusses the impact of adaptive adversarial attacks, and even under strong adaptive attacks, the proposed method still performs well.
The proposed method is relatively simple and can be applied at the testing phase. It can be used in conjunction with other methods to enhance model robustness. Depending on the need, one can choose whether or not to use this method.

缺点

This paper investigates the issue of robustness in video action recognition, but it lacks comparison with test-time adaptation (TTA) methods, such as [A-B]. These TTA methods also aim to adapt to out-of-distribution data when the input data is disturbed by noise. Although these TTA methods mainly focus on updating model parameters, and this paper primarily focuses on adjusting the input data, how to prove that data processing is superior to model parameter adjustment? I believe a comparison should be made based on experimental results.
Under noisy conditions, many TTA methods can achieve desirable results, while the improvement brought by this paper's method is relatively low.
In appendix A.2.1, under noisy conditions, the average performance improvement brought by this paper's method is very low and can even be counterproductive under certain noise conditions. Does this indicate an issue with the approach of changing input data?
How to verify the reliability of the long-range photometric consistency in section 3.3? Are there any ablation study results reflecting the performance gain brought by each part?
The explanation of the formula content in Algorithm 1 in the main body is not clear enough.

[A] Temporal Coherent Test-Time Optimization for Robust Video Classification. ICLR23 [B] Video Test-Time Adaptation for Action Recognition. CVPR23

问题

What does ω represent in Equation 3? Is it necessary to obtain it through additional training?
How is the ground truth flow obtained in Figure 3? Is it supervised? This is not clearly described in the paper.
In appendix A.3.1, the method is compared with the RAFT method (2020), which is out-of-date. Why not compare it with newer methods?

审稿意见

评分: 5置信度: 22023-11-01

In this paper, the authors find that natural corruption and adversarial attacks harm both video classifiers and motion estimation. Thus they try to improve the model's robustness by a test-time constraint using motion information. Experiments on UCF and HMDB demonstrate its effectiveness.

优点

The paper is well-written and organized, with clear figures and tables.
The logic is clear and easy to lead.
As the authors claim, this is the first inference-time defense for videos that uses motion consistency to improve robustness.

缺点

As I'm not familiar with this research topic, I may not give a fair review:

What does the standard mean in different tables?
How about the time consuming? Is it better than those methods that need training?
Is this method suitable for those with videos with quick motion?
How does this method extend to general action recognition, since most of the current video backbones only use RGB frames?

问题

The reference style in the main paper may be wrong. Mao et al. (2020) should be (Mao et al., 2020) for ICLR.
The caption for Table 4 is wrong.