Full-Distance Evasion of Pedestrian Detectors in the Physical World
摘要
评审与讨论
This paper is dedicated to improving the field of physical adversarial attacks on pedestrian detection, focusing on the robustness of attack performance over distance. To bridge the appearance gap caused by distance between the digital and physical spaces, the authors propose a distant image converter (DIC). To address the inconsistency of features required for short and long distances, the authors propose multi-frequency optimization (MFO). The authors tested the attack performance in the physical world at distances ranging from 4 to 40 meters. Experiments show that the proposed DIC and MFO can improve the attack success rate in scenarios with varying distances.
优点
- The authors focused on a key factor affecting the performance of physical adversarial attacks: distance. This is an important exploration.
- The authors conducted ablation experiments and comprehensive evaluations of the DIC and MFO modules, making the method's design convincing.
- The DIC designed by the authors considers atmospheric conditions, camera effects, and effect filters, which is a reasonable and thorough modeling approach.
- The method proposed in this paper has been validated through extensive experiments in the physical world, and the authors provided a quantitative evaluation by collecting data from the physical world.
缺点
- From the videos provided in the supplementary materials of this paper, the performance of the attack does not seem to be as good as claimed in the paper. For example, in 8m.mp4 and 14m.mp4, in most frames, the model can detect pedestrian instances with adversarial patches, indicating that the attack failed. In the quantitative evaluation of the paper, the ASR at a distance of 14 meters reached over 50%. This inconsistency undermines the effectiveness of the proposed method.
- This paper lacks a comparison with other popular methods, such as AdvPatch [1], Adv-Tshirt [2], NAP [3], and T-SEA [4].
- The DIC proposed in this paper is physics-based modeling. Such modeling generally involves non-differentiable operations. The authors used SGD during the training of DIC, and gradient propagation is also required when updating the patch. How did the authors address this issue?
- From the code provided in the supplementary materials, it can be seen that the method proposed in this paper appears to be related to YOLOv2. However, YOLOv2 is not mentioned in the paper. Please explain this issue.
- Unclear description: It's not clear what "D" refers to on line 205.
- Some writing issues: on line 32, "In DIC, We find...". Some abbreviations are not explained the first time they appear, such as "DNN" on line 18 and "FDA" on line 39.
- Lack of important references, such as [4, 5].
[1] Thys S, Van Ranst W, Goedemé T. Fooling automated surveillance cameras: adversarial patches to attack person detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2019: 0-0.
[2] Xu K, Zhang G, Liu S, et al. Adversarial t-shirt! evading person detectors in a physical world[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer International Publishing, 2020: 665-681.
[3] Hu Y C T, Kung B H, Tan D S, et al. Naturalistic physical adversarial patch for object detectors[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 7848-7857.
[4] Huang H, Chen Z, Chen H, et al. T-sea: Transfer-based self-ensemble attack on object detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 20514-20523.
[5] Wei H, Tang H, Jia X, et al. Physical adversarial attack meets computer vision: A decade survey[J]. arXiv preprint arXiv:2209.15179, 2022.
问题
Please address the weakness raised in Weaknesses.
局限性
The authors discussed limitations and societal impact.
Question: From the videos provided in the supplementary materials of this paper, the performance of the attack does not seem to be as good as claimed in the paper at 8m and 14m. Answer: We have extracted all frames in the 8-meter and 14-meter demonstration videos and evaluated the ASRs by counting. At 8 meters, the attack was successful in 54 out of 132 frames, resulting in an ASR of 41%. At 14 meters, the attack succeeded in 111 out of 171 frames, yielding an ASR of 65%. Both results were comparable to the results reported in Figure 8(a). Please note that because of the persistence of vision, the bounding boxes would linger in our perception longer than they do in the videos. This phenomenon may explain why the subject holding the patch seemed to be detected in most frames of the 14-meter demonstration video.
Question: This paper lacks a comparison with other popular methods, such as AdvPatch, Adv-Tshirt, NAP, and T-SEA. Answer: Please note that in Figure 8(a), we conducted a head-to-head comparison between FDA and Adv-Tshirt. In Figure 9(a), we compared FDA with TCA [r1], which is also a commonly used attack method for full-angle clothing attack. According to your suggestion, we obtained the AdvPatch, NAP, and T-SEA patterns targeting YOLOV5 by either utilizing the patterns provided in the respective papers or executing the codes provided. These patterns were evaluated following the physical-world patch attack setting described in Section 5.2, where the results were averaged over three different pedestrians. The AdvPatch, NAP, and T-SEA patterns achieved average ASRs of 19%, 19% and 42%, respectively, all of which were lower than the 75% average ASR achieved by the FDA pattern. We will include these results in the final paper.
Question: The DIC proposed in this paper is physics-based modeling. Such modeling generally involves non-differentiable operations. How did the authors propagate gradient through the DIC? Answer: In our work, we implemented all three modules in DIC with differentiable computations. The atmospheric perspective module is a differentiable function (Equation 1, line 119). The camera simulation module is constructed using two convolutional layers (Equation 6, line 170). The style filter simulation module comprises a sequence of differentiable functions that simulate various style filters (Equation 7, line 175). If there are any details that require clarification, please feel free to discuss them with us during the discussion period.
Question: Please explain the YOLOV2 related codes in the supplementary material. Answer: We built our code based on the TCA [r1] code which treated the YOLOV2 detector as the main target model. In our experiment, we forgot to remove some YOLOV2-related codes imported by the TCA authors and leveraged some general utility functions they provided. Please note that since YOLOV2 is now commonly considered outdated, we have not performed any experiments on the model. A tidier version of the code will be included in the final paper.
Question: Unclear description: It's not clear what "D" refers to on line 205. Answer: D refers to the number of distances optimized. We will clarify it in the final version.
Question: Some writing issues: on line 32, "In DIC, We find...". Some abbreviations are not explained the first time they appear, such as "DNN" on line 18 and "FDA" on line 39. Answer: Thank you for your careful reading. We will correct them in the final version.
Question: Lack of important references, such as [4, 5]. Answer: Thank you for highlighting the two interesting pieces of research. We will discuss our work in respect to the two papers in the final version.
[r1] Hu, Z., Huang, S., Zhu, X., Sun, F., Zhang, B., & Hu, X. (2022). Adversarial Texture for Fooling Person Detectors in the Physical World. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 13307-13316).
Dear reviewer ouf1, have our answers in our rebuttal properly addressed your concerns? If there are any follow-up questions, please feel free to discuss them with us during this discussion period.
Thank you for your response. My concerns have been addressed. I will maintain my current rating in support of accepting this paper.
Dear Reviewer,
The authors have provided a rebuttal. Can you please provide your feedback after reading the rebuttal? The deadline is approaching fast.
Thanks, AC
To achieve full-distance attacks, the authors summarize three factors that distort the performance including atmospheric, camera hardware, and effect filters. Then, the authors simulate these factors in the digital world via DIC. To overcome the conflict of different distances requiring different low-frequency patterns, the authors further propose a Multi-Frequency Optimization (MFO) technique.
优点
- The summarized factors for full-distance attack are precise and important, and the DIC simulation for these factors conducted by the authors looks reasonable.
- The victim detectors used in this paper are advanced, which better illustrates the effectiveness of the proposed method.
缺点
Weakness:
- The best-performing approach in refer [1] is TC-EGA, while TCA is the suboptimal one. Why does this paper use TCA as a comparison method instead of TC-EGA? Additionally, why does this paper not adopt the TC-EGA approach to generate Expandable FDA, which is supposed to yield better results?
- In Fig. 9, I notice that the FDA pattern looks like many abstract human heads stacked together, and the detector might also perceive it as a combination of multiple small human heads. Considering that the IoU threshold for calculating ASR in this paper is only IoU > 0.5, these small human heads do not meet this condition and are therefore directly filtered out. However, these heads can indeed impact the attack's effectiveness. Namely, if the heads are detected, the attack should also be considered a failure. Therefore, using a single IoU threshold to measure ASR may not accurately reflect the attack's effectiveness. I hope the authors can address this issue, for instance, by providing ASR results at different IoU thresholds.
- The pattern of FDA appears very different with and without TC. Does this imply that FDA has a considerable number of different local optima during optimization?
- Compared to ASR, mASR (mean ASR with different confidence scores) [1] can more comprehensively reflect the attack's effectiveness.
- Providing experimental comparison results in the digital world can better illustrate the simulation effectiveness of the proposed method during training.
[1] Hu Z, Huang S, Zhu X, et al. Adversarial texture for fooling person detectors in the physical world[C]//CVPR 2022.
问题
Please refer to the weakness.
局限性
The authors didn't provide the limitations. But they have explained their reason in the checklist, which sounds acceptable.
Question: Why does this paper not adopt the TC-EGA approach as a baseline and generate an Expandable FDA? Answer: Unlike the adversarial texture paper [1] which used the YOLOV2 and V3 models as the target models, we used YOLOV5 as our main target model. Table S3 of the NAPGuard Appendix [4] shows that TC-EGA performed poorly against YOLOV5, as it was only able to reduce the model’s AP from 100% to 65%. We confirmed this in our own experiments, making TC-EGA unsuitable as our baseline. In contrast, TCA obtained a good AP of 36% on YOLOV5 and it also supports full-angle clothing attacks, so we used TCA as our baseline method.
Question: The FDA pattern looks like many abstract human heads stacked together, which might cause small detection boxes to show up. So, using a single IoU threshold to measure ASR may not accurately reflect the attack's effectiveness.
Answer:
First of all, please note that it is a convention to use an IoU threshold of around 0.5 in physical attack evaluations [1,5], since in the literature of detectors, if a predicted bounding box has a small IoU relative to the ground truth, it is conventionally considered a false positive. We had to follow this convention to reproduce existing results and perform comparisons with the baselines.
Second, please note that with the TC (toroidal cropping) design, depending on the target model used, it is common for abstract human-like patterns to show up on adversarial patterns. For example, in Figure S1 (d), (f), (g) and (h) of the TCA paper’s Appendix [1], the adversarial pattern targeting YOLO, FasterRCNN and MaskRCNN are formed by abstract human heads and trunks stacked together. In contrast, when we changed the target model to be Deformable DETR or RetinaNet with PVT backbone, no human-like patterns showed up (Figs. 1 and 2 of the rebuttal PDF uploaded).
So, we found that when we used YOLOV5 as the target model, due to the presence of abstract human heads, as the IoU threshold decreases from 0.5 to 0.4, 0.3, 0.2, 0.1 and 0.0, more and more small detection boxes corresponding to the abstract human heads were no longer filtered out, causing the average ASRs to decrease from 73% to 65%, 40%, 39% and 38% respectively (the average ASR of normal clothing was 0% regardless of the IoU threshold used). Please note that even at the most extreme case of using an IoU threshold of 0, where the attack would be considered a failure if any small bounding boxes showed up on the subject due to the abstract human heads, the FDA pattern still obtained an average ASR of 38%. Additionally, on detectors where abstract human heads did not appear in the optimized adversarial pattern, the average ASR did not drop significantly as IoU threshold decreased. That is, when treating Deformable DETR as the target model, at IoU thresholds of 0.5, 0.4, 0.3, 0.2, 0.1 and 0.0, the average ASRs were 71%, 70%, 68%, 68% and 68% respectively. Similarly, when treating RetinaNet with PVT backbone as the target model, the average ASRs were 73%, 73%, 72%,70% and 67% respectively.
Question: The pattern of FDA appears very different with and without TC. Does this imply that FDA has a considerable number of different local optima during optimization? Answer: Please note that with and without TC, the feasible region of optimization is different (TC imposes a constraint on the feasible region), so it is quite normal to obtain different adversarial patterns in the two situations. It implies that multiple points in the feasible region of the unconstrained FDA optimization problem are effective in attacking the detectors. So, your hypothesis might be correct, but it requires additional analysis to confirm.
Question: Compared to ASR, mASR can more comprehensively reflect the attack's effectiveness. Answer: Following the TCA paper [1], we calculated the mean average ASRs across confidence thresholds of 0.1, 0.2, …, 0.9. The mean average ASR of the FDA clothing and TCA clothing was 78% and 43%, respectively.
Question: Providing experimental comparison results in the digital world can better illustrate the simulation effectiveness of the proposed method. Answer: For the result in Figure 8(a), the corresponding digital-world ASRs of the FDA pattern at different distances (4m, 8m, 14m, 20m, 26m, 34m and 40m) were 79%, 53%, 76%, 88%, 66%, 64%, 88%, which resulted in an average ASR of 73%. The digital-world ASRs of the Adv-Tshirt pattern at different distances were 83%, 66%, 17%, 8%, 3%, 8%, 13%, which resulted in an average ASR of 28%. We observed that the physical-world attack results were consistent with the digital-world attack results. We will include this result in the final paper.
[1] Hu, Z., Huang, S., Zhu, X., Sun, F., Zhang, B., & Hu, X. (2022). Adversarial Texture for Fooling Person Detectors in the Physical World. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 13307-13316).
[2] Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 568-578).
[3] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, & Jifeng Dai (2021). Deformable \DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations.
[4] Wu, S., Wang, J., Zhao, J., Wang, Y., & Liu, X. (2024). NAPGuard: Towards Detecting Naturalistic Adversarial Patches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 24367-24376).
[5] Zhu, X., Hu, Z., Huang, S., Li, J., & Hu, X. (2022). Infrared Invisible Clothing: Hiding From Infrared Detectors at Multiple Angles in Real World. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) .
Thanks for the response. My Weaknesses 1, 3, 4, are adequately addressed by the authors. For Weakness 5, I think AP may be a more precise and commonly-used metric in the digital world, despite the results in ASR are also acceptable.
I still have questions about W2.
- Firstly, I think FDA is much more similar to human heads compared with TCA, considering FDA even looks like the human head has two eyes.
- Secondly, I wonder how could the ASR remain 38% when iou=0, does that mean the network sometimes may not consider them as human heads like we perceive?
- Thirdly, in real-world scenarios, IoU cannot be calculated since we don't have the ground truth. The detector only cares about if bbox of the human class exists. Therefore, I think poor performance in low IoU would essentially undermine the effectiveness of FDA in real-world applications.
We sincerely thank you for your effort in answering our rebuttal. We will include the corresponding digital world APs at different distances in the final version.
Regarding the three new questions, our answers are as follows.
Question: I think FDA is much more similar to human heads compared with TCA, considering FDA even looks like the human head has two eyes. Answer: If you zoom in on Figure S1 (g) and (h) of TCA paper’s appendix [1], similar abstract human head patterns with two eyes can also be identified.
Question: I wonder how could the ASR remain 38% when iou=0, does that mean the network sometimes may not consider them as human heads like we perceive? Answer: Yes, when IoU was 0, an ASR of 38% remained, since the network sometimes may not consider them as human heads. Please note that those human-head-like patterns are abstract.
Question: Thirdly, in real-world scenarios, IoU cannot be calculated since we don't have the ground truth. The detector only cares about if bbox of the human class exists. Therefore, I think poor performance in low IoU would essentially undermine the effectiveness of FDA in real-world applications. Answer: Thank you for stressing the importance of using a low IoU threshold. Although we used an IoU threshold of around 0.5 following the existing works, but from our discussions, we agree that a low IoU threshold should be used to better demonstrate the performance of attack. In the final version, we will include results at IoU thresholds of 0 to 0.5. Regarding the question, we wish to bring two points to your attention. (1) Although the FDA pattern targeting YOLOV5 had a lowered ASR of 38% when the IoU threshold decreased to 0, but on Deformable DETR [3] and RetinaNet with PVT backbone [2], the average ASRs of the FDA patterns were 68% and 67%, which we believe is strong evidence that when the “abstract human heads” are not present in the adversarial pattern, FDA attack can be very effective even on this harder metric. We will include the corresponding results in the final version. (2) When the target model was YOLOV5 and IoU threshold was 0, the average ASRs of FDA clothing, TCA clothing, random clothing and normal clothing were 38%, 8%, 2% and 0% respectively. Considering the goal of our work is to find a method that boosts full distance attack performance of existing methods under identical setting, we find our method to have fulfilled this goal.
Thanks for the in-depth analysis in your response. I wish the analysis from the rebuttal could be properly incorporated into the camera-ready version for better comprehension. Despite the human-heads problems in this paper are still severe from my perspective, I do agree that it is a universal problem in the community that needs to be researched in the future. I will raise my final rating.
Thank you for your effort in reviewing our paper and providing us with feedback. We will try our best to properly incorporate our analysis from the rebuttal into the final version of our paper.
The presents an adversarial attack that works in real world at different distances and fools object detectors. The method utilizes several advanced techniques like atmospheric perspective, camera and filter simulations, multi-frequency optimization to bring the method closer to real-world scenarios. The method is evaluated to fool YOLO-V5, Mask RCNN, and Deformable DETR.
优点
- Easy-to-follow writing with vivid illustrative figures and clear background and all text. It really helps to grasp the ideas.
- The change of the image from different distances is indeed a problem for fooling deep detectors. This paper leverages principles of physics and camera hardware design to transform patch-based digital attacks, that do not usually transform to physical world, to real-world scenarios.
- The method is accessed across different kinds of detectors and it seems it works at different distances most of the time. Evenmore, the crafted clothes fool across networks, making them black-box. Several ablations studies in the appendix are also interesting.
缺点
- The adversarial patches/clothes are really massive.
- At some distances (5-10m) adv tshirt is better than the proposed FDA
问题
- Did you consider variation not across distance, but across different angle of views, or other variations? If a person wearing the proposed adversarial clothes is still going to be adversarial if they change their pose, angle towards camera, etc?
- How are you going to propagate gradients for the detectors that use some non-differentiable operations like in MTCNN?
- How long does it take to optimize an attack?
局限性
Authors mention some limitations in the NeurIPS checklist and potential social impact is discussed in the Appendix L.
Question: The adversarial patches/clothes are really massive.
Answer:
We used relatively large patches and full-body clothing in our experiments for two reasons. First, it is now a common practice to employ large adversarial patterns such as full-body clothing [5][6] or fully-covered car paint [3][4], to conduct adversarial attacks. Second, considering the goal of this study is to enhance attack performance under a wide range of distances, when the subject is far away, due to the scale reduction, the blurred small patch or small piece of cloth cannot contain a sufficiently complex pattern for performing an adversarial attack.
As demonstrated in Figures 8(a) and 9(a), when printed with random patterns, at the current sizes, the patches and clothing had negligible attack performance. Additionally, at the current patch and clothing sizes, the baseline methods had limited full distance attack performance. This indicates that the current patch and clothing sizes do not contribute much to our full distance attack performance.
Question: At some distances (5-10m) adv tshirt is better than the proposed FDA. Answer: Yes, you are right. Please note the goal of this study is to maximize the average ASR across different distances, so we find it acceptable for the proposed method to not perform perfectly at a small number of distances. But your comment reminds us that there is space for further improvement, and we’ll explore it in the future.
Question: Is the FDA clothing adversarial under different angles and poses? Answer: Yes, the FDA clothing is adversarial under different angles and poses. Regarding different angles, other than evaluating the clothing from the front and back view, we have also evaluated the performances of FDA, TCA and random clothing from the side view. The average ASRs were 61%, 46% and 0% respectively (lines 336-337 of the paper). Regarding different poses, when performing the clothing experiments, at various distances, we asked the subjects to walk with their arms swinging to collect the testing images. Therefore, the results in Figure 9 (a) were already averaged over different poses.
Question: How are you going to propagate gradients for the detectors that use some non-differentiable operations like in MTCNN? Answer: To the best of our knowledge, over the last 5 years, most mainstream detectors are differentiable, which is why we performed FDA by gradient backpropagation. But we found that there are several potential ways to perform FDA if there are any non-differentiable operations in the target model. Firstly, by using the ensemble attack method [1], FDA patterns that generalize well across detectors can be generated, enabling attacks on models with non-differentiable operations. Additionally, there are works that estimate adversarial attack gradients (BPDA) [2], generate attack gradients with a surrogate network [3], perform attacks using search algorithms [4], or propose a general adaptive attack method [7] designed accounting for the presence of non-differentiable operations. These approaches can potentially be leveraged to craft FDA patterns for models with non-differentiable operations.
Question: How long does it take to optimize an attack? Answer: It usually takes about 16 hours on a single NVIDIA 3090 GPU to optimize an FDA pattern. With the same set of code, the AdvTshirt and TCA patterns also required a comparable amount of time to properly optimize. We note that the optimization speed was not our main concern in this work and it could certainly be improved.
[1] Yanpei Liu, Xinyun Chen, Chang Liu, & Dawn Song (2017). Delving into Transferable Adversarial Examples and Black-box Attacks. In International Conference on Learning Representations.
[2] Anish Athalye, Nicholas Carlini, & David A. Wagner (2018). Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In International Conference on Machine Learning.
[3] Yang Zhang, Hassan Foroosh, Philip David, & Boqing Gong (2019). CAMOU: Learning Physical Vehicle Camouflages to Adversarially Attack Detectors in the Wild. In International Conference on Learning Representations.
[4] Tong Wu, Xuefei Ning, Wenshuo Li, Ranran Huang, Huazhong Yang, & Yu Wang (2020). Physical Adversarial Attack on Vehicle Detector in the Carla Simulator. ArXiv, abs/2007.16118.
[5] Hu, Z., Chu, W., Zhu, X., Zhang, H., Zhang, B., & Hu, X. (2023). Physically Realizable Natural-Looking Clothing Textures Evade Person Detectors via 3D Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 16975-16984).
[6] Jing Li, Sisi Zhang, Xingbin Wang, and Rui Hou. 2023. Mimic Octopus Attack: Dynamic Camouflage Adversarial Examples Using Mimetic Feature for 3D Humans. In Information Security and Cryptology: 18th International Conference, Inscrypt 2022, Beijing, China, December 11–13, 2022, Revised Selected Papers. Springer-Verlag, Berlin, Heidelberg, 429–444. https://doi.org/10.1007/978-3-031-26553-2_23
[7] Yao, C., Bielik, P., Tsankov, P., & Vechev, M. (2021). Automated Discovery of Adaptive Attacks on Adversarial Defenses. In Advances in Neural Information Processing Systems (pp. 26858–26870). Curran Associates, Inc..
Thanks for your answers. After reading other reviews and replies, I am keeping my score 5 with leaning to accept the paper, but it can go either way.
Thank you for your effort in reviewing our paper and providing us with feedback.
We sincerely appreciate the time and effort the reviewers have put into this manuscript, which will help us improve the quality of the revised paper. It is encouraging that the reviewers find our work to be “important” [vr4j, ouf1], our design to be “reasonable” and “convincing” [vr4j, ouf1], our presentation to be “vivid” and “clear” [e7ej]. We addressed the concerns in each reviewer’s individual rebuttal area and the paper will be updated accordingly. We hope you are satisfied with our response. We welcome any further comments and we would be happy to answer them in the discussion period. Please note that the PDF file uploaded is to provide supporting figures in response to a question from reviewer vr4j.
The rebuttal addressed most of the concerns raised by the reviewers and now all reviewers support accepting the paper. Hence, it will be accepted. The reviewers particularly liked that the paper identified an interesting problem and solved it effectively and presented various experiments showing its effect.