PaperHub
5.5
/10
Poster4 位审稿人
最低4最高7标准差1.1
6
5
7
4
4.0
置信度
正确性3.3
贡献度3.0
表达3.0
NeurIPS 2024

Revisiting Adversarial Patches for Designing Camera-Agnostic Attacks against Person Detection

OpenReviewPDF
提交: 2024-04-25更新: 2024-11-06
TL;DR

We propose a camera-agnostic physical adversarial attack, CAP (Camera-Agnostic Patch), addressing the oversight of camera influence in existing physical adversarial attack methods.

摘要

关键词
Physical Adversarial AttackPerson DetectionAdversarial PatchCamera ISP

评审与讨论

审稿意见
6

The authors note that existing physical adversarial attack methods overlook the transitioning from the physical domain to the digital domain, which involves the camera. Therefore, they propose a camera-agnostic attack to enhance the stability of adversarial patches across different cameras. The proposed adversarial optimization framework employs a differentiable camera ISP proxy network as a defender, engaging in a zero-sum game with the attacker to enhance the attack performance of adversarial patches. Experimental validation across multiple imaging devices demonstrates the effectiveness of the proposed approach in this paper.

优点

  1. The authors identify a flaw in the existing pipeline of physical adversarial attack methods, specifically the oversight of transitioning from the physical domain to the digital domain, which they assert is mediated by the camera. They subsequently evaluate the camera's influence on the performance of physical adversarial attacks. Their research approach is logical and well-founded.

  2. The adversarial optimization framework proposed in this paper, utilizing a camera ISP proxy network as a defender, is concise and intuitive.

  3. The multi-box detection issue introduced in this paper is a significant problem worthy of attention.

缺点

  1. Explanation of hyperparameters and their value ranges for the camera ISP proxy network are needed.

  2. Essentially, this paper aims to address the issue of camera black-box attacks. However, the authors overlook another black-box scenario, namely the model. The transferability of attack methods across different models is also an important research direction. This paper lacks discussion on the model black-box aspect.

  3. Placing Figure A from the supplementary materials into the main text can help readers better understand the motivation behind this paper.

问题

  1. While introducing a camera ISP module is reasonable, the authors chose 6 hyperparameters for the camera ISP proxy network and listed their corresponding ranges. What criteria guided the selection of these hyperparameters and their ranges?

  2. Could the authors provide more disscussion on the threat model and the black-box model scenario?

局限性

The authors discuss the limitations of this paper, specifically regarding the simulation of the camera module.

作者回复

Q1: Explanation of hyperparameters and their value ranges for the camera ISP proxy network are needed.

A1: Thank you for your insightful comment regarding the hyperparameter selection for our camera ISP proxy network.

Camera ISPs consist of multiple processing stages. Tseng et al.[1] summarize common imaging pipelines into the following stages: (1) Optics, (2) White Balance & Gain, (3) Demosaicking, (4) Denoising, (5) Color & Tone Correction, and (6) Color Space Conversion & Compression. The first three stages pertain to RAW data acquisition and processing, while the latter three involve operations on RGB values. Since our task primarily focuses on images in RGB format, we chose conditional parameters for the proxy network from the latter three stages. We selected six parameters that significantly impact visual attributes, such as Brightness Contrast Control and Gamma Adjustment, as detailed in Table 2 of the paper. Empirical evaluations demonstrate that these six parameters affect attack performance.

We observed that certain combinations could lead to complete information loss in the image. To balance the quality and diversity of the images generated by the ISP proxy network, we selected the ranges specified in Table 2 of the paper. For parameter aa, values below 64 may result in images being too dark. For parameter bb, values below 64 may cause colors to appear washed out. For the γ\gamma setting, values above the specified range may introduce noise in dark areas, while values below the range can reduce image contrast. For parameter cc, deviations from the specified interval may lead to poor image quality. For parameter dd, values below the specified range may result in insufficient spatial filtering, causing excessive noise. For parameter ee, values less than 1 may not effectively remove noise

We will include a more comprehensive discussion of this topic in the revised manuscript.

[1] Hyperparameter optimization in black-box image processing using differentiable proxies. ACM TOG 2019.

Q2: Essentially, this paper aims to address the issue of camera black-box attacks. However, the authors overlook another black-box scenario, namely the model. The transferability of attack methods across different models is also an important research direction. This paper lacks discussion on the model black-box aspect.

A2: Thank you for your insightful comment. We agree that exploring the transferability of our proposed attack method across different models is an important avenue for future research. We have therefore included additional experiments with other detectors, as shown in the table below.

YOLOv3:

MethodAP (the lower the better)ASR (the higher the better)
Random Noise71.311.3
AdvPatch48.133.3
AdvT-shirt55.824.4
AdvCloak52.730.5
NAP66.214.0
LAP65.214.6
TC-EGA56.924.7
CAP (Ours)41.543.3

YOLOv8:

MethodAP (the lower the better)ASR (the higher the better)
Random Noise77.94.0
AdvPatch75.68.8
AdvT-shirt77.26.2
AdvCloak73.710.1
NAP78.15.0
LAP78.64.6
TC-EGA77.36.7
CAP (Ours)60.514.7

The above comparative evaluations were conducted under a black-box setting. The detection models were pretrained on the COCO dataset and then fine-tuned on the INRIAPerson dataset. We observe that, although the effectiveness of our CAP attack decreases compared to white-box attacks (see Table 3 in the paper), it still outperforms other methods.

Your suggestion is valuable and has inspired us to not only consider the black-box camera setting but also the black-box model setting. In future work, we will explore physical adversarial attacks under a double black-box scenario.

Q3: Placing Figure A from the supplementary materials into the main text can help readers better understand the motivation behind this paper.

A3: Thank you for your detailed comment. We agree that including Figure A in the main text would offer a more direct and intuitive understanding of the motivation behind our work. We will carefully consider your suggestion.

Q4: Discussion on the threat model.

A4: Thank you for your careful review. We discuss the threat model from the following three aspects:

  • Attack Goal: Our method aims to make the target person evade detection. This is achieved by applying a carefully crafted adversarial perturbation to the target individual's physical appearance, ensuring that the perturbation remains effective across various camera settings.
  • Adversary Capability: The adversary has: (1) knowledge of the object detection model and (2) the ability to generate and physically realize adversarial perturbations.
  • Design Requirements: (1) The method must be physically realizable and (2) maintain effectiveness across different imaging devices.

We appreciate your high evaluation of our work and the valuable suggestions for future improvements. We believe that the novel perspective introduced by our approach will significantly contribute to the field.

评论

Thank you for your rebuttal, your responses have address my concerns, so I am willing to recommend a positive rating score.

审稿意见
5

This paper proposed a cross-camera physical adversarial attack, Caera-Agnostic Patch (CAP) attack against person detection. This method incorporates a differentiable camera Image Signal Processing (ISP) proxy network to compensate for physical-to-digital domain transition gap. Additionally, the camera ISP proxy network serves as a defense module, forming an adversarial optimization framework with the attack module. The attack module optimizes adversarial perturbations to maximize effectiveness, while the defense module optimizes the conditional parameters of the camera ISP proxy network to minimize attack effectiveness. Experimental results demonstrate the effectiveness of the proposed Camera-Agnostic Patch attack on different hardware.

优点

This paper focuses on a commonly overlooked aspect of adversarial attacks—the ISP—and demonstrates good performance across different devices.

The experiments are comprehensive. Good quantitative results are observed even in challenging physical world evaluation.

缺点

1, In the experimental setup, only the patch size is provided, while the input image size is missing.

2, Some key experimental settings are missing. The total number of images used to calculate ASR is not provided.

3, The robustness of the proposed CAP against real-world disturbances (e.g., random noise, blur, etc.) should also be discussed.

问题

See weakness.

局限性

Yes

作者回复

Q1: In the experimental setup, only the patch size is provided, while the input image size is missing.

A1: Thank you for your careful review of our paper. The input image size we used is 640x640, consistent with the official YOLOv5 repository. It is worth noting that YOLOv5 supports both 640x640 and 1280x1280 input image sizes, and we used the former. We will include the specific input image sizes used in our experiments in the revised version of the paper.

Q2: Some key experimental settings are missing. The total number of images used to calculate ASR is not provided.

A2: Thank you for your valuable feedback.

In the digital-space evaluation, we used the publicly available INRIAPerson dataset, which consists of 613 training images and 288 test images. Specifically, the training set contains 3,019 person instances, while the test set contains 855 person instances. Therefore, the ASR in the digital space is calculated based on 288 images containing 855 person instances.

In the physical-space evaluation, we collected images using 6 cameras at 4 different times to avoid interference from unrelated factors. For each patch setting, 5 images were captured per camera per session, resulting in 6x4x5=120 images per patch. We compared 6 adversarial patches in the physical domain, yielding a total of 6x120=720 images. The ASR in the physical space is thus calculated based on these 720 images.

We will clarify the specific number of images in the revised version of the paper.

Q3: The robustness of the proposed CAP against real-world disturbances (e.g., random noise, blur, etc.) should also be discussed.

A3: Thank you for your insightful comment. We acknowledge the importance of robustness in real-world disturbances. During the training of CAP, we implemented techniques to enhance robustness, specifically (1) adding random noise and (2) adding random rotations. These measures strengthen the robustness of the CAP attack, supporting its effectiveness in real-world scenarios, as demonstrated in Section 4.3 of the paper.

Blur, commonly caused by camera motion in videos, was not considered in our evaluation as the detectors we targeted, such as the YOLO series and Faster R-CNN, are image-based. Your comment highlights an important area for future research, particularly the impact of blur on physical adversarial attacks. To the best of our knowledge, this has not been explored extensively. In future work, we plan to investigate physical adversarial attacks on video detection models and evaluate the effects of blur. We believe that modeling video motion and incorporating temporal frame consistency during training are promising approaches.

We appreciate your positive feedback on our work and the valuable suggestions for our future research. We believe that the new perspective offered by our method will significantly contribute to the field.

审稿意见
7

The authors present an improved method for human-detection-blocking adversarial patch attacks, with a particular focus on ensuring said attacks are robust to changes in camera. This approach is motivated by a study of the impact of camera Image Signal Processing (ISP) pipelines, which the authors show is a considerable factor in the inconsistency of real-world adversarial patch attacks. The authors’ proposed Camera-Agnostic Patch (CAP) attack uses a proxy network to simulate the ISP, and said network is optimized through an adversarial learning approach to make patches that are robust across different cameras. The authors demonstrate this with real data collected with 6 different imaging devices.

优点

The author’s motivation is well presented and well justified, as they point out that much research has been invested into converting adversarial perturbations from digital to physical, but little research focuses on the impact of the camera, which converts the physical patch back into a digital image before it is actually fed into the network.

The CAP method is effective and achieves the stated goal of making attacks robust to changes in camera. This also helps to address significant limitations in the reproducibility of prior works with physical adversarial patches.

The authors present comprehensive experiments, including digital and physical evaluations, along with ablations and adversarial defenses.

The authors also include insightful analysis of the best performing baseline (T-SEA) and show that while it does disrupt the detections, it does not fully remove them and instead fragments them into smaller pieces.

Overall, the work is clear and well presented.

缺点

While the authors experimental evaluation is quite comprehensive, one aspect they did not test for was differences in target model, as the same victim detector is used in all experiments. This could potentially be improved by demonstrating CAP attacks for different models. In addition, it would be interesting to see if the CAP attacks are any better at generalizing across target models, such as for black-box style attacks.

问题

I note that the supplemental material includes an additional visualization video, and I am aware that videos have their own unique artifacts due to their encoding. Does the difference between video capture and image capture have an impact on attack effectiveness? Does CAP help improve the generalization of attacks from image to video?

局限性

The authors have discussed some limitations of the work, which include other camera settings like exposure time and aperture size, which are not modelled in their current approach. As this is a work presenting an adversarial attack, the authors have also discussed the potential risks of such attacks and they have also presented some defense evaluations also.

The authors present results with six different imaging devices. This number could of course be increased, though this is not strictly necessary in my opinion.

As this is a work focused on presenting an adversarial attack, there is some risk for negative societal impact.

The statement about the random seed (line 189) may not be necessary in the main paper.

There is duplicated text on lines 254-255.

作者回复

Q1: While the authors experimental evaluation is quite comprehensive [...].

A1: Thank you for your valuable feedback. We acknowledge that evaluating the performance of our approach on other detectors would be meaningful. To address your concerns, we have conducted additional experiments, as detailed in the table below.

YOLOv3:

MethodAP (the lower the better)ASR (the higher the better)
Random Noise71.311.3
AdvPatch48.133.3
AdvT-shirt55.824.4
AdvCloak52.730.5
NAP66.214.0
LAP65.214.6
TC-EGA56.924.7
CAP (Ours)41.543.3

YOLOv8:

MethodAP (the lower the better)ASR (the higher the better)
Random Noise77.94.0
AdvPatch75.68.8
AdvT-shirt77.26.2
AdvCloak73.710.1
NAP78.15.0
LAP78.64.6
TC-EGA77.36.7
CAP (Ours)60.514.7

The above comparative evaluations were conducted under a black-box setting. The detection models were pretrained on the COCO dataset and then fine-tuned on the INRIAPerson dataset. We observe that, although the effectiveness of our CAP attack decreases compared to white-box attacks (see Table 3 in the paper), it still outperforms other methods.

Your suggestion is valuable and has inspired us to not only consider the black-box camera setting but also the black-box model setting. In future work, we will explore physical adversarial attacks under a double black-box scenario.

Q2: Video artifacts.

A2: Thank you for your insightful comment. The difference between video capture and image capture indeed affects attack effectiveness. The key distinction lies in the inherent camera and scene motion in video, which leads to scene shifts and introduces motion blur-related artifacts that significantly impact attack effectiveness. We observed that attacks sometimes failed in the presence of blur.

Our CAP attack focuses on images because detectors like the YOLO series and Faster R-CNN are image-based. Consequently, we did not consider the effects of video artifacts. However, to enhance the generalization of the adversarial perturbations, we incorporated tricks such as (1) adding random noise and (2) applying random rotations during the optimization process.

Your comment provides valuable insights for future research, especially regarding physical adversarial attacks on video detection models. To the best of our knowledge, no work has been done in this area. The generalization from images to videos remains an open problem. We believe that modeling video motion and considering temporal frame consistency during training are promising approaches.

Q3: Camer number.

A3: Thank you for your valuable comment. We chose to design and validate our method on smartphone cameras (iPhone, Redmi, Huawei, and Samsung) and typical consumer cameras (Sony and Canon) first because these cameras are portable, widely used, and more susceptible to attacks. In contrast, they are less reliable than industrial cameras, especially in challenging environments like high temperatures, humidity, and electromagnetic interference.

We agree that expanding the range of cameras would further enhance the applicability of our method. In future work, we plan to extend our evaluation by using industrial cameras to validate the attack's effectiveness. The following table lists industrial cameras commonly used in autonomous driving and video surveillance that we have surveyed.

ManufacturerModelApplication Areas
SonyIMX415Autonomous vehicles, robotics
ON SemiconductorAR0132Surveillance, drones
HikvisionDS-2CD2043G1-ISurveillance

Q4: Negative societal impact.

A4: We fully acknowledge it is crucial to consider these potential negative societal impacts. The ethical statement about our work is given below:

Our work successfully achieves physical adversarial attacks in person detection tasks. Given the effectiveness of our attack method across various imaging devices, its real-world application is feasible. This exposes potential security risks in existing DNNs-based applications, particularly when the technology is leveraged for malicious purposes. We advocate for the responsible and ethical use of technology. Furthermore, we offer comprehensive methodological descriptions and openly address the implications of our work, encouraging discourse within and beyond the scientific community to contribute to the advancement of trustworthy and dependable AI.

Additionally, to mitigate the potential risks of our CAP attacks, we discuss defense strategies in Section 4.5 of the paper. Through evaluation across three defense strategies, the experimental results indicate that adversarial training effectively mitigates our CAP attack in both digital and physical spaces. Moving forward, we will actively collaborate with the security community to explore the broader implications of our work and identify further mitigation strategies.

Q5: Random seed.

A5: Thank you for your careful review. As per your suggestion, we will remove the statement about the random seed from the main paper and include it in the appendix, leaving more space in the main text for more valuable discussions.

Q6: There is duplicated text on lines 254-255.

A6: Thank you for bringing this to our attention. We will eliminate the redundancy and thoroughly proofread the entire paper to ensure clarity throughout.

We appreciate your high praise for our work and the valuable suggestions for our future research. We believe that the new perspective offered by our method will significantly contribute to the field.

评论

I thank the authors for their responses and discussion. I believe the authors' focus on the role of the camera is critically underexamined, and their work has very important implications for real-world adversarial attacks. I maintain my original rating in support of accepting this work.

审稿意见
4

The paper "Revisiting Adversarial Patches for Designing Camera-Agnostic Attacks against Person Detection" addresses the limitations of current physical adversarial attack methods, which often fail to consider the variability introduced by different camera Image Signal Processing (ISP) pipelines. This oversight leads to instability and reduced effectiveness of attacks in real-world scenarios. The authors propose a novel approach that incorporates a differentiable camera ISP proxy network into the adversarial patch generation process. This network models the transformations that images undergo when captured by different cameras, making the adversarial patches more robust and effective across various camera systems. The approach involves an adversarial optimization framework with two key components: an attack module that generates adversarial perturbations and a defense module that optimizes the ISP parameters to mitigate the attack's effectiveness.

优点

By introducing a differentiable camera Image Signal Processing (ISP) proxy network, the authors address a critical and previously overlooked aspect of these attacks: the variability introduced by different camera systems. This novel inclusion significantly enhances the robustness and generalizability of adversarial patches, marking a substantial departure from traditional methods that often neglect the role of the camera.

缺点

  1. Some typographical errors: The citation in line 52 is incorrect; citation [22] should be changed to [40]. There are similar errors with multiple citations throughout the paper. It is recommended to thoroughly check each citation to avoid confusion. Moreover, there is a duplication error in line 255.
  2. The method flowchart is not very clear and does not seem related to attacking person detection. It is recommended to be more specific and should include detailed training methods and more details about the ISP proxy network.
  3. Although different smartphone cameras and typical cameras were tested, object detection is often used in autonomous driving or video surveillance. It is recommended to expand the testing to include cameras commonly used in these systems and to focus on these cameras.
  4. The paper is overly focused on engineering aspects and lacks thorough theoretical analysis. For example, it is recommended to explain in detail why the training strategy in the GAN framework is effective.

问题

  1. Does the method in this paper have any differences from mainstream patch-based methods in terms of the patch training process?
  2. Is this paper primarily considering targeted attacks or untargeted attacks? In equation 1, it seemes to be an untargeted attack, however, the goal of concealing person instances is more likely to be a targeted objective. If considering untargeted attacks, it is more appropriate to use AP (Average Precision) as a metric. If considering targeted attacks, ASR (Attack Success Rate) is more relevant.

局限性

The authors deserve commendation for their transparency and integrity in acknowledging the limitations of their work and the potential ethical issues associated with their research.

作者回复

Q1: Typographical errors [...].

A1: We have made the following corrections:

  • Corrected the citation in line 52: Zhang et al.[40] employed [...].
  • Corrected the citation in line 129: as Zhang et al.[40] demonstrated [...].
  • Removed the duplication in line 255.

We will thoroughly proofread the paper.

Q2: Flowchart.

A2: Thanks for your feedback. The revised flowchart can be found in the one-page PDF file.

Q3: Although different smartphone [...].

A3: Thanks for your comment. Smartphone and consumer cameras are portable and widely used but less reliable in harsh environments (e.g., high temperatures, electromagnetic interference) compared to industrial cameras used in autonomous driving and surveillance. Exploring their security implications is crucial, so we initially validated our method on these cameras.

To substantiate the claim in the paper, our approach introduces a wide parameter space for the ISP proxy network and uses six different cameras, offering a broad sampling space for training and evaluating.

In future work, we plan to include industrial cameras to enhance applicability in autonomous driving and surveillance. The following table lists commonly used industrial cameras we have surveyed.

ManufacturerModelApplication Areas
SonyIMX415Autonomous vehicles, robotics
ON SemiconductorAR0132Surveillance, drones
HikvisionDS-2CD2043G1-ISurveillance

Q4: Theoretical analysis.

A4: As outlined in Sections 3.2 and 3.4, our approach is inspired by the GAN theoretical framework. Attacker acts as Generator, producing adversarial perturbations, while Defender acts as Discriminator, optimizing hyperparams. Our alternating optimization mirrors GAN convergence analysis. We provide detailed theoretical proof:

Global Optimality of the Adversarial Optimization Framework

We begin by analyzing the optimal Defender strategy given any Attacker strategy.

Proposition 1: For fixed Attacker strategy, optimal Defender strategy minimizes adversarial perturbation effectiveness.

Proof: Attacker aims to maximize error induced by adversarial perturbations, while Defender aims to minimize this error. Formally, let PP represent the perturbations and θd\theta_d the hyperparams. Attacker's objective:

LA=maxPExpdata[(f(x+P),y)]L_A = \max _P \mathbb{E} _{x \sim p _\text{data}} \left[ \ell(f(x + P), y) \right]

where \ell is the loss function, ff is the neural network, and yy are the GT labels.

The Defender seeks to minimize the effectiveness of these perturbations by optimizing θd\theta_d:

LD=minθdExpdata[(f(g(x+P;θd)),y)]L_D = \min _{\theta_d} \mathbb{E} _{x \sim p _\text{data}} \left[ \ell(f(g(x + P; \theta_d)), y) \right]

where gg represents the ISP proxy network processing.

The optimal Defender strategy θd\theta_d^* satisfies:

θd=argminθdLD\theta_d^* = \arg \min_{\theta_d} L_D

Given this optimal strategy, the Attacker's objective function becomes:

LA=maxPLD(θd)L_A^* = \max_P L_D(\theta_d^*)

The interplay between the Attacker and Defender can be modeled as a minimax game:

minθdmaxPExpdata[(f(g(x+P;θd)),y)]\min _{\theta _d} \max _P \mathbb{E} _{x \sim p _\text{data}} \left[ \ell(f(g(x + P; \theta_d)), y) \right]

Since the ISP proxy network aims to minimize the impact of adversarial perturbations, the optimal strategy for the Defender minimizes the Attacker's maximum achievable loss, concluding the proof.

Convergence of the Optimization Algorithm

Proposition 2: If Attacker and Defender have sufficient capacity, and at each iteration of optimization algorithm, Attacker and Defender optimize respective objectives, adversarial perturbations converge to equilibrium where effectiveness is minimized by ISP proxy network.

Proof: The optimization algorithm follows an alternating training strategy, akin to the GAN. At each step, Attacker optimizes the perturbations PP while keeping the ISP parameters θd\theta_d fixed, and vice versa.

Attacker’s update rule is:

P(t+1)=P(t)+αPLAP^{(t+1)} = P^{(t)} + \alpha \nabla_P L_A

Defender’s update rule is:

θd(t+1)=θd(t)βθdLD\theta_d^{(t+1)} = \theta_d^{(t)} - \beta \nabla_{\theta_d} L_D

By iteratively updating PP and θd\theta_d in this manner, the algorithm performs gradient descent on a convex-concave function, ensuring convergence to a local Nash equilibrium. At this equilibrium, the adversarial perturbations PP^* are countered effectively by the optimized ISP parameters θd\theta_d^*, concluding the proof.

Q5: Differences of the patch training process?

A5: Mainstream patch-based methods apply a random patch to the target image and update it to maximize the discrepancy between predictions and GT, analogous to the Attacker in our method. Our approach further introduces a Defender, represented by a ISP proxy network. This addition enables the Attacker and Defender to form an adversarial optimization framework, enhancing the attack transferability across different cameras.

Q6: Targeted or untargeted attacks.

A6: Our approach is an untargeted attack. In classification tasks, targeted attacks mislead the classifier into categorizing the target instance as a specific incorrect class, while untargeted attacks cause the classifier to misclassify the target instance into any incorrect class. Based on this criterion, we tend to classify the task of concealing person instances as an untargeted attack.

Our method, as an untargeted attack, was evaluated using AP. However, AP alone can be misleading, especially in multi-box detection scenarios where AP may be low despite ineffective attacks (see Section B of our supplementary material). ASR compensates for this limitation of AP, as it is not affected by an increase in True Positive samples. Thus, we employ both AP and ASR.

Lastly, we would like to thank you for acknowledging the strengths of our method, as you noted: (1) a critical [...] and (2) This novel [...]. We hope our responses address your concerns and contribute to an improved rating for our paper.

作者回复

We thank the reviewers for their insightful and constructive comments. As summarized by reviewers xDi9, eDby, and ai8f, our work focuses on a critical and previously overlooked aspect—the impact of camera variability on physical adversarial attacks—and proposes an effective and robust adversarial patch capable of cross-camera attacks. To the best of our knowledge, ours is the first work to design and evaluate across multiple imaging devices, aligning with real-world scenarios.

All reviewers acknowledged the value of our work. We are pleased that most reviewers agreed that:

  • Incorporating a differentiable camera ISP proxy network into the adversarial patch generation process is a novel approach. This inclusion significantly enhances the robustness and generalizability of adversarial patches. [reviewer xDi9]
  • Our motivation is well presented and well justified. [reviewer N8jz]
  • The experiments are comprehensive. [reviewers N8jz and eDby]
  • The research approach is logical and well-founded. The framework is concise and intuitive. [reviewer ai8f]

We have provided point-by-point responses to each reviewer's comments. Specific feedback will be addressed in our individual responses. We hope to demonstrate the validity of our claims through detailed discussions and additional experiments.

评论

Dear Reviewers,

I hope this message finds you well. We would like to know if our responses to your comments have addressed your concerns. If you have any additional questions or suggestions that could help us improve the quality of our work, we would be more than happy to address them.

最终决定

This work focuses on a critical and previously overlooked aspect—the impact of camera variability on physical adversarial attacks. The authors have proposed an effective and robust adversarial patch capable of cross-camera attacks.

Reviewer xDi9 has raised two main questions: (1) regarding difference of the proposed method from mainstream patch-based methods in terms of patch training process, and (2) targeted attack or untargeted attack.

The authors have responded in Q5 and Q6 in their rebuttal of Reviewer xDi9. For (1), the main difference is that the proposed method introduces a Defender, which is represented by an ISP proxy network. This addition enables the attacker and defender to form an adversarial optimization framework. The authors have shown further analysis of the global optimality of the adversarial optimization framework in Q4 response.

Reviewer xDi9 followed up with additional concerns on (1) “the decision to focus on consumer cameras, while partially justified, leaves room for further improvement in demonstrating the method's applicability to more relevant scenarios, such as autonomous driving and surveillance. The effectiveness of the approach may not be fully validated until these additional tests are conducted.” and (2) “I recognize that the proposed approach makes a significant contribution to the field by addressing an important and previously overlooked aspect of adversarial attacks. However, the novelty and effectiveness of the method could be better demonstrated with additional experimentation and theoretical justification.”

I think for (1), focusing on consumer cameras is a pretty solid scope already, and extending to autonomous driving and surveillance could be a next follow-up work. For (2), I think the authors have done an OK job demonstrating the proposed method via experimentation and theoretical justification.

Given that Reviewer xDi9’s concerns are mostly addressed and the other 3 reviewers (Reviewer N8jz, Reviewer eDby, and Reviewer ai8f) are all trending towards the positive side, I am inclined to accept this work.