Cross-Modality Perturbation Synergy Attack for Person Re-identification
摘要
评审与讨论
The paper addresses security concerns in cross-modality person re-identification systems, focusing on systems that use both RGB and infrared images. Traditional ReID systems have primarily focused on RGB images, but the differences between RGB and infrared modalities present unique challenges. The authors propose a universal perturbation attack method designed for cross-modality ReID, which optimizes perturbations using gradients from multiple modalities. Experiments on the RegDB and SYSU-MM01 datasets demonstrate the effectiveness of this method.
优点
1.This work investigates vulnerabilities in cross-modality ReID models. 2.Proposes a cross-modality perturbation synergy (CMPS) attack using triplet loss to optimize perturbations that leverage shared knowledge across modalities. 3.Extensive experiments on the RegDB and SYSU datasets show the method's effectiveness and provide insights for future improvements in cross-modality ReID robustness.
缺点
1.The authors use random grayscale transformations to reduce modality differences. Are the three-channel grayscale images based on visible light, infrared, or both? 2.What is the intended meaning of the decision boundary in Figure 2? 3.The Figure 3 is difficult to understand. 4.Can the method be discussed on more datasets, such as the LLCM dataset? 5.Does the size of the adversarial boundary affect the experimental results? 6.What is the overall loss function used in the paper? How are the functions discussed in Section 3.2 and Section 3.4 related? Additionally, in Section 3.4, the sequence of formulas seems inconsistent with the context. 7. The SYSU-MM01 common tests are conducted in all-search and indoor-search modes. Which mode is the experiment in Table 1 based on? It is recommended to discuss both modes. 8.The authors should select more diverse types of baseline models to verify the generalizability of the method. 9.In Algorithm 1, the CMPS attack is mentioned to use grayscale images to update perturbations, but in the ablation experiments in Table 3, they are validated as two separate modules. What is the reason for this?
问题
See the weaknesses above.
局限性
The authors explain the limitations of their work, and there is no negative societal impact.
We would like to express our sincere gratitude for your thorough review and valuable feedback on our paper. Your input has undoubtedly played a pivotal role in enhancing the quality and clarity of the manuscript. Responses to the individual questions below.
Reviewer’s Comment :
“ ...... Are the three-channel grayscale images based on visible light, infrared, or both? ”
Response:
The three-channel grayscale image is based on both visible light and infrared.
Reviewer’s Comment :
“What is the intended meaning of the decision boundary in Figure 2? ”
Response:
Universal adversarial perturbations leverage the manifold hypothesis, indicating inherent structures and relationships among data. These perturbations push different sample features into a common sub-region that affects the model's accuracy. In Figure 2, we use a spherical representation of the manifold. Identical shapes with different colors represent features of the same person across different modalities. Our method generates universal perturbations that direct these features into a common sub-region.
Reviewer’s Comment :
“The Figure 3 is difficult to understand ”
Response:
We noticed your comment about the difficulty in understanding Figure 3. To address this, we have redrawn it (see Figure 1 in the rebuttal supplementary material PDF). It shows how our approach captures intrinsic associations between modalities by pushing apart feature distances of positive pairs with the same identity and pulling closer feature distances of negative pairs with different identities across modalities.
Reviewer’s Comment:
“Can the method be discussed on more datasets, such as the LLCM dataset? ”
Response:
We have supplemented the relevant experiments. The experimental results (ϵ=8) of our proposed method on the LLCM dataset[1] and the DEEN baseline[1] are shown in the table below (lower accuracy indicates better performance).
| Method | Visible to Infrared (r=1) | Infrared to Visible (r=1) |
|---|---|---|
| DEEN Baseline | 62.5% | 54.9% |
| M-FGSM Attack | 28.4% | 24.8% |
| LTA Attack | 15.1% | 19.5% |
| ODFA Attack | 26.3% | 23.7% |
| Col.+Del. Attack | 8.6% | 9.1% |
| Our Attack | 5.8% | 6.4% |
LLCM is a dataset designed for cross-modality ReID under low-light conditions. Compared to other datasets, it presents more challenges for attackers due to its diverse scenarios and low-light conditions. This complexity and uncertainty make it difficult for adversarial samples to remain effective, reducing the attack success rate.
[1]Zhang, Y.,et al. (2023). Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In CVPR (pp. 2153-2162).
Reviewer’s Comment:
“Does the size of the adversarial boundary affect the experimental results? ”
Response:
Yes, the size of the adversarial boundary can significantly affect the experimental results. The experimental results on AGW using the RegDB dataset are as follows:
| Adversarial Boundary (ϵ) | Visible to Thermal (r=1) | Thermal to Visible (r=1) |
|---|---|---|
| - | 70.0% | 70.5% |
| 2 | 32.7% | 40.5% |
| 4 | 9.6% | 13.8% |
| 8 | 2.3% | 2.0% |
| 16 | 0.3% | 0.5% |
As the adversarial boundary ϵ increases, the attack effect is significantly enhanced, and the model's rank-1 accuracy drops rapidly.
Reviewer’s Comment:
“What is the overall loss function used in the paper? How are the functions discussed in Section 3.2 and Section 3.4 related ....... ”
Response:
Section 3.2 introduces the framework and overall optimization objective of our method, providing a macro overview. Section 3.4 further details the specific process of perturbation optimization under different modalities. The implementation of this method is quite flexible. Adversarial perturbations can be optimized on a per-sample basis or in batches. Section 3.4 does not explain this, which may have caused confusion regarding its relationship with Section 3.2. Additionally, adjusting the order of Formula 1 will indeed provide a better reading experience. We appreciate the reviewer pointing these out and will make the necessary optimizations. Regarding the flexibility issue, there is further discussion in subsequent questions.
Reviewer’s Comment:
“The SYSU-MM01 common tests are conducted in all-search and indoor-search modes. Which mode is the experiment in Table 1 based on? ”
Response:
For the "Visible to Infrared" scenario, we used the all-search mode. For the "Infrared to Visible" scenario, we used the indoor-search mode. We chose these modes to better align with the typical use cases and challenges presented by each cross-modality scenario. We will clarify this information in the revised manuscript to avoid any confusion. Thank you for pointing this out.
Reviewer’s Comment:
“The authors should select more diverse types of baseline models to verify the generalizability of the method ”
Response:
In the previous response, we evaluated the DEEN baseline on the LLCM dataset and, as requested by other reviewers, assessed the transferability of the proposed method on different network architectures, including IDE, PCB, and ResNet18 (see the response to Reviewer fT1U). These experiments validate the method's generalizability.
Reviewer’s Comment:
“ ...... CMPS attack ...... validated as two separate modules. What is the reason for this? ”
Response:
Separating these components allows us to assess their individual contributions to the overall method, clearly demonstrating their roles in the final performance. Different environments may require varying method components. For example, in sketch-RGB systems, grayscale images may not be effective for attack augmentation. By validating these components independently, we showcase the method's flexibility and adaptability, proving that specific enhancements improve the effectiveness of universal perturbations.
Dear Reviewer tAG1, As the deadline for the discussion period is approaching, we would like to kindly request your feedback on our responses. We wish to express our deepest gratitude for the time and efforts you have dedicated to reviewing our work. We sincerely hope that our detailed responses have adequately addressed all the concerns and suggestions you raised. We fully understand that you may be occupied with other commitments, but we would greatly value any comments you can provide on our responses before the deadline. Thank you for your attention to this matter. We eagerly look forward to hearing from you soon. Sincerely, 9505 Authors
Thank you for this feedback. I still have some concerns regarding the rebuttal.
- The author has provided an explanation for Figure 2, but Figure 2 does not seem to convey the author's intended message.
- The author has explained the impact of the adversarial boundary, but in my view, the choice of adversarial boundary ϵ can control the strength of the attack but does not ensure that the adversarial attack will be effective in all scenarios. Attackers may need to fine-tune for different models and datasets. How does the author address the issue that, in practical applications, it may not be possible to determine an optimal ϵ to ensure the success rate of the attack, which could increase both time and economic costs?
Additionally, I still have concerns that the use of gradients from different modalities to optimize perturbations does not seem to be a new idea. Classic attack methods such as FGSM, MI-FGSM, and PGD are all based on gradient perturbations, suggesting a lack of originality. According to the author's explanation, Aug seems to be as effective as CMPS, meaning that the general random grayscale changes have actually contributed to the improved performance reported in the paper.
We sincerely appreciate you taking the time to respond to our work, and we would like to clarify a few points further.
Regarding your feedback on Figure 2, we take your concerns very seriously. We understand there may still be some dissatisfaction with the current version, and we are willing to spend more time to further optimize it. To maintain a compact and aesthetically pleasing layout, we kept the title of Figure 2 concise, with more detailed explanations provided in the main text. This might have contributed to the figure’s lack of immediate clarity. We will carefully consider your feedback and strive to improve Figure 2 in future iterations.
Currently, research on the security of cross-modality person re-identification is still in its early exploratory stages, with infrared and thermal imaging being the primary focus scenarios in this field. Regarding your comment that "...... does not ensure that the adversarial attack will be effective in all scenarios", gradient-based methods may have limitations in improving the generalization and adaptability of attacks across more scenarios. This is indeed one of our future research directions. As we mentioned in our response to Reviewer Zepn, we will continue to address this issue in subsequent work. As for your question about "...... in practical applications, it may not be possible to determine an optimal ϵ to ensure the success rate of the attack, which could increase both time and economic costs?" generally speaking, the larger the ϵ value, the more effective the attack. The ϵ value is mainly related to the magnitude of the perturbation. In real-world applications, many tasks require a balance between the visibility of the perturbation and the effectiveness of the attack, so ϵ is typically not set too high. In most tasks, a value of 8 for ϵ is considered reasonable. The time cost is primarily related to the number of iterations set during the optimization process of the adversarial perturbation. Generally, the more iterations, the better the optimization of the perturbation, and the stronger the attack effect. Additionally, one significant advantage of universal perturbations compared to other attack methods is that they do not need to be customized for each sample (i.e., there is no need to redesign the perturbation for different samples). Although more time may be spent optimizing the universal perturbation initially, it incurs no additional time or economic costs in subsequent applications.
Regarding the originality of the method, we would like to clarify further. Although FGSM, MI-FGSM, and PGD are classical gradient-based methods, they are primarily designed for single-modality scenarios and lack an intrinsic mechanism to handle the correlation of information across different modalities. This is precisely the motivation behind our research—through the collaborative optimization of information across different modalities, we propose the Cross-Modality Perturbation Synergy (CMPS) attack, a universal perturbation method that effectively addresses the security challenges in cross-modality ReID. Moreover, another key contribution of our paper lies in the theoretical analysis of the intrinsic mechanism of the proposed method, which demonstrates its effectiveness from a theoretical perspective. We hope that the reviewers will consider this contribution.
As for the Aug method you mentioned, we have incorporated it as an auxiliary enhancement strategy within CMPS to further improve the generalization of cross-modality perturbations. In traditional methods like FGSM, MI-FGSM, and PGD, which typically focus on perturbation optimization in single-modality contexts, there is a lack of an intrinsic mechanism to address cross-modality differences, making it difficult to integrate enhancement methods like Aug. Our work effectively integrates Aug through the CMPS strategy, enabling it to produce more significant results in cross-modality scenarios.
Once again, we thank you for your attention and suggestions on our work. Your feedback has not only helped us refine our current research but also provided direction for our future studies. We hope that our work can contribute to the security research of cross-modality ReID systems.
Sincerely,
9505 Authors
This paper investigates adversarial attacks on cross-modality person re-identification (ReID) systems. It is purportedly the first study to investigate vulnerabilities in cross-modality ReID models, with the goal of evaluating the security of these systems.
To this end, the paper introduces an innovative universal perturbation attack method specifically designed for cross-modality ReID. This new method includes a cross-modality attack augmentation technique, which helps the perturbation synergy attack to better bridge the gap between different modalities. Experimental results on the RegDB and SYSU datasets show that the CMPS attack significantly reduces the accuracy of ReID systems, outperforming existing traditional methods.
优点
This paper introduces a novel perturbation attack specifically designed for cross-modality person re-identification, addressing a significant gap in the existing literature. The authors have made a notable contribution by being the first to explore vulnerabilities in these cross-modality systems. The importance of this work lies in its innovative approach to tackling the complex challenges posed by person re-identification systems that use different imaging modalities, such as RGB and infrared. The paper provides rigorous theoretical analysis and thorough experimental validation of the proposed method, demonstrating its significance both theoretically and practically. Additionally, the quality of writing is high, with the authors presenting their ideas clearly and concisely, making it accessible even to readers who are not experts in the field.
缺点
- The difference between VI-ReID and regular ReID is that VI-ReID requires matching pedestrians across different modalities. Theoretically, it is sufficient to attack the RGB features, rendering them inadequate to match the infrared image features. Can focusing solely on attacking RGB features also fulfill the requirements?
- The article states that infrared images are grayscale images, which might not be entirely accurate. Near-infrared images appear visually similar to grayscale images and lack color information.
- While the paper demonstrates the effectiveness of the CMPS attack in bridging the gap between RGB and infrared modalities, extending this research to a broader range of modalities would provide deeper insights into the applicability and limitations of the proposed attack method. Additionally, the current experiments are primarily focused on the RegDB and SYSU datasets. Future studies incorporating more diverse datasets from various scenarios and conditions would help validate the robustness and generalizability of this method in complex environments.
问题
Have the authors considered how to enhance the robustness of cross-modality ReID systems to resist attacks similar to CMPS? The paper could further expand its impact by investigating the resilience of CMPS attacks against existing adversarial defense mechanisms. By discussing or empirically evaluating the performance of CMPS attacks in the presence of adversarial training or other defense strategies, a deeper understanding of the proposed method's robustness can be gained, guiding the development of more secure cross-modality ReID systems.
局限性
Yes, the authors have adequately addressed the limitations.
We would like to express our sincere gratitude for your thorough review and valuable feedback on our paper. Your input has undoubtedly played a pivotal role in enhancing the quality and clarity of the manuscript.
Responses to the individual questions below.
Reviewer’s Comment:
“The difference between VI-ReID and regular ReID is that VI-ReID requires matching pedestrians across different modalities. Theoretically, it is sufficient to attack the RGB features, rendering them inadequate to match the infrared image features. Can focusing solely on attacking RGB features also fulfill the requirements?”
Response:
Yes, theoretically, attacking only the RGB features can partially disrupt a VI-ReID system. However, this approach has significant limitations, and the effectiveness of the attack can be relatively constrained. As shown in the table below, 'RGB Attack' refers to using only RGB visible images from the cross-modality dataset RegDB for the attack. In this context, smaller values indicate better attack effectiveness.
AGW Baseline:
| Method | Visible to Thermal (r=1) | Thermal to Visible (r=1) |
|---|---|---|
| AGW baseline | 70.1% | 70.5% |
| RGB attack | 15.7% | 22.3% |
| RGB+Infrared attack | 5.1% | 16.9% |
DDAG Baseline:
| Method | Visible to Thermal (r=1) | Thermal to Visible (r=1) |
|---|---|---|
| DDAG baseline | 69.3% | 68.0% |
| RGB attack | 18.3% | 30.5% |
| RGB+Infrared attack | 4.6% | 19.5% |
It can be seen that when considering only RGB, the effectiveness of the attack is limited and is even more constrained in the “Non-Visible to Visible” testing scenario. While attacking RGB features can be somewhat effective in a single RGB-infrared ReID system, its generalizability is likely poor in more complex real-world scenarios and when considering more modalities. If we focus only on attacking RGB features in the RGB-infrared ReID system, the effectiveness of the attack may significantly diminish when transferring to RGB-thermal ReID systems or even infrared-thermal ReID systems. Therefore, it is more important to research and develop universal perturbations that exhibit good transferability between different modalities to ensure the effectiveness of attacks in various practical applications.
Reviewer’s Comment:
“Have the authors considered how to enhance the robustness of cross-modality ReID systems to resist attacks similar to CMPS? ...... By discussing or empirically evaluating the performance of CMPS attacks in the presence of adversarial training or other defense strategies ...... guiding the development of more secure cross-modality ReID systems.”
Response:
Thank you for your valuable feedback. Our future work will focus on evaluating the effectiveness of CMPS attacks against existing adversarial defense mechanisms. We have conducted some experiments. We injected adversarial perturbations with a magnitude of 8 into the training samples on the RegDB dataset and then performed adversarial training by mixing the adversarial samples with the original samples. The results are shown in the table below:
| Method | Visible to Thermal (r=1) | Thermal to Visible(r=1) |
|---|---|---|
| AGW Baseline | 70.1% | 70.5% |
| Before Adversarial Training | 2.3 % | 2.0% |
| After Adversarial Training | 25.7% | 29.4% |
The results indicate that adversarial training remains an effective defense method in cross-modality scenarios.
After reading the author's defense and the opinions and responses of other reviewers, my doubts have been fully resolved. Therefore, I have decided to maintain the current rating.
Dear Reviewer Th4d, As the deadline for the discussion period is approaching, we would like to kindly request your feedback on our responses. We wish to express our deepest gratitude for the time and efforts you have dedicated to reviewing our work. We sincerely hope that our detailed responses have adequately addressed all the concerns and suggestions you raised. We fully understand that you may be occupied with other commitments, but we would greatly value any comments you can provide on our responses before the deadline. Thank you for your attention to this matter. We eagerly look forward to hearing from you soon. Sincerely, 9505 Authors
This paper is the first to explore the security vulnerabilities of cross-modality ReID models and proposes a universal perturbation attack method for cross-modality person re-identification (ReID) systems, called the Cross-Modality Perturbation Synergy (CMPS) attack. This method innovatively utilizes gradient information from both RGB and infrared images to generate a universal perturbation, maintaining its effectiveness across multiple modalities. Experiments conducted on the RegDB and SYSU datasets demonstrate that the CMPS method significantly reduces the accuracy of ReID models, outperforming existing traditional attack methods. This study emphasizes the necessity of considering multiple modalities in the security evaluation of ReID systems and provides new perspectives for future research.
优点
This paper makes a notable contribution by addressing security challenges in cross-modality person re-identification (ReID) systems with a pioneering approach. It stands out as the first to explore vulnerabilities in these systems, advancing the understanding and testing of their robustness against adversarial attacks, particularly in under-explored cross-modality scenarios. By focusing on this critical gap, the paper opens new directions for security evaluation in complex multimodal environments.
The novelty of the proposed attack method lies in its innovative use of aggregated feature gradients from different modality images to probe vulnerabilities in cross-modality ReID models. The writing quality is excellent, featuring clear, concise, and well-structured explanations that effectively communicate the paper's concepts. The insights provided are poised to drive improvements in the security of ReID systems.
缺点
-
Figure 1 would benefit from using the same set of gallery images for both the before and after comparisons.
-
The paper exhibits inconsistencies in referencing "Table" and "Tab," as observed in lines 228 and 229. Standardizing these references would enhance readability and professionalism.
问题
The paper offers initial insights into the transferability of the proposed attack across various models and datasets. A more thorough investigation into factors influencing perturbation transferability, such as diverse model architectures and characteristics of cross-modality data, would advance the understanding of adversarial attacks in this domain. Such comprehensive research could elucidate the performance of perturbations in different settings, optimizing attack methods and enhancing their practical applicability.
局限性
Yes, the authors have appropriately addressed the limitations of their study.
We would like to express our sincere gratitude for your thorough review and valuable feedback on our paper. Your input has undoubtedly played a pivotal role in enhancing the quality and clarity of the manuscript.
Responses to the individual questions below.
Reviewer’s Comment :
“A more thorough investigation ...... such as diverse model architectures and characteristics of cross-modality data, would advance the understanding of adversarial attacks in this domain ...... ”
Response:
We conducted adversarial transferability experiments using IDE, PCB, and ResNet18. The rank-1 transfer attack success rates (higher values indicate better transferability of the attack) are shown in the table below:
| Source \ Target Model | IDE (Ours/Col.+Del.) | PCB (Ours/Col.+Del.) | ResNet18 (Ours/Col.+Del.) |
|---|---|---|---|
| IDE | 98.7% / 94.3% | 84.5% / 81.2% | 87.4% / 86.1% |
| PCB | 85.1% / 80.4% | 97.6% / 92.8% | 88.3% / 85.7% |
| ResNet18 | 81.0% / 78.5% | 77.5% / 74.9% | 98.2% / 95.6% |
All models exhibit high success rates for attacks against themselves, indicating vulnerabilities in recognizing and handling adversarial samples. Overall, the success rate of attacks transferring from one model to another is relatively high. Specifically, ResNet18 has the weakest defense against attacks on itself, and the attacks it generates have slightly lower transferability to other models (IDE and PCB), which may be due to differences in feature representation between ResNet18 and the other models.
For the effectiveness of the proposed method in attacking more diverse modality data, please refer to the response to Reviewer fT1U. The chart shows the transferability of attacks using our method across various modality datasets (SYSU, RegDB, Sketch[1], CnMix).
The Sketch ReID dataset [1] contains 200 individuals, each represented by one sketch and two photographs. The photographs of each individual were captured during daylight using two cross-view cameras. Raw images (or video frames) were manually cropped to ensure that each photograph includes only the specific individual. Additionally, we applied random channel mixing to images from the Market1501 [2] dataset to simulate a new modality dataset, which we refer to as CnMix. Market1501 includes 1,501 pedestrians captured by six cameras (five HD cameras and one low-definition camera).
[1] Lu Pang,Yaowei Wang,Yi-Zhe Song,Tiejun Huang,and Yonghong Tian.Cross-domain adversarial feature learning for sketch re-identification.In Proceedings of the 26th ACM international conference on Multimedia,pages 609-617,2018.
[2] Liang Zheng,Liyue Shen,Lu Tian,Shengjin Wang,Jingdong Wang,and Qi Tian. Scalable person re-identification:A benchmark.In Proceedings of the IEEE international conference on computer vision,pages 1116-1124,2015.
Dear Reviewer fT1U, As the deadline for the discussion period is approaching, we would like to kindly request your feedback on our responses. We wish to express our deepest gratitude for the time and efforts you have dedicated to reviewing our work. We sincerely hope that our detailed responses have adequately addressed all the concerns and suggestions you raised. We fully understand that you may be occupied with other commitments, but we would greatly value any comments you can provide on our responses before the deadline. Thank you for your attention to this matter. We eagerly look forward to hearing from you soon. Sincerely, 9505 Authors
Although there is still room for improvement in this paper, considering its technical and theoretical contributions to the safety of cross-modality ReID, I have decided to maintain the current score.
This paper proposes an innovative strategy called Cross-Modality Perturbation Synergy (CMPS) attack, aimed at revealing security vulnerabilities in cross-modality person re-identification (ReID) systems. These systems are crucial in security applications, typically using RGB and infrared imaging to identify individuals under various lighting conditions and camera setups. The study highlights that current security research mainly focuses on single-modality (RGB-based) ReID systems, neglecting the complexities and potential vulnerabilities in cross-modality scenarios.
优点
- By focusing on cross-modality security vulnerabilities, this paper fills a critical gap in the field of person re-identification (ReID), where previous research has primarily concentrated on single-modality (RGB-based) studies. This novel perspective is highly significant as it extends the understanding of ReID systems to real-world scenarios where different modalities are frequently used.
- The proposed Cross-Modality Perturbation Synergy (CMPS) attack is innovative, leveraging gradient information from both RGB and infrared images to generate universal perturbations that remain effective across multiple modalities. This approach not only demonstrates theoretical originality but also provides practical value through robust experimental validation.
- The experimental results on widely used datasets such as RegDB and SYSU convincingly demonstrate the superiority of the CMPS method compared to existing traditional attack methods, highlighting the method's effectiveness in reducing the accuracy of ReID systems.
- This paper exhibits strong writing and organizational skills, with logically tight explanations that make complex concepts easy to understand. The comprehensive description and intuitive presentation of the methodology and experimental setup contribute to the paper's clarity and comprehensibility.
缺点
- Although the paper is clearly written, some technical sections are still quite complex. Simplifying these parts would benefit a wider audience. For example, in the section "3.2 Optimizing Loss Functions for Attacking," the derivation process from equations (4) to (11) is rather dense and might be difficult for non-specialist readers to understand. Adding more explanatory text between these equations would make it easier for a broader range of readers to follow.
- The descriptions of "Figure" and "Fig" in lines 53 and 55 of the article are inconsistent, and they need to be carefully checked and unified.
问题
- The effectiveness of the CMPS attack method has been validated on RGB, infrared, and thermal images. Have the authors considered whether this method can be extended to more modalities, or even any modality? If so, what adjustments would be necessary to the current method?
- The paper presents an interesting model attack scenario where universal perturbations are added to query images to mislead ReID models. Have the authors considered how to deploy such interference in real-world ReID systems?
局限性
Yes
We would like to express our sincere gratitude for your thorough review and valuable feedback on our paper. Your input has undoubtedly played a pivotal role in enhancing the quality and clarity of the manuscript.
Responses to the individual questions below.
Reviewer’s Comment:
“Have the authors considered whether this method can be extended to more modalities, or even any modality? If so, what adjustments would be necessary to the current method?”
Response:
In our subsequent work, we addressed this issue by expanding adversarial attacks to more modalities using a dual-layer optimization framework. First, we utilized image gradients within each modality to learn universal perturbations, ensuring their effectiveness in the specific modality. Then, in the second optimization layer, we employed evolutionary computation methods to search for shared features across more different image modalities. This layer of evolutionary search aimed to identify sparse perturbations that could be effectively transferred to other modalities, further optimizing and enhancing the cross-modal transferability of the universal perturbations learned in the previous step. The experimental results are shown in the table below. For example, RegDB->SYSU indicates that we optimize the universal perturbation using the RegDB dataset and then transfer it to the SYSU dataset for testing. Thermal images in the RegDB dataset lose more detailed information compared to sketch images and the channel-randomized CnMix images, and its transfer attack performance is also the worst. Therefore, we hypothesize that the smaller the gap between modalities, the better the transferability.
| Method | r=1 | r=10 | r=20 | mAP |
|---|---|---|---|---|
| SYSU -> RegDB | 19.62 | 49.70 | 60.23 | 15.93 |
| RegDB -> SYSU | 22.37 | 51.92 | 62.17 | 19.05 |
| SYSU -> CnMix | 15.81 | 31.36 | 40.85 | 15.02 |
| CnMix -> SYSU | 17.23 | 35.78 | 45.62 | 16.74 |
| SYSU -> Sketch | 17.14 | 35.10 | 45.82 | 16.75 |
| Sketch -> SYSU | 18.38 | 36.70 | 47.82 | 17.63 |
The Sketch ReID dataset [1] contains 200 individuals, each represented by one sketch and two photographs. The photographs of each individual were captured during daylight using two cross-view cameras. Raw images (or video frames) were manually cropped to ensure that each photograph includes only the specific individual. Additionally, we applied random channel mixing to images from the Market1501 [2] dataset to simulate a new modality dataset, which we refer to as CnMix. Market1501 includes 1,501 pedestrians captured by six cameras (five HD cameras and one low-definition camera).
[1] Lu Pang,Yaowei Wang,Yi-Zhe Song,Tiejun Huang,and Yonghong Tian.Cross-domain adversarial feature learning for sketch re-identification.In Proceedings of the 26th ACM international conference on Multimedia,pages 609-617,2018. [2] Liang Zheng,Liyue Shen,Lu Tian,Shengjin Wang,Jingdong Wang,and Qi Tian. Scalable person re-identification:A benchmark.In Proceedings of the IEEE international conference on computer vision,pages 1116-1124,2015.
Reviewer’s Comment :
“Have the authors considered how to deploy such interference in real-world ReID systems?”
Response:
This research is currently in its early stages, but its potential applications and impacts have garnered significant attention. As the technology matures, attackers may develop more sophisticated methods. For instance, they could embed such perturbations into specially designed stickers or attach them to people's clothing. While these stickers or patterns may appear harmless, they could actually have a significant impact on surveillance systems.
These attacks could exploit the perturbations to deliberately interfere with the images captured by surveillance cameras, thereby affecting the accuracy of person re-identification (ReID) systems. ReID systems are widely used in areas such as public safety, traffic management, and smart retail for automatically identifying and tracking individuals. Attackers could achieve several objectives through these methods: Identity Disguise: Attackers could use these perturbations to disguise themselves as others, evading detection by security systems. This could pose serious security risks in high-security locations such as airports, government buildings, or financial institutions. Surveillance Interference: The perturbations could also be used to disrupt the normal functioning of surveillance systems, preventing them from accurately capturing and identifying targets. In such cases, security personnel may be unable to timely identify and respond to potential threats or abnormal situations.
Privacy Protection: On the other hand, some individuals might use these techniques to protect their privacy and avoid being tracked by surveillance systems. This could raise legal and ethical issues, especially in public spaces or monitored private areas. Overall, as research progresses and technology advances, addressing these potential attacks will become an important area of focus. This will require not only technical innovation but also legal and policy measures to ensure a balance between security and privacy.
The authors solve all my quentions, and the motivation, innovation, as well as practicality of the paper are satisfactory, so I maintained the original score.
We thank all the reviewers for their detailed feedback and valuable time. We are pleased to see that the reviewers found our paper insightful (Zepn, fT1U, Th4d, tAG1), appreciated our experimental validation (fT1U, Th4d), and recognized the theoretical originality and practical value of our work (Zepn, Th4d). In this work, we proposed the Cross-Modality Perturbation Synergy (CMPS) attack method, filling a critical gap in the security research of cross-modality person re-identification (ReID) systems.
Regarding the experiments, Zepn and tAG1 suggested validating the transferability of the perturbations across more diverse model architectures and cross-modality datasets. Therefore, we will supplement experiments in this area. Additionally, Th4d expressed doubts about whether attacking only RGB features would be sufficient, and we will also supplement relevant experiments to address this point. Furthermore, tAG1 pointed out some inconsistencies and confusion in our presentation. We have carefully reviewed these issues and addressed them in our individual responses to ensure clarity and consistency in the revised manuscript.
Based on the reviewers' comments, we are conducting further experiments and will report the results and make certain revisions to the draft in the coming days to incorporate the reviewers' feedback.
The authors propose a CMPS attack, a perturbation method that is designed for cross-modality person ReID. It leverages aggregated feature gradients from RGB and IR images to generate cross-modality perturbations. Four experts in the field reviewed this paper. Their recommendations are Accept, Weak Accept, Weak Accept, and Borderline Reject. Overall, the reviewers appreciated the paper because it is well written and organized, and it addresses the important and unexplored problem of adversarial attacks in visible-infrared person ReID. The paper provides an approach with practical significance. It also shows the CMPS method’s effectiveness through comprehensive experimental evaluations on the challenging RegDB, SYSU, LLCM, and DEEN datasets, and valuable insights for future improvements in cross-modality ReID robustness.
Based on the reviewers’ feedback and the authors’ satisfactory rebuttal, which addresses most concerns, I recommend it for acceptance. However, the reviewers have raised some issues and concerns in their Weaknesses sections that should be addressed in the final camera-ready version of the paper. In particular, the authors should provide a more detailed explanation for Figure 2 and a justification to support selecting the ϵ value. The authors should also highlight the technical novelty of their approach to clarify that they have not merely applied classic FGSM and PGD methods, and integrated relatively conventional strategies into visible-infrared ReID models without exploring the specific characteristics of this task. Finally, some authors suggested validating the transferability of the perturbations across more diverse model architectures and cross-modality datasets. The authors are encouraged to make the necessary changes to the best of their ability. We congratulate the authors on the acceptance of their paper!