GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models
摘要
评审与讨论
This paper proposes a data-independent metric called GREAT Score for evaluating adversarial robustness. To be specific, GREAT Score is calculated as the mean of the gaps in the likelihood of model prediction between the ground-truth class and the most likely class other than the ground-truth class achieved by the samples generated from a generative model. The empirical results seem to validate the effectiveness of GREAT Score in evaluating robustness.
优点
-
The GREAT Score is data-independent, which is applicable to privacy-sensitive black-box evaluation.
-
The GREAT Score is less sensitive to data points compared to AutoAttack.
缺点
-
The definition of robustness in Eq. (1) and Eq.(4) seems to be confusing and possibly wrong. In Eq. (1), is defined as the minimal perturbation of a sample-label pair causing the change of the top-1 class prediction. Then, I understand is a lower bound of . However, in Eq. (3), is defined as the gap between two probabilities. Therefore, I am confused about how the gap between two probabilities measures the minimal perturbation.
-
The rank of the model in terms of adversarial robustness is unclear and confusing. As shown in Table 1, the rank of a model based on GREAT Score could be different from the rank based on AutoAttack. Besides, the rank of a model based on GREAT Score could also be different from the rank based on the calibrated GREAT Score. Therefore, the research would be confused about what is the genuine rank of a model in terms of its robustness.
-
How can you guarantee the gap between the two probabilities achieved by the generated data is achieved in the worst case? As far as I can see, the authors directly used the generated samples from the generative models, which are supposed to be natural samples instead of adversarial samples.
-
The table and figure should be resized to be larger, which makes it easier to read.
问题
Please refer to my comments in “Weaknesses”.
Thanks for the author's replies. My questions have been partially resolved. However, I still have the following concerns:
-
Could you provide any high-level explanations of how the gap between two probabilities (i.e., ) measures the lower bound of minimal perturbation ()? The scale of the two probabilities' gap is , right? If your theorem holds, the lower bound of the minimal perturbation belongs to , right? It is weird that the minimal distance should be related to the dimensionality whereas the gap between two probabilities is irrelevant to the dimensionality.
-
You mean that '''uncalibrated score is the default choice''. However, it is still confusing for me to judge which model is more robust based on your metric. For example, in terms of uncalibrated score, Gowal_extra (0.534) is more robust than Rebuffi_extra (0.507); in terms of AutoAttack accuracy, Gowal_extra (80.53) is less robust than Rebuffi_extra (82.32). Further, in terms of calibrated score, Rebuffi_28_ddpm (1.214) is more robust than Rebuffi_70_ddpm (1.208); in terms of AutoAttack accuracy, Rebuffi_28_ddpm (78.80) is less robust than Rebuffi_70_ddpm (80.42). I believe this inconsistency makes the rank of the model unclear in terms of robustness, which degrades the soundness and contribution.
We really appreciate the reviewer for your active engagement in the discussion! Please find our point-to-point responses below.
- The reviewer is correct that our theorem suggests the lower bound of the minimal perturbation belongs to , when the input data range is in . This result is obtained by ignoring the gap term and assuming the maximal value of the gap term is 1. We would like to provide a high-level explanation of how we build the connection between the proposed metric and the certified lower bound based on our derivations in Sec. 6.4.
- First of all, we start from the Lipschitz continuity in gradient form (Lemma 1) to quantify the maximal change on the model output concerning input perturbations.
- Then, based on Lemma 1, Lemma 2 shows a certified local perturbation bound involving the gap and a local Lipschitz term.
- In Lemma 3, we show that using Stein's lemma gives a closed-form expression of the Lipschitz term when data are generated from generated models taking Gaussian inputs.
- Putting all things together, we obtain the certified lower bound in the form of eq. (3). Please see the paragraph below eq. (14) for details.
The reviewer's intuition is also correct that the minimal perturbation should scale with the dimension. We believe this dimension dependency is implicitly characterized in the gap term and the generative models that generate input data of the same dimension.
- We note that a root cause of the discrepancy in ranking is from the fundamental differences in the evaluation methodology of certified robustness (margin-based assessment) versus empirical robustness (norm-bounded attack-accuracy-based assessment). Intuitively, due to different evaluation objectives, certified bounds do not need to be fully consistent with the trends of attack accuracy, especially when the attack evaluation is sensitive to the hyperparameter of the perturbation level. As discussed in Sec. 4.3,
GREAT Score measures classification margin, while AutoAttack measures accuracy under a fixed perturbation budget ϵ. AutoAttack’s ranking will change if we use different ϵ values... Therefore, we note that GREAT Score and AutoAttack are complementary evaluation metrics and they don’t need to match perfectly.
We also note we are not advocating GREAT Score as an alternative to AutoAttack. Rather, we would argue GREAT Score and RobustBench/AutoAttack are complimentary evaluation metrics as they altogether consider broad data distributions and cover margin-based and norm-bounded robustness evaluations.
We hope these explanations clarify your concerns. While we know the discussion deadline is approaching, we are at the reviewer's disposal to address any remaining concerns the reviewer may have.
Thanks for the reviewer's responses. While I appreciate the author's dedicated efforts, I have the following two major concerns, thus I decided to keep my score.
-
I appreciate the detailed explanation of the proof. However, I am still not fully convinced of the correctness of the lower bound since it is irrelevant to the dimensionality. It seems that when GREAT Score is applied to datasets of different resolutions, the margin-based perturbation has a consistent lower bound. Intuitively, I find it difficult to believe the correctness.
-
My major concern is that the GREAT Score and AutoAttack accuracy make the robustness rank ambiguous. If we have the assumption of the adversarial budget, AutoAttack accuracy provides a deterministic rank, and we can say that the model is good against a particular adversary. Uncalibrated and calibrated GREAT Score provide other two different deterministic ranks, which are irrelevant to the adversary's adversarial budget. However, when there exist some contradictions between these different ranks of GREAT Score and the ranks of AutoAttack, it makes the comparison more difficult/ambiguous and makes the researcher more difficult to obtain a conclusion. Therefore, I am doubt of the value of the proposed method.
Dear Reviewer fKuC,
We appreciate your detailed questions and comments, which help us better understand and address your raised concerns. Based on the review comments, we believe there are some critical misunderstandings about our theoretical results and responses. We would like to use this opportunity to clarify your concerns.
- First of all, we did not claim our lower bound is irrelevant to the dimensionality. In fact, our proposed metric in eq. (3) is not dimension-agnostic, just because the constraint term is present. We feel that the reviewer misses the dependency of the input dimensionality on the model output for . In [R1], the authors prove that neural network classifiers possess an inherent input dimension dependency on adversarial robustness, where the minimal perturbation in norm scales inversely with the square root of the input dimension. Therefore, applying this result to our term in our proposed metric in eq. (3), our proposed metric is indeed dimension-dependent and diminishes with the input dimension, as the reviewer expected.
[R1] Simon-Gabriel, Carl-Johann, et al. "First-order adversarial vulnerability of neural networks and input dimension." International conference on machine learning. PMLR, 2019.
- We believe the reviewer's confusion comes from the different robustness evaluation methodologies between certified robustness and empirical robustness. We believe the reviewer would agree that a model having high accuracy against AutoAttack (or any state-of-the-art attacks) does not guarantee that the model is attack-proof. In fact, empirical robustness only evaluates the accuracy against generated adversarial examples. It does not give any robustness guarantee that there won't exist other possible adversarial examples given the same attack budget. On the other hand, certified robustness guarantees that there won't exist any adversarial examples within the certified region. Our method lies in the certified robustness category, and therefore we have been stressing that that both perspectives (AutoAttack and ours) are complementary, rather than competing. Due to their differences in the robustness evaluation criterion, one should not expect the model rankings to be completely the same. However, a high correlation between certified and empirical robustness methods should be expected if both are reasonably effective.
On calibrated v.s. uncalibrated GREAT scores, we note that both are valid certified lower bounds. However, since the calibrated GREAT score leverages additional information on actual attack perturbations, the certified range is larger than that of the uncalibrated score. We believe their use cases are straightforward and distinct, depending on the available information at the time of certified robustness evaluation. Again, this is another way to demonstrate the complementary benefits of certified and empirical robustness.
We express sincere gratitude for your valuable feedback and constructive comments.
Question 1: The definition of robustness in Eq. (1) and Eq.(4) seems to be confusing and possibly wrong, how does the gap between two probabilities measure the minimal perturbation?
The reviewer's question is exactly one of the major contributions of our work - establishing the connection between the defined metric in Eq. (1) and the derived certified lower bound in Appendix A.2.3. The novelty lies in the use of a generative model taking random Gaussian noise as input and adapting the Stein’s Lemma to derive a global certified lower bound on the minimal perturbation defined in Eq. (1). Please see the proof of Theorem 1 in Appendix 6.4 for the detailed derivation. If the reviewer still feels uncertain about this connection, we are willing to engage in further discussion to address the concern.
Question 2:Determine which rank should be used.
The criterion depends on whether or not additional information is available to a user at the time of evaluating global robustness. We would like to emphasize that the calibrated GREAT score ranking is expected to be better than the uncalibrated one since calibration improves the rankings by providing additional information (e.g. calibrated with CW attack distortion). In principle, if no additional information is provided, then the uncalibrated score is the default choice. If some auxiliary information (e.g., attack perturbations) is available, then we suggest the use of a calibrated score.
Question 3: How can you guarantee the gap between the two probabilities achieved by the generated data is achieved in the worst case?
Based on the reviewer's comment, we believe there may be some misunderstandings about our guarantee. We would like to use the following points to clarify the reviewer's concern.
- For a generated sample , the gap in our context refers to the difference in the prediction probability between the correct class and the most likely (top-1) predicted class, where the gap is defined as . We note that is the prediction probability of class , instead of a probability distribution.
- Our goal is not to generate data that achieves the worst-case "gap" (we don't actually know what gap refers to here in the reviewer's comment). Instead, we are using our defined gap to derive a certified lower bound on the true global robustness, as proved in Sec. 6.4.
We hope our explanation clarifies the reviewer's concern. We are at the reviewer's disposal to answer any further questions the reviewer may have.
Question 4: The table and figure should be resized to be larger, which makes it easier to read.
In the updated version, we have included new changes requested by reviewers and shortened some parts to remain in 9 pages.
To solve the problems of lack of proper global adversarial robustness evaluation, limitation to white-box settings and computational inefficiency, this paper proposes GREAT Score for global robustness evaluation of adversarial perturbation using generative models. The authors have introduced and compared the proposed method, but the novelty of this paper is limited. Here are some of my suggestions for improving the quality of this paper.
优点
The paper is well writen, with better typesetting, the research content is also of certain practical value.
缺点
(1) There are obvious spelling and coincidence errors in the paper. (2) In the background and related works section, it is suggested to simplify the introduction of generative models and add the description of global robustness evaluation rather than Appendix 6.3. (3) It is recommended to draw flow charts of the algorithm proposed by the author and other algorithms to highlight the innovation of the algorithm. (4) It is recommended to give a more detailed time complexity calculation. (5) In the experimental results section, the layout of the experimental results graphs and tables is confusing, and it is recommended to rearrange them. (6) Appendix chapters need to be arranged according to the order in which they appear in the main text. (7) Some of the examples in the article need to be replaced to adhere to the previous arguments.
问题
(1) For other competitive methods, it would be better to use all the data in the dataset to measure robustness rather than the strategy proposed by the author? (2) Does it still make sense to use a lower bound if a true global robustness evaluation can be obtained with high computational efficiency ?
Question 6: Appendix chapters need to be arranged according to the order in which they appear in the main text.
We have updated our paper to rearrange the order.
Question 7: Some of the examples in the article need to be replaced to adhere to the previous arguments.
We have rearranged the graphs and appendix.
Question 8: For other competitive methods, it would be better to use all the data in the dataset to measure robustness rather than the strategy proposed by the author?
For a fair comparison, we indeed compared our metric with other methods on the same set of generated samples. For baseline methods, we also reported their robust accuracy on the entire test set. For example, as demonstrated in Table 2, the utilization of AutoAttack on both synthetic and original data resulted in a low correlation coefficient of 0.7296, suggesting the instability of using AutoAttack for model ranking. Moreover, scalability is the unique strength of our method, as our method only requires forward inferences with the predictions made on the generated data samples, whereas attack algorithms are computationally intense, as discussed in the run-time analysis in Sec. 4.4.
Question 9: Does it still make sense to use a lower bound if a true global robustness evaluation can be obtained with high computational efficiency?
Based on the findings presented in following research, it has been established that the computation of the local minimum perturbation is an NP-hard/NP-complete problem, which means the true global robustness is computationally inefficient to obtain. Therefore, computationally efficient lower bounds such as GREAT Score become a viable alternative for global robustness evaluation.
Daskalakis, Constantinos, Stratis Skoulakis, and Manolis Zampetakis. "The complexity of constrained min-max optimization." Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing. 2021.
Katz, Guy, et al. "Reluplex: An efficient SMT solver for verifying deep neural networks." Computer Aided Verification: 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30. Springer International Publishing, 2017.
We express sincere gratitude for your valuable feedback and constructive comments.
Question 1: There are obvious spelling and coincidence errors in the paper.
We are sorry about the spelling errors and typos. We have run a careful check in the revised version.
Question 2: Simplify the introduction of generative models and add the description of global robustness evaluation.
In the updated version, we have included new changes requested by reviewers and shortened some parts to remain in 9 pages.
Question 3: Draw flow charts of the algorithm.
This is a great suggestion! We have added a flowchart right after the algorithm in Figure 5 of the Appendix.
Question 4: Give a more detailed time complexity calculation.
Here we analyze the time complexity of our algorithm. The time complexity of the GREAT Score algorithm is determined by the number of iterations (generated samples) in the loop, which is denoted as . Within each iteration, the algorithm performs operations such as random selection, sampling from a Gaussian distribution, generating samples, and predicting class labels using the classifier. We assume these operations have constant time complexity and absorb them in the big notation.
Additionally, the algorithm computes the sample mean of the recorded statistics, which involves summing and dividing the values. As there are values to sum and divide, this step has a time complexity of .
Therefore, the overall time complexity of the algorithm can be approximated as .
For CW attack, the optimization process iteratively uses gradients to find the adversarial perturbation. The number of iterations required depends on factors such as the desired level of attack success and the convergence criteria. Each iteration involves computing gradients, updating variables, and evaluating the objective function. It also involves a hyperparameter search stage to adjust the weighted loss function.
Let be the complexity of backpropagation, be the number of iterative optimizations, and be the number of binary search steps for the hyperparaneter. The dominant computation complexity of CW attack for samples is in the order of . Normally, is set to 1000, and is set to 9. Therefore, the CW attack algorithm is much more time-consuming than ours. In the table below, we include a table of run-time comparison of GREAT Score and CW Attack on the same 500 examples for Rebuffi_extra and Gowal_extra models.
| Model Name | GREAT Score (seconds per sample) | CW Attack (seconds per sample) |
|---|---|---|
| Rebuffi_extra | 0.038 | 11.376 |
| Gowal_extra | 0.034 | 11.544 |
Question 5: The layout of the experimental results graphs and tables is confusing.
We understand the reviewer's concern, but the space of our revision is constrained by the ICLR submission guideline that does not allow for an extra page in the revised version. Nonetheless, we have made adjustments to our revised version to include the requested changes by all reviewers and optimize the graph and table presentations.
Thanks for the author's reply. After considering the comments of other reviewers and the content of the reply, I decided to maintain my original score.
We thank the reviewer for the feedback. We view the response as a positive sign that your raised concerns were properly addressed. We hope our responses and discussions with all reviewers are instrumental in finalizing your rating.
Dear Reviewer npik,
First of all, we would like to thank you all again for your valuable time and efforts spent reviewing our paper and helping us improve it. We have also revised our paper to reflect our responses to your questions.
As the discussion period is closing, we sincerely look forward to your feedback.
It would be very much appreciated if you could once again help review our responses and let us know if these address your concerns. We are eager to answer any further questions you might have. We strive to improve the paper consistently, and it is our pleasure to have your feedback!
Best regards,
Authors
The authors propose GREAT score, a privacy measure that may be used to quickly evaluate the robustness of black-box models. Towards this end, GREAT score utilizes generative models (e.g., GANs and DDPMs) to generate (potentially label-conditioned) samples which are fed into the black-box model of interest. In this manner, the black-models outputs (over the input classes) are gathered and used in the calculation of GREAT score. The GREAT score itself is interesting; it leverages the overlapping input-noise characteristics utilized by both GANs and DDPMs (i.e., zero-mean isotropic Gaussian noise) to calculate a certified lower-bound on the true global robustness (wrt to the underlying generative model used). GREAT score may thus be used either as a standalone auditing strategy for black-box models, or in conjunction with computationally intensive benchmarks which directly test robustness via adversarial attacks (e.g., RobustBench or specific attacks, such as AutoAttack, FGSM, PGD, etc.). One limitation of the presented work is GREAT score's theoretical guarantees only apply to L2-based attacks, the effects of which on other widely-used distance metrics (L0 and L_{\inf}) requires understanding through further follow-up work.
优点
The overall writing and derivation of the GREAT score itself are easy to follow and intuitive. Furthermore, the authors do a good job of extensively testing GREAT score, empirically contrasting its results to RobustBench, and demonstrating its use across a suite of face recognition APIs. While the theoretical results are interesting, exhaustive experiments are impressive and convincing of the utility of this score for measuring robustness of black-box models.
缺点
Given the extensive research concerning robustness wrt L0/L2/L_{\inf} based adversarial attacks, GREAT score's limitation to L2-based attacks is a major weakness. As only L2-based attacks are considered in the paper, either GREAT score must be extended to cover L0/L_{\inf} based attacks (which, the authors note, is unlikely under the current derivation) or follow up work must explore how effective GREAT score is at measuring robustness for L0 and L_{\inf} based attacks.
The papers description of some relevant work is either terse or lacking. For instance:
-
"using the Rebuffi_extra model (Rebuffi et al., 2021)" <- please give a short inline description of this model
-
"successful adversarial perturbations (an upper bound on the minimal perturbation of each sample) returned by any norm-minimization adversarial attack method such as the CW attack (Carlini & Wagner, 2017)" <- Please give a brief description of the CW attack
-
In Table 1, please add a column showing which generative model is being evaluated.
Furthermore, GREAT score's role as a lower bound for CW distortion is noted throughout the paper. However, this is not entirely clear given the current version of the paper. Please explain that the CW (L2) attack solves an optimization objective which results in low L2 distortion. Also, it is important to note why GREAT score serves as a lower bound for the CW attack distortion; Equation 3 is equivalent to the non-L2 term of the CW (L2) attack, with c = sqrt{ \pi / 2} (in practice, CW determines c via a grid search). Thus, assuming c >= sqrt{\pi /2), GREAT score trivially lower bounds the CW L2 attack.
问题
-
"Moreover, as a byproduct of using generative models, our adversarial robustness evaluation procedure is executed with only synthetically generated data instead of real data, which is particularly appealing to privacy-aware robustness assessment schemes, e.g., remote robustness evaluation or auditing by a third party with restricted access to data and model." <- Please note that this is not a panacea, i.e., see:
-Carlini, Nicolas, et al. "Extracting training data from diffusion models." 32nd USENIX Security Symposium (USENIX Security 23). 2023. -Duan, Jinhao, et al. "Are diffusion models vulnerable to membership inference attacks?." arXiv preprint arXiv:2302.01316 (2023). -
"In Figure 2, we compare the cumulative robust accuracy (RA)" <- RA is used before its definition on page 7; please define on the first use of acronym RA
-
"using the Rebuffi_extra model (Rebuffi et al., 2021)" <- please give a short inline description of this model
-
"DMs reverse the forward process and implement a sampling process from Gaussian noises to reconstruct the true samples" <- Please mention that this is done by solving a stochastic differential equation.
Question 3.2: Explain why the GREAT score serves as a lower bound for the CW attack distortion.
We would like to clarify that the derivation of GREAT Score is irrelevant to CW attack, and it is not an attack algorithm that finds an adversarial perturbation. In fact, we proved in Sec. 3.1 that GREAT Score is a lower bound of the minimal perturbation, making it a natural lower bound for all successful adversarial perturbations, including CW attack.
We further emphasize their differences using the following points:
- Although our equation (3) may look similar to the second term of the CW attack objective, the only similarity is that they are the evaluation of the classification criterion for misclassification. Note that our metric does not involve any norm loss term as did in the CW attack.
- Another notable difference is that the CW attack aims to find a perturbation , whereas our metric is to use a generated sample from a generative model for global robustness evaluation. Our metric shows a certified perturbation range and does not involve the design of attack perturbation.
- In Theorem 1 we showed that GREAT Score serves as a certified lower bound of the true robustness (minimal perturbation), whereas CW attack (and any other attack methods) is an upper bound of the minimal perturbation.
Question 4: Privacy-aware question for training data.
Thank you for your advice. To our understanding, the mentioned paper discloses the risk of leaking private training data through model queries at inference time. However, our statement is about preventing the use of private and sensitive data to evaluate/audit a model. The scope is different from the privacy of training data. For example, when evaluating online facial recognition APIs (as discussed in Sec. 4.5), using our generated samples is more privacy-preserving than using real private and sensitive facial images. On the other hand, in this context, the mentioned paper is interested in finding private facial images used to train those APIs, which is beyond our scope.
Question 5: "In Figure 2, we compare the cumulative robust accuracy (RA)" <- RA is used before its definition on page 7; please define on the first use of the acronym RA.
We have defined it in the summary of Classifiers on the RobustBench part.
Question 6: "using the Rebuffi_extra model (Rebuffi et al., 2021)" <- please give a short inline description of this model.
We have added it to the updated version.
Question 7: "DMs reverse the forward process and implement a sampling process from Gaussian noises to reconstruct the true samples" <- Please mention that this is done by solving a stochastic differential equation.
We have added the description in the updated version.
I thank the authors for their time and detailed responses, I have read all other reviews and rebuttals.
We thank the reviewer for the positive feedback. Your review comments and suggestions are instrumental to our work.
We express sincere gratitude for your valuable feedback and constructive comments.
Question 1: limitation to L2-based attacks.
As stated and explained in Sec. 5, we admitted that our GREAT Score focused on norm, due to the restriction of extending Stein's Lemma to obtain a closed-form global Lipschitz constant in other norms. Please see Lemma 4 in Appendix A.3 for details about the derivation.
Nonetheless, we hope the new insights and contributions from this paper, such as global certified robustness using generative models and its use in evaluating the robustness of online black-box APIs can outweigh this limitation.
Question 2: The paper's description of some relevant work is either terse or lacking.
We understand the reviewer's concern, but our revision is constrained by the ICLR submission guideline that does not allow for an extra page in the revised version. Nonetheless, we have made adjustments to our revised version to expand the descriptions of related work in the main paper. We also note that we have provided extended discussion in the supplementary material.
Question 3: Please explain that the CW (L2) attack solves an optimization objective which results in low L2 distortion. Also, it is important to note why the GREAT score serves as a lower bound for the CW attack distortion.
Question 3.1: Explain CW (L2) attack solves an optimization objective that results in low L2 distortion.
Using our nation, consider a -way classifier . Let be a data sample and be its top-1 classification label. Denote as the adversarial perturbation. The untargeted CW Attack ( norm) solves the following optimization objective:
where is the prediction of the -th class, and is a hyperparameter.
If the second term of the objective function is zero, then by definition is a perturbed sample of causing label flipping. In CW attack implementation, given , it uses a gradient-based optimization solver and runs it for iterations to solve for the smallest that makes the second term zero. Then, it performs a binary search on for a number of trials (in each trial, another round of multiple iterations is executed) to ensure the second term remains zero while aiming to minimize the first term (the square of the performance level). Finally, the attack algorithm returns the smallest perturbation in the entire process such that (equivalently, when the second term is 0). If no solution is found, one should increase either and .
The dominant computation complexity of CW attack for samples is in the order of , where is the complexity of backpropagation. On the other hand, the complexity of GREAT Score is , where is the number of generated samples, and is the forward inference cost of the generative model and the classifier. A run-time comparison is shown in Sec. 4.4. In the table below, we include a table of run-time comparisons of GREAT Score and CW Attack on the same 500 examples for Rebuffi_extra and Gowal_extra models.
| Model Name | GREAT Score (seconds per sample) | CW Attack(seconds per sample) |
|---|---|---|
| Rebuffi_extra | 0.038 | 11.376 |
| Gowal_extra | 0.034 | 11.544 |
The paper is focused on an important problem which is "global robustness" that aims to evaluate the robustness of classifiers with respect to the underlying, unknown data distribution. This work can be classified in the family of robustness evaluation algorithms. The authors propose GREAT Score for evaluating the global robustness of classifiers by leveraging generative models. Specifically, the GREAT Score utilizes a generative model to approximate the data distribution and then calculates a certified lower bound on the minimal adversarial perturbation level, averaged over the sampling distribution of the generative model. This allows estimating the distribution-wide robustness without needing the true data distribution.
优点
- This problem setting is important. The overwhelming majority of work in the space of robustness evaluation has considered aggregated local robustness statistics over test samples. However, one might argue that these test samples might be biased and not able to represent the true data distribution. The global robustness of models w.r.t. the entire data space is still under exploring. In this sense, the problem the authors seek to address is relevant and will likely continue to grow in relevance.
- The paper is very well-written and organised. The main insights are well explained, and it is easy to follow. The use of generative models is well-motivated.
缺点
- The authors utilise the generative models to estimate models' global robustness and try to provide statistical guarantees which is claimed as one of the main theorems. My biggest concern is from this point. Since generative models are not perfect, there exists a notable gap between the generative and true data distribution. We are still worrying about the error bound induced by the difference between the underlying data distribution and generative data distribution. The paper would greatly benefit from including probabilistic guarantees or error bounds on the estimated global robustness concerning the true data distribution, as opposed to solely with respect to the generative data distribution. Such bounds would significantly enhance the appeal and credibility of the results.
- Table 1 illustrates the comparison between (Calibrated) GREAT Score v.s. minimal distortion found by CW attack on CIFAR-10. However, the table does not explicitly show the advantages of GREAT score compared to other metrics. For example, the CW distortion score between Rebuffi_extra and Gowal_extra is pretty large, but the GREAT score only differs 0.003.
- Is there any other options besides generative models or GANs, which can approximate the unknown data distribution?
- Figure 1 and 2 should be larger. Similarly, it is hard to get a close look at Table 1 and 2. The figures and tables on Page 7 and 8 are not well-constructed.
问题
Pls see the Section Weaknesses
We express sincere gratitude for your valuable feedback and constructive comments.
Question 1: Error bound induced by the difference between the underlying data distribution and generative data distribution.
The reviewer's intuition is correct. In our ablation study of generative models (GMs), we do find that better GMs give higher ranking. In Figure 3, we show that increasing the Inception Score of GMs can significantly increase the Spearman's rank correlation. Intuitively, a higher inception score means better learning of GMs to the underlying data distribution, resulting in improved ranking efficiency in our case. We also had a discussion on the approximation guarantee of some GMs to the true data distribution in Sec. 6.2.
Question 2: Table 1 lacks clear advantages of GREAT score over other metrics despite significant distortion differences.
The main message we wanted to deliver for Table 1 is to show the complete statistics of different methods on all CIFAR-10 models. The results also verify our main theoretical contribution (Theorem 1) that GREAT Scores are indeed certified lower bounds of minimal perturbation, as the perturbations found by CW Attack (an upper bound) are all numerically larger than that of GREAT Score.
The "Advantages" of GREAT Scores are demonstrated in other figures and tables in the same section. For example, high model ranking correlation (Sec. 4.3, Table 2 & Table 3), fast run time (Sec. 4.4 & Fig. 4), and its utility in the evaluation of online black-box APIs (Sec. 4.5, Table 4 & Table 5).
Question 3: Are there any other options besides generative models or GANs, which can approximate the unknown data distribution?
This is a great question! Beyond our discussion in Sec. 6.2, in addition to generative models such as GANs or Diffusion models, another approach commonly used to approximate unknown data distributions is Kernel Density Estimation (KDE). KDE is a non-parametric statistical tool that can provide an estimation of the underlying data distribution based on observed data samples.
Several research papers have explored the application of KDE for approximating hidden data distributions, such as
S. A. Bigdeli, G. Lin, L. A. Dunbar, T. Portenier and M. Zwicker, "Learning Generative Models Using Denoising Density Estimators," in IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2023.3308191.
M. Rosenblatt, Remarks on some nonparametric estimates of a density function.Ann.Math. Stat.27, 832–837 (1956).
E. Parzen, On estimation of a probability density function and mode. Ann. Math. Stat.33, 1065–1076 (1962).
However, it is important to note that KDEs typically perform well in low-dimensional settings and they are known to suffer from the curse of dimensionality. For common datasets like CIFAR-10 and ImageNet, KDE may not be the most effective approach to accurately approximate the underlying data distribution given the limited size of their training data.
Question 4: Figures 1 and 2 should be larger. Similarly, it is hard to get a close look at Tables 1 and 2. The figures and tables on Pages 7 and 8 are not well-constructed.
Thanks for your advice. Although the ICLR submission guideline does not allow for an extra page in the revised version, we have rearranged the tables and figures to improve the clarity of figures. Please see the updated version.
Dear Reviewer kzkL,
First of all, we would like to thank you all again for your valuable time and efforts spent reviewing our paper and helping us improve it. We have also revised our paper to reflect our responses to your questions.
As the discussion period is closing, we sincerely look forward to your feedback.
It would be very much appreciated if you could once again help review our responses and let us know if these address your concerns. We are eager to answer any further questions you might have. We strive to improve the paper consistently, and it is our pleasure to have your feedback!
In this paper, motivated by the point that the prior works on adversarial robustness mainly focus on aggregating local robustness results from a set of data samples to evaluate and rank different models, the authors address this challenge and present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. The authors' responses have partially satisfy the reviewers's comments as there is not a consensus between Reviewer fKuC and authors. In particular, the reviewer fKuC still is not satisfied with Authors' responses on 1. we did not claim our lower bound is irrelevant to the dimensionality and 2. the inconsistency of the GREAT Score is due to the difference between certified robustness and empirical robustness. This paper, though possesses some merits, is suggested to be rejected as the reviewers did not reach a consensus.
为何不给更高分
The authors' responses have partially satisfy the reviewers's comments as there is not a consensus between Reviewer fKuC and authors. In particular, the reviewer fKuC still is not satisfied with Authors' responses.
为何不给更低分
Partial comments have been addressed well.
Reject