3.3

/10

Rejected3 位审稿人

最低1最高6标准差2.1

3.7

置信度

正确性2.3

贡献度1.7

表达2.3

ICLR 2025

Gaussian Loss Smoothing Enables Certified Training with Tight Convex Relaxations

Stefan Balauca,Mark Niklas Mueller,Yuhao Mao,Maximilian Baader,Marc Fischer,Martin Vechev

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

TL;DR

We show that Gaussian Loss Smoothing allows us to overcome the Paradox of Certified Training and yields better networks when training with tighter bounds.

摘要

关键词

Certified RobustnessAdversarial RobustnessCertified TrainingConvex RelaxationNeural Network Verification

评审与讨论

审稿意见

评分: 3置信度: 32024-10-27

This paper proposes to use Gaussian Loss Smoothing (GLS) with certified training algorithm, such as IBP, CROWN, etc. The motivation is that a previous paper pointed out that methods other than IBP, though having tighter relaxation, suffer from discontinuity, non-smoothness, perturbation sensitivity. The authors argue that GLS can make the loss surface smoother using a theoretical result and some plots as evidence. The method is tested on MNIST, Cifar-10 and TinyImagenet.

优点

The writing is fine. There is little difficulty understanding the paper.

缺点

Contribution: This paper proposes to apply GLS to existing certified training methods such as IBP and DeepPoly, and one can use PGPE or RGS to compute the GLS. Note that, however, all of IBP, DeepPoly, GLS, PGPE and RGS are existing methods. Theorem 3.1 is a direct application of an existing result. So if I understand correctly, the only novelty of this submission is combining these things together, and thus the contribution is incremental.
Performance: How does the proposed method perform? From Table 1, it seems that the proposed method (PGPE) is not significantly better than the standard method (GRAD), if not worse. It is true that PGPE makes DeepPoly better than IBP, but the way it does that is to make IBP worse, not making DeepPoly better. On MNIST (0.3) and CIFAR-10 (8/255), the best PGPE method is worse than the best GRAD method. On the other two settings, it is only slightly better. For all settings, IBP-PGPE is worse than IBP-GRAD (standard). Thus, this result suggests that one probably should not use PGPE. There is no evidence that the proposed method works in practice.
Experimental results: Tables 1 and 3 seem contracting with each other. In Table 1, standard IBP on CIFAR-10 (2/255) has natural accuracy 48.05% and certified accuracy 37.69%; in Table 3, the two reported numbers are 54.92% and 45.36%. Probably this is because CNN5 is better than CNN3, but then what is the point of Table 1? Why not always use a bigger model (CNN7 seems even better). And why not compare with RGS in Table 1? I don't think there is anything preventing you from comparing with RGS in Table 1. The authors report DP-RGS to be the best method in Table 3, but since there are so many problems with the tables I do not trust this result.

Overall, this submission proposes to combine a bunch of existing methods, but the experiments show that this is even worse than the original methods. Thus, I recommend rejecting this submission.

问题

DeepPoly-GRAD achieves 90.04% certified accuracy on MNIST (0.1) in Table 1; in Table 2 this number is only 68.47%. Is this a typo? I don't believe that changing the model size can make such a big difference; plus, only DeepPoly-GRAD has such a huge drop in performance.
Is it correct that you are only changing the loss of training the model, but you are not changing the way of certifying the model?

评论- Response to Reviewer $\Rx$ (2/2)

2024-11-23

Q: In Table 1, standard IBP on CIFAR-10 (2/255) has natural accuracy 48.05% and certified accuracy 37.69%; in Table 3, the two reported numbers are 54.92% and 45.36%. Is this solely because CNN5 is better than CNN3? Why not always use a bigger model, e.g., CNN7?

The discrepancy between results in Tables 1 and 3 is indeed caused by a difference in network size and capacity. Specifically, CNN3 has ~5k parameters for MNIST and ~7k parameters for CIFAR-10, while CNN5 has 166k parameters for MNIST and 281k parameters for CIFAR-10. In general, using larger models improves performance and for this reason, we experimented with RGS on CNN5, as we were not able to scale PGPE to this architecture. Moreover, our results in Table 3 also show that further increasing network capacity (from CNN5 to CNN5-Large) also improves performance. Unfortunately, our best-performing method on this architecture (DP-RGS) is unable to scale to the SOTA architecture CNN7 because of memory constraints: training even with DeepPoly-GRAD on CNN7, with a batch size of 1 requires ~100GB of VRAM memory and even using high-end GPUs, the training time would be prohibitively large. Note that adding RGS would only increase the training time by a factor of 2, while using the same amount of memory, thus the main bottleneck comes from the expensive DeepPoly relaxation.

Q: Could you also add RGS results in Table 1?

Sure. We added results for RGS in Table 1 in the revised manuscript.

Q: DeepPoly-GRAD achieves 90.04% certified accuracy on MNIST (0.1) in Table 1; in Table 2 this number is only 68.47%. Is this an error?

We would like to thank the reviewer for the insightful observation. We have checked our data and found there was an error in collecting this number, and the correct number for Table 2 is 82.04%. Note that this corrected number does not alter any of the observations and claims made with respect to this table. The other numbers have been carefully rechecked to avoid other potential number collection errors. This has been fixed in the revised manuscript.

Q: Is it correct that you are only changing the training loss of the model, but you are not changing the way of certifying the model?

Yes. GLS is a modification to the training loss, thus the certification procedure remains the same. In particular, we use MN-BaB, one of the SOTA complete certification algorithms, throughout this work.

References

[1] Lee et al., Towards better understanding of training certifiably robust models against adversarial examples.

[2] Jovanovic et al., On the Paradox of Certified Training.

评论- Response to Reviewer $\Rx$ (1/2)

2024-11-23

We are happy to hear that Reviewer $\Rx$ finds our paper easy to read and understand. In the following, we address all concrete questions raised by Reviewer $\Rx$ .

Q: Is the only novelty of this submission combining existing things together, thus incremental?

The main novelty of this study is a solution regarding the paradox of certified training. We acknowledge that the concept of GLS and the related algorithms, PGPE and RGS, have been developed by previous works. However, these concepts and algorithms have not yet been known to be effective against the paradox of certified training. We note that this paradox is considered strongly problematic [1,2], but yet no effective solution is found despite many reasons identified. Therefore, the significance of this study is to show that GLS simultaneously addresses all identified reasons behind the paradox of certified training, both theoretically (Theorem 3.1) and empirically (through extensive evaluation with PGPE and RGS).

Q: Performance-wise, the best PGPE numbers are worse than IBP-GRAD in large perturbation settings. Does DP-PGPE work simply because optimizing with the PGPE algorithm makes IBP worse? Is the improvement achieved in small perturbation settings trivial? Is there any evidence that the proposed method works in practice?

While DeepPoly-PGPE does not outperform IBP-GRAD in large perturbation settings, the method’s efficacy cannot be attributed simply to PGPE making IBP worse. We note that the goal of our experiment in Table 1 is not to show that PGPE is the best method one should use for certified training. Instead, Table 1 demonstrates that tight relaxations, such as DeepPoly, can achieve better results than loose relaxations, like IBP, when the undesired optimization barriers related to DeepPoly are mitigated by GLS algorithms such as PGPE. The optimization power of PGPE is clearly lowered because of only estimating low-rank gradients, therefore the results of IBP-PGPE are always lower than IBP-GRAD, since IBP benefits less from the theoretical properties of GLS (IBP is continuous and more smooth than tight relaxations, etc.). More importantly, the advantages of GLS instantiated by PGPE outweigh the disadvantages of low-power optimization when training with tight relaxations such as DeepPoly, resulting in better performance for DeepPoly-PGPE compared to DeepPoly-GRAD, which is the main goal.

The performance improvements obtained by PGPE + DeepPoly in Table 1 are not very substantial, and we attribute this to (1) the weak optimization power of PGPE and (2) the limited capacity of the networks used. We note that Table 1 represents a proof-of-concept for the promise of GLS to solve the paradox, while in Table 3 we show that RGS, a much stronger optimization algorithm, applied to much larger networks (CNN5 is two orders of magnitude larger than CNN3) also achieves more substantial improvements over IBP-GRAD and even SOTA grad-based methods.

While our PGPE experiments mostly serve as an empirical validation of the theoretical insight that GLS has the potential to solve the paradox of certified training, our findings also show two main benefits of the PGPE with potential practical applicability: (1) PGPE can be used to train networks using even tighter relaxations that produce non-differentiable loss surfaces, further improving performance (Table 2) and (2) when the trained network has a size comparable to the population size used in PGPE, the problem with low-rank gradient optimization fades and PGPE shows improved performance (Table 7 in Appendix). These benefits cannot be used for large-scale practical applications such as computer vision and language processing, but could still enhance both the accuracy and robustness of small-scale networks deployed in safety-critical environments such as medical devices and aircraft control.

审稿意见

评分: 6置信度: 42024-11-03

This paper proposes to use a method (Gaussian Loss Smoothing) to address the paradox of certified training which is caused by the discontinuity/non-smoothness/sensitivity issues of certifiable training with tighter convex relaxations. Moreover, the authors also use a gradient-based method called Randomized Gradient Smoothing (RGS) to scale GLS to larger models.

优点

I think the following strength itself is enough for accept.

The paradox of certified training is an important subject to study.
The proposed method (GLS) is well-motivated and backed up by theoretical results.
The empirical results on small perturbation settings in Table 3 are quite impressive.

缺点

This paper is well-motivated, but has a minor weakness in the performance (on large perturbation) detailed as follows:

Table 3 (Table 5 in Appendix) shows the results for small (large) perturbation settings. The results for large perturbation is not significant. IBP performs well for large perturbations and other tighter methods does well for small perturbations. This is because of the continuity, smoothness and sensitivity of the loss landscape. Thus, to check the effectiveness of GLS, I think it is crucial to check the result of tighter methods on large perturbation settings. However, setups (i) MNIST $\epsilon=0.3$ and (ii) CIFAR-10 $\epsilon=8/255$ show that GLS (or RGS) (i) does not show a significant performance gain (DP-RGS (IBP) vs MTL-IBP; 88.69 vs 88.68) or (ii) shows a worse performance (29.25 vs 29.62). This should be discussed in the main text (not in Appendix). I don't think the performance itself is a reason for a rejection, but it needs more discussion.
In Table 1, "the more precise DeepPoly bounds now yield the best certified accuracy across all settings, even outperforming accuracy at low perturbation radii'', but not for the large perturbation (IBP vs DeepPoly-PGPE; 77.23 vs 74.28, 25.72 vs 22.19).
In GRAD Training, "IBP dominates the other methods, confirming the paradox of certified training", but in the original paper of CROWN-IBP, CROWN-IBP outperforms IBP for a larger network (see their Table 2 https://arxiv.org/pdf/1906.06316).
(For a larger model,) IBP outperforms the other methods (e.g, CROWN-IBP, CAP) for large perturbation, not for small perturbation (e.g., see Table 1 in Lee et al. (2021)). This implies that the paradox plays more important role for a larger perturbation.

Please check the Questions part together. I think unclear presentation is also a weakness.

问题

There is no comparison with CROWN-IBP in Table 3. Why?
"We remark that scaling to CNN7 used by the SOTA methods is still infeasible due to the high computatinoal cost of evaluating DeepPoly" How about applying RGS to CROWN-IBP as it is cheaper than DeepPoly?
What do the italic numbers mean in Table 3?
DP-RGS appears first in L480, but never defined (seems like DP-RGS = DeepPoly-RGS).

评论- Response to Reviewer $\Rd$ (2/2)

2024-11-23

Q: Could you apply RGS to CROWN-IBP on CNN7?

Sure, we added new experimental results in Appendix D1.

We show that CROWN-IBP-RGS improves over CROWN-IBP-GRAD on CNN7, consistent with our previous result that GLS improves tight relaxations, but we also observe that more future work is required for the use of BatchNorm layers. This is because RGS creates multiple perturbed copies of the original network, thus BatchNorm, in its original form, does not combine naturally with RGS. We also observe that the trends are similar to those observed for the CNN5 architecture, so we would expect the same generalization for DeepPoly training, although it is too costly to run on CNN7. The results showcase that the promise of using GLS for certified training with tighter relaxations also scales to SOTA architectures.

Q: What do Italic numbers mean in Table 3?

In Table 3, Italic numbers represent the SOTA results obtained on CNN7, reported by [1]. They represent the best that the current IBP-based methods can achieve. We note that these results are based on a model architecture that is more than 10 times larger than CNN5-L reported in Table 3. We have also clarified this in the revised manuscript.

Q: DP-RGS is not formally defined. Is this DeepPoly-RGS?

Yes, this means DeepPoly-RGS. We have changed all instances of DP-RGS to DeepPoly-RGS in the revised manuscript for clarity.

References

[1] Mao et al., CTBENCH: A Library and Benchmark for Certified Training

[2] Shi et al., Fast Certified Robust Training with Short Warmup

[3] Mueller et al., Certified Training: Small Boxes are All You Need

[4] Mao et al., Connecting Certified and Adversarial Training

[5] De Palma et al., Expressive Losses for Verified Robustness via Convex Combinations

[6] Jovanovic et al., On the Paradox of Certified Training

[7] Xu et al., Automatic Perturbation Analysis for Scalable Certified Robustness and Beyond

评论- Response to Reviewer $\Rd$ (1/2)

2024-11-23

We are happy to hear that Reviewer $\Rd$ considers our studied problem important, the proposed method well-motivated and the empirical results on small perturbation settings impressive. We are particularly encouraged by the positive opinions from Reviewer $\Rd$ . In the following, we address all concrete questions raised by Reviewer $\Rd$ .

Q: Could you discuss large perturbation settings in the main text rather than appendix? Why is the performance of tight relaxations still worse than IBP even after applying GLS on large perturbation settings?

Yes. We now highlight the conclusions of the large perturbation settings experiments from the appendix in the main text (see line 484). We note that the results table and a more detailed discussion are still included in the appendix due to the page limit.

Indeed, DeepPoly-RGS and DeepPoly-PGPE are still worse than IBP-GRAD in large perturbation settings despite significant improvement in small perturbation settings, e.g., CIFAR $\epsilon=2/255$ . While the real reasons are not yet identified, we propose a few speculative hypotheses next. First, RGS and PGPE are off-the-shelf solvers and minimally revised here, they may not fully exploit the strength of GLS, and there may exist more effective GLS methods. For example, PGPE has weaker optimization capability when the number of parameters to optimize is high (Table 6), although it can optimize non-differentiable relaxations (Table 2). In this regard, more effective GLS methods need to be developed in future work. Second, large perturbation settings may require more regularization to generalize well. This is partially supported by the fact that none of the SOTA IBP-based methods significantly improves over IBP in large perturbation settings [1]. In this regard, more insights towards the inconsistency between small and large perturbation settings need to be developed in future work.

Q: Line 425-426 states that DeepPoly bounds now yield the best certified accuracy across all settings. How should this be interpreted?

We note that this statement is to compare different relaxations under the same optimization algorithm, PGPE. This is because different optimization algorithms have different optimization capabilities, thus different relaxations cannot be directly compared when optimized with different optimization algorithms (discussed in Section 4.1). The main goal of Table 1 (the study related to this statement) is to show that after mitigating all the problems identified (non-smoothness, etc.), tight relaxations (DeepPoly) can get better results than loose relaxations (IBP) with the same optimization algorithm (PGPE), thus resolving the paradox of certified training in this case, rather than directly supporting that DeepPoly-PGPE is uniformly better than IBP-GRAD (which is only established in small perturbation settings). The key implication is that if more powerful GLS optimization algorithms are developed in the future, DeepPoly has the potential to exceed IBP even in large perturbation settings. We will make this logic more clear in the revised manuscript.

Q: Line 418 states that across all these settings IBP dominates the other methods (CROWN-IBP). Is this in conflict with the original CROWN-IBP paper, which shows it is better than IBP?

They are not in conflict, because IBP has been significantly improved by [2]. Specifically, [2] proposes special initialization and regularization to improve IBP, while CROWN-IBP benefits less from these specialized training tricks [2]. The community [3,4,5] has been using this improved version of IBP, and refers to it as IBP as well, which is also adopted by our paper. In addition, [6] finds that for small networks, IBP performs better than CROWN-IBP even without these tricks, while the original CROWN-IBP paper uses a large network. Therefore, both our and the original paper are correct.

Q: Could you provide a comparison with CROWN-IBP in Table 3?

Yes. We have added CROWN-IBP-RGS results in the table for the CNN5 architecture on MNIST and CIFAR and TinyImageNet. We observe that CROWN-IBP-RGS is better than CROWN-IBP-GRAD in these settings, similar to the case of DeepPoly, as expected.

We note that due to limited time and resources, for TinyImageNet we only run the Loss-Fusion [7] version of CROWN-IBP. This version is a lot cheaper (2x IBP cost) compared to normal CROWN-IBP due to the large number of classes in the TinyImageNet dataset (200 classes meaning ~200x IBP cost). We will also add results for the normal CROWN-IBP in the final manuscript.

2024-11-26

Thank you for the answer. I've never heard of [2].
I still believe that it is crucial to analyze the result on large perturbation settings because the performance on large perturbation explains the effectiveness of the proposed method and understanding of the underlying phenomenon (more than the small perturbation counterpart).

评论- Reply to $\Rd$

2024-11-27

We thank Reviewer $\Rd$ for their response and are happy to know that we have addressed most of their concerns.

We agree that performance under large perturbations provides critical insights into the effectiveness of certified training methods and sheds light on underlying phenomena that may not be fully captured in small perturbation settings.

Although the gains are less pronounced than small perturbation settings, we are encouraged that our proposed methods also generate some improvements in large perturbation scenarios (Table 1 shows that on MNIST $\epsilon=0.3$ , both DeepPoly-PGPE and DeepPoly-RGS are better than DeepPoly-GRAD, and on CIFAR $\epsilon=8/255$ , DeepPoly-RGS is better than DeepPoly-GRAD). This highlights both the potential of our approach and the challenges that remain in achieving state-of-the-art performance under large perturbations.

We hope our work motivates further study into this difficult but essential problem. As Reviewer $\Rd$ correctly noted, understanding and improving performance in large perturbation settings is crucial for advancing certified robustness and achieving practical applicability at scale. We see this as an exciting direction for future research, and we are optimistic that continued exploration will yield meaningful advancements.

Once again, we appreciate Reviewer $\Rd$ ’s constructive feedback and the opportunity to clarify our work.

审稿意见

评分: 1置信度: 42024-11-03

The paper talks about the variation in certified training; though tighter relaxations sometimes reduce performance compared to looser bounds, it shows the authors introducing Gaussian Loss Smoothing as one of the methods for smoothing the loss landscape to reduce discontinuity and sensitivity—the major hurdles in certified training with tight relaxations.

They provide two ways of realizing the different realizations of GLS: first is PGPE, a gradient-free approach based on policy gradients with parameter-based exploration; second is RGS, which is described as a gradient-based approach using randomized gradient smoothing. Experimental results on several datasets illustrate that indeed the algorithm GLS outperforms current methods relying on tight relaxations; sometimes its performance is tested along with the DEEPPOLY relaxation.

优点

GLS introduces novelty in using Gaussian smoothing across the loss landscapes, strongly problematic in certified training.

缺点

Writing Style

Introduction: In some respects, the introduction could be improved about adversarial certified robustness. It is abrupt and feels more work-related.
Transitions and Structure: Transitions from one section to another are missing; hence, making the paper hard to follow. Statements such as, "While CROWN-IBP is not strictly more or less precise than either IBP or DEEPPOLY," necessarily need references or explanations.
Undefined Terms: Terms such as "soundness" and "sensitivity" are undefined.
Results of experiments: These are presented in a very untidy fashion.

Related Work

Definitions: Some more clear definitions can be provided, for instance, adversarial attack and adversarial robustness because the latter is different from certified robustness.
Redundant Subsections: Sections 2.1 would seem to represent redundant subsections: "Training for Robustness" and "Adversarial Training."

Theoretical Findings

Lack of Rigor: Theoretical statements, like Theorem 3.1, are informal and not mathematically precise. References to results, including Stein's Lemma [2], could be good in the proof.
Flaw in Proofs in Lemmas: In the proof of Lemma B.1 terminology and symbols that are important are not defined (for example, $\delta \theta$ , $P_{\epsilon_1}$ , $P_{\epsilon_2}$ , $P_{\mathcal{N}(0, \sigma^2)}$ ). The authors seem to take the limit $\delta \theta$ to $0$ but it is never stated anywhere in the proof and the integral of the derivative of the loss $L^\prime$ should be a multi dimensional integral as the input space is multi dimensional. In Lemma B.2's proof the simplifications are excessive, the structure of the proof is highly defective in many places and it needs a thorough revision.

Experimental Results

Lack of Detail: Neither the dataset nor the architecture used in Figure 1 are specified.
Performance Gains: Although the experimental results are somewhat improved, they are not very significant compared to previous methods; therefore, this added complexity is questionable in value. Presentation of the standard deviations over multiple runs would have given more robust conclusions since apparent gains are marginal.

Major Concerns

Theoretical Rigor: The proofs are not rigorous; better structure and formalization are required. The use of Stein's Lemma might improve the underlying framework, giving credibility to the theoretical results.
Marginal Gains and Complexity: GLS brings in computational complexity especially with PGPE, while performance benefits are marginal.
Readability and Clarity: The writing style and the structure take away from clarity, making the paper difficult to follow.

[2] Stein, “Estimation of the Mean of a Multivariate Normal Distribution,” The Annals of Statistics, 1981.

问题

GLS appears to be related to Sharpness-Aware Minimization (SAM) [1], as both approaches aim to smooth the loss landscape for a more regular surface. It would be beneficial to include a discussion or a related work section to clarify this connection in the paper.

[1] Foret et al., “Sharpness-Aware Minimization for Efficiently Improving Generalization,” ICLR, 2021.

评论- Response to Reviewer $\Rv$ (2/2)

2024-11-23

Q: The current introduction is more work-related. Could the authors provide a more detailed introduction about certified robustness?

Sure. While the current introduction provides sufficient background on certified training which is the focus of our work, it does not elaborate on the general certified robustness which does not directly affect understanding of this work, due to page limit. We will incorporate a more detailed introduction to the general certified robustness in the appendix.

Q: What is the meaning of the terms "soundness" and "sensitivity"? Should they be explicitly defined in the paper?

Soundness is a textbook term in certification. A certification is sound if and only if the certified property holds in reality. Sensitivity is defined by [2] to be the degree of the output polynomial quotient, i.e., $P(x) / Q(x)$ , where $P(x)$ and $Q(x)$ are polynomials. Such a definition is overly obfuscated and complex, only covers a special function family, and serves as a convenient theoretical proxy in [2]. This is why we replace it with a more friendly and general metric, termed deviation from convexity, as both analyze the optimization difficulty. We will add a formal definition of them in the appendix for interested readers.

Q: Could you give additional references or explanations for the statement "While CROWN-IBP is not strictly more or less precise than either IBP or DeepPoly"?

Yes. This is established by [2] and we will reference them in this statement. Intuitively, this is because DeepPoly is not strictly more precise than IBP, thus CROWN-IBP which adds a DeepPoly component is not strictly more precise than IBP and vice-versa.

In practice even though [2] shows that CROWN-IBP is on average more precise than IBP, there are cases when the result of propagating the same input region using CROWN-IBP neither includes nor is included in the propagation obtained by IBP. The same is true for DeepPoly.

Q: Could you include precise definitions for “adversarial attack” and “adversarial robustness” in contrast with “certified robustness”?

Yes. The revised manuscript includes more precise definitions for these terms.

Q: Are paragraph titles "Training for Robustness" and "Adversarial Training" in Section 2.1 redundant?

Indeed, the paragraph titles are slightly overlapping. We have fixed this in the revised manuscript.

Q: Could you provide more experimental details about Figure 1 as it does not specify dataset and architecture?

Figure 1 is an easy-to-understand version of Table 1, presented in the introduction section, without overwhelming experimental details. These numbers are based on CNN3 networks evaluated on MNIST with $\epsilon=0.1$ . We have specified this in the revised manuscript.

Q: Could you provide results regarding the standard deviations of performance over multiple runs?

Most of our experiments regarding DeepPoly training are very expensive, requiring multiple days on 8 GPUs, which makes providing standard deviation results practically infeasible.

Comment

We thank Reviewer $\Rv$ for their valuable suggestions on our writing. We have addressed all concrete writing problems raised by Reviewer $\Rv$ . If Reviewer $\Rv$ has additional questions regarding writing, we are happy to further address them.

References

[1] Wen and Ma, How Does Sharpness-Aware Minimization Minimize Sharpness?

[2] Jovanovic et al., On the Paradox of Certified Training

2024-12-03

I thank the reviewer for their feedback. However, I respectfully maintain that the theorem is not new, and its results are well-established, akin to Stein's lemma and related classical results. While the presentation may offer a different perspective, the underlying principles and conclusions remain consistent with well-known findings. Therefore, I will maintain my current score.

评论- Response to $\Rv$

2024-12-04

We thank the reviewer for their thoughtful feedback. While we acknowledge that the theoretical foundation of our work is based on established prior work (we do not seek to reinvent the wheel), we would like to emphasize the significant practical relevance and insights provided by our approach.

Impactful Improvements: Our work addresses the longstanding challenge of overcoming the paradox of certified training—a problem that has remained unresolved despite significant interest from the community. By leveraging well-understood principles in a novel combination and application, we achieve state-of-the-art (SOTA) improvements that we believe are both meaningful and valuable to practitioners in the field.

Practicality Over Complexity: As the reviewer may be aware, the current SOTA method for certified training (MTL-IBP, [1]) is itself a linear combination of two basic ideas proposed almost a decade ago (Adversarial training [2] and IBP training [3]). This underscores an important aspect of progress in our field: significant advancements often come not from introducing highly complex new methods but from insightful refinements and combinations of existing ideas.

Thinking Outside the Box: Our findings are an example of finding solutions to hard and important research questions by borrowing and adapting existing techniques from related problems and fields, rather than by exclusively focusing on developing entirely new methods. This interdisciplinary approach has been central to many breakthroughs in the fields of Machine Learning and Neural Networks (e.g., the adoption of convolution operations from signal processing and neuroscience for computer vision, attention mechanisms inspired by cognitive neuroscience for NLP, and the use of evolutionary algorithms from biology for neural architecture search). These examples highlight how innovative applications of established concepts can drive significant progress in addressing long standing challenges.

Broader Implications for the Community: While novelty is essential, it is equally important to recognize the value of approaches that integrate existing tools in innovative ways to solve long-standing problems. If we, as a community, focus solely on complex methodologies not relying on previous results, we risk undervaluing work that offers practical, impactful advancements.

Final Comment

We hope the reviewer and meta-reviewer recognize that our work exemplifies this philosophy, offering actionable improvements and addressing a persistent problem in our field. While the theoretical results are less novel and only serve as the foundation and motivation of our methods, the practical implications and insights derived from our approach represent a significant step forward.

We respectfully ask the reviewer to consider these aspects and their importance to the community when evaluating our work.

References

[1] De Palma et al., Expressive Losses for Verified Robustness via Convex Combinations, ICLR 2024

[2] Madry et al., Towards Deep Learning Models Resistant to Adversarial Attacks, ICLR 2018

[3] Gowal et al., On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models, NeurIPS SECML 2018 Workshop

评论- Response to Reviewer $\Rv$ (1/2)

2024-11-23

We are happy to hear that Reviewer $\Rv$ considers our work to be novel and agrees that we study a very important problem. In the following, we address all concrete questions raised by Reviewer $\Rv$ .

Q: The proofs are not fully rigorous. Could the authors provide a better structure and formalization in the proof?

We have completely rewritten the full proof. The new proof features the following improvements: (1) it extends the original proof from 1-dimension to d-dimension; (2) using the fact that GLS is equivalent to Gaussian convolution, it avoids explicit integral, improving readability; (3) notations are defined clearly at the beginning of the proof, avoiding potential confusion; (4) it uses a better proxy (named deviation from convexity) for non-convexity than sharpness, by directly measuring how non-convex a function is; (5) it reasons more explicitly, e.g. the existence of the $n$ -th derivative is now shown explicitly, and all computations are rolled out explicitly. If Reviewer $\Rv$ further finds something unclear in the new proof, we are happy to address them.

Q: GLS brings in computational complexity, especially with PGPE, while performance benefits are marginal. What is the value of GLS (PGPE)?

PGPE and RGS are studied in this paper as concrete optimization algorithms with GLS features. They demonstrate the potential of GLS algorithms in resolving the paradox of certified training. While the current algorithms are not fully satisfying, e.g., PGPE has significant computation overhead, they successfully prove that GLS can mitigate the paradox of certified training, and have resolved it already in small perturbation settings. In this regard, they are a preliminary but successful step towards a long-known but unsolved problem. Further, RGS, as another instantiation of GLS, has significantly less overhead and scales better than PGPE. However, PGPE has its own benefits, as it allows training with non-differentiable relaxations (Table 2). Therefore, we believe both algorithms are worth discussing, and both strengthen our key conclusion: GLS can resolve the currently identified reasons of the paradox of certified training.

Q: What is the connection between sharpness-aware minimization (SAM) and GLS?

SAM algorithm as used in [1] is fundamentally different from GLS. This is because GLS takes the expectation of neighborhood loss rather than the worst-case loss; in fact, SAM is closer to adversarial training with FGSM rather than GLS. In particular, SAM does not resolve the discontinuity problem, while GLS provably solves it (Theorem 3.1). To see this, consider the threshold function $I(x>0)$ and an initial $x_0=0.1$ . Any single-point gradient-based methods (including SAM) will only get zero gradients, and thus cannot optimize it. Therefore, while it is likely that GLS has the benefit of reduced sharpness as well, GLS enjoys fundamentally different benefits from SAM.

To confirm this empirically, we apply SAM to IBP and DeepPoly training. Specifically, we update the parameters with gradients computed based on the adversarially perturbed network $w^\prime = w + \rho \times \nabla_w L / \|\nabla_w L\|_2$ . We train with IBP and DeepPoly on MNIST $\epsilon=0.1$ with the same CNN3 architecture used in the paper. The results are shown as follows, all networks certified with MN-BaB:

	nat(%)	cert(%)
IBP-Grad	96.02	91.23
IBP-SAM $\rho=0.1$	96.08	90.20
IBP-SAM $\rho=0.01$	96.32	93.32
IBP-SAM $\rho=0.001$	95.80	91.73
DP-Grad	95.95	90.04
DP-SAM $\rho=0.1$	94.22	88.39
DP-SAM $\rho=0.01$	96.93	92.34
DP-SAM $\rho=0.001$	96.95	90.91
DP-PGPE	97.44	91.53
DP-RGS	97.37	91.88

We observe that for a correctly chosen hyperparameter ( $\rho = 0.01$ ), SAM does indeed improve performance for IBP and DP. While SAM performs better than PGPE for this very shallow network, as expected from our previous theoretical analysis, it does not address the paradox. In particular, IBP-SAM still performs better than DP-SAM uniformly for every choice of $\rho$ . While combining SAM with PGPE or other certified training methods might thus constitute an interesting future direction, it does not explain the reranking of approximation methods we observe for PGPE and RGS (DP-PGPE > IBP-PGPE and DP-RGS > IBP-RGS vs IBP-SAM > DP-SAM). We therefore conclude that the sharpness-aware aspect of GLS is not (solely) responsible for its effectiveness in resolving the paradox of certified training. We also incorporate this discussion in the revised manuscript (see Appendix D4).

评论- General Response

2024-11-23

We thank all reviewers for their insightful reviews, helpful feedback, and interesting questions. $\newcommand{\Rd}{\textcolor{green}{9DHW}}$ $\newcommand{\Rv}{\textcolor{blue}{v31k}}$ $\newcommand{\Rx}{\textcolor{orange}{XmeV}}$ We are particularly encouraged that reviewers consider our work novel, well motivated and intriguing ( $\Rd$ , $\Rv$ , $\Rx$ ), the experimental results and conclusions insightful for the community ( $\Rd$ , $\Rv$ ), and the paper well-written ( $\Rd$ , $\Rx$ ). We will incorporate the reviewer’s writing suggestions.

As we did not identify any shared questions among reviewers, we will address each reviewer’s questions in individual responses. We look forward to the reviewers’ replies.

AC 元评审

2024-12-19

The main claim of the paper is that issues arising in certified training can be alleviated by smoothing the loss surface in parameter space of a neural network by convolving it with a Gaussian distribution.

Combining ideas of loss smoothing and convex relaxation is an interesting direction, and some of the experiments in the small perturbation setting are encouraging. However, there are several concerns pointed out by the reviewers regarding theoretical rigor, marginal gains in the experiments and incremental contributions.

I recommend to reject the paper and encourage the authors to take the detailed comments of the reviewers into account for a resubmission.

审稿人讨论附加意见

Reviewers v31k and XmeV pointed out various issues regarding presentation, novelty, contribution and the experiments. These issues were not cleared during the rebuttal phase. Therefore, I recommend to reject the work.

最终决定Reject

2025-01-22

Reject