PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
5
4
3
3.5
置信度
创新性3.0
质量3.0
清晰度3.5
重要性2.5
NeurIPS 2025

Scalable Neural Network Geometric Robustness Validation via Hölder Optimisation

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

NN robustness validation against geometric perturbations using Hölder Optimisation

摘要

Neural Network (NN) verification methods provide local robustness guarantees for a NN in the dense perturbation space of an input. In this paper we introduce H$^2$V, a method for the validation of local robustness of NNs against geometric perturbations. H$^2$V uniquely employs a Hilbert space-filling construction to recast multi-dimensional problems into single-dimensional ones and Hölder optimisation, iteratively refining the estimation of the Hölder constant for constructing the lower bound. In common with methods, Hölder optimisation might theoretically converge to a local minimum, thereby resulting in a robustness result being incorrect. However, we here identify conditions for H$^2$V to be provably sound, and show experimentally that even outside the soundness conditions, the risk of incorrect results can be minimised by introducing appropriate heuristics in the global optimisation procedure. Indeed, we found no incorrect results validated by H$^2$V on a large set of benchmarks from SoundnessBench and VNN-COMP. To assess the scalability of the approach, we report the results obtained on large NNs ranging from Resnet34 to Resnet152 and vision transformers. These point to SoA scalability of the approach when validating the local robustness of large NNs against geometric perturbations on the ImageNet dataset. Beyond image tasks, we show that the method's scalability enables for the first time the robustness validation of large-scale 3D-NNs in video classification tasks against geometric perturbations for long-sequence input frames on Kinetics/UCF101 datasets.
关键词
Robustness ValidationHölder OptimisationGeometric robustnessHilbert curve dimensionality-reduction

评审与讨论

审稿意见
4

This paper introduces HOVER, a novel method for verifying the robustness of neural networks against low-dimensional geometric perturbations like rotation, scaling, and translation. The core technical contribution is a clever use of Hilbert space-filling curves to reduce the multi-dimensional verification task to a one-dimensional problem. This problem is then solved using Hölder optimisation with an adaptive estimation of the Hölder constant. The authors provide an extensive evaluation showing that HOVER scales to very large models, including Vision Transformers and, for the first time, 3D-NNs for video classification. While the method is not theoretically guaranteed to be sound, the authors provide empirical evidence from SoundnessBench and VNN-COMP showing it produces no incorrect results in their experiments.

优缺点分析

The paper's main strength is its achievement in scalability. By successfully verifying models with hundreds of millions of parameters (e.g., ResNet152, ViT) and enabling the first-ever robustness analysis of large video classifiers, the work pushes the practical boundary of what is possible in verification. This is driven by a novel technical approach that creatively combines dimensionality reduction with global optimisation. Furthermore, the extensive experimental validation is a major plus; the perfect record on SoundnessBench is an empirical counterargument to the method's primary weakness.

However, the paper's central weakness is its lack of theoretical soundness. For a method presented in the context of "verification," whose very purpose is to provide formal guarantees, this is a fundamental limitation. The optimization can theoretically converge to a local minimum, leading to an incorrect certificate. The authors' argument that the "aggregation of these results, not a single query, that enables the analysis of a model's robustness" is a pragmatic but concerning stance from a safety-critical perspective.

Other notable weaknesses include:

  1. The method is explicitly designed for and limited to low-dimensional perturbation spaces. It is not a general-purpose verifier and cannot handle the high-dimensional lp​-norm perturbations that are the focus of most verification literature.
  2. The method's reliability hinges on parameters like the reliability parameter r and the optimisation budget ϵ. The paper offers no constructive procedure for selecting them, leaving their tuning to user expertise and heuristics.

问题

I think generally it is a good paper, even though with a potential unsoundness concern, I will increase my score if my following questions are well answered:

  1. Your work makes a clear trade-off, sacrificing theoretical soundness for practical scalability. Could you provide a quantitative assessment of this trade-off? For instance, what is the performance and runtime cost if you modify HOVER to use more conservative, sound bounding techniques (like interval arithmetic for the lower bounds) instead of the current optimisation approach? This would clarify whether an unsound approach is necessary or merely convenient.
  2. For HOVER to be adopted, users need guidance on setting critical parameters. Can you provide practical heuristics or an empirical study showing how the choice of the reliability parameter r, the Hilbert curve resolution m, and the optimisation budget ϵ affects the trade-off between runtime, memory usage, and the tightness of the verification bound?
  3. Your evaluation against SoA verifiers is limited. Why were general-purpose open-source verifiers like α,β\alpha,\beta-CROWN not benchmarked on the main proposed geometric robustness tasks? A direct comparison would be necessary to establish if HOVER's specialized approach is truly superior to state-of-the-art general methods on this specific problem class. Furthermore, can you clarify the missing baseline results for GeoRobust in Table 1?

局限性

Yes, the authors acknowledge their main limitations

最终评判理由

Your response to Question 2 is excellent. The comprehensive ablation studies for parameters, along with concrete recommendations based on optimization literature, significantly strengthen the paper's practical usability.

Your response to Question 3 is also convincing. The technical explanation of α,β\alpha,\beta-CROWN's scalability limitations (memory footprint growing with neurons and input resolution, lack of STN operator support) provides compelling evidence for why existing general-purpose verifiers cannot handle your problem scale.

However, your response to Question 1 - the most fundamental issue I raised - is insufficient. You state it's "unclear" how HOVER could be combined with sound methods and dismiss IBP after testing on only one small model. This sidesteps my core request: a quantitative assessment of the soundness vs. performance trade-off. The theoretical soundness issue I identified as your paper's "central weakness" remains unresolved. For a method presented as "verification," the lack of formal guarantees is problematic, especially for safety-critical applications. Your empirical evidence (500 queries with no false results in Soundbench) is valuable but doesn't substitute for theoretical analysis.

格式问题

None

作者回复

We thank the reviewer for the comments. We address the specific questions and provide clarifying points to he identified weaknesses.

Question 1.

We have run HOVer on 500 verification queries from the SoundBench benchmark, for which the ground truth result is known. We have observed no unsound results. HOver is based on global optimisation and, unlike other methods like SIP, it does not compute bounds for the nodes of the model. So it is unclear to us how it could be combined with SIP or IBP. For a comparison we did use IBP (auto_LiRPA) to compute bounds for a small ResNet model (WideResNet trained on TinyImageNet) - the bounds derived were too loose to establish a good robustness result (see Appendix B1).

Question 2.

[Guidance on parameter r]

Consolidated results from the optimisation literature suggest the that a value slightly higher than 1, e.g., 1.3, yields close to optimal solutions [1]. We report an ablation study on r in Table 1 on two benchmarks with ground-truth: TLL Verify Bench from VNN-COMP (2 dimensions) and MLP5 from SoundnessBench (10 dimensions). We report the counts of SAFE, UNSAFE, and Unknown cases, along with the average runtime in seconds (SAFE+UNSAFE, excluding Timeout), where the result also confirms [1] . Hence, we would just recommend that the consolidated results from the optimisation literature are appropriate to be used here.

Table 1: Ablation study on the reliability parameter r

rTLLSAFE (sound)UNSAFEUnknownRuntime (s)MLP5SAFE (sound)UNSAFEUnknownRuntime (s)
1.115(15)1701.0016(16)31420.92
1.315(15)1701.0414(14)11810.58
1.513(13)1721.1610(10)22110.58
31(1)17140.12023111.96
6017150.03023159.56
10017150.03023132.79

[Guidance on the Hilbert curve resolution parameter m]

The computational complexity of the Hilbert space mapping grows linearly in m. Any value of m > 30, which already provides small calibration errors (see mathematical expression (10) in the paper), would be adequate to use. We report an ablation study on m in Table 2, which confirms this intuition - any value of m > 10 resolves all verification queries with stable overheads.

Table 2: Ablation study on the Hilbert curve resolution parameter m

mTLLSAFE (sound)UNSAFEUnknownRuntime (s)MLP5SAFE (sound)UNSAFEUnknownRuntime (s)
10017150.1814(14)21713.02
3015(15)1700.3313(13)2187.59
5015(15)1700.3314(14)11810.58
7015(15)1700.3314(14)21716.13
9015(15)1700.3414(14)21711.37

[Guidance on the Optimisation budget parameter ϵ]

The optimisation budget corresponds to the typical optimisation gap present in all convergent algorithms. As the ablation study in Table 3 shows, the smaller its value the more verification queries can be resolved, albeit with longer runtime. Its value is thus contingent to the complexity on the queries in question and resources. Our recommendation is to tune ϵ in line with the comprehensive ablation study provided.

Table 3. Ablation study on the optimisation budget parameter ϵ

ϵTLLSAFE (sound)UNSAFEUnknownRuntime (s)MLP5SAFE (sound)UNSAFEUnknownRuntime (s)
1e-312(12)1730.620033-
1e-615(15)1700.684(4)0292.33
1e-915(15)1700.789(9)0242.52
1e-1315(15)1700.8813(13)1195.03
1e-1515(15)1701.0414(14)11810.58

Question 3.

We have attempted to run αβ‑CROWN on ImageNet‑scale networks (with more than 24 million NN parameters) that we analyse here (e.g. ResNet‑50 on a 3 × 224 × 224 input and Inception‑v3 on 2 × 299 × 299), but we were not able to. One major obstacle is operator coverage: although the Spatial Transformer Network (STN) can be represented in the computation graph, in our understanding the current αβ‑CROWN implementation does not provide the bounded operations for geometric perturbations, specifically, affine_grid and grid_sample.

A further obstacle is scalability: αβ‑CROWN’s memory footprint grows roughly linearly with both the number of neurons and the full input resolution which makes branching/splitting prohibitively expensive. To illustrate this, we constructed a one‑pixel perturbation (corresponding to 3 dimensions) as a comparable low‑dimensional proxy, as our geometric experiments involve low‑dimensional perturbations.

We used the ResNet18_cifar model provided in αβ‑CROWN’s repository and used its wider variant to increase the inplanes from 2 to 8, yielding a larger model. We then performed verification queries at ε = 0.1 and ε = 0.3 on these two models over a random pixel, reasonably assuming that they are expected to be robust (i.e., safe) within these perturbation radii.

Table 4 reports the results for this simplified setting, where αβ‑CROWN verifies the smallest case at ε=0.1 but TIMEOUT or runs out of memory (OOM) as ε increases or the model grows; in contrast, HOVer consistently returns safe with lower and more stable memory as it only requires the forward pass. In our experiments αβ‑CROWN runs out of memory (OOM) on a relatively small model with 175,802 parameters. In conclusion we see no way to scale it to models with tens/hundreds of millions of parameters as we do here, or to ImageNet‑scale inputs with geometric perturbations.

Table 4: Verification results on one‑pixel perturbation for CIFAR10 models

Model (#params) & εMethodResult / Peak Mem (MB) / Time (s)
ResNet18_cifar (11,270), ε=0.1PGD (10k steps × 5 restarts)unknown / 656 / 12.75
αβ‑CROWNsafe / 644 / 4.68
HOVersafe / 636 / 30.28
ResNet18_cifar (11,270), ε=0.3PGD (10k steps × 5 restarts)unknown / 656 / 13.32
αβ‑CROWNunknown / 4298 / Timeout
HOVersafe / 636 / 30.20
ResNet18_cifar model 8 inplanes (175,802), ε=0.1PGD (10k steps × 5 restarts)unknown / 678 / 14.16
αβ‑CROWNout of memory (OOM)
HOVersafe / 638 / 24.88

[Comment on Low-dimensional perturbation spaces]

Geometric transformations are important in many practical applications including object detection since geometric adversarial vulnerability is evidenced in practice [2]. HOVer is specifically designed for robustness analysis with respect to geometric perturbations. Within this scope, we do not consider said perturbation as a limitation, as all combinations of geometric transformations can be encoded using HOVer. We noted elsewhere that the same approach will be effective on other low-dimensional perturbations, such as bias-field, contrast, luminosity and blur. However, we acknowledge that if the goal were to develop a general-purpose method for neural network analysis, this focus would indeed constitute a limitation.

Clarification

The missing baseline result for GeoRobust in Table 1 is due to an update in the timm (PyTorch Image Models) library, which previously caused the ResNet model to achieve only ∼9% robust accuracy (abnormally low). The updated model now achieves a more reasonable ∼25% robust accuracy. We will rerun GeoRobust using the updated model and include the result in the revised version.


[1] Introduction to global optimization exploiting space-filling curves, 2013

[2] Adversarial examples based on object detection tasks, 2023

评论

Thank you for the comprehensive rebuttal. I appreciate the detailed ablation studies and technical explanations provided.

Your response to Question 2 is excellent. The comprehensive ablation studies for parameters, along with concrete recommendations based on optimization literature, significantly strengthen the paper's practical usability.

Your response to Question 3 is also convincing. The technical explanation of α,β\alpha,\beta-CROWN's scalability limitations (memory footprint growing with neurons and input resolution, lack of STN operator support) provides compelling evidence for why existing general-purpose verifiers cannot handle your problem scale.

However, your response to Question 1 - the most fundamental issue I raised - is insufficient. You state it's "unclear" how HOVER could be combined with sound methods and dismiss IBP after testing on only one small model. This sidesteps my core request: a quantitative assessment of the soundness vs. performance trade-off. The theoretical soundness issue I identified as your paper's "central weakness" remains unresolved. For a method presented as "verification," the lack of formal guarantees is problematic, especially for safety-critical applications. Your empirical evidence (500 queries with no false results in Soundbench) is valuable but doesn't substitute for theoretical analysis.

While your rebuttal strengthens the paper's practical contributions significantly, the fundamental theoretical limitation remains unaddressed. I maintain my score (4) for further discussion.

评论

Thank you for clarifying you are satisfied with our answers to Questions 2 and 3.

Regarding Question 1, we would like to comment as follows.

Firstly, we agree with you and indeed acknolwedge in the submission that the lack of theoretical soundness is a limitation of the method.

In our view this is mitigated by the fact that the only underlying source of unsound results stems from potential non-convergence of the underlying global optimisation approach. However, these are very widely used and understood methods in global optimisation for which a lot is known. We use best practice in optimisation and adopt converservative parameters to minimise the risk of unsoundness.

So we see the evaluation that we have no unsound cases from all the publicly available test cases with available ground truth, just as another confirmation the method is defined in a principled way. We also argued that our overall objective is to evaluate the overall model robustness and for this analysis many validation instances ought to be run making the theoretical possibilities of hitting one unsound case even less important.

Even so, we do see this is a limitation of the method. We embrace the recommendation by reviewer rxtg that avoiding completely the term "verification" when referring to HOVer could bring further clarity. We refer to our conversation with reviewer rxtg above; but in a nutshell in the paper we would avoid using the term verification and instead refer to the method as a technique for "NN robustness validation". In this way the limitation of the method is even clearer from the outset.

Secondly, we agree with you that being able to incorporate conservative estimates into the approach and regain soundness by doing so it would be advantageous. However note that, unlike in bound propagation methods, ultimately the source of potential, however unlikely, unsoundness here is the repeated estimation of valid upper bounds for the Lipschitz constant. If we could do that we could in principle compute an exact solution, hence any result would indeed be sound. This is dependent on obtaining sound estimates which is certainly possible. For example we could use spectral norms. By doing so HOVer could at present verify small trained models with soundness guarantees but would not scale to the large models analysed in the paper.

So, we agree with your remark that in principle we could also recover soundness, as you suggest. We chose this way of presenting the approach to focus on scalability, keep a simple message, and be upfront about the limitations. But we could certainly add a small result that under a sound computation of the Lipschitz constant (spectral norms or others) the soundness of any result is actually theoretically guaranteed. Also, note that sound Lipschitz estimation is the only limiting factor; differently from much of the verification literature, model size is not.

We would value your opinion as to whether or not adding the above is useful to the paper.

评论

Thank you for the discussion, I will keep my score.

评论

Dear Reviewer BsfB,

Thank you for acknowledging our response. May we ask whether our answer addressed your question? We are asking this as we were not sure of the question and remain more than happy to clarify further.

We took your comments further today (with our interpretation), and to provide a further quantitative assessment related to your question, we conducted additional experiments. We took the publicly available CIFAR10 and RESNET models from VNNCOMP and evaluated HOVer on provably sound Lipschitz overapproximations (by multiplying the spectral norms of the network), denoted as HOVer* in the table, and compared with HOVer, defined as in the paper on commonly accepted parameters in global optimisation. The queries are 1-pixel perturbations. As previously discussed, in this setting HOVer* reports provably sound results.

Table 5: Verification results on different cifar10 models (HOVer*: HOVer with True uppper bound of Lipschitz constant)

Model (# params) & (ε=0.1)Upper bound of Lipschitz constantMethodResultTime (s)
marabou-cifar10/cifar10_small (# 2,456)4.8095e+00HOVerSAFE2.11
HOVer*SAFE1.54
marabou-cifar10/cifar10_medium (# 9,008)6.9372e+00HOVerSAFE3.39
HOVer*SAFE1.56
marabou-cifar10/cifar10_large (# 34,400)1.0747e+01HOVerSAFE2.90
HOVer*SAFE1.55
cifar10_resnet/resnet2b (# 112,006)1.2080e+02HOVerSAFE2.89
HOVer*SAFE556.10
cifar10_resnet/resnet4b (# 123,734)4.9105e+02HOVerSAFE3.55
HOVer*UnknownTimeout

As the results show HOVer* verifies some models but timeouts when the Lipschitz constant exceeds a certain value. In contrast, of course HOVer validates all models and scales to models with much larger Lipschitz constants (e.g., around 8e+50 for ResNet34) and much larger models (over 300M parameters), as reported in the paper. We are adding this table as perhaps this is the explicit quantitative assessment you would have liked to see to identify exactly where the barrier is.

As in the previous message, we stress that, unlike much of the existing verification literature, model size itself is not the primary bottleneck in our setting. This is instead the value of the Lipschitz constant, as the experiments above demonstrate.

We hope this further clarifies the issue raised previously.

审稿意见
5

The paper proposes an optimization-based approach for the robustness analysis of neural networks w.r.t. geometric perturbations. To this end, the paper phrases the problem as a low-dimensional optimization problem (where the few dimensions correspond to the configuration of the perturbation, e.g. rotation or scaling) and uses a space filling curve to reduce the problem into a one-dimensional verification problem. Subequently, a Hölder-constant based optimization approach is deployed.

优缺点分析

Strengths:

  • (S10) The innovative approach is conceptually interesting in its usage of space filling curves to reduce the dimensionality of the optimization problem.
  • (S20) The results are very promising as the approach is capable of scaling its analysis technique to NNs much larger than traditionally analyzed in the field of NN verification, e.g. ImageNet networks or Video Classification Networks.
  • (S30) While some of the details of the optimization approach are quite technical, the paper provides a good overview of the approach.

Opportunities for Improvement:

  • (O10) A significant portion of the paper is spent on discussing the theoretical unsoundness of the approach. I appreciate that the authors were up front about this. To this end, my understanding of the paper's argument is that (1) empirically the obtained results so far have always been correct, and (2) that other State of the Art verifiers are unsound, too. I find this argument unconvincing: Yes, some verification tools have bugs that lead to unsound results, but for any of the methods underlying alpha-beta-CROWN, NeuralSAT, Marabou or other NN verifiers we know that it is, in principle, possible to implement them in a sound manner and these bugs should be fixable (even if this may require some assumptions about floating point operation order). The purpose of NN verification is precisely to go beyond what can be achieved via empirical methods and hence an empirical soundness result is insufficient in my view. To this end, it is also important to emphasize that the field of NN verification is broader than "just" robustness verification. Hence, even if robustness analysis might get away without rigorous soundness guarantees, other application scenarios do not and require that something labeled "NN verification" actually comes with a rigorous soundness guarantee. This might be nitpicking (with my luck I probably end up being reviewer 2), but I believe it is important to be careful with terms like "verification": Without a soundness guarantee in the classical sense it is my stance that this approach should not be called verification. My suggestion would be to relabel it as, e.g., "approximate verification" (see, e.g., approximate model counting), "robustness analysis" or "approximate certification".
  • (O20) The paper mentions that tools from VNNCOMP are not capable of resolving queries of the kind evaluated (line 288ff): To this end, I know that, e.g., alpha-beta-CROWN is in principle able to process computational graphs which, in my understanding, should be able to capture the geometric transformations performed in the paper. This makes me wonder whether the authors actually tried to apply alpha-beta-CROWN and it failed or whether they performed a comparison on some other benchmarks where they demonstrated that alpha-beta-CROWN is indeed less performant than their proposed approach. Especially given the low input dimensionality I expect that some NN verification tools might scale further than is evaluated in classical VNNCOMP benchmarks.

Minor Notes:

  • Line 48: "Hölder" (o->ö)
  • Line 188: Where is hkh_k defined?
  • Line 230: "first": missing r
  • Line 338: Unnecessary "the"

问题

  • (Q1) Would the authors be willing to change the title of their paper to make it more transparent, that the approach does not come with a provable soundness guarantee
    (e.g. "approximate verification", "robustness analysis", "approximate certification" or something similar)
  • (Q2) Have the authors performed any experiments that demonstrate that their tool indeed outperforms alpha-beta-CROWN or other VNNCOMP participants? Are there fundamental reasons why the benchmarks used for evaluation (e.g. ImageNet) cannot be evaluated using alpha-beta-CROWN via an encoding as a computational graph?

I want to emphasize that I am very willing to increase my score if these questions are addressed.

局限性

yes

最终评判理由

Overall, I find the author's rebuttal convincing. I have hence moved my score to "Accept" under the assumption that the authors will rephrase the contribution from "Verification" to something akin to robustness analysis (or similar). The comparison to alpha beta CROWN nicely showcases that their approach indeed outperforms other techniques in the considered setting. The authors' rebuttal lifts my concerns outlined in (O10) and (O20).

格式问题

none

作者回复

Question 1.

We fully agree with the reviewer’s remarks. Our goal was to be as transparent and scientifically precise as possible, and we welcome the opportunity to further improve this. We are happy to refer to our approach as a method for "robustness analysis" or "robustness validation" and change the title accordingly and all referenced text within accordingly. Indeed, we may prefer these over "approximate verification" as they more clearly emphasise the contrast to "verification", in the spirit of the reviewer's comment, which we endorse. That said, we believe that the method here presented strictly goes beyond, and is conceptually different from, the kind of empirical robustness analysis typically performed using PGD and derivatives. We welcome further comments on this.

If the paper is accepted, we will revise the paper to acknowledge that, in principle, a sound method can indeed be implemented correctly. We had absolutely no intention of being dismissive to existing work, which we believe has great value, and are grateful for the opportunity to clarify our position.

Question 2.

As discussed in the paper tools from VNNCOMP cannot solve queries on geometric perturbations on large models that we evaluate here; this is not only due to the nature of the perturbations but also the large input size and large models.

We have attempted to run αβ‑CROWN on the ImageNet‑scale networks (with more than 24 million NN parameters) that we analyse here (e.g. ResNet‑50 on a 3 × 224 × 224 input and Inception‑v3 on 2 × 299 × 299), but we were not able to. One major obstacle is operator coverage: although the Spatial Transformer Network (STN) can be represented in the computation graph, to our knowledge the current αβ‑CROWN implementation does not provide the bounded operations for geometric perturbations, specifically, affine_grid and grid_sample.

A further obstacle is scalability: αβ‑CROWN’s memory footprint grows roughly linearly with both the number of neurons and the full input resolution, which makes branching/splitting prohibitively expensive. To illustrate this, we constructed a one‑pixel perturbation (corresponding to 3 dimensions) as a comparable low‑dimensional proxy, as our geometric experiments involve low‑dimensional perturbations.

We used the ResNet18_cifar model provided in αβ‑CROWN’s repository and used its wider variant to increase the inplanes from 2 to 8, yielding a larger model. We then performed verification queries at ε = 0.1 and ε = 0.3 on these two models over a random pixel, reasonably assuming that they are expected to be robust (i.e., safe) within these perturbation radii.

Table 1 reports the results for this simplified setting, where αβ‑CROWN verifies the smallest case at ε=0.1 but reports TIMEOUT or runs out of memory (OOM) as ε increases or the model grows; in contrast, HOVer consistently returns safe with lower and more stable memory as it only requires the forward pass. In our experiments αβ‑CROWN runs out of memory (OOM) on a relatively small model with 175,802 parameters. In conclusion we see no way to scale it to models with tens/hundreds of millions of parameters as we do here, or to ImageNet‑scale inputs.

Table 1: Verification results on one‑pixel perturbation for CIFAR10 models

Model (#params) & εMethodResult / Peak Mem (MB) / Runime (s)
ResNet18_cifar (#11,270), ε=0.1PGD (10k steps × 5 restarts)unknown / 656 / 12.75
αβ‑CROWNsafe / 644 / 4.68
HOVersafe / 636 / 30.28
ResNet18_cifar (#11,270), ε=0.3PGD (10k steps × 5 restarts)unknown / 656 / 13.32
αβ‑CROWNunknown / 4298 / Timeout
HOVersafe / 636 / 30.20
ResNet18_cifar model 8 inplanes (#175,802), ε=0.1PGD (10k steps × 5 restarts)unknown / 678 / 14.16
αβ‑CROWNout of memory (OOM)
HOVersafe / 638 / 24.88

Minor Notes.

We thank the reviewer for pointing out these typos which will be correct in the final version if the paper is accepted. hkh_k is defined in lines 182-183, which is the global Hölder constant so far.

评论

Thank you for the rebuttal.

Given your response and your willingness to update the wording, I will be updating my score to Accept.

I agree that your technique goes far beyond what is achievable by PGD as it clearly goes beyond "testing". I am also wondering what the right title might be, and I agree that "robustness analysis" or "robustness validation" (or maybe "robustness estimation"?) might be a sensible option. Graybox techniques (if we consider PGD to be "black box" which is also not quite correct) is another word that comes to mind, although I'm also not quite sure if it fits perfectly. I appreciate that you understand my concern about the usage of "verification".

I appreciate the comparison to alpha beta CROWN, which demonstrates the utility of the proposed approach.

评论

We are thankful for the valuable suggestions which will be acknowledged the final version if the paper is accepted. Our preferred presentation is indeed that HOVer provides a way to conduct robustness analysis, or robustness validation; so the revised title would be "Scalable Neural Network Robustness Validation against Geometric Perturbations via Holder Optimisation", or "Scalable Neural Network Geometric Robustness Validation via Holder Optmisation".

Thank you for the other remarks on the additional benchmarking and considerations on PGD too.

审稿意见
4

This paper introduces HOVER, a novel global optimization framework for verifying the local robustness of neural networks against geometric perturbations (e.g., rotation, translation, scaling). Unlike traditional methods constrained by input dimensionality or heavy reliance on theoretically sound (but often impractical) optimization guarantees, HOVER leverages a dimensionality reduction strategy via Hilbert space-filling curves and adapts a Hölder continuous optimization method that is capable of verifying robustness for very large models (e.g., ResNet152, ViTs, and 3D CNNs).

The approach formulates the robustness verification problem as a univariate optimization task, reducing the original high-dimensional geometric perturbation space to a single dimension. HOVER provides empirical reliability via adaptive estimation of Hölder constants, interval refinement heuristics, and dynamic calibration, even though theoretical soundness is not always assured. Extensive experiments show that HOVER: 1. Outperforms previous methods on benchmarks like ImageNet, Kinetics, and UCF101; 2. Is applicable to both image and video classification; 3. Performs competitively on SoundnessBench and VNN-COMP, often with higher empirical correctness than “sound” solvers like αβ-CROWN.

优缺点分析

Strengths

HOVER can verify large-scale models, including 3D CNNs for video, which is unprecedented in the robustness verification literature for geometric perturbations.

The use of Hilbert curves for dimensionality reduction enables the transformation of multivariate geometric robustness problems into tractable univariate optimization tasks. This is both elegant and effective.

Despite theoretical limitations, HOVER shows very strong empirical soundness across 500+ verification queries, outperforming several state-of-the-art verifiers.

The authors present results on diverse tasks (image and video classification), architectures (CNNs, ResNets, ViTs, 3D CNNs), and datasets (ImageNet, Kinetics, UCF101), along with comparisons against certified and adversarial baselines.

The appendices thoroughly cover theoretical proofs, ablation experiments, extended results, and implementation details, improving the paper's transparency and credibility.

Weaknesses

While empirical correctness is high, the algorithm lacks a constructive guarantee of convergence to a global minimum unless the unknown optimal Hölder constant is accurately estimated. This may be problematic in safety-critical applications.

The core contributions (space-filling curves + Hölder optimization + heuristics) are well-engineered but do not introduce fundamentally new theory. The work is closer to a strong engineering improvement than a conceptual leap.

The method is only applicable to low-dimensional perturbation spaces (e.g., geometric transformations). It does not generalize to pixel-level or noise-based robustness verification tasks, limiting its scope compared to symbolic interval propagation or MILP-based solvers.

On certain architectures like Swin or ViT-Large, HOVER underperforms in certified robustness compared to GeoRobust. Some claims about “no unsound results” would benefit from further third-party validation.

The choice of reliability parameter r, budget ε, and re-estimation strategies are empirical and somewhat arbitrary. No ablation study shows how sensitive results are to these choices.

问题

  1. The practical reliability of HOVER depends heavily on the heuristic tuning of parameters like the reliability constant r. Could you include an ablation study showing how variations in r affect false positives/negatives or convergence time across benchmarks?

  2. The current method is explicitly limited to low-dimensional transformations. Do you see a path toward generalizing the Hilbert-based Hölder optimization to moderately high-dimensional perturbation spaces (e.g., 20–50 dimensions)? What are the theoretical or practical barriers?

  3. Are there examples where HOVER returns an incorrect robustness label? What mitigation strategies are in place for safety-critical applications where even one unsound result could be unacceptable?

  4. Can you provide wall-clock runtime comparisons between HOVER and GeoRobust, PGD, or αβ-CROWN on the same tasks?

  5. You fix m = 50 throughout the experiments. Could you show the effect of varying m on both verification success rate and runtime? This would clarify its role in the trade-off between accuracy and speed.

局限性

Partially addressed. The authors acknowledge the limitation of HOVER's applicability to low-dimensional geometric transformations and mention that noise perturbations are better handled by symbolic interval methods. However, the discussion of societal or safety risks is absent. Given the focus on robustness and safety-critical use cases, it would be important to address: 1. The risks of false confidence in verification outcomes; 2. The challenges in applying HOVER under distribution shifts; 3. The potential for misuse or overtrust in industrial settings.

格式问题

  1. Some equations (e.g., those involving the Hilbert curve mappings and Hölder constants) could benefit from clearer inline explanations or accompanying visual intuition.

  2. Some figures (e.g., optimization curve illustrations) are overly dense or low contrast.

  3. The main text is very dense. Consider condensing some derivations or moving more setup into the appendix.

  4. Occasionally uses citation numbers without context (e.g., [6]) in places where a brief summary of the cited method would help readers.

作者回复

We thank the reviewer for the comments. We address the specific questions as follows.

Question 1.

[Reliability parameter r]

Empirical results from the optimisation literature suggest the usage of a small value close to 1, e.g., 1.3, yields close to optimal solutions [1]. We report an ablation study on r in Table 1, which confirms the observations in [1]. Specifically, we conducted three ablation studies on the reliability parameter r on two benchmarks with ground-truth: TLL Verify Bench from VNN-COMP (2 dimensions) and MLP5 from SoundnessBench (10 dimensions). We report the counts of SAFE, UNSAFE, and Unknown cases, along with the average runtime in seconds (SAFE+UNSAFE, excluding Timeout). The results show that HOVer outputs no unsound results for any value of r>1, but it becomes hard to answer verification queries with bigger values of r, which, given the above, should not be used. In summary: we have never witnessed a false positives in any of the experiments. Any counterexample is provably correct, so we cannot have any false negatives, and indeed never encountered one.

Table 1: Ablation study on the reliability parameter r

rTLLSAFE (sound)UNSAFEUnknownruntimeMLP5SAFE (sound)UNSAFEUnknownruntime (s)
1.115(15)1701.0016(16)31420.92
1.315(15)1701.0414(14)11810.58
1.513(13)1721.1610(10)22110.58
31(1)17140.12023111.96
6017150.03023159.56
10017150.03023132.79

[Hilbert curve resolution parameter m]

The computational complexity of modern libraries providing the Hilbert space mapping grows linearly in m, so any value of m > ~30, which already provides small calibration errors (see mathematical expression (10) in the paper), would be adequate to use. We report an ablation study on m in Table 2 which confirms this intuition - any value of m > 10 resolves all verification queries with stable overheads.

Table 2: Ablation study on the Hilbert curve resolution parameter m

mTLLSAFE (sound)UNSAFEUnknownRuntimeMLP5SAFE (sound)UNSAFEUnknownRuntime (s)
10017150.1814(14)21713.02
3015(15)1700.3313(14)2187.59
5015(15)1700.331411810.58
7015(15)1700.331421716.13
9015(15)1700.341421711.37

[Optimisation budget parameter ϵ]

The optimisation budget corresponds to the typical optimisation gap present in all convergent algorithms. As the ablation study in Table 3 shows, the smaller its value the more verification queries can be solved, albeit with longer runtime. Its value is thus contingent to available computation and temporal resources.

Table 3: Ablation study on the optimisation budget parameter ϵ

ϵTLLSAFE (sound)UNSAFEUnknownRuntime (s)MLP5SAFE (sound)UNSAFEUnknownRuntime (s)
1e-3121730.620033-
1e-6151700.6840292.33
1e-9151700.7890242.52
1e-13151700.88131195.03
1e-15151701.041411810.58

Question 2.

HOVer is designed for geometric verification, and therefore inherently restricted to low‑dimensional perturbation spaces. As shown in Theorem 1, the convergence bound includes an irreducible residual term H(ϵ/2)1/NH \cdot (\epsilon/2)^{1/N}. Because (ϵ/2)1/N1(\epsilon/2)^{1/N}\rightarrow 1 as NN grows, the resulting approximation gap, governed by the Hölder constant, also grows with N, thereby deteriorating convergence and giving overly loose certificates in higher dimensions. Consequently, pushing the method to high dimensionality transformations such as noise remains an open challenge. However, note that extending Hover to other smaller dimensionality perturbations such as contrast, luminosity, bias field, blur, etc, is entirely feasible.

Question 3.

In our extensive evaluations, we have observed no cases where HOVer produces an incorrect robustness result. In principle, by further increasing r beyond what we used in the experiments, we reduce even further the possibility of any unsound results. Currently however, HOVer provides no formal mitigation for potential unsoundness. We believe this to be an acceptable trade-off in the context of a robustness analysis across many test points for the very large models here that can now be analysed.

Question 4.

[GeoRobust and PGD]

We ran 100 randomly sampled ImageNet images through ResNet‑50 considered in the paper and compared the performance of our method (HOVer) against PGD and GeoRobust under the 4‑parameter perturbations. The termination threshold was set to 500 seconds or 50,000 queries per sample. Table 4 reports the average runtime observed. As expected, while fast PGD cannot in principle establish any safe cases. In comparison to GeoRobust, HOVer is markedly more performant.

Table 4: Comparison with GeoRobust and PGD

Rotation + Translation + Isotropic ScalingSAFEUNSAFEUnknownRuntime (s)
HOVer3360759.49
GeoRobust335710286.01
PGD-34644.18

[αβ\alpha\beta-CROWN]

We have attempted to run αβ‑CROWN on the ImageNet‑scale networks (with more than 24 million NN parameters) that we analyse here (e.g. ResNet‑50 on a 3 × 224 × 224 input and Inception‑v3 on 2 × 299 × 299), but we were not able to. One major obstacle is operator coverage: although the Spatial Transformer Network (STN) can be represented in the computation graph, to our knowledge the current αβ‑CROWN implementation (nor any other IBP/SIP tools) does not provide the bounded operations for geometric perturbations, specifically, affine_grid and grid_sample.

A further obstacle is scalability: αβ‑CROWN’s memory footprint grows roughly linearly with both the number of neurons and the full input resolution, which makes branching/splitting prohibitively expensive. To clarify this, we constructed a one‑pixel perturbation (corresponding to 3 dimensions) as a comparable low‑dimensional proxy, as our geometric experiments involve low‑dimensional perturbations. In this simplified setting and in contrast to HOVer, αβ‑CROWN often times out or runs out of memory (OOM). We refer to the response to Question 2 of Reviewer rxtg for the experimental results.

Question 5.

Please see the answer for Question 1.

[Comment on Low-dimensional perturbation spaces]

Geometric transformations are important in many practical applications, including object detection since geometric adversarial vulnerability is evidenced in practice [2]. HOVer is specifically designed for robustness analysis with respect to geometric perturbations. Within this scope, we do not consider said perturbation as a limitation, as all combinations of geometric transformations can be encoded using HOVer. We noted elsewhere that the same approach will be effective on other low-dimensional perturbations, such as bias-field, contrast, luminosity and blur. However, we acknowledge that if the goal were to develop a general-purpose method for neural network analysis, this focus would indeed constitute a limitation.

[Comment on Soundness]

We agree it would be ideal to have a theoretical guarantee on no false positives. As argued in the paper, we believe this is an acceptable trade-off in the context of robustness analysis for very large models that could not be analysed before. We remark that we observed no false positives in all the experiments run.

[Comment on Underperformance wrt GeoRobust]

We observed in the paper that GeoRobust reported incorrect results in some cases. We believe this is due to an underestimation of the Lipschitz constant in their approach; this leads to faster resolution of some cases.

Limitations (possible overconfidence).

We have found no cases where the answer of the tool was incorrect. But we would like to stress that no safety critical systems, such as aviation systems, are validated purely on the basis of formal verification. An assurance case in the industry is much more comprehensive and verification is only a part of the process. Even verification results on sound tools are evaluated in the context of tool qualification and confidence and the possibility of errors creeping up in various ways. This is also likely to be the case with ML systems also in view of the difficulty of coverage quantification of the Operational Design Domain. Therefore we do not believe this risk to be high. Still, we do not believe this removes value from the method, as it significantly pushes forward robustness validation for large models, something that was not possible before.


[1] Introduction to global optimization exploiting space-filling curves, 2013

[2] Adversarial examples based on object detection tasks, 2023

评论

Thank you for the comprehensive and thoughtful rebuttal. I appreciate the detailed experimental additions and clarifications you've provided, which significantly strengthen the submission.

  1. Your response addressed my concern about heuristic tuning by including ablation studies on the reliability parameter r (Table 1) and discretization parameter m (Table 2), as well as the optimization budget. These studies confirm the practical robustness of your choices and demonstrate how parameter tuning impacts verification reliability and runtime. This added empirical grounding is appreciated and increases confidence in the proposed framework's stability.

  2. Your honest discussion of the method’s current limitation to low-dimensional geometric perturbations is appreciated. The theoretical explanation based on the Hölder bound scaling with NN (i.e., convergence degrading in higher dimensions) is clear, and I agree that focusing on geometric transformations (and their combinations) is a practical and important niche. Your indication that other low-dimensional transformations (e.g., contrast, blur) are within scope is promising. That said, I also appreciate your acknowledgement that generalization to higher-dimensional perturbation spaces—such as pixel-level noise—remains an open challenge. Clarifying this boundary helps position the contribution appropriately.

  3. I value your transparency around the absence of formal unsoundness guarantees, and your clarification that no incorrect results were observed in extensive experiments. While this tradeoff is likely acceptable in many practical contexts—especially for large-scale models that previous tools cannot handle—an explicit mitigation strategy or fallback mechanism for safety-critical scenarios would further strengthen the work. Your contextualization of formal verification as just one part of a broader assurance case is well-taken.

  4. Thank you for the new runtime comparisons against PGD and GeoRobust (Table 4). These results highlight HOVER’s practical advantages at scale and reinforce the claim that existing sound tools cannot feasibly handle large architectures like ResNet-50 under geometric perturbations. Your observations about αβ-CROWN’s scalability limitations and operator coverage gaps help clarify why direct comparisons weren’t possible in those cases.

Overall, I find this to be a compelling paper that introduces a novel and practical approach to robustness verification under geometric transformations. While it does not offer theoretical novelty in the strictest sense, the integration of space-filling curves, Hölder-based optimization, and reliability-aware heuristics is well-engineered and empirically validated across challenging domains, including video models and large-scale image classifiers.

I am maintaining my original score, but I now view the contribution more favorably in light of the added experiments and your thoughtful rebuttal. This is a strong and relevant submission that pushes the boundary of scalable robustness analysis in a focused domain. Thank you again for the detailed clarifications and engagement.

评论

Thank you for the response. We do not understand the points made. We suspect this response was meant for a different paper. Could you please revisit this? Thank you.

评论

Dear Reviewer,

Thank you for your deep and thoughtful comments and the positive words on the technical aspects and the clarity and openness of our submission. We appreciate the open discussion held and of course, we are happy that you are now leaning even more positively towards the paper.

In the spirit of open scientific discussion, in relation to your point 3, please note our response to Reviewer BsfB, Table 5, where we clarify and provide experiments for the tradeoff on Lipschitz constants for which HOVer can be shown to be theoretically sound in a range of noteworthy cases.

Given your excellent understanding of the paper, in case you have the time, we would appreciate your thoughts on including such results if the paper is accepted.

Thank you.

审稿意见
3

The work describes a verification approach, HOVer, that estimates the local robustness of neural networks to low-dimensional geometric attacks rotation, scaling, and translation. HOVer initially maps the N-dimensional perturbation space to one dimension along a Hilbert space-filling curve before applying an adaptive Hölder global-optimisation process, along with a set of heuristics to ensure that a reliable bound results. Empirical results show that HOVer scales to reasonably large models (ResNet-152, ViT-Large) and 3-D CNNs on video classification, and outperform prior work (GeoRobust) while remaining reliable (no incorrect results) on 500 ground-truth queries of SoundnessBench and VNN-COMP.

优缺点分析

Strengths

The paper describes a new combination of Hilbert curve dimensionality reduction and adaptive Hölder optimization to allow 1-D search over 4-parameter geometric spaces.

The evaluation shows that this approach scales to the largest models (to my knowledge) for verified video tasks.

Although the certified accuracy of HOVer is comparable to GeoRobust, and the adversarial accuracy is lower on most models, it appears that HOVer is likely the better solution for certified classification as GeoRobust is unsound.

Weaknesses

Although most prior geometric certification techniques could not come close to scaling to models of the size used in this eval, it would still be helpful to see how the performance compares to those methods, albeit on smaller models.

The evaluation covers accuracy and soundness, but does not mention anything about the cost and how it scales with model and data size. Without this, the audience cannot tell how practical the method really is.

While the approach seems to improve on prior work in terms of model scale, the significance of doing so for certified geometric robustness is somewhat limited. There is quite a bit of prior work on this problem, but its lack of impact on practical applications does not stem from the cost of those methods, but rather the impact that they have on accuracy and the fact that the threat model is not well aligned with practical concerns.

问题

How many queries saw their outcome change with growing r, and what was the accompanying runtime overhead?

How would the method scale if more geometric parameters (e.g. shear, aspect-ratio) are taken into account?

局限性

Yes

格式问题

No

作者回复

Question 1.
Empirical results from the optimisation literature suggest that using a small value close to 1, e.g., 1.3, yields close to optimal solutions [1]. We report an ablation study on r in Table 1, which confirms these empirical observations. Specifically, we conducted three ablation studies on the reliability parameter r on two benchmarks where the ground-truth for the verification queries is known: TLL Verify Bench from VNN-COMP (2 dimensions) and MLP5 from SoundnessBench (10 dimensions). We report the counts of SAFE, UNSAFE, and Unknown cases, along with the average runtime in seconds (SAFE+UNSAFE, excluding Timeout). The results show that HOVer outputs no unsound results for any value of r>1, but it becomes increasingly challenging to solve verification queries with bigger values of r. This is in line with literature findings [1] which suggest that big values for r should not be used.

Table 1: Ablation study on the reliability parameter r on Benchmark TLL and MLP5

rTLLSAFE (sound)UNSAFEUnknownRuntime (s)MLP5SAFE (sound)UNSAFEUnknownRuntime (s)
1.115(15)1701.0016(16)31420.92
1.315(15)1701.0414(14)11810.58
1.513(13)1721.1610(10)22110.58
31(1)17140.12023111.96
6017150.03023159.56
10017150.03023132.79

Question 2.

Since any composition of scaling, shearing, rotation, and translation can be represented as an affine transformation of only 7 parameters, HOVer scales to the models reported in the paper for any composition of transformations.

Specifically, although our initial experiments considered only 4 parameters, our approach extends seamlessly to the 7‑parameter affine group by replacing the transformation matrix with:

[λx(cosγ+sxsinγ)λy(sinγ+sxcosγ)txλx(sinγ+sycosγ)λy(cosγsysinγ)ty] \left[ \begin{matrix} \lambda_x (\cos \gamma + s_x \sin \gamma) & \lambda_y (-\sin \gamma + s_x \cos \gamma) & t_x \\\\ \lambda_x (\sin \gamma + s_y \cos \gamma) & \lambda_y (\cos \gamma - s_y \sin \gamma) & t_y \\ \end{matrix} \right]

where λx\lambda_x and λy\lambda_y are independent scaling factors along the horizontal and vertical axes; γ\gamma is the rotation angle; sxs_x and sys_y are the horizontal and vertical shear coefficients; txt_x and tyt_y represent translations along the x-axis and y-axis.

We evaluated 100 randomly sampled ImageNet images using ResNet‑50 with our method (HOVer) under both 4‑parameter and 7‑parameter perturbations. The termination threshold was set to 500 seconds or 50,000 queries per sample. As shown in Table 2, which reports the SAFE/UNSAFE/Unknown cases and the average runtime for the SAFE + UNSAFE cases, there is only a modest performance drop when moving from isotropic scaling to anisotropic scaling plus shearing (due to the increased number of geometric parameters and the resulting higher number of unsolved cases). Nevertheless, HOVer remains able to verify higher‑dimensional geometric transformation spaces.

Table 2: Robust performance on 100 ImageNet images (ResNet‑50), with clean Acc 70%

Geometric PerturbationSAFEUNSAFEUnknownRuntime (s)
Rotation + Translation + Isotropic Scaling3360759.49
Rotation + Translation + Anisotropic Scaling + Shearing20661432.36

[Scaling]

In Appendix B, we have also reported the performance comparisons on smaller models for the MNIST, CIFAR‑10, and TinyImageNet datasets, where we evaluate against prior geometric certification techniques, including DeepG, GSmooth, TSS, and Auto_LiRPA with CGT.

The timing tables below report the average running time (in seconds) for SAFE + UNSAFE cases, excluding timeouts. Measurements are based on 100 ImageNet images and 50 Kinetics videos, evaluated across neural networks ranging from 22M to 300M parameters and data sizes from image to video. We will include all the runtimes in all tables in our revised version.

Table 3: Runtime comparison on the ImageNet dataset

ModelNN paramsClean accuracySAFEUNSAFEUnknownRuntime (s)
ResNet3422M64%2077329.67
ResNet5026M70%3060759.49
ResNet10145M69%37531097.77
ResNet15260M74%395110104.30
ViT300M79%13672065.80

Table 4: Runtime comparison on the Kinetics dataset

ModelNN paramsClean accuracySAFEUNSAFEUnknownRuntime (s)
x3d_m3.79M76%16304194.98
c2d_r5024.33M78%12317145.84

[Geometric transformations]

Geometric transformations are important in many practical applications including object detection, since geometric adversarial vulnerability is evidenced in practice [2][3]. We emphasise that our method is specifically designed for robustness analysis with respect to geometric perturbations. Within this scope, we do not consider said perturbation as a limitation, as all combinations of geometric transformations can be effectively encoded using our approach. However, we acknowledge that if the goal were to develop a general-purpose method for neural network analysis, this focus would indeed constitute a limitation.

[Comparison with the SoA]

The Appendix includes a comparison on small models against the previous SoA that can solve geometric certification queries.


[1] Introduction to global optimization exploiting space-filling curves, 2013

[2] Exploring the Landscape of Spatial Robustness, 2017

[3] Adversarial examples based on object detection tasks, 2023

最终决定

Over the course of a very active discussion period, the authors assuaged the concerns of all reviewers except one, who did not respond to the authors or myself after multiple prompts. The actual concerns of that reviewer appear to me as minor ones that the reviewers addressed in their responses. Overall, this paper describes an interesting method that is notable in its capability for scaling to the types of architectures in use today for state-of-the-art tasks. That makes it highly relevant for NeurIPS and worthy of acceptance.