/10

Poster4 位审稿人

最低2最高4标准差0.7

ICML 2025

Adaptive Sensitivity Analysis for Robust Augmentation against Natural Corruptions in Image Segmentation

Laura Yu Zheng,Wenjie Wei,Tony Wu,Jacob Clements,Shreelekha Revankar,Andre Harrison,Yu Shen,Ming Lin

提交: 2025-01-23更新: 2025-07-24

TL;DR

We estimate sensitivity curves for different perturbation functions at intervals during train-time, and sample these curves for sensitivity-based augmentation.

摘要

关键词

image segmentationaugmentationrobustnessheuristicnatural corruptionsadverse weather

评审与讨论

审稿意见

评分: 32025-03-08

This work proposes a sensitivity-guided method to improve model robustness against image corruptions. The sensitivity measure enables a selection of proper model-free augmentation policies. The experiments show that the method improves robustness of models on both real and synthetic datasets, compared to SOTA augmentation methods in image segmentation tasks.

给作者的问题

The results of AutoAugment on ACDC (Fig. 3) are surprisingly low. Are the augmentation policy the default one, or computed on ACDC and IDD separately?
Is it a model-free augmentation policy or model-agnostic method? Since the sensitivity analysis is highly related to a model, it is hard to call it model-free.

论据与证据

LN191-192: The authors claimed that: The set of α values that fulfills Q are at equal intervals along the function g. However, there aren't any proof on why they should be equal spacing. And from Figure 1, it does not look like having equal spacing.
LN160-161: The authors claimed that: we seek to find a set of increasing, nontrivial augmentation intensities α1 < α2 < . . . < αL that maximize sensitivity. However, it comes abruptly that it is important to find out α1 - αL, without explaining why.
Same as the second point, the point of having 'adequate spacing' is not explained well (LN162-163 right)

方法与评估标准

The metrics: standard metrics like mean average precision should be used.
The sensitivity analysis only considers the impact of augmentation intensity (strength), without considering the types of augmentation.
Confusing equation 4: it is unclear where the max-min is applied to. The authors claimed to maximize the minimum value, but it is still uncertain what is being optimized. The objective function should be better formulated.

理论论述

N\A

实验设计与分析

Experiments seem to use only one random seeds. There should be multiple random seeds used, and the results should provide the average and standard deviation.
Missing recent SOTAs: the authors miss the augmentation operations that are applied to the frequency domain of images, like AFA, VIPAug, HybridAug++, where the intensities of the operations also matter. And the authors can also consider adding Fourier-basis functions as one of the augmentation options when carrying out sensitivity analysis.
Figure 4: unclear meaning of different lines in one subplot. The authors mentioned 'recency', but I cannot find the definition of recency in the paper. Also, the different lines in Lighter/Darker H have high variations. The authors only explained the changing tendency but not the high variations.
Concern regarding finetuning foundation model: the model was trained on numerous data, which means it might possibly have seen the corruption images with adverse weather effects. This makes the experiments less convincing, as: 1. the authors claimed that the proposed method is robust to unseen corruptions (LN061-062), 2. the authors did not disclose what exact augmentation operations are used, and also, whether they use the same set of operations in other SOTAs for comparison is unclear.

补充材料

N\A

与现有文献的关系

The sensitivity-based augmentation technique is an interesting topic regarding efficient data augmentation. But the insights brought by this sensitivity analysis method are unclear,

遗漏的重要参考文献

More recent SOTA augmentation techniques:

Hendrycks, et al., PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures, 2022.
Yucel, et al., HybridAugment++: Unified Frequency Spectra Perturbations for Model Robustness, 2023.
Wang, et al. Fourier-basis Functions to Bridge Augmentation Gap: Rethinking Frequency Augmentation in Image Classification, 2024.
Lee, et al., Domain Generalization with Vital Phase Augmentation, 2024.

其他优缺点

The contribution points are not strong, especially the third point which depicts mostly what has been done. There should be more emphasis on the benefits and insights brought by the proposed method.

其他意见或建议

LN145-146: strange line space
Table 5: the subscript for Ours∼g and Ours∼p seems to be reversed
LN1128: 'Figure ??'
Table 1: too far from where it is referred

作者回复

2025-03-30

Thank you for your review! We clarify some misunderstandings below.

"Figure 1 not support equal spacing along function g, and there is no proof."

By equal spacing, we meant that given the set of α values that fulfills Q, the set of g(α_i) are at equal intervals along the y-axis of the function g. We have included a proof for equal spacing here: https://drive.google.com/file/d/1oOqCZXiSmLV4k4T5YyJqZ2NDmS91U2RE/view?usp=sharing

We will be sure to include this proof and clarification in the final revision.

"Authors explain that α1 < α2 < . . . < αL, without explaining why."

Thanks for pointing this out. The intuitive explanation behind solving for “adequate spacing” can be thought of as solving for “uniformly difficult augmentation levels” with respect to model sensitivity. For example, it could be that a model is robust to a wide range of intensity values for a particular perturbation, but robustness quickly degrades past a certain value.

"Standard metrics like mAP should be used."

While mAP is standard for object recognition and instance segmentation, it isn’t commonly used with semantic segmentation, which classifies by-pixel.

"Sensitivity analysis only considers augmentation intensity, without the type."

Our work considers only intra-augmentation sensitivity; currently, all augmentation types have an equal chance to be sampled at train-time, when models may be more sensitive to one type of augmentation over the other (e.g., geometric types and photometric types). Implementation wise, the change is simple: the augmentation distribution sampler can be modified to account for the absolute sensitivity of each augmentation all together. Note: we indeed considered such weighing in practice, but this distribution may skew too heavily towards certain types of augmentations, resulting in unintended overfitting.

"Experiments only use one fixed random seed. There should be multiple random seeds used."

We agree that multiple iterations of each experiment under different random seeds is important to validate that improvements in performance are not attributed to randomness. While we conduct all experiments under one fixed seed in this work, previous work (AdvSteer) has validated the consistency for sensitivity analysis results across multiple random seeds that our work has compared against. Additionally, we reduce the role of randomness in experiments by initializing ALL models to the same initialization weights.

"Authors miss the augmentation operations that are applied to the frequency domain of images."

The mentioned works in frequency space augmentation primarily deal with classification tasks, while our work focuses on segmentation. Direct translation of these works to segmentation is nontrivial, as segmentation is likely much more dependent on high frequency details. This is apparent in that some works explore frequency-based domain adaptation specifically for this task (https://openreview.net/pdf?id=b7hmPlOqr8). However, this idea is very interesting and we believe it valuable to explore in future work! Currently, the set of augmentation operations we use are consistent with other baselines we benchmark against. We choose to augment in the image space due to consistency with previous work, as well as interpretability/explainability and intuition w.r.t. sensitivity curves. Our framework can be directly applied in the frequency space, although the results in frequency space may not be a direct comparison anymore to the methods used in our current experiments.

"The fine-tuning experiment results may be questionable since the baseline model may have already seen corrupted examples drawn from the same distribution as the evaluation set."

While the foundation models we initialize weights with are based on much larger datasets, which may have seen adverse weather samples at some point in their training, we still observe improvements over baseline fine-tuning experiments when applying augmentation at fine-tuning. Considering that downstream fine-tuning does NOT involve adverse weather samples, we still find that improvement on generalization on natural corruptions to be valuable, since they were not involved in the fine-tuning process whatsoever. Did not disclose exact augmentation operations used, and whether they use the same set of operations in other methods for comparison In Section 4 (Experiments), the “Experiment Setup” excerpt mentions that all methods use the same set of augmentation operations, with the exception of IDBH, which includes two additional augmentations. As explained in the main text, hyperparameter tuning details can be found in Appendix Section A5, which describes all the augmentation operations.

"Figure 4: unclear meaning of different lines in one subplot"

We evaluate color channel sensitivity several times during training. This plot is meant to show how model sensitivity changes across color channels over the course of training.

审稿人评论

2025-04-08

Thank you for the response, which addressed most of my concerns. Hence, I decided to raise the score to weak accept.

作者评论

2025-04-08

Thank you very much for re-visiting our submission, we are glad the response clarified most concerns! We greatly appreciate your review. Again, improvements per your feedback will be included in future revisions.

审稿意见

评分: 22025-03-09

The paper addresses a practical challenge of enhancing model robustness to natural corruptions in semantic segmentation, a critical area for real-time perception applications. It proposes a novel, computationally efficient online adaptive sensitivity analysis approach (10x faster and 200x less storage than existing sensitivity analysis methods), facilitating practical deployment during training (lines 110-118).

It including real-world adverse conditions (Fog, Rain, Night, Snow) on the ACDC dataset (lines 275-288) and synthetic benchmarks such as ImageNet-C and AdvSteer across multiple datasets (6).

给作者的问题

Please see the review feedback above mentioned.

论据与证据

The authors claim significant efficiency improvements such as "10x faster computation and 200x less storage" compared to existing methods (abstract lines 010-014). However, these claims lack explicit evidence in the form of tables or figures within the main text. While Table 1 briefly touches upon this (compared with 1), crucial details like memory computation benchmarks or clear runtime comparisons during inference are missing, making these claims difficult to verify. for example. AutoAug (Cubuk et al., 2019), Basis Perturbations (Shen et al. 2021).
The paper claims to achieve state-of-the-art performance; however, the reported results (Table 2, lines 275-288 and Table 3 lines 330-363) suggest otherwise. For example, according to Table 2, AugMix and IDBH demonstrate better performance under rainy or snowy weather conditions, highlighting the dependency on adaptive sensitivity. Specifically, the authors' augmentation approach is less than 1.6% compared to existing SOTA methods for rainy conditions. Even for foggy conditions, their improvement is only marginal (0.6%) over IDBH. Additionally, there is no comparison provided regarding computational savings in terms of time, speed, or memory cost.
In Table 3, which compares results across a broader set of datasets (six different datasets), this limitation becomes more apparent. The authors' method achieves top performance only in basic augmentation scenarios, while in Clean, AdvSteer, and IN-C scenarios, TrivialAug and IDBH outperform the authors' method in 2/6, 4/6, and 5/6 datasets, respectively. Furthermore, in unseen scenarios, IDBH consistently leads performance metrics. This raises concerns about the actual benefits of including the proposed method. The authors should therefore clearly articulate the unique advantages or novel contributions of their augmentation approach.

方法与评估标准

The evaluation procedure lacks clarity and comprehensiveness, particularly concerning memory and runtime costs, as mentioned earlier. Furthermore, the experimental analyses provided (Figures 3 and Tables 2-4) are insufficient to justify claims such as "10x faster" and "200x less storage,"

理论论述

The presentation and explanation of critical algorithmic details (Algorithm 1, lines 110-164) lack clarity. For instance, abbreviations such as "pf" and "PDF" appear without clear definitions or context within the main text. Terms like "BetaBinom," "Levels append," and "Metrics append" (lines 206, 179) are introduced without proper explanation or motivation.
Additionally, equations (5) and (6) (lines 206-219) require clearer descriptions and justification to enhance reader understanding and reproducibility.

实验设计与分析

Please consider to provide explicit computational and memory storage comparisons, including clear tables and figures, supporting the claims of "10x faster" and "200x less storage", not only to AdvSteer.
Given the moderate to negligible improvement over existing methods like IDBH on certain benchmarks, could the authors clarify what distinct advantages their augmentation approach offers, particularly in practical deployment scenarios?
Highly suggested to provide clearer explanations or intuitive descriptions for the choices of parameters and methods used in Algorithm 1, particularly "BetaBinom," "Levels append," and "Metrics append"?line 179, 206, Why Gmax=2?

补充材料

Yes, we have gone through the supplementary part.

与现有文献的关系

They author compaired Adverse Conditions Dataset with Correspondences (ACDC) (Sakaridis et al., 2021) with four weather scenarios: Fog, Rain, Night, Snow (Table 2). ADE20K (Zhou et al., 2019) etc ImageNet-C (IN-C) synthetic corruption benchmark (Hendrycks & Dietterich, 2019), AdvSteer synthetic augmentation benchmark (Shen et al., 2021).

遗漏的重要参考文献

Several essential references in data augmentation and explainable AI are missing. Notably, key datasets or methods such as CVPR2020 Cityscapes, ICLR2022 Image-9, CVPR2024 XimageNet-12, BDD100K from Berkeley, Rain100L / Rain100H / Rain800 real rainy images for deraining and robustness studies, provide important contextual background, color channel and are highly relevant, have not been cited or discussed.

其他优缺点

Please see the review feedback above mentioned.

其他意见或建议

Please see the review feedback above mentioned.

作者回复

2025-03-30

Thank you! We appreciate the suggestions and will improve clarity in the revision, adding more references and notation. We address points below:

“Efficiency improvement claims are missing benchmarks like inference runtime benchmarks and memory usage.”

We would like to clarify that our efficiency improvements are strictly compared to the previous works in sensitivity analysis like AdvSteer, as presented in Table 1. Previously, sensitivity analysis (SA) was NOT computationally feasible to compute during training; it required a huge amount of local storage and long computation times dependent on dataset sizes. Our work makes SA feasible for online training, and opens many possibilities of future work along this direction (see response above to mfiq). Our efficiency claims in the faster inference time or memory usage would not be a fair comparison against other augmentation techniques. Thus, our significant efficiency claims are benchmarked purely for sensitivity analysis. In terms of training complexity, our experiments re-compute SA only 4x during training, amounting to merely 38.4 more GPU minutes for one experiment. Dataset sizes do not influence the sensitivity analysis runtime due to the use of KID, whose data efficiency we verify against Frechet Inception Distance (FID) in Appendix Section A.15. Otherwise, the total train time and memory of our adaptive SA method is comparable to randomized augmentation approaches.

“The results suggest that this approach is not SOTA, as shown in some results of Table 2 and 3.”

We respectfully disagree, and would like to clarify some interpretations of the results. In Table 2, our method occasionally performs worse on absolute accuracy (aAcc) metrics compared to other methods on rain and snow scenarios. As we mention in the corresponding Section 4.1 discussing the table results, higher numbers on aAcc but lower numbers on mIoU may be indicative of poor generalization to class imbalances, as aAcc measures the total # of correct pixels. In many of the Cityscapes data, there are disproportionately many pixels classified to “sky”. Our method achieves a higher mIoU than the other method (47.53 -> 49.36 compared to AugMix for Rain, and 45.35 -> 48.16 compared to IDBH for Snow). While more PIXELS are correctly classified, lower values on mIoU may suggest that underrepresented classes are poorly classified. We include BOTH metrics in table results to provide an interpretable metric (absolute pixel acc) as well as a class-balanced metric (mIoU) for a more complete picture comparison.

As for Table 3, while our method does not perform best on the AdvSteer benchmark compared to the next-best method, we note that it outperforms other methods in most other scenarios across multiple datasets. AdvSteer benchmark involves heavily altered synthetic data, as shown in the Appendix Section A.10; these alterations are not meant to reflect real-world corruptions, but should rather be interpreted as a synthetic limit test. In contrast, ImageNet-C benchmark reflects transformations meant to replicate real-world effects such as frost, snow, etc. We interpret the boost in performance of our results as contributing primarily to realistic corruptions, which aligns with our goals. Additionally, we include the results for AdvSteer for transparency.

“Could the authors clarify what distinct advantages their augmentation approach offers, particularly in practical deployment scenarios?”

Our method estimates model sensitivity to various augmentation transformations and samples augmentations with uniform difficulty based on this sensitivity. Unlike randomized approaches (e.g., IDBH, TrivialAug, RandomAug, AugMix) that ignore model state or model-based methods (e.g., AutoAugment) requiring pre-trained policies, our approach is a middle ground by using a generic image classifier trained on ImageNet for KID computations. Although our method incurs additional computational cost compared to randomized approaches—adding ~38.4 GPU minutes to training (4 sensitivity evaluations at ~9.6 minutes each on an RTX A4000 GPU)—it scales efficiently across datasets since KID evaluation requires only a fixed number of samples. Sensitivity analysis currently runs on a single GPU, further optimization and acceleration possible. Also, our method affects only training time and has no impact on inference speed. Sensitivity analysis is also useful for interpretability of robustness, as shown in Figure 4.

In summary, our method 1) provides a middle ground between model-based and randomized augmentation, 2) adds a negligible amount of fixed overhead that is agnostic to dataset size due to use of KID, and 3) provides interpretability and explainability on failure cases for practical deployment.

To show inference comparison with competing methods, we include timing results on experiments with Segment Anything: https://docs.google.com/spreadsheets/d/1KF0lY8iyv7Uo8K53EJmvfZqVqGVfKbuItcu-6iXXjNM/edit?usp=sharing

审稿意见

评分: 42025-03-16

This paper introduces an adaptive, sensitivity-guided augmentation method to improve the robustness of image segmentation models against natural corruptions. The idea is to perform a lightweight, online sensitivity analysis during training to identify the most impactful perturbations. This approach aims to bridge the gap between the efficiency of random augmentation techniques and the effectiveness of policy-based augmentations guided by sensitivity analysis. The authors claim their sensitivity analysis runs significantly faster and requires less storage than previous methods, enabling practical online estimation during training.

给作者的问题

What's your take on the fact that uniform augmentation of computed sensitivity analysis values (alpha) is almost as good as beta-binomial sampling?

论据与证据

The paper claims 10x speed up but to be precise it's more like 9.3x according to Table 1, where it provides a runtime and storage comparison with AdvSteer (Shen et al., 2021).
The paper mentions in the introduction that the proposed approach is general and can be applied to other tasks, architectures, or domains, which are mainly shown in the appendix:
- Different domains : medical domains in Appendix A.6
- Different architectures: Appendix A.11
- Different tasks: Classification, Appendix A.12

方法与评估标准

The authors propose an adaptive sensitivity analysis method that iteratively approximates model sensitivity curves. They use Kernel Inception Distance (KID) to measure image degradation and define sensitivity as the ratio of change in model accuracy to change in KID. They optimize an objective function (Equation 4) to find optimal augmentation intensities. The method includes a training loop (Algorithm 1) that incorporates the sensitivity analysis.
The paper uses absolute pixel accuracy (aAcc), mean pixel accuracy (mAcc), and mean Intersection-over-Union (mIoU) to evaluate segmentation performance. They evaluate on real-world corrupted datasets (ACDC, IDD) and synthetic benchmarks (ImageNet-C, AdvSteer).
The use of KID for measuring image degradation is well-motivated, and the evaluation metrics are standard for segmentation tasks. The choice of datasets covers both real-world and synthetic corruptions.

理论论述

Not applicable as this is not a theory paper.

实验设计与分析

The paper contains several experiments to evaluate their method such as evaluation on real-world corruptions (ACDC), synthetic datasets (ADE20K, VOC2012, etc), ablation studies to analyze the contribution of different components of the proposed method.
The analysis of color channel sensitivity is an interesting addition that showcases the potential of sensitivity analysis for interpretability.

补充材料

Not in details.

与现有文献的关系

Addressing the problem of robustness against natural corruptions, which is a well-explored area in image classification, for semantic segmentation is an important real-world consideration such as self-driving cars etc.
Contrasting their adaptive sensitivity analysis with previous methods like AdvSteer (Shen et al., 2021), emphasizing the improvements in efficiency and practicality.
Building upon existing data augmentation techniques (e.g. AutoAugment, DeepAug) and highlighting their limitations.

遗漏的重要参考文献

The related work section seems well-written.

其他优缺点

The proposed adaptive sensitivity-guided augmentation method is novel, and the 10x speedup boost compared to the previous sensitivity analysis is a strong point.
Improving the robustness of segmentation models is crucial for real-world applications, and the paper demonstrates significant improvements on challenging datasets.
The paper is well-written and easy to follow.

其他意见或建议

p8. sec5: "Our model can complements" -> "Our model can complement"

作者回复

2025-03-30

Thank you for your comments, we are grateful to hear that you find our work impactful in real-world robotics applications and our analyses interesting for model sensitivity! We will be sure to add writing fixes in paper revision regarding Table 1 and increase visibility of results related to our claims, which many of were in the appendix.

Regarding the question, “why is uniform augmentation of computed sensitivity analysis values (alpha) almost as good as beta-binomial sampling?”, we believe that this may be due to that the most optimal sampling in the basis augmentation spaces is already close to uniform. In future revisions, we will include an experiment with such a scenario to emphasize the advantage of not needing corruption gradients.

审稿意见

评分: 32025-03-19

This paper proposes an adaptive, on-the-fly sensitivity analysis approach to design data augmentation for increasing the robustness of the semantic segmentation models under naturally occurring corruptions. The proposed approach attempts to bridge the gap between choosing random augmentations like Trivial Aug and learned policies through RL. They do this by solving an optimization problem on-the-fly. The problem is essentially posed as finding the right set of intensities /parameters of the chosen augmentations. The objective is set through the lens of the sensitivity analysis i.e, change in model accuracy with respect to change in intensities and this guides the augmentation. The change in intensities is captured through a Kernel Inception distance which measures the difference in Inception net features for the dataset due to one augmentation and the dataset with the other augmentation. By adaptively sampling from the intensity levels to which the model is most sensitive, the authors reduce overhead compared to prior sensitivity-analysis-based methods. They present extensive experiments on real-world driving datasets, as well as generic and domain-specific segmentation benchmarks demonstrating notable improvements over several augmentation baselines. They also highlight the applicability of their approach with foundation-model fine-tuning (e.g., DinoV2).

给作者的问题

Please see strengths and weaknesses

论据与证据

Yes

方法与评估标准

Yes

理论论述

Not Applicable

实验设计与分析

Yes

补充材料

Yes; All of it

与现有文献的关系

It is related well with respect to the current literature on designing augmentation strategies for complex computer vision tasks that extend beyond classification

遗漏的重要参考文献

其他优缺点

The idea of focusing on where the model is most vulnerable (high-sensitivity intensities) is interesting and grounded. Adapting this process for on-the-fly training is a good contribution. Furthermore, the paper’s evaluation is comprehensive and the results compelling both in terms of accuracy and efficiency

Weaknesses and Questions

The approach adaptively selects corruptions, somewhat similar to meta-learning’s inner/outer loops (adaptation on a “task” and validation on a target). It would be helpful to discuss whether meta-learning techniques (e.g., using a gradient-based measure for how each corruption influences final performance) could offer a more direct optimization. Furthermore authors assume quite reasonably that they have an understanding of how the test environment behaves - a small discussion here could help in the rebuttal phase.

Next, the authors examine a broad set of “basis” transformations (RGB, HSV, blur, etc.), there may be domain-specific or more complex corruptions that aren’t captured. The paper could clarify how one might extend the method to less parameter-friendly corruptions.
Also, while they consider generic augmentations, a lot of domain specific augmentations have been developed- for example this paper (https://ieeexplore.ieee.org/document/10350672) designs a context-aware augmentation protocol for object detection, it is not clear how this approach will scale to those kind of augmentations. Also, it would be interesting to compare with parameter-efficient or partial fine-tuning approaches (e.g., LoRA) that might mitigate overfitting to specific corruptions. The paper mentions partial or efficient adaptation in passing, but an actual baseline or experiment (especially for smaller datasets) would be better. Finally, given that deep networks can exhibit “grokking” or double-descent (e.g., https://arxiv.org/abs/2201.02177), the authors relying on short adaptation intervals for measuring sensitivity could be problematic. Could there be cases where the model’s sensitivity at a short training horizon is misleading at full convergence? I would like to hear authors’ thoughts on this. Ofcourse, it was surprising that there was no discussion on segment anything which is now almost the defacto segmentation model. Can the authors comment on it too?

其他意见或建议

$\alpha_{max}$ is not defined. Is it $\alpha_{l}$ ?

作者回复

2025-03-30

Thank you! We appreciate your feedback, and address points below:

"Could meta-learning offer a more direct optimization?"

Yes. NOTE: the difference is mostly wrt the choice between data augmentation vs meta-learning as the training approach, rather than an alternative for the sensitivity analysis (SA). In MAML, we can interpret each parameterized augmentation type as a synthetic task. Then, during bilevel optimization, inner loop updates would optimize performance on specific augmentations (support) while the outer loop maintains overall performance on all aug’s (query). Our SA modeling can still contribute additional info in this case when choosing the task data subsets; the way task data is selected or parameterized is largely open-ended in meta-learning. [1] strengthens model robustness by adding a learned adversarial noise to query (outer loop) data. The difference between this approach and our proposed technique is the paradigm in which sensitivity analysis is applied to add robustness to models (data augmentation sampling vs. noise sampling query attacks in meta-learning). We may also add adversarial noises in our framework.

"How to extend to less parameter-friendly corruptions?"

One of our motivations is to utilize parameterized corruptions to improve generalization on natural corruptions that may not be easily parameterized. Prior art [1] has shown that many natural corruptions can be replicated with a composition of “basis augmentations”, from which this work is inspired by and generalized from. In cases where we don’t have access to a parameter ‘alpha’, the problem becomes similar with AdvSteer [1], which samples sensitive augmentations from a fixed set of augmentation values (instead of a continuous range). However, this approach has increased computational complexity, since we need to test all values from a selected subset of augmentations.

"How does it scale to domain-specific augmentations?"

In InterAug’s case, the primary contribution appears to be a re-contextualization around the subjects in the image, s.t. spurious co-occurrences between subjects and background do not occur throughout training. Our is a direct complement to the context area extraction. Instead of considering entire images in our SA computation, we can consider the context bounding box only. The two concepts applied together may produce context-specific sensitivities. As for other domains, the adaptation may be case-by-case. For example, lesion augmentation in medical applications involves synthetically increasing the diversity of lesion shapes, locations, intensities, and load distributions [2] – all of these can be considered as aug- types within our SA framework.

"How might parameter-efficient approaches like LoRA mitigate overfitting to specific corruptions?"

Training to increase robustness is often accompanied by degradation in clean accuracy, which may suggest either conflicting gradients or overfitting to corruptions. Using LoRA layers to mitigate this has been shown to work in AutoLoRA [3]. Since LoRA layers work very well when trained on small datasets, we may use our sensitivity analysis approach to select “uniformly difficult” augmentations for each class to generate the task dataset for LoRA training. Then, a routing approach similar to Polytropon [4] or MHR [5] can be used for inference on unseen data. This is an interesting extension of our work and a valuable future direction.

"A benchmark for efficient adaptation would be nice, especially with a small dataset."

We show results on fine-tuning for the ACDC Snow dataset in the bottom of the following spreadsheet: https://docs.google.com/spreadsheets/d/1KF0lY8iyv7Uo8K53EJmvfZqVqGVfKbuItcu-6iXXjNM/edit?usp=sharing and plan to include some more small dataset finetuning experiments in future revisions.

"Relying on short adaptation intervals might be problematic given grokking is common."

We observe in practice that increasing the number of training iterations (thus increasing the number of iterations per interval, since the number of intervals is fixed) has very little effect on performance outcome. This may suggest that the current interval values may be sufficient for generalization to sensitivity curves within intervals. We can include analysis on this in future revisions. How does this work perform relative to Segment Anything We show downstream fine-tuning results and inference time/memory benchmarks on SegmentAnything in the same spreadsheet as above. We’ll also include these in the updated revision.

" $a_{max} \neq a_L$ . $a_L$ is a level we solve for, but $a_{max}$ is the max intensity of the parameter range, which for our operations is 1."

[1] https://proceedings.neurips.cc/paper/2020/file/cfee398643cbc3dc5eefc89334cacdc1-Paper.pdf

[2] https://arxiv.org/abs/2308.09026

[3] https://openreview.net/forum?id=09xFexjhqE

[4] https://arxiv.org/abs/2202.13914

[5] https://arxiv.org/abs/2211.03831

最终决定Accept (poster)

2025-05-01

The paper received two weak accepts, an accept and a weak reject. Some of the important points raised by the reviewers prior to rebuttal stage were:

scaling to domain-specific augmentations
overstatement in case of reporting efficiency
methods lacks SOTA performance in all results
possible scenario dependent performance
concerns regarding fine-tuning foundation models

All reviewers acknowledged the rebuttal and several engaged in post-rebuttal discussion. Almost all of the major concerns were adequately addressed in the rebuttal. Some points such as non SOTA in a few cases and possibly scenario-dependent performance remained after post-rebuttal stage, but during discussion with reviewers, the consensus was that they are not of a major concern and should not lead to rejection. Also, reviewers liked the idea and found it sufficiently interesting. To this end, the decision is to recommend the acceptance of the paper.