PaperHub
6.3
/10
Poster4 位审稿人
最低5最高8标准差1.1
6
6
8
5
4.0
置信度
ICLR 2024

Domain-Inspired Sharpness-Aware Minimization Under Domain Shifts

OpenReviewPDF
提交: 2023-09-19更新: 2024-04-21
TL;DR

The paper presents DISAM, an algorithm enhancing optimization under domain shifts by ensuring domain-level convergence consistency, resulting in quicker convergence, improved generalization, and outperforming existing methods.

摘要

关键词
generalizationsharpness-aware minimizationdomain shift

评审与讨论

审稿意见
6

Targeting at domain generalization scenario with possible shifts among domains, this paper proposes to take 'per domain optimality' into consideration for finding the perturbation of SAM. The proposed DISAM is shown to have an improved convergence rate. Numerically, DISAM outperforms other SAM alternatives.

优点

S1. The idea of tackling domain shift in SAM is novel.

S2. A new algorithm, DISAM, is proposed with satisfying numerical results. DISAM improves over state-of-the-art by a large margin.

缺点

W1. Stronger motivation needed. The authors motivates the domain difference using Fig. 1 (b). While the convergence behaviors among domains are indeed inconsistent at the early stage, the losses are similar after e.g., 30 epoch. The authors should also explain why the difference of convergence in early phase impact the generalization of SAM.

W2. More discussions on λ\lambda in eq. (7) are needed. This is a critical parameter that considers the variance/domain shifts in DISAM. However, this λ\lambda does not appear in Theorem 1. Can the authors illustrate more on this point? And how does the choice of λ\lambda influence convergence and generalization?

问题

Q1. Relation with a recent work (https://arxiv.org/abs/2309.15639).

The paper above also proposes approaches to reduce variance for finding perturbations, although not designed for the domain generalization setting. How does this work relate with the proposed DISAM?

Q2. Theorem 1 illustrates that the convergence of DISAM benefits from Γ\Gamma. Can the authors explain more on the discussion of

as DISAM enjoys a smaller Γ\Gamma than SAM, DISAM can permit the potential larger ρ\rho than that in SAM, thus yielding a better generalization

In particular, how does the convergence rate link with generalization?

Q3. The last sentence in Sec 3 claims that

... allowing larger ρ\rho for better generalization.

Why does larger ρ\rho relate to better generalization?

Q4. (minor) The notation in e.g., eq (5) can be improved, because the multiple subscripts ii in ΣiCiiCi\Sigma_{i} \frac{C_i}{\sum_i C_i} are confusing.

评论

Q2 & Q3

Q2: Theorem 1 illustrates that the convergence of DISAM benefits from Γ\Gamma. Can the authors explain more on the discussion of as DISAM enjoys a smaller Γ\Gamma than SAM, DISAM can permit the potential larger ρ\rho than that in SAM, thus yielding a better generalization". In particular, how does the convergence rate link with generalization? Q3. The last sentence in Sec 3 claims that ".. allowing larger ρ\rho for better generalization." Why does larger ρ\rho relate to better generalization?

Thank you for the advice on explanation about convergence and generalization, and we have added more discussion and analysis in the Appendix B.3 on page 21. In the following, we provide some clarification on these points.

  • Generalization Theorem of SAM: In the SAM framework, the parameter ρ\rho plays a crucial role in determining generalizability. As established in [1], there exists an upper bound on the generalization error for SAM, suggesting that a larger ρ\rho could potentially enhance generalization, provided that convergence is not impeded. Here is the relevant generalization bound from [1]:

For any ρ>0\rho > 0 and any distribution D\mathcal{D}, with probability 1δ1- \delta over the choice of the training set SDS\sim \mathcal{D}, LD(w)maxϵ2ρLS(w+ϵ)+klog(1+w22ρ2(1+log(n)k)2)+4lognδ+O~(1)n1\mathcal{L} _{\mathcal{D}} (w) \leq \max _{\| \epsilon\| _2 \leq \rho} \mathcal{L} _{S}(w+\epsilon) + \sqrt{\frac{k \log \left( 1 + \frac{\| w \| _2^2}{\rho^2} (1+\sqrt{\frac{\log (n)}{k}})^2 \right) + 4 \log \frac{n}{\delta} + \tilde{O}(1)}{n-1}} where n=Sn = |S|, kk is the number of parameters and we assumed LD(w)EϵiN(0,ρ)[LD(w+ϵ)]\mathcal{L} _{\mathcal{D}}(w) \leq \mathbb{E} _{\epsilon_i \approx \mathcal{N}(0, \rho)} [\mathcal{L} _{\mathcal{D}}(w+\epsilon)]. DISAM leverages a smaller Γ\Gamma than SAM, as shown in Theorem 1 in page 5. This allows DISAM to employ a potentially larger ρ\rho, enhancing generalizability.

  • Practical Implications: Combining the above theorem with the convergence theorem (Theorem 1 on page 5), there is a trade-off with respect to ρ\rho. A larger ρ\rho might theoretically enhance generalization but poses greater challenges for convergence. This reflects the intuitive notion that searching for flatter minima across a broader range is inherently more complex, which can potentially affect training efficiency. However, if LS(w+ϵ)\mathcal{L}_{S} (w + \epsilon) can be converged with a sufficiently small value, a larger ρ\rho corresponds to better generalization. DISAM, with a smaller Γ\Gamma compared to SAM, converges faster, which means that under the same convergence speed, a larger ρ\rho can be used to achieve better generalization.

  • Empirical Validation: Our experiments, as illustrated in Figures 3(c) and (d) on page 6, demonstrate that DISAM effectively employs a larger ρ\rho compared to traditional SAM. DISAM's ability to handle a larger ρ\rho results in both consistent convergence and improved generalization compared to SAM, demonstrating its superiority in domain shift scenarios.

We appreciate the reviewer's comments on the theoretical parts and have adjusted the corresponding parts to make these points more clear.

Reference:

[1]. Sharpness-aware minimization for efficiently improving generalization, ICLR2021.

Q4

(minor) The notation in e.g., eq (5) can be improved, because the multiple subscripts ii in iCiiCi\sum_{i}\frac{C_i}{\sum_{i}C_i} are confusing.

We appreciate your detailed feedback on the notation in Eq. (5) and have updated it on page 4 to avoid confusion.

评论

W1

Stronger motivation needed. The authors motivates the domain difference using Fig. 1 (b). While the convergence behaviors among domains are indeed inconsistent at the early stage, the losses are similar after e.g., 30 epoch. The authors should also explain why the difference of convergence in early phase impact the generalization of SAM.

We would like to kindly point out that the y-axis of Figure.1 (b) represents the convergence degree instead of the loss value, which is computed by using the current loss value to divide the converged maximum loss. For the details, we follow the reviewer's advice and have added one section to strenghen the explanation about the motivation in the Appendix C.5 of the revised version.

Relationship of early-stage convergence and generalization: In Figure 7(b) of page 24, we present the curves of various domain losses during training, which can promote the understanding. Basically, the inconsistency of the convergence degree in the early phase in Figure 7(a) actually impairs the overall convergence of the model. As can be seen in Figure 7(b), SAM has converged at a higher loss value, which reflects that the model is optimized towards a poorer local minima, resulting in the worse generalization performance.

W2

More discussions on λ\lambda in Eq. (7) are needed. This is a critical parameter that considers the variance/domain shifts in DISAM. However, this λ\lambda does not appear in Theorem 1. Can the authors illustrate more on this point? And how does the choice of λ\lambda influence convergence and generalization?

Thank you very much for the reviewer's advice. We have elaborated more discussion about the role of λ\lambda in DISAM in Appendix B.2 (pages 18-21) in the revised submission. In the proof of Theorem 1, specifically Eq. (15) on page 20, λ\lambda is integrated into β\beta, serving as a hyperparameter that regulates the weight adjustment in DISAM. It functions by modulating the degree of correction for domain shifts:

βti=αi2λM(Li(wt)1Mj=1MLj(wt))\beta^i_t = \alpha_i - \frac{2\lambda}{M} \left (\mathcal{L}^i(w_t) - \frac{1}{M} \sum_{j=1}^M \mathcal{L}^j(w_t) \right)

The influence of λ\lambda: The choice of λ\lambda influences how aggressively DISAM responds to variance or domain shifts, with a higher λ\lambda leading to more pronounced adjustments in β\beta. Our experimental analysis in Figures 5(c) and (d) on page 9, reveals that DISAM's performance remains relatively stable across a wide range of λ\lambda values. However, choosing too large λ\lambda can result in overly aggressive early training adjustments, yielding the negative impact on the convergence process and leading to the increased variance in repeated experiments. Consequently, we adopted a default λ\lambda value of 0.1 in all experiments.

Q1

Relation with a recent work (https://arxiv.org/abs/2309.15639). The paper above also proposes approaches to reduce variance for finding perturbations, although not designed for the domain generalization setting. How does this work relate with the proposed DISAM?

Thank you for recommending this excellent work on variance suppression (VaSSO)[1]. We have added this method into the revised submission with the proper discussion. Generally, while DISAM and VaSSO both enhance the perturbation direction generation in SAM, they target different aspects:

  • VaSSO Approach: VaSSO aims to reduce noise from mini-batch sampling by using averaged previous perturbation directions. Its primary focus is on stabilizing perturbation generation within the same domain.

  • DISAM's Unique Focus: In contrast, DISAM incorporates domain information to specifically address domain-level convergence inconsistencies, a challenge prevalent in domain shift scenarios. DISAM's approach is to impose a variance minimization constraint on domain loss during the perturbation generation process, thereby enabling a more representative perturbation location and enhancing generalization.

We are running the experiments to compare/combine DISAM with VaSSO. Once the experiments are finished, we will report the results here and include the results in the submission.

Reference:

[1]. Enhancing Sharpness-Aware Optimization Through Variance Suppression, arXiv2023.

评论

Q1

Relation with a recent work (https://arxiv.org/abs/2309.15639). The paper above also proposes approaches to reduce variance for finding perturbations, although not designed for the domain generalization setting. How does this work relate with the proposed DISAM?

We conduct experiments on the DomainBed benchmark to compare VaSSO[1] and DISAM.

  • VaSSO achieves a significant improvement in in-domain convergence by reducing perturbation direction noise in PACS, VLCS, and TerraInc datasets. However, since it does not consider domain shift explicitly and only aims to make perturbation directions more consistent, it cannot guarantee that the perturbation directions are more representative, leading to a relatively modest improvement in out-of-domain performance.
  • DISAM demonstrates better generalization by incorporating domain information during perturbation direction generation. Nonetheless, DISAM exhibits reduced convergence performance within the target domain on certain datasets, such as VLCS.
  • By combining DISAM and VaSSO, we can achieve synchronized improvements in both in-domain and out-of-domain performance.

[Table 1: In-domain results]

MethodPACSVLCSOfficeHomeTerraIncDomainNetAvg.
SAM97.384.885.888.968.585.1
VaSSO[1]97.885.786.494.568.686.6
DISAM97.884.486.394.870.286.7
VaSSO+DISAM97.985.886.694.970.187.1

[Table 2: Out-of-domain results]

MethodPACSVLCSOfficeHomeTerraIncDomainNetAvg.
SAM85.879.469.643.344.364.5
VaSSO[1]86.179.970.546.144.865.5
DISAM87.380.170.747.945.866.4
VaSSO+DISAM87.280.270.948.045.866.5
评论

Dear Reviewer BeTH,

We sincerely appreciate the effort and time you have devoted to providing constructive reviews, as well as your positive evaluation of our submission. As the deadline for discussion and paper revision is approaching, we would like to offer a brief summary of our responses and updates:

  • Supplementary explanation for Figure 1(b)
  • The role and principles of λ\lambda in the DISAM.
  • Similarities and differences between DISAM and VaSSO.
  • Mechanism of ρ\rho's impact on convergence and generalization

Would you mind checking our responses and confirming if you have any additional questions? We welcome any further comments and discussions!

Best Regards,

The authors of Submission 1547

评论

Dear Reviewer BeTH,

Thanks very much for your time and valuable comments.

We understand you might be quite busy. However, the discussion deadline is approaching, and we have only a few hours left.

Would you mind checking our response and confirming whether you have any further questions?

Thanks for your attention.

Best regards,

The authors of submission 1547.

审稿意见
6

The paper introduces the Domain-Inspired Sharpness Aware Minimization (DISAM) algorithm, a novel approach for optimizing under domain shifts. The motivation behind DISAM is to address the issue of inconsistent convergence rates across different domains when using Sharpness Aware Minimization (SAM), which can lead to optimization biases and hinder overall convergence.

The key innovation of DISAM lies in its focus on maintaining consistency in domain-level convergence. It achieves this by integrating a constraint that minimizes the variance in domain loss. This strategy allows for adaptive gradient perturbation: if a domain is already well-optimized (i.e., its loss is below the average), DISAM will automatically reduce the gradient perturbation for that domain, and increase it for less optimized domains. This approach helps balance the optimization process across various domains.

Theoretical analysis provided in the paper suggests that DISAM can lead to faster overall convergence and improved generalization, especially in scenarios with inconsistent domain convergence. The paper supports these claims with extensive experimental results, demonstrating that DISAM outperforms several state-of-the-art methods in various domain generalization benchmarks. Additionally, the paper highlights the efficiency of DISAM in fine-tuning parameters, particularly when combined with pretraining models, presenting a significant advancement in the field.

优点

As of now, there has not yet been a sharpness-aware minimization (SAM) methodology developed specifically for addressing distribution shifts. The issue of varying convergence rates across different domains, as observed in SAM, is undeniably a significant challenge.

This methodology presents an impressive degree of compatibility, as it can be integrated with a variety of sharpness-variants. An especially commendable aspect of this approach is its computational efficiency. Compared to standard SAM techniques, it does not incur additional computational costs, making it a practical option for scenarios where resource constraints are a consideration.

In summary, the development of a SAM methodology that is adept at handling distribution shifts, and particularly its implications for domain convergence, is both novel and highly relevant in the current landscape of optimization challenges.

缺点

The idea of minimizing the variance between losses, a core aspect of the presented methodology, is not entirely novel. Similar concepts have been previously explored in methods like vREX (Out-of-Distribution Generalization via Risk Extrapolation) and further extended to gradient computations in methodologies like Fishr (Invariant Gradient Variances for Out-of-Distribution Generalization). In this context, the proposed approach appears to be an incremental adaptation of vREX principles applied specifically to the challenges faced in Sharpness Aware Minimization (SAM) scenarios.

The improvement in out-of-distribution (OOD) performance using the DISAM methodology does not appear intuitive. In fact, when comparing its performance enhancements to those achieved with CLIPOOD, as reported, the difference seems marginal. This observation raises questions about the actual effectiveness of DISAM, particularly in the context of fine-tuning methodologies.

问题

Similar to how transitioning from ERM to vREX in optimization has been shown to enhance domain generalization performance, the application of vREX to SAM in the form of this methodology could be seen as a natural extension that brings comparable performance improvements. Furthermore, it is a valid assertion that incorporating various algorithms tailored for domain generalization (such as Fish, Fishr, gradient alignment) into the SAM optimization framework could potentially yield performance enhancements. The logic here is that these methods, when applied within the context of SAM, could enhance its ability to generalize across domains.

However, the critique that DISAM may simply be an incremental version of applying domain generalization methodologies to SAM is not without its counterarguments. It's important to consider the specific challenges and nuances of the SAM framework and how DISAM addresses these. If DISAM introduces significant modifications or adaptations that are uniquely tailored to the idiosyncrasies of SAM, then its contribution could extend beyond a mere incremental update. The key would lie in the specifics of how DISAM modifies or enhances the existing principles of SAM and domain generalization methods, making it more than just a straightforward application of known techniques.

In summary, while the perspective that DISAM is an incremental version of existing methodologies is certainly tenable, a comprehensive evaluation would require a deeper exploration of how DISAM specifically adapts or augments the SAM framework to address its unique challenges. If such adaptations are significant, they could justify the novelty and utility of DISAM beyond a simple combination of existing techniques.

Can you provide the reproducible code during the rebuttal period?

伦理问题详情

\

评论

W2

The improvement in out-of-distribution (OOD) performance using the DISAM methodology does not appear intuitive. In fact, when comparing its performance enhancements to those achieved with CLIPOOD, as reported, the difference seems marginal. This observation raises questions about the actual effectiveness of DISAM, particularly in the context of fine-tuning methodologies.

We would like to kindly clarify that in Table 2, the results of CLIPOOD in gray are from the orginal paper [1], but we cannot reproduce these results with the open source code of [1] (even after the considerable hyparameter searching). The best results with their open source code are reported as CLIPOOD* and on the basis of these results, DISAM's improvement is actually not marginal (see the following table for convenience).

Besides, it's crucial to highlight that DISAM's advantages become more evident in open-class scenarios. As can be seen in the following table (or Table 3 on page 8), DISAM notably outperforms CLIPOOD in these scenarios, achieving an average improvement of 2.8% on new classes and 1.0% on base classes. This is significant, considering that CoOp and CLIPOOD even underperform compared to zero-shot results in new classes.

MethodResults on DomainBedOpen-class Results on Base ClassesOpen-class Results on New Classes
Zero-shot70.272.667.4
CoOp73.474.466.3
+DISAM74.875.369.6
CLIPOOD*77.976.066.9
+DISAM78.877.069.7

For a comprehensive understanding of how DISAM enhances fine-tuning in open-class scenarios, we kindly refer the reviewer to the in-depth analysis provided in Appendix C.6 on page 25. This section substantiates DISAM's role in improving generalization in fine-tuning, especially in contexts that existing methods are challenging to achieve improvement.

Reference:

[1]. CLIPOOD: Generalizing CLIP to Out-of-Distributions, ICML2023.

Reproducible code

Can you provide the reproducible code during the rebuttal period?

We provide an anonymized version of the code repository, accessible through this 2-hop link: [https://openreview.net/forum?id=I4wB3HA3dJ&noteId=e1Uu30vHqy]. To promote the understanding of the code, the reviewers can also combine with Algorithm 1 and the "Pseudo Code of DISAM" in Appendix D on page 25-26.

评论

W1 & Question

W1: The idea of minimizing the variance between losses, a core aspect of the presented methodology, is not entirely novel. Similar concepts have been previously explored in methods like vREX (Out-of-Distribution Generalization via Risk Extrapolation) and further extended to gradient computations in methodologies like Fishr (Invariant Gradient Variances for Out-of-Distribution Generalization). In this context, the proposed approach appears to be an incremental adaptation of vREX principles applied specifically to the challenges faced in Sharpness Aware Minimization (SAM) scenarios. Question: In summary, while the perspective that DISAM is an incremental version of existing methodologies is certainly tenable, a comprehensive evaluation would require a deeper exploration of how DISAM specifically adapts or augments the SAM framework to address its unique challenges. If such adaptations are significant, they could justify the novelty and utility of DISAM beyond a simple combination of existing techniques.

Second, DISAM is orthogonal to existing state-of-the-art methods including V-REx and Fishr and can improve their performance regarding generalization. The following tables present some results to prove DISAM’s superiority:

[Table 1. Comparison with existing methods.]

MethodPACSVLCSOfficeHomeTerraIncDomainNetAvg.
IRM83.578.564.347.633.961.6
V-REx[1]84.978.366.446.433.661.9
V-REx+DISAM85.878.470.545.942.364.6
Fish[3]85.577.868.645.142.763.9
Fishr[4]86.978.268.253.641.865.7
Fishr+DISAM87.579.270.754.843.967.2

[Table 2. Comparison with existing SAM-based methods.]

MethodPACSVLCSOfficeHomeTerraIncDomainNetAvg.
SAM85.879.469.643.344.364.5
SAM+DISAM87.380.170.747.945.866.4
GSAM85.979.169.347.044.665.1
GSAM+DISAM87.280.070.850.645.666.8
SAGM86.680.070.148.845.066.1
SAGM+DISAM87.580.771.050.046.067.0

These results clearly demonstrate DISAM’s distinctive approach and its effectiveness in enhancing generalization, supporting its novelty and utility beyond existing methodologies.

We appreciate the reviewer's constructive suggestion, and added a comparison table and more discussion for related works (see Appendix A.2.5 on page 17-18) to clarify the novelty of DISAM in the revision.

Reference:

[1]. Out-of-distribution generalization via risk extrapolation (rex), ICML2021.

[2]. Invariant risk minimization, arXiv2019.

[3]. Gradient matching for domain generalization, arXiv2021.

[4]. Fishr: Invariant gradient variances for out-of-distribution generalization, ICML2022.

评论

W1 & Question

W1: The idea of minimizing the variance between losses, a core aspect of the presented methodology, is not entirely novel. Similar concepts have been previously explored in methods like vREX (Out-of-Distribution Generalization via Risk Extrapolation) and further extended to gradient computations in methodologies like Fishr (Invariant Gradient Variances for Out-of-Distribution Generalization). In this context, the proposed approach appears to be an incremental adaptation of vREX principles applied specifically to the challenges faced in Sharpness Aware Minimization (SAM) scenarios. Question: In summary, while the perspective that DISAM is an incremental version of existing methodologies is certainly tenable, a comprehensive evaluation would require a deeper exploration of how DISAM specifically adapts or augments the SAM framework to address its unique challenges. If such adaptations are significant, they could justify the novelty and utility of DISAM beyond a simple combination of existing techniques.

First, we are sorry about one typo issue in Eq.(7) that may mislead the understanding of the reviewer, which we has corrected with the proper description in the revised submission. Concretely, the ww (i.e., w^\hat{w} in the revised version) in the variance term is actually without derivative taken, when optimzing the model parameter. That is to say, the ww (i.e., w^\hat{w} in the revised version) only makes effect during the inner loop for the perturbation generation. This makes DISAM intrinsically different from V-REx [1] (an extension of IRM [2]). We use a table in the following the comprehensively compare DISAM and VRex to clarify our difference. Generally, V-REx focuses on achieving consistent loss values across domains, and Fish [3] and Fishr [4] emphasize gradient consistency across domains to foster out-of-domain generalization.

In comaprison, DISAM targets to address the challenge of effective perturbation directions for sharpness estimation in domain shift scenarios by introducing the guidance of the domain-level loss variance minimization, which actually does not affect the training objective (i.e., the first term of Eq.(7)). Unlike V-REx, which directly minimizes domain loss variance and can negatively impact convergence, or Fish and Fishr, which constrain gradient updates, DISAM adopts a distinct strategy, minimizing sharpness of multiple domains consistently, to enhance generalization.

MethodTotal Optimization FunctionOptimization on wwOptimization on ϵ\epsilon
ERMminwi=1MαiLi(w)\min _{w} \sum _{i=1}^M \alpha _i \mathcal{L} _{i}(w)Same to left×\times
V-REx[1]minwi=1MαiLi(w)+βVar{Li(w)}i=1M\min _{w} \sum _{i=1}^M \alpha _i \mathcal{L} _{i}(w) + \beta \text{Var} \{ \mathcal{L} _i(w) \} _{i=1}^MSame to left×\times
Fish[3]minwi=1MαiLi(w)γ2M(M1)i,j[1,M]ijLi(w)Lj(w)\min _{w} \sum _{i=1}^M \alpha _i \mathcal{L} _{i}(w) - \gamma \frac{2}{M(M-1)} \sum _{i,j \in [1,M]}^{i \neq j} \nabla \mathcal{L} _{i}(w) \cdot \nabla \mathcal{L} _{j}(w)Same to left×\times
Fishr[4]minwi=1MαiLi(w)λ1Mi=1MLi(w)L(w)2\min _{w} \sum _{i=1}^M \alpha _i \mathcal{L} _{i}(w) - \lambda \frac{1}{M} \sum _{i=1}^{M} \| \nabla \mathcal{L} _{i}(w) - \nabla \mathcal{L}(w)\|^2Same to left×\times
SAMminwmaxϵ2ρi=1MαiLi(w+ϵ)\min _{w} \max _{\| \epsilon\|_2 \leq \rho} \sum _{i=1}^M \alpha _i \mathcal{L} _i(w + \epsilon)minwi=1MαiLi(w+ϵ)\min _{w} \sum _{i=1}^M \alpha _i \mathcal{L} _i(w + \epsilon)maxϵ2ρi=1MαiLi(w+ϵ)\max _{\| \epsilon\|_2 \leq \rho} \sum _{i=1}^M \alpha _i \mathcal{L} _i(w + \epsilon)
DISAMminwmaxϵ2ρ[i=1MαiLi(w+ϵ)λVar{Li(w^+ϵ)}i=1M]\min _{w} \max _{\| \epsilon\|_2 \leq \rho} \left[ \sum _{i=1}^M \alpha _i \mathcal{L} _i(w + \epsilon) - \lambda \text{Var}\{\mathcal{L} _i(\hat{w} + \epsilon)\} _{i=1}^M \right]minwi=1MαiLi(w+ϵ)\min _{w} \sum _{i=1}^M \alpha _i \mathcal{L} _i(w + \epsilon)maxϵ2ρ[i=1MαiLi(w+ϵ)λVar{Li(w+ϵ)}i=1M]\max _{\| \epsilon\|_2 \leq \rho} \left[ \sum _{i=1}^M \alpha _i \mathcal{L} _i(w + \epsilon) - \lambda \text{Var}\{\mathcal{L} _i(w + \epsilon)\} _{i=1}^M \right]
评论

Dear Reviewer d1EH,

We sincerely appreciate the effort and time you have devoted to providing constructive reviews, as well as your positive evaluation of our submission. As the deadline for discussion and paper revision is approaching, we would like to offer a brief summary of our responses and updates:

  • Detailed explanation of DISAM's novelty and contributions.
  • Description and analysis in an open-class setting for DISAM.
  • Submission of reproducible code.

Would you mind checking our responses and confirming if you have any additional questions? We welcome any further comments and discussions!

Best Regards,

The authors of Submission 1547

评论

We greatly appreciate the reviewer's dedicated time and effort in thoroughly reviewing our paper and providing professional and valuable feedback.

Best,

The authors of submission 1547.

审稿意见
8

This paper introduces a novel optimization algorithm named Domain-Inspired Sharpness Aware Minimization (DISAM) tailored for challenges arising from domain shifts. It seeks to maintain consistency in sharpness estimation across domains by introducing a constraint to minimize the variance in domain loss. This approach facilitates adaptive gradient adjustments based on the optimization state of individual domains. Theoretical and empirical findings show the proposed method offers faster convergence and superior generalization under domain shifts.

优点

  1. The proposed method targets at the model generalization under domain shifts, which is a common challenge in machine learning. To date, there has been a lack of thorough investigation into sharpness-based optimization in the context of domain shifts, and the idea of constraint the variance of losses among training domains is interesting.

  2. The paper not only presents theoretical evidence showcasing the efficiency of DISAM, but it also provides empirical data to support this claim, demonstrating the improved performance across various domain generalization benchmarks.

  3. The analytical experiments conducted in this paper are comprehensive and lucid, providing evidence of DISAM's efficacy in enhancing convergence speed and mitigating model sharpness. Additionally, the study investigates the application of DISAM for fine-tuning a clip-based model, aiming to achieve improved open-class generalization.

缺点

  1. SAM-based optimization incurs twice the computational overhead and additional storage overhead in comparison to the commonly used SGD. While DISAM, the method proposed in this paper, demonstrates faster convergence under domain shift conditions when compared to SAM, it does not include a comparison with optimizers such as SGD or Adam.

  2. This paper employs multiple benchmarks to evaluate the performance of multi-source domain generalization. The article highlights the need for advancements in the domain shift perspective of the SAM method and suggests conducting comparisons between DISAM and the state-of-the-art (SOTA) method to further validate the effectiveness of the proposed approach.

  3. The value of ρ\rho in DISAM significantly influences both the convergence speed and generalizability. And it needs more discussion on how to effectively determine the value to maximize the benefits of proposed method.

问题

  1. The article presents a theoretical analysis suggesting that larger values of parameter ρ\rho should lead to improved generalization, given that convergence is guaranteed. It is important to reflect this aspect in the experiments to provide stronger evidence and validation.

  2. Regarding the open class generalization of the clip-based model, further experimental analysis should be conducted to elucidate the reasons behind the superior performance of DISAM.

For other questions, please refer to the weaknesses.

评论

W3 & Q1

W3: The value of ρ\rho in DISAM significantly influences both the convergence speed and generalizability. And it needs more discussion on how to effectively determine the value to maximize the benefits of proposed method. Q1: The article presents a theoretical analysis suggesting that larger values of parameter ρ\rho should lead to improved generalization, given that convergence is guaranteed. It is important to reflect this aspect in the experiments to provide stronger evidence and validation.

Thank you for the suggestion. Following the advice, we have enriched the discussion about the value of ρ\rho in the Appendix B.3 (page 21) of the revised version. Regarding the experiments, we actually have conducted the corresponding experiments to verify this aspect. We kindly refer the reviewer to Appendix B.3 for more details. We summarize the parts for the reviewer's questions as follows.

  • Generalization Theorem of SAM: In the SAM framework, the parameter ρ\rho plays a crucial role in determining generalizability. As established in [1], there exists an upper bound on the generalization error for SAM, suggesting that a larger ρ\rho could potentially enhance generalization, provided that convergence is not impeded. Here is the relevant generalization bound from [1]:

For any ρ>0\rho > 0 and any distribution D\mathcal{D}, with probability 1δ1- \delta over the choice of the training set SDS\sim \mathcal{D}, LD(w)maxϵ2ρLS(w+ϵ)+klog(1+w22ρ2(1+log(n)k)2)+4lognδ+O~(1)n1\mathcal{L} _{\mathcal{D}} (w) \leq \max _{\| \epsilon\| _2 \leq \rho} \mathcal{L} _{S}(w+\epsilon) + \sqrt{\frac{k \log \left( 1 + \frac{\| w \| _2^2}{\rho^2} (1+\sqrt{\frac{\log (n)}{k}})^2 \right) + 4 \log \frac{n}{\delta} + \tilde{O}(1)}{n-1}} where n=Sn = |S|, kk is the number of parameters and we assumed LD(w)EϵiN(0,ρ)[LD(w+ϵ)]\mathcal{L} _{\mathcal{D}}(w) \leq \mathbb{E} _{\epsilon_i \approx \mathcal{N}(0, \rho)} [\mathcal{L} _{\mathcal{D}}(w+\epsilon)]. This theorem's proof focuses solely on the magnitude of ρ\rho, thus affirming the applicability of this theoretical framework to DISAM.

  • Practical Implications: When considering the convergence Theorem 1 on page 5 alongside the above generalization theorem, a critical trade-off emerges with respect to ρ\rho. A larger ρ\rho might theoretically enhance generalization but poses greater challenges for convergence. This reflects the intuitive notion that searching for flatter minima across a broader range is inherently more complex, which can potentially affect training efficiency.

  • Empirical Validation: DISAM, with its accelerated convergence, can utilize a larger ρ\rho while still maintaining an acceptable convergence. This advantage is empirically showcased in Figures 3(c) and (d) on page 6, where we demonstrate that DISAM effectively employs a larger ρ\rho compared to traditional SAM. This ensures both convergence and enhanced generalization. Such a capability to balance between convergence efficiency and generalization is a distinguishing feature of DISAM over conventional SAM methods.

Reference:

[1]. Invariant risk minimization, arXiv2019.

Q2

Regarding the open class generalization of the clip-based model, further experimental analysis should be conducted to elucidate the reasons behind the superior performance of DISAM.

Thank you for the suggestion. To clarify the reasons, we complemented more analysis in Appendix C.6 on page 25 in the revised version. Generally, according to the results in Table 3 (we select some results in the following table for reference) and Figure 4, we can see:

  • ERM tends to overfit to training data classes: As shown in the table below, although CoOp and CLIPOOD perform better on base classes than zero-shot, their performance on new classes is worse than zero-shot. This suggests that the fine-tuned parameters overfit to the existing training data distribution from both the domain and class perspectives. Figure 4 visualizes the change in performance trends during the training process, and we observe a trend where ERM initially performs well on base classes but then exhibits a decline on new classes, suggesting a collapse of the feature space onto the training data classes.

  • DISAM improves generalization on new classes: Although SAM offers some relief from overfitting, its performance on new classes does not match zero-shot levels. In contrast, DISAM, by minimizing sharpness more effectively, shows improved performance on new classes, especially in domain shift scenarios.

MethodResults on BaseResults on New
Zero-shot72.667.4
CoOp74.466.3
CoOp+DISAM75.369.6
CLIPOOD76.066.9
CLIPOOD+DISAM77.069.7
评论

W1

SAM-based optimization incurs twice the computational overhead and additional storage overhead in comparison to the commonly used SGD. While DISAM, the method proposed in this paper, demonstrates faster convergence under domain shift conditions when compared to SAM, it does not include a comparison with optimizers such as SGD or Adam.

Thank you for highlighting the point of the computational cost. We have to admit that although SAM-based methods help improve generalization, it is ususally at the expense of approximately double the computational effort per iteration compared to standard SGD, due to the sharpness-aware perturbation. DISAM shares the similar cost as other SAM-based methods, since we target to address the domain-level inconsistency during the sharpeness estimation, instead of the training accelaration. In the revision version, we highlight this point with the empirical verification in Figure 8 on page 25, pointing out that compared DISAM with ERM, revealing that although DISAM converges faster than SAM, it does not outpace ERM in terms of convergence speed. The potential extension combined with the acceleration methods like [1,2] can be explored in the future works.

Reference:

[1]. Efficient sharpness-aware minimization for improved training of neural networks, ICLR2022.

[2]. Sharpness-aware training for free, NeurIPS2022.

W2

This paper employs multiple benchmarks to evaluate the performance of multi-source domain generalization. The article highlights the need for advancements in the domain shift perspective of the SAM method and suggests conducting comparisons between DISAM and the state-of-the-art (SOTA) method to further validate the effectiveness of the proposed approach.

We would like to kindly clarify that in Table 1, we has conducted comparison with some SOTA methods [1-2] like the latest SAM-based method SAGM [1] in combination with CORAL [2]. To further address the reviewer's concern, we here add two new SOTA methods "Decompose, Adjust, Compose"(DAC-SC) [3] and Fishr [4] (following reviewer d1EH's recommendation). The table below presents the performance of different methods across multiple domains:

MethodPACSVLCSOfficeHomeTerraIncDomainNetAvg.
ERM85.577.566.546.143.863.9
SAM85.879.469.643.344.364.5
DISAM87.380.170.747.945.866.4
Fishr (ICML2022) [4]86.978.268.253.641.865.7
Fishr+DISAM87.579.270.754.843.967.2
DAC-SC (CVPR2023) [3]87.578.770.346.544.965.6
DAC-SC+DISAM88.779.170.647.445.666.3
SAGM (CVPR2023) [1]86.680.070.148.845.066.1
SAGM+DISAM87.580.771.050.046.067.0

These results clearly illustrate the consistent improvement of DISAM on a range of methods in the field of multi-source domain generalization.

Reference:

[1]. Sharpness-aware gradient matching for domain generalization, CVPR2023.

[2]. Deep coral: Correlation alignment for deep domain adaptation, ECCV2016.

[3]. Decompose, Adjust, Compose: Effective Normalization by Playing with Frequency for Domain Generalization, CVPR2023.

[4]. Fishr: Invariant gradient variances for out-of-distribution generalization, ICML2022.

评论

Dear Reviewer mma9,

We sincerely appreciate the effort and time you have devoted to providing constructive reviews, as well as your positive evaluation of our submission. As the deadline for discussion and paper revision is approaching, we would like to offer a brief summary of our responses and updates:

  • Comparison of convergence speed on ERM.
  • Expanded comparisons with more state-of-the-art methods.
  • Discussion on ρ\rho's impact on convergence and generalization.
  • Explanations for the open-class experiments.

Would you mind checking our responses and confirming if you have any additional questions? We welcome any further comments and discussions!

Best Regards,

The authors of Submission 1547

评论

Thanks for the response. After reading the authors’ thorough rebuttal and other reviewers’ comments, I feel the authors have well addressed my concerns and thus will increased my score.

Best, The reviewer

评论

We sincerely appreciate your feedback regarding our efforts to address your concerns, and we would like to express our gratitude for your positive support. We will carefully consider all of your advice and incorporate the resulting improvements into the final version.

Best,

The authors of submission 1547.

审稿意见
5

Due to the inconsistent convergence degree of SAM across different domains, the optimization may bias towards certain domains and thus impair the overall convergence. To address this issue, this paper considers the domain-level convergence consistency in the sharpness estimation to prevent the overwhelming perturbations for less optimized domains. Specifically, DISAM introduces the constraint of minimizing variance in the domain loss. When one domain is optimized above the averaging level w.r.t. loss, the gradient perturbation towards that domain will be weakened automatically, and vice versa.

优点

They identify that the use of SAM has a detrimental impact on training under domain shifts, and further analyze that the reason is the inconsistent convergence of training domains that deviates from the underlying i.i.d assumption of SAM.

缺点

This paper considers the domain-level convergence consistency in SAM for multiple domains, and proposes to adopts the domain loss variance in training loss. The convergence consistency is a general issue, and the solution is normal, thus the novelty is not so clear for publication in ICLR.

问题

  1. In the definition of the variance between different domain losses, the values of loss between different domains are restricted. Which one is more import? The value of losses in different domains, or the minimization speed of loss in different domains?
  2. In the learning of multiple domains, there is Multi-Objective Optimization, so the domain-level convergence consistency is a general issue under domain shifts? Or the convergence consistency is a general issue in Multi-Objective Optimization?
  3. This paper considers the domain-level convergence consistency in SAM for multiple domains, and proposes to adopts the domain loss variance in training loss. The convergence consistency is a general issue, and the solution is normal, thus the novelty is not so clear.
评论

Q1

In the definition of the variance between different domain losses, the values of loss between different domains are restricted. Which one is more import? The value of losses in different domains, or the minimization speed of loss in different domains?

Firstly, we are sorry that one typo in Eq. (7) has mislead your understanding. In the revised manuscript, we have addressed such a typo issue in the description of Eq.(7) as follows.

minwEξS[LDISAM(w;ξ)]minwmaxϵ2ρ[i=1MαiLi(w+ϵ)λVar{Li(w^+ϵ)}i=1M]\min _{w} \mathbb{E} _{\xi \in \mathcal{S}} [\mathcal{L} _{DISAM}(w;\xi)] \triangleq \min _{w} \max _{ \| \epsilon \| _2 \leq \rho} \left [ \sum _{i=1} ^M \alpha _i \mathcal{L} _i (w + \epsilon) - \lambda \text{Var}\{\mathcal{L} _i(\hat{w} + \epsilon)\} _{i=1} ^M \right]

Here w^\hat{w} is ww without derivative taken during optimization, and it only makes effect in the maxϵ2ρ\max_{\| \epsilon\|_2 \leq \rho} loop without affecting the first term.

  • Variance Minimization Focus: DISAM primarily focuses on minimizing variance during the generation of perturbation directions ϵ\epsilon. The outer optimization w.r.t. ww does not involve a trade-off between the empirical loss term and the variance term as we enforce w^\hat{w} (assigned by ww) without derivative taken.
  • Crucial Role of Minimizing Variance: Minimizing domain-level variance in the perturbation generation loop is critical. Our experiments, illustrated in Figures 5(c) and 5(d) on page 9, show a marked decrease in generalization performance when λ=0\lambda=0, confirming its essential effectiveness in DISAM. Furthermore, DISAM exhibits a robust performance across a broad range of λ\lambda values.

Q2

In the learning of multiple domains, there is Multi-Objective Optimization, so the domain-level convergence consistency is a general issue under domain shifts? Or the convergence consistency is a general issue in Multi-Objective Optimization?

Differences from multi-objective optimization:

  • The research topic is intrinsically different from multi-objective optimization. Specially, the goal of DISAM has a single ultimate objective, improving the generalization performance of the model trained under multiple domains. This emphasizes both in-domain generalization and out-of-domain generalization, while multi-objective optimization usually refers to improving the multiple objectives of all collaborated tasks.
  • Methodologically, we still use the SAM optimization framework, and do not involve multi-objective optimization process during training. Specifically, {αi=ni/N}i=1M\{\alpha_i=n_i/N\}_{i=1}^M are constant in Eq.(7) unlike the task-specific variables to be learnt in multi-objective optimization. We observed the negative impact of domain-level convergence inconsistency on SAM-based methods during the perturbation direction generation process. DISAM achieves better perturbation directions for out-of-domain generalization by minimizing the variance of the domain losses.
评论

Weakness & Q3

Weakness: This paper considers the domain-level convergence consistency in SAM for multiple domains, and proposes to adopts the domain loss variance in training loss. The convergence consistency is a general issue, and the solution is normal, thus the novelty is not so clear for publication in ICLR. Q3: This paper considers the domain-level convergence consistency in SAM for multiple domains, and proposes to adopts the domain loss variance in training loss. The convergence consistency is a general issue, and the solution is normal, thus the novelty is not so clear.

Differences from general convergence consistency issue:

  • Distinct focus: DISAM focuses on the issue where SAM-based methods are unable to accurately estimate sharpness in domain shift scenarios, leading to the ineffective sharpness minimization and reduction in generalization performance.
  • Enhancing on top of general methods: While traditional solutions[1,2,3] aim at convergence consistency in parameter optimization, DISAM's methodology is distinct and orthogonal. It builds upon methods like V-REx[2] and Fishr[3], but goes further in enhancing out-of-domain generalization through better sharpness minimization. This is evident in our experiments, where combining DISAM with Fishr results in significant performance gains (as shown in the table below).
MethodPACSVLCSOfficeHomeTerraIncDomainNetAvg.
V-REx[2]84.978.366.446.433.661.9
V-REx+DISAM85.878.470.545.942.364.6
Fishr[3]86.978.268.253.641.865.7
Fishr+DISAM87.579.270.754.843.967.2

Novelty and contributions:

  • We first identify that the straightforward application of SAM has a detrimental impact on training under domain shifts (as shown Figure 1 and Table 1). Specifically, we observed that the way SAM generates perturbation directions amplifies the inconsistency in convergence between domains, leading to inaccurate sharpness estimation and making sharpness minimization less effective.
  • DISAM handle the above challenge by imposing a variance minimization constraint on domain loss during the sharpness estimation process, thereby enabling a more representative perturbation location and enhancing generalization.
  • The significant improvements in extensive experimental results (as shown in Table 1-3) validate DISAM's novelty and practical relevance.

Reference:

[1]. Invariant risk minimization, arXiv2019.

[2]. Out-of-distribution generalization via risk extrapolation (rex), ICML2021.

[3]. Fishr: Invariant gradient variances for out-of-distribution generalization, ICML2022.

评论

Dear Reviewer sVwJ,

We sincerely appreciate the effort and time you have devoted to providing constructive reviews, as well as your positive evaluation of our submission. As the deadline for discussion and paper revision is approaching, we would like to offer a brief summary of our responses and updates:

  • Clarification on the novelty and contributions and expanded comparisons with more state-of-the-art methods.
  • Clarification and discussion on the mechanism of the optimization function.
  • discussion on the differences between DISAM and multi-opjective optimization.

Would you mind checking our responses and confirming if you have any additional questions? We welcome any further comments and discussions!

Best Regards,

The authors of Submission 1547

评论

Dear Reviewer sVwJ,

The authors greatly appreciate your time and effort in reviewing this submission, and eagerly await your response. We understand you might be quite busy. However, the discussion deadline is approaching, and we have only a few hours left.

We have provided detailed responses to every one of your concerns/questions. Please help us to review our responses once again and kindly let us know whether they fully or partially address your concerns and if our explanations are in the right direction.

Best Regards,

The authors of Submission 1547

评论

Most concerns were resolved by the author's rebuttal. I really appreciate the author's efforts. However, i have further questions about Q3. To validate the efficacy of DISAM over SAM on incremental application of existing domain generalization methods, the author should also conduct the experiments on

  • V-REX+SAM vs V-REX+DISAM
  • V-REX+SAGM vs V-REX+DISAM

and

  • Fishr+SAM vs Fishr+DISAM
  • Fishr+SAGM vs Fishr+DISAM

If these experiments (It is okay with simpler evaluation) also shows statistically significant results, i am willing to increase the score from 5 to 6.

评论

Thanks for your valuable question! Considering the time-consuming nature of experiments on DomainNet, we have conducted additional experiments on the remaining four datasets in DomainBed (PACS, VLCS, OfficeHome and TerraInc).

MethodPACSVLCSOfficeHomeTerraIncAvg.
ERM85.577.366.546.168.9
SAM85.879.469.643.369.6
DISAM87.380.170.747.971.5
SAGM86.680.070.148.871.4
SAGM+DISAM87.580.771.050.072.3
V-REx[1]84.978.366.446.469.0
V-REx+SAM86.077.968.045.169.3
V-REx+DISAM85.878.470.545.970.2
V-REx+SAGM86.178.469.645.469.9
Fishr[2]86.978.268.253.671.7
Fishr+SAM87.078.769.047.170.5
Fishr+DISAM87.579.270.754.873.1
Fishr+SAGM87.079.370.648.571.4

As can be seen from the table above:

  • DISAM achieves consistently performance improvements on top of V-REx/Fishr+SAM and V-REx/Fishr+SAGM.
  • SAM performs poorly on the TerraInc dataset, leading to a decrease in generalization on top of V-REx and Fishr. SAGM offers a slight improvement, but its enhancement is not as significant as that achieved by DISAM.

We speculate that:

  • SAM exhibits poor performance on the TerraInc dataset due to significant domain shift and convergence inconsistency at the domain level. SAGM partially mitigates this inconsistency issue by constraining gradient directions. However, DISAM directly addresses domain-level convergence inconsistency, leading to a more substantial performance boost.
  • It is important to note that DISAM and SAGM improve the first and second steps of SAM separately, allowing for their combination. In Table 1 of the main text (page 7), we present the notable gains in generalization achieved by combining DISAM and SAGM. To better present the performance of DISAM, we are currently conducting experiments with V-REx+SAGM+DISAM and Fishr+SAGM+DISAM. We will update the results as soon as possible.

References:

[1]. Out-of-distribution generalization via risk extrapolation (rex), ICML2021.

[2]. Fishr: Invariant gradient variances for out-of-distribution generalization, ICML2022.

评论

Thank you again for the constructive comments. Now, we complement more experiments regarding different combinations to provide a comprehensive comparison. Please refer to the following table for the results. Note that, as DomainNet (600,000 images) are too time consuming for us to finish the training in the remaining time of this phase, we provide the results without DomainNet (we are still running on this dataset and can be provided in the final).

MethodPACSVLCSOfficeHomeTerraIncAvg.
ERM85.577.366.546.168.9
SAM85.879.469.643.369.6
DISAM87.380.170.747.971.5
SAGM86.680.070.148.871.4
SAGM+DISAM87.580.771.050.072.3
V-REx[1]84.978.366.446.469.0
V-REx+SAM86.077.968.045.169.3
V-REx+DISAM85.878.470.545.970.2
V-REx+SAGM86.178.469.645.469.9
V-REx+SAGM+DISAM86.579.271.046.670.8
Fishr[2]86.978.268.253.671.7
Fishr+SAM87.078.769.047.170.5
Fishr+DISAM87.579.270.754.873.1
Fishr+SAGM87.079.370.648.571.4
Fishr+SAGM+DISAM87.880.171.255.373.6

Two conclusions can be made:

  • Under the backbone of V-REx, we can find that SAM, DISAM and SAGM improve the performance of the vanilla V-REx by 0.3, 1.2, 0.9 and DISAM performs the best. Besides, DISAM and SAGM both outperform the vanilla SAM from different designing perspectives in optimization, wherein DISAM's perspective seems to be more effective. Their combinations (i.e., V-REx+SAGM+DISAM) achieves the best performance compared to the either, which demonstrates their orthogonality and composability.
  • Under the backbone of Fishr, we can find that SAM even brings the negative impact on the vanilla Fishr (70.5 v.s. 71.7), and SAGM cannot remedy the negative impact induced by SAM, yielding the overall lower performance than the vanilla Fishr (71.4 v.s. 71.7). DISAM (73.1) significantly outperforms SAM and SAGM, and exhibits the similar orthogonality and composability with SAGM.

Although DISAM shows the consistent superiority over SAM and SAGM under the backbones of V-REx and Fishr, we would like to specially summarize their differences as follows for clarity and for the proper claim.

  • The vanilla SAM suffer from impairment under multiple domain discrepancy, while DISAM and SAGM both can alleviate this issue, as to some extent their improvements in design inherently consider the potential bias under the two-stage SAM optimization procedure.
  • Differently, SAGM makes the change in the second stage of SAM, which considers the direction difference between the perturbed gradient by SAM and the original gradient by SGD during gradient updates, by narrowing the angle between these two gradient directions. However, DISAM focuses on the first stage of SAM, namely, the perturbation direction generation process, enhancing domain-level convergence consistency to achieve better sharpness estimation and, and consequently, improving generalization. DISAM is more direct to intervene the domain shift problem, while SAGM actually makes the implicit effect.

Overall, we do not intend to critize SAGM but to point out the difference under the scenario of our study. We hope the experiments and analysis can address the reviewer's remaining concerns. Any more comments and advices are welcomed.

Best,

The authors of submission 1547.

评论

Thanks for all reviewer's eforts in reviewing our paper. To avoid concerns about the reproducibility and the detailed setups in our experiments, we open our source code in this anonymous repository: https://anonymous.4open.science/r/DISAM-BF40.

评论

Summary

We thank reviewers for their valuable feedback, and appreciate the great efforts made by all reviewers, ACs, SACs and PCs. We appreciate that the reviewers have multiple positive impressions of our work, including: (1) focused problem is a common and significant challenge (mma9, d1EH); (2) a novel and reasonable method (mma9, BeTH, d1EH); (3) comprehensive experiments (mma9) with good results (BeTH); (4) practical with little additional computational costs(d1EH);

We provide a summary of our updates, and for detailed responses, please refer to the feedback of each comment/question point-by-point.

  • We meticulously enhance the motivation of our study and included additional comparisons and discussions with recent works in the Introduction (refer to Section 1) and the Related Work (see Appendix A). Furthermore, we have provided an in-depth analysis of Figure 1(b) in Appendix C.5.
  • We enrich the analysis and explanations of the DISAM algorithm, providing a detailed explanation of the impact of ρ\rho on both generalization and convergence from both theoretical and experimental perspectives (See Appendix B.3 and C.6). And we improve our notation of equations in the Method (refer to Section 3)
  • We conduct extensive experimental evaluations against against V-REx[1] and other new state-of-the-art methods (DAC-SC[2] and Fishr[3]) to compare and integrate them with DISAM. These analyses are detailed in our responses to the reviewers' comments. Moreover, we have expanded the comparison results for convergence speed experiments (see Appendix C.7).

The above updates in the revised draft (including the regular pages and the Appendix) are highlighted in blue color.

We once again express our gratitude to all reviewers for their time and effort devoted to evaluating our work. We eagerly anticipate your further responses and are hopeful for a favorable consideration of our revised manuscript.

Reference

[1]. Out-of-distribution generalization via risk extrapolation (rex), ICML2021.

[2]. Decompose, Adjust, Compose: Effective Normalization by Playing with Frequency for Domain Generalization, CVPR2023.

[3]. Fishr: Invariant gradient variances for out-of-distribution generalization, ICML2022.

AC 元评审

This paper proposes an improvement to Sharpness-Aware Minimization (SAM), which is a training method aiming to find flat minima and hence improve domain generalization. It is observed that, in the case of multiple source domains, the convergence of SAM in different domains might not be synchronized, which can bias sharpness estimation toward some domains (i.e., those with high losses). Named Domain-Inspired Sharpness Aware Minimization (DISAM), the proposed method addresses the issue by imposing a variance minimization constraint on domain losses. The reviewers generally find the problem interesting, the method novel and reasonable, and the experiments comprehensive and the result good. There is a slight negative score with low confidence (3) and three positive scores with high confidence (4 or 5).

为何不给更高分

See above

为何不给更低分

See above

最终决定

Accept (poster)