4.8

/10

Rejected4 位审稿人

最低3最高6标准差1.1

3.5

置信度

ICLR 2024

Adaptive Invariant Representation Learning for Non-stationary Domain Generalization

Thai-Hoang Pham,Xueru Zhang,Ping Zhang

OpenReview PDF

提交: 2023-09-23更新: 2024-02-11

TL;DR

This paper presents (1) theoretical bounds for accuracy in non-stationary domain generalization, and (2) a proposed model that achieves good accuracy in unseen domains generated from non-stationary environment.

摘要

关键词

domain generalizationnon-stationarytemporal-shiftinvariantrepresentation learningadaptive

评审与讨论

审稿意见

评分: 3置信度: 52023-11-02

This paper addresses the continuous domain generalization problem (also named non-stationary DG setting in the paper), and proposes the adaptive invariant representation learning algorithm to solve it. The model uses a transformer layer and a LSTM model to capture the evolving patterns between time steps. The experimental results validate the proposed method. However, the non-stationary DG setting is not a new one and there lack some literatures in this paper, and the theoretical analysis seems trivial, which together greatly lower the technical contribution of this work.

优点

After carefully reading this paper, I think the strengths mainly lie in the design of the model.

The authors propose a novel model architecture to handle the continuous DG problem, which contains a transformer layer and an LSTM model to capture the time-dependent patterns.
The experimental results are nice to validate the proposed method, which involves many datasets and baselines.

缺点

I have several concerns regarding to the theoretical analysis, the model design, and the experimental results.

The theoretical analysis is trivial and there is little novelty in it.

Firstly, the Theorem 4.5 is a conventional generalization bound via Rademacher Complexity, and the constant term $C$ is the upper bound of the loss function, which means the upper bound is quite loose and not quite meaningful to guide the design of methods. That is, if the upper bound is too loose, one could add any term to it that is consistent to the method. Also, the meaning of $K$ is not demonstrated in the paper.
Secondly, Proposition 4.6 is also trivial, and the reweighting method that it inspired has little relationship with invariance.

The model design:

Firstly, why the alignment of distributions could lead to invariance? The authors did not formally define the invariance property and it seems vague.
Secondly, it seems that the designed model has a lot of computation burden, could the authors analyze it or provide some empirical analysis/results on this? For example, extra running time.

The experiments: the authors did not report the variance of different runs. Further, the validation protocol is not reported.

问题

Please refer to Weaknesses.

伦理问题详情

N/A.

评论- Official Comment by Authors (1/2)

2023-11-19

Thanks for your review. We would like to address your concerns as follows.

Q1. Regarding Theorem 4.5.

We clarify that the usage of Rademacher complexity is not our main contribution. Rademacher complexity is one conventional way to relate the learning bound with finite data setting (in practice we only have access to finite data). It has been used to construct learning bounds for many domain adaptation/generalization settings [1,2,3]. The key contribution in Theorem 4.5 is to construct a learning bound based on the non-stationary mechanism $M^{\ast}$ which provides the guarantee for the proposed algorithm.

We also clarify that the assumption about bounded loss function is common in domain adaptation/generalization literature [1,4,5]. This assumption is mild and can be easily satisfied by many loss functions. For example, cross-entropy loss can be bounded by $C$ modifying the softmax output from $\left(p\_1, p\_2, \cdots. p\_{|\mathcal{Y}|} \right)$ to $\left(\hat{p}\_1, \hat{p}\_2, \cdots. \hat{p}\_{|\mathcal{Y}|} \right)$ where $\hat{p}\_i = p\_i (1 - \exp(-C)|\mathcal{Y}|) + \exp(-C)$ .

$K$ is a constant that depends on statistical distance $\mathcal{D}$ used measures the distance between two distributions. In particular, $K = \frac{1}{\sqrt{2}}$ when $\mathcal{D}$ is KL-divergence and $K = \sqrt{2}$ when $\mathcal{D}$ is JS-divergence. The details of $K$ are already given in the proof of Lemma A.1 in Appendix A. Based on your suggestion, we revised Theorem 4.5 to include this information.

Q2. Regarding Proposition 4.6 and reweighting method.

Could you provide more details for your argument “Secondly, Proposition 4.6 is also trivial”? We note that this proposition is not trivial and the goal of this one is to help us to connect Theorem 4.5 (which suggests us finding the mapping $m\_{t-1}^{\ast}: \mathcal{X} \times \mathcal{Y} \rightarrow \mathcal{X} \times \mathcal{Y}$ such that distance of the joint distributions $P^{X,Y}$ between two domains $D\_t$ and $D\_t^{\ast} = m\_{t-1} \sharp D\_{t-1}$ is minimized $\forall t \in [2,T]$ ) to our 2-stage algorithm shown in Remark 4.7 (which minimizes (1) distance of label distribution $P(Y)$ by importance weighting and (2) distance of label-conditional feature distribution $P(X|Y)$ by invariant representation learning).

We also clarify that importance weighting method is used to minimize the distance of label distribution $P(Y)$ between two domains $D\_t$ and $D\_t^{\ast}$ rather than achieving label-conditional invariance representation learning which is equivalent to minimizing the distance of label-conditional distribution $P(X|Y)$ .

Q3. Regarding invariant property.

As suggested by Proposition 4.6, after conducting importance weighting, we need to find the sequence of mappings in the input space $\mathcal{X}$ that can help to minimize the distance of label-conditional distribution $P(X|Y)$ between 2 consecutive source domains. However, finding such mappings in input space may not be trivial due to the complexity of input space. We alleviate this issue by finding the sequence of mappings in the representation space $\mathcal{Z}$ such that each $m\_t^{\ast}$ can be replaced by $f\_t^{\ast}: \mathcal{X} \rightarrow \mathcal{Z}$ and $g\_t^{\ast}: \mathcal{X} \rightarrow \mathcal{Z}$ . In essence, we find $f\_t^{\ast}$ and $g\_t^{\ast}$ such that the distance of the label-conditional distribution $P^{Z|Y}$ between two consecutive source domains is minimized. This objective is equivalent to finding label-conditional invariant representation between two consecutive source domains. (i.e., $P^{Z|Y}$ is invariant across 2 domains $D\_t$ , $D\_{t+1}$ is equivalent to $\mathcal{D}\left( P^{Z|Y}\_{D\_t}, P^{Z|Y}\_{D\_{t+1}} \right) = 0$ ).

评论- Official Comment by Authors (2/2)

2023-11-19

Q4. Regarding computation complexity.

Compared to existing works for non-stationary DG, our method (AIRL) required fewer computational resources. It’s because of our effective design to capture non-stationary patterns. Moreover, compared to the base model (ERM), our method (AIRL) has only one additional Transformer and LSTM layers, incurring only minimal overhead in terms of computation. Note that we only need to use Transformer and LSTM layers during training only. In the inference stage, the predictions are made by the fixed representation mapping $g$ and the classifier $h$ pre-generated by LSTM which then results in a similar inference time with ERM. Meanwhile, LSSAE and DRAIN have more complex architectures and objective functions resulting in much more training time than our method. While DPNET has slightly better training time than ours, this model requires storing previous data to make predictions and is not generalized to multiple target domains. To further support our claim, the average training times (i.e., seconds) of these methods for different datasets are shown in the table below.

	Circle	Circle-Hard	RMNIST	Yearbook	CLEAR
AIRL	32	25	382	749	1504
LSSAE	184	175	1727	1850	13287
DRAIN	460	230	2227	5538	1920
DPNET	18	13	208	448	1542

Q5. Regarding standard deviation and validation protocol.

Thanks for your comment. We’ve revised Table 1 to include standard deviation calculated from results of 5 different random seeds. In our experiment, models are trained on a sequence of source domains and are evaluated on target domains under 2 setting: Eval-S and Eval-D. In Eval-S, models are trained one time on the first half of domain sequence and are deployed to make predictions on the second half of domain sequence. In Eval-D, source and target domains are not static but are updated periodically as new data/domain becomes available. In both settings, we split the training set into smaller subsets with a ratio 81 : 9 : 10; these subsets are used as training, validation, and in-distribution testing sets. The details of datasets and evaluation settings are shown in Appendix C of our manuscript.

References

[1] Mansour, Yishay, Mehryar Mohri, and Afshin Rostamizadeh. "Domain adaptation: Learning bounds and algorithms." arXiv preprint arXiv:0902.3430 (2009).

[2] Cortes, Corinna, Mehryar Mohri, and Andrés Muñoz Medina. "Adaptation algorithm and theory based on generalized discrepancy." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015.

[3] Redko, Ievgen, et al. "Optimal transport for multi-source domain adaptation under target shift." The 22nd International Conference on artificial intelligence and statistics. PMLR, 2019.

[4] Cortes, Corinna, and Mehryar Mohri. "Domain adaptation and sample bias correction theory and algorithm for regression." Theoretical Computer Science 519 (2014): 103-126.

[5] Zhang, Chao, Lei Zhang, and Jieping Ye. "Generalization bounds for domain adaptation." Advances in neural information processing systems 25 (2012).

评论- Reminder to reviewer

2023-11-23

Dear reviewer VfLk,

Thank you for your time helping review and improve the paper. We hope our response has addressed all your concerns. Since we are approaching the end of the discussion period, please let us know if you have any other questions, and we are happy to discuss more. If you’re satisfied with our response, we sincerely hope you could reconsider the rating.

Best regards,

Authors

审稿意见

评分: 6置信度: 32023-11-03

The paper delves into the challenges associated with Domain Generalization (DG) under non-stationary environments. The authors investigate the effects of such non-stationary environments on model performance and provide theoretical upper bounds for errors when models are applied to target domains. To address the identified challenges, they introduce a new algorithm that is rooted in invariant representation learning. This algorithm uses the observed non-stationary patterns to develop a model that is expected to evolve and achieve better performance on target domains. The paper validates the effectiveness of the proposed algorithm through experiments conducted on both synthetic and real-world data.

优点

The article presents an innovative approach to tackle non-stationary domain generalization, a notable hurdle in practical scenarios. It delivers an extensive analysis of the associated difficulties and meticulously evaluates the impact of environmental changes over time.
A major academic contribution of this work is the formulation of theoretical upper limits for model error, lending a robust theoretical underpinning to their methodology.
The effectiveness of their method is validated with both synthetic and real data, showcasing its potential real-world applications.

缺点

The evaluation section lacks significance testing and does not report standard deviations in the results table. Given the close performance across many results, the inclusion of standard deviations is crucial to affirm the method's efficacy.
The datasets and networks employed in the study are somewhat limited in size. The paper does not address scalability, leaving questions about the method's performance with larger datasets or more complex network architectures.
While the theoretical analysis provided is compelling, the practical implementation details in Section 5 are complex and challenging to comprehend, even with the pseudo-code. A more detailed and clearer illsutration would greatly enhance understanding.

问题

Please refer to the weaknesses mentioned above.

2023-11-19

Thanks for your review. We would like to address your concerns as follows.

Q1. Regarding standard deviation.

Thanks for your suggestion. We’ve revised Table 1 to include standard deviation calculated from results of 5 different random seeds.

Q2. Regarding scalability.

We clarify that the design of our method allows it to scale well with larger datasets or more complex network architectures. In particular, compared to the base model (ERM), our method (AIRL) has only one additional Transformer and LSTM layers, incurring only minimal overhead in terms of computation. Note that we only need to use Transformer and LSTM layers during training. In the inference stage, the predictions are made by the fixed representation $g$ mapping and the classifier $h$ pre-generated by LSTM which then results in a similar inference time with ERM. Compared to existing works for non-stationary domain generalization, our method required fewer computational resources. It’s because of our effective design to capture non-stationary patterns. Specifically, LSSAE and DRAIN have more complex architectures and objective functions resulting in much more training time than our method. While DPNET has slightly better training time than ours, this model requires storing previous data to make predictions and is not generalized to multiple target domains. To further support our claim, the average training times (i.e., seconds) of these methods for different datasets are shown in the table below.

	Circle	Circle-Hard	RMNIST	Yearbook	CLEAR
AIRL	32	25	382	749	1504
LSSAE	184	175	1727	1850	13287
DRAIN	460	230	2227	5538	1920
DPNET	18	13	208	448	1542

Q3. Regarding section 5 writing.

Thanks for your suggestion. Could you provide more details about what thing we should do to improve clarification for section 5? We will revise this section based on your comments in our final version.

评论- Reminder to reviewer

2023-11-23

Dear reviewer vTu8,

Best regards,

Authors

审稿意见

评分: 5置信度: 32023-11-03

This paper considers a non-stationary domain generalization problem. Authors first establish theoretical upper bounds for the model error at target domain, and then leverage the non-stationary pattern to train a model based using invariant representation learning. Experiments show some improved results over existing methods.

优点

A more general setting regarding non-stationary DG problem.
A novel invariant algorithm is proposed based on theoretical bounds.
Improved empirical results across a wide range of datasets.

缺点

Description of the setting is not clear and the key assumption seems strong.
Proposed algorithm seems to be ad-hoc and complicated.
Presentation can be made more concise.

问题

I have some questions regarding the problem setting and assumptions:

What's the difference between this problem's setting and the IRM's where it is assumed that some invariance exists across domains? I can see that some datasets you use (like RMNIST) falls into the IRM's setting. A related question is, why do you consider learning invariant representations using only two consecutive domains?
And do you implicitly assume that the domain indexes and their order are given? If so, I think this setting is limited in this sense. Note that most benchmark methods do not utilize the order information. Please clarify.
Major Concern: It is stated that "We note that Assumption 4.3 is mild because it is required only for the optimal mechanism M∗ . This assumption implies that there exists at least one hypothesis in M under which the non-stationary patterns learned from source can generalize sufficiently well to the target (with bounded $\Phi$ )."
- regarding the criterion $\Phi$ : I don't think Definition 4.1 necessarily implies a good estimate of the mechanism $M$ . That is, $\hat M$ can be bad for all domain pairs, so that each $D$ inside the $|\cdot|$ is large but the difference is small. Perhaps some discussions shall be added here.
- More importantly, I would like to think that Assumption 4.3 is rather strong. Note that it is NOT equivalent to assuming that there exists a pattern in the hypothesis space; here it is a "specific" one, which minimizes the divergence of observed datasets, and could provide an almost optimal estimate of an unseen domain. If so, this would be a strong assumption on the relationship of observed domains and unseen domains. Please clarify.
Minor: Section 5 is hard to follow. Please be more concise.
Experiments: how many domains are used for training? And what if the domain index is re-ordered?

Overall, I think this paper has some interesting and useful contributions. I am happy to increase my evaluation if authors can address my concerns.

评论- Official Comment by Authors (1/2)

2023-11-19

Thanks for your review. We would like to address your concerns as follows.

Q1. Regarding problem setting and invariant learning.

We consider DG in non-stationary environment where domains evolve along specific direction (e.g., time, space). This setting is different from conventional DG setting considered in existing methods (including IRM). In essence, conventional DG assumes domains are sampled from a stationary environment and target domains lie on or are near the mixture of source domains. In addition, evaluation protocol is also different between non-stationary and conventional DG. Models designed for conventional DG follow “leave-one-out” evaluation (i.e., one domain is target while the remaining domains are source) because the goal is to generalize to arbitrary domains near the mixture of source domains. Meanwhile, in non-stationary DG, models are trained on previous domains and evaluated on future domains in domain sequences. We also note that dataset (colored MNIST) used in IRM paper is different from the one (rotated MNIST) used in our work.

Under the assumption that target domains lie on or are near the mixture of source domains, existing works that enforce invariant representations across all source domains can help to generalize the model to target domains [1,2,3]. However, this assumption may not hold in non-stationary DG where the target domains may be far from the mixture of source domains resulting in the failure of the existing methods. In non-stationary DG, the key is to capture non-stationary patterns from the sequence of source domains which can help us to estimate the target domains. Our proposed method captures these patterns in the representation space via learning the sequence of invariant representation space in which each space is for a pair of consecutive source domains. Note that, this method is different from the existing works that learn a general invariant representation space for all source domains.

Q2. Regarding the usage domain index and order.

We clarify that both our method and all currently existing baselines for non-stationary DG rely on the utilization of domain order information. This information is necessary to capture non-stationary patterns over the source domains, thereby playing a pivotal role in facilitating models to generalize to target domains in the context of non-stationary DG. It is noteworthy that in numerous applications in computer vision [4], natural language processing [5], and healthcare [6], this information naturally emerges and does not necessitate additional efforts for collection.

Q3. Regarding Definition 4.1.

We clarify that the “generalizability” term $\Theta\left( \widehat{M} \right)$ defined in Definition 4.1 is small does not imply $\widehat{M}$ is a good mechanism because the errors of $\widehat{M}$ on both source and target domains may be large while $\Theta\left( \widehat{M} \right)$ is small. Instead, $\Theta\left(\widehat{M}\right)$ measures how consistent in terms of performance of mechanism $\widehat{M}$ on capturing non-stationary patterns on source and target domains. To avoid ambiguation, we’ve revised Definition 4.1 by replacing the term “generalizability” with “non-stationary consistency”.

We also note that rather than working with an arbitrary $\widehat{M}$ , we focus on $M^{\ast} = \arg\min\_{\widehat{M} \in \mathcal{M}} \frac{1}{T-1} \sum\_{t=2} \mathcal{D} \left( P^{X,Y}\_{D_t}, P^{X,Y}\_{\widehat{D}\_t} \right)$ in our analysis (Theorem 4.5). In practice, we can achieve small error of $M^{\ast}$ on source domains during training. Then, under Assumption 4.3 for $\Theta\left( M^{\ast} \right)$ , $M^{\ast}$ is guaranteed to achieve small error on target domain.

Q4. Regarding Assumption 4.3.

We clarify that Assumption 4.3 about $\Theta\left( M^{\ast} \right)$ is reasonable. If Assumption 4.3 does not hold, it means that non-stationary patterns underlying target domain are very different from the patterns estimated from the source domains, making the task to generalize to target domain infeasible. It is noteworthy that we do not mention Assumption 4.3 is equivalent to “assuming that there exists a pattern in the hypothesis space” because it only applied for optimal mechanism $M^{\ast}$ . We only mentioned that Assumption 4.3 implies that assumption. We’ve revised our writing to avoid this ambiguity.

Q5. Regarding section 5 writing.

2023-11-19

Thanks for clarifications.

Regarding Q1: To clarify, Rotated MNIST is not the one used in the IRM paper but falls into the setting of IRM. And in combination with Q2, this seems to indicate that for some settings the proposed method does not work as well as stationary-DG methods. Of course, a method does not need to beat all other methods; nevertheless, an explicit discussion shall be appreciated.
Regarding Q4 on assumption 4.3: sorry for the previous misunderstanding. Yet I still have a different opinion on this assumption, because something is assumed to be "stationary" too---here it is the transition--- and it can be obtained by directly minimizing the ERM loss of observed datasets. As such, it would be a strong assumption in terms of "generalization". Or can you show it is indeed a necessary condition?
Regarding Section 5's writing: one suggestion from my side is to move some implementation details into the appendix and also simplify the notations.

Based on the current response, I would like to keep my evaluation as my major concern remains.

2023-11-19

Thanks for your quick response. We would like to address your remaining concerns as follows.

Q7. Regarding comparison with IRM.

We would like to emphasize the distinctive nature of the setting in our work (non-stationary DG) as opposed to the setting in IRM paper (conventional DG). Domains in conventional DG are generated from stationary environment characterized by the absence of any order among the domains. The objective is to train a model capable of generalizing to targets domain near the mixture of source domains. In contrast, domains in non-stationary DG evolve along specific direction (i.e., there exists mechanism generating domains in sequential manner). The goal is to train model on previous domains such that it can generalize to future domains in the domain sequence. While the rotated MNIST dataset can be employed to benchmark IRM, given that both non-stationary and conventional settings fall under the domain generalization umbrella, it does not imply the suitability of IRM for non-stationary DG. As illustrated in Table 1, our method consistently outperforms IRM across all non-stationary datasets.

Due to the inherent distinctions between these two settings, our method, tailored explicitly for non-stationary DG, is not optimized for performance in conventional DG. Similarly, existing methods crafted for conventional DG are not adept at achieving optimal results in non-stationary DG. It is crucial to emphasize that our objective is not to devise a method that achieves good performance in both settings. Instead, we aim to demonstrate that non-stationary DG possesses unique characteristics distinct from conventional DG. This realization serves as motivation for the development of a novel method specifically tailored to effectively address the challenges posed by this distinctive setting.

Q8. Regarding Assumption 4.3 (transition between domains).

We wish to emphasize that the transition between domains in our setting is not required to be stationary. Specifically, in non-stationary DG, a mechanism denoted as $M$ generates a sequence of mappings $[m\_1, m\_2, \cdots]$ such that each $m\_t$ in the sequence is the mapping from domain $D\_t$ to domain $D\_{t+1}$ . It is noteworthy that in this setting, we do not assume $m\_t = m\_{t’}$ for $t \neq t’$ . Therefore, the mapping $m\_T$ that maps the last source domain $D\_T$ to the target domain $D\_{T+1}$ is distinct from the previous mappings $m\_1, m\_2, \cdots, m\_{T-1}$ . Moreover, the term $\Theta\left(\widehat{M}\right)$ evaluates the consistency in terms of error of hypothesis $\widehat{M}$ (rather than the mapping) between source and target domains. Therefore, we clarify that Assumption 4.3 about $\Theta\left(M^{\ast}\right)$ is not strong.

Regarding your question about the mapping/transition between domains, we have already conducted experiment on synthetic datasets Circle and Circle-Hard to explore it (The detailed descriptions of these 2 datasets are given in Appendix C1). In Circle dataset, the distance between two consecutive domains can be seen as fixed (i.e., $m\_1 = m\_2 = \cdots = m\_T$ ). Meanwhile, in Circle-Hard dataset, the distance between two consecutive domains increases proportionally with domain index (i.e., $m\_1 < m\_2 < \cdots < m\_T$ ). As shown in Table 1, our method consistently outperforms baselines on both datasets, confirming its effectiveness in non-stationary DG, regardless of whether the transition between domains is stationary or not.

评论- Official Comment by Authors (2/2)

2023-11-19

Q6. Regarding source domain training.

We experiment with several datasets with varied number of domains. In our experiment, models are trained on a sequence of source domains and are evaluated on target domains under 2 settings: Eval-S and Eval-D. In Eval-S, models are trained one time on the first half of domain sequence and are deployed to make predictions on the second half of domain sequence. In Eval-D, source and target domains are not static but are updated periodically as new data/domain becomes available. The details of datasets and evaluation settings are shown in Appendix C of our manuscript.

In non-stationary DG, the key is to capture non-stationary patterns from the sequence of source domains which can help models to generalize the target domains. Then, when the order of source domains is shuffled, the non-stationary patterns learned from source domains cannot generalize to the target domain (Assumption 4.3 does not hold), resulting in performance drop for methods designed for non-stationary DG.

References

[1] Albuquerque, Isabela, et al. "Generalizing to unseen domains via distribution matching."

[2] Sicilia, Anthony, Xingchen Zhao, and Seong Jae Hwang. "Domain adversarial neural networks for domain generalization: When it works and how to improve."

[3] Phung, Trung, et al. "On learning domain-invariant representations for transfer learning with multiple sources."

[4] Lin, Zhiqiu, et al. "The clear benchmark: Continual learning on real-world imagery." Thirty-fifth conference on neural information processing systems datasets and benchmarks track (round 2). 2021.

[5] Clement, Colin B., et al. "On the use of arxiv as a dataset." arXiv preprint arXiv:1905.00075 (2019).

[6] Johnson, Alistair, et al. "Mimic-iv." PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021) (2020).

评论- Reminder to reviewer

2023-11-23

Dear reviewer 6RJQ,

Best regards,

Authors

审稿意见

评分: 5置信度: 32023-11-05

Many machine learning algorithms work based on the assumption that training and test data are sampled from IIDs. However, this is commonly violated in real-world cases as the data distribution may shift between train and test times. It has encouraged many researchers to develop techniques such as domain generalization and domain adaptation. However, such methods cannot accommodate the case that the data distribution changes over time based on a mechanism. To tackle such a problem, this paper studies domain generalization in a non-stationary environment, which aims to learn a model from a sequence of source domains that can capture the non-stationary patterns and generalize to unseen target domains. The authors examined the impacts of non-stationary distribution shifts and investigated how the model learned on the source domains performs on the target domains. Experiments were done on simulated, semi-synthetic, and real-world datasets.

优点

The problem this paper addresses is interesting. Multiple domains evolving over time is a realistic scenario but has hardly been studied before. Also, the authors demonstrated that the existing typical multiple-source DG/DA methods
The problem setup is more flexible than those of the related works as it allows the modeling of non-stationary dynamics and can be applied to multiple unseen target domains.
Well-written and easy to follow.

缺点

Definition of $\Phi$ is a bit ambiguous. In Definition 4.1 and Remark 4.2, $\Phi(\widehat{M})$ consists of an average error of $\widehat{M}$ and on the source domains and error of $\widehat{M}$ on the target domain. It doesn't consider whether each error is low enough but only considers their difference. Thus, based on the definition, $\widehat{M}$ is generalizable even if $\widehat{M}$ has high errors on both source and target domains. I am unsure if such $\widehat{M}$ is generalizable.
Moreover, Assumption 4.3 tells us that we can find a decent $\widehat{M}$ with a small error on the source domains but seems to tell nothing about the error on the target domain. I am unsure if I understood this statement correctly, but it doesn't seem like Assumption 4.3 implies that the error of $M^{*}$ on the target domain is small. I think this is one of the key statements of this study so it needs to be clarified.
The authors need to add an uncertainty metric to Table 1.

问题

I appreciate the theoretical analysis and experimental evidence that the authors presented. Could the authors further provide under what condition the proposed approach will have guaranteed improvement compared to either the ERM or some conventional DG method?
I think the FMoW dataset from the WILDS benchmark aligns with the problem setup in the paper. (The authors already included sufficient experimental results, so I do not request the authors to add an additional dataset.)
The authors may want to add some dotted lines to Table 1 to differentiate different approaches - ERMs / conventional DA/DGs, ...

伦理问题详情

n/a

2023-11-19

Thanks for your review. We would like to address your concerns as follows.

Q1. Regarding definition of generalizability term $\Theta\left(\widehat{M}\right)$ .

To clarify, $\Theta\left(\widehat{M}\right)$ is defined as the difference between two terms: error of $\widehat{M}$ on sequence of source domains $\frac{1}{T-1}\sum\_{t=2}\mathcal{D}\left( P^{X,Y}\_{D\_t},P^{X,Y}\_{\widehat{D}\_t}\right)$ and error of $\widehat{M}$ on target domain $\mathcal{D}\left(P^{X,Y}\_{D\_{T+1}},P^{X,Y}\_{\widehat{D}\_{T+1}}\right)$ . In other words, $\Theta\left(\widehat{M}\right)$ measures how consistent in terms of performance of mechanism $\widehat{M}$ on capturing non-stationary patterns on source and target domains, and the term “generalizability” used in Definition 4.1 implies it. Clearly, good $\widehat{M}$ makes $\Theta\left(\widehat{M}\right)$ small but $\Theta\left( \widehat{M} \right)$ does not imply $\widehat{M}$ can capture the evolving on target domain because $\widehat{M}$ may have large error on both source and target domains. Thanks for your comment, we’ve revised Definition 4.1 by replacing the term “generalizability” with “non-stationary consistency” to avoid ambiguity.

Q2. Regarding Assumption 4.3.

We clarify that Assumption 4.3 implies error of $M^{\ast}$ on target domain is small. Note that Assumption 4.3 holds only for the optimal mechanism $M^{\ast} = argmin\_{\widehat{M} \in \mathcal{M}} \frac{1}{T-1} \sum\_{t=2} \mathcal{D} \left( P^{X,Y}\_{D_t}, P^{X,Y}\_{\widehat{D}\_t} \right)$ instead of arbitrary $\widehat{M}$ . During training, we can feasibly achieve small source error for $M^{\ast}$ (i.e., $\frac{1}{T-1} \sum\_{t=2} \mathcal{D} \left( P^{X,Y}\_{D\_t}, P^{X,Y}\_{D^{\ast}\_t} \right)$ by using high-capacity hypothesis class (e.g., neural network). This fact combining with Assumption 4.3 ( $\left \| \mathcal{D} \left( P^{X,Y}\_{D\_{T+1}}, P^{X,Y}\_{ D^{\ast}\_{T+1}} \right) - \frac{1}{T-1} \sum\_{t=2} \mathcal{D} \left( P^{X,Y}\_{D\_t}, P^{X,Y}\_{D^{\ast}\_t} \right) \right| < \epsilon$ ) implies that error of $M^{\ast}$ on target domain $\mathcal{D} \left( P^{X,Y}\_{D\_{T+1}}, P^{X,Y}\_{ D^{\ast}\_{T+1}} \right)$ is small.

Q3. Regarding uncertainty metric to Table 1.

Thanks for your suggestion. We’ve revised Table 1 to include standard deviation calculated from results of 5 different random seeds.

Q4. Regarding the setting the proposed approach outperforms conventional DG methods.

To clarify, our method is designed for the setting where data evolves along a specific direction (e.g., space, time), and by taking non-stationary environment into consideration, our method achieves better performance compared to existing DG methods in this setting. In essence, existing DG methods often enforce invariant representations across all domains, and it has been shown that when target domains lie on or are near the mixture of source domains, enforcing invariant representations across all source domains can help to generalize the model to target domains [1,2,3]. However, this assumption may not hold in non-stationary DG where target domains may be far from the mixture of source domains resulting in the failure of existing methods.

To further support this argument, we conduct an experiment on rotated RMNIST dataset with DANN [4] – a model that learns invariant representations across all domains. Specifically, we create 5 domains by rotating images by 0, 15, 30, 45, and 60 degrees, respectively, and follow leave-one-out evaluation (i.e., one domain is target while remaining domains are source). Clearly, the setting where target domain is images rotated by 0 or 60 degrees can be considered as non-stationary DG while other settings can be considered as conventional DG. The performances of DANN with different target domains are shown in the following table. As we can see, the accuracy drops significantly when target domain is images rotated by 0 or 60 degrees. This result demonstrates that conventional DG methods are not suitable for non-stationary DG.

Target domain	0	15	30	45	60
DANN performance	51.2	59.1	70.0	69.2	53.9

Q5. Regarding FMoV dataset.

Thanks for pointing out the great resource related to our research. We will explore this dataset in future works.

Q6. Regarding Table 1 presentation.

Thanks for your suggestion. We’ve revised Table 1 to distinguish different approaches.

References

[1] Albuquerque, Isabela, et al. "Generalizing to unseen domains via distribution matching."

[2] Sicilia, Anthony, Xingchen Zhao, and Seong Jae Hwang. "Domain adversarial neural networks for domain generalization: When it works and how to improve."

[3] Phung, Trung, et al. "On learning domain-invariant representations for transfer learning with multiple sources."

[4] Ganin, Yaroslav, et al. "Domain-adversarial training of neural networks."

评论- Reminder to reviewer

2023-11-23

Dear reviewer Uk8r,

Best regards,

Authors

评论- General Response to Reviewers

2023-11-19

Dear reviewers,

We have made point-to-point responses based on your comments. We also revised our manuscript (text with red color) to reflect the change. In this general post, we would like to provide here a summary of our responses:

We clarified the usage of term $\Theta\left(\widehat{M}\right)$ . We also revised Definition 4.1 about $\Theta\left(\widehat{M}\right)$ . To avoid ambiguity, we replaced the term “generalizability” in Definition 4.1 with “non-stationary consistency”. (Reviewers Uk8r and 6RJQ)
We revised Assumption 4.3 to clarify that this assumption implies error of $M^{\ast}$ on target domain is small. (Reviewers Uk8r and 6RJQ)
We revised Table 1 to include the standard deviation calculated from 5 different random seeds. (Reviewers Uk8r and VfLk)
We clarified the difference between non-stationary DG and conventional DG settings. We also conducted an additional experiment to explain why methods designed for conventional DG are not suitable for non-stationary DG. (Reviewer Uk8r)
We conducted an additional experiment to show that our method is scalable in terms of computational complexity. (Reviewers vTu8 and VfLk)
We clarified the roles of our theoretical analysis (Theorem 4.5 and Proposition 4.6) as well as our proposed algorithm (importance weighting and label-conditional invariant representation learning between every pair of consecutive source domains). (Reviewers 6RJQ and VfLk)
We clarified the evaluation settings used in non-stationary DG. (Reviewer VfLk)

Best Regards,

Authors

AC 元评审

2023-12-05

This paper studies the problem of domain-generalization in the context of non-stationary environments. The contribution includes a theoretical analysis, an algorithm based on invariant representation learning and an experimental evaluation.

On the positive side, the problem addressed by the paper and the originality/novelty of the paper have been appreciated, as well as the empirical evaluation.
On the negative side, some writing/presentation/description of some parts are not clear or ambiguous, some theoretical results are not convincing/limited, some parts are complex need to be clarified, the experimental evaluation is limited in some aspects.

During the rebuttal, authors have provided multiple answers to the points raised by the reviewers.
During discussion, reviewers have agreed on the fact some issues were still not solved, the notions of "mechanism" and "transition mapping" appear still to be fuzzy after authors' rebuttal.
The clarity and the analysis have still to be improved.

Based on the elements mentioned above, it appears that this paper is not yet ready for ICLR.
I then propose rejection.

为何不给更高分

The feedback of the reviewers is not that positive.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject