PaperHub
4.0
/10
Poster3 位审稿人
最低1最高3标准差0.9
3
1
3
ICML 2025

Distributionally Robust Policy Learning under Concept Drifts

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We propose a minimax optimal offline policy learning algorithm that is robust under concept drifts.

摘要

Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-case *joint* distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studies a more nuanced problem --- robust policy learning under the *concept drift*, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-$n$ rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class $\Pi$, and show that the sub-optimality gap of the proposed algorithm is of the order $\kappa(\Pi)n^{-1/2}$, where $\kappa(\Pi)$ is the entropy integral of $\Pi$ under the Hamming distance and $n$ is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.
关键词
distributionally robust optimizationoffline policy learningconcept driftbandit learningreinforcement learning.

评审与讨论

审稿意见
3

This paper propose the distributionally robust method for offline bandit under concept shift, where the P(Y|X) is shifting. They propose a doubly robust method and DRO under KL divergence for offline policy learning. And they show the Asymptotic normality of the ope and propose a policy learning and the corresponding regret bound.

给作者的问题

See above.

论据与证据

  1. 'To be concrete, imagine that the distribution of covariates changes while that of Y | X remains invariant — in this case, the distribution shift is identifiable/estimable since the covariates are often accessible in the target environment. As a result, it is often unnecessary to account for the worst-case covariate shift rather than directly correcting for it.' This is not true, when there is some OOD data, especially in RL.

  2. Comparing with [1], what is the other difference except for the doubly robust estimator and you are only considering the concept shift instead of the joint shift?

  3. Is it possible that you provide some bias and variance of the proposed value estimator? And what is the benefit of Asymptotic normality, or in other word, why do we want this or what is the advantage compared with other estimator that is not Asymptotic normality?

[1] Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributionally robust batch contextual bandits. Management Science, 2023.

方法与评估标准

Looks good to me.

理论论述

Yes. I check the proof of the regret bound.

实验设计与分析

Yes. I check the whole data generation process.

补充材料

I only go through some of the proof of the lemma from the main text.

与现有文献的关系

Provide a reliable method for the policy learning under concept shift, which is crucial for domain adaptation under the concept.

遗漏的重要参考文献

The following paper also discuss the DRO for policy learning, maybe you can also discuss in the related work.

[1] Distributionally Robust Policy Gradient for Offline Contextual Bandits

其他优缺点

See above.

其他意见或建议

Maybe highlight the difference of your method compared with [1].

Discuss (Hamming entropy integral) intuitively, currently it is hard to follow.

[1] Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributionally robust batch contextual bandits. Management Science, 2023.

作者回复

We would like to thank the reviewer for dedicating the time to review our paper and for providing the insightful comments. Due to the character limit, we cannot upload the revised manuscript, however we have edited according to the reviewer's helpful suggestions. Reference can be found in our reply to Reviewer wdwV.

Claims And Evidence

1 To our best knowledge, context shift and concept shift are two big sources of out of distribution data. In our setting, we are concerned with concept shift dataset, i.e. OOD data such that the distribution of YXY|X in testing data is shifted from that in training data. In the quoted sentence, we are discussing the other case of context shift, where the distribution of XX in testing data is shifted from that in training data, with the conditional reward YXY|X distribution unchanged. Please let us know if we have misunderstood your question.

In our revision, we have extended our framework to incorporate context shift. One can find that the policy value in this case is V_δ(π)=E_P[r(X)(απ(X)eY(π(X))+ηπ(X)απ(X)1+ηπ(X)+απ(X)δ)]\mathcal{V}\_{\delta}(\pi) = -\mathbb{E}\_{P}[r(X)(\alpha^*_\pi(X)e^{-\frac{Y(\pi(X))+\eta^*_\pi(X)}{\alpha^*_\pi(X)}-1}+\eta^*_\pi(X)+\alpha^*_\pi(X)\delta)], where r(x)=PXPXr(x)=\frac{P'_X}{P_X} with PXP'_X being the shifted context distribution. Here, we make a conscious choice to estimate the context shift (as opposed to hedge again the worst-case shift) because in most practical situations, users have access to the covariates in the target environment and the context shift is identifiable and estimable; it is thus unnecessary to guard against the worst-case shift. We have added this extension to our manuscript.

2 Thank you for your question. The concept-shift setting is one of the major differences: we aim at providing a better solution to DRO policy learning when knowing additionally the type of distribution shift. Such a change of objective brings substantial technical challenges, which are what we have addressed in this paper.

Except for the differences in settings, [1] assumes a known behavior policy π0\pi_0 (and thus known propensity score), while our setting allows for an unknown π0\pi_0, which adds new challenge as slow estimation rates of the propensity score could result in high regret bound. We note that unknown π0\pi_0 setting is ubiquitous in observational studies [9]. This challenge calls for regression methods for fitting π0\pi_0 and an intricate design of empirical risk minimization (ERM) method, combined with a double-robust construction, in our work to compensate for the unknown π0\pi_0. In spite of all these challenges, we managed to show theoretically and empirically that, if only concept shift takes place, then employing [1] is suboptimal, and our algorithm does better with this one bit of extra information. We have added this comparison to our literature review section.

In terms of theoretical analysis, we adopt the chaining technique to achieve a regret rate of O(n1/2)O(n^{-1/2}), while [1] uses a quantile trick.

3 Thank you for your question. To learn a near-optimal policy that maximizes the distributionally robust value V_δ(π)\mathcal{V}\_{\delta}(\pi) from a dataset, a consistent (i.e. asymptotically normal) estimator of V_δ(π)\mathcal{V}\_{\delta}(\pi) is a necessary intermediate step to achieve good quality learning. Asymptotic normality allows for inference on the policy value (e.g., constructing confidence intervals, conducting hypothesis testing). In terms of the variance of the proposed value estimator, asymptotically the variance is σπ2\sigma_\pi^2 as stated in Theorem 3.5. This is to say that the bias is asymptotically 0, given consistent nuisance parameter estimators. The non-asymptotical bias contributed by each nuisance parameter estimator is carefully analyzed in Appendix D.2, proof of Theorem 3.5.

Other Comments Or Suggestions

1 See Claims and Evidence 2

2 Hamming entropy integral is a variant of the classical entropy integral introduced in [10], based on the Hamming distance, which is a well-known metric for measuring the similarity between two equal-length arrays (policies in our context) whose elements are supported on on discrete sets. Hamming entropy integral is widely used in offline learning literature [1-4] for measuring the complexity of a policy class. Details and examples can be found in [1].

审稿人评论

Thanks for the effort in addressing my questions. I will raise my score to 3.

作者评论

Thank you so much for your time and positive feedback!

审稿意见
1

This paper investigates distributionally robust policy learning with concept shift. While this problem has been previously studied in the literature, the current work extends to a more general setting where the context space is not necessarily finite. To address this generalized setup, the authors propose a doubly robust estimator. The paper demonstrates that the policy value estimator exhibits asymptotic normality when the nuisance parameters are estimated with sufficient accuracy. Besides, other key contributions include establishing upper bounds for general spaces and providing corresponding lower bounds.

给作者的问题

None.

论据与证据

I found the problem setup with exclusive focus on concept shift to be somewhat artificial. If there is only concept shift without uncertainty about the marginal distribution of the context XX, it seems more natural to optimize the policy for each XX individually. Indeed, when Π\Pi encompasses all possible policies (or rectangular in the sense that having separate constraints for each context), by the interchangeability principle, the optimal policy that solves equation (1) satisfies: π(X)argmaxpΔ(A)infQYXP(PYX,δ)EYX[Y(p)X],\pi^*(X) \in \arg\max_{p \in \Delta(A)} \inf_{Q_{Y|X} \in \mathcal{P}(P_{Y|X},\delta)} \mathbb{E}_{Y|X}[Y(p)|X], where Δ(A)\Delta(A) is the set of randomized actions.

The restriction to specific policy families in the paper typically serves the purpose of generalization—learning a policy that performs well across the entire context space. However, the concept-shift-only setup implies no need for such generalization since PXP_X is assumed to be known exactly. While there might be practical applications requiring certain policy forms or impose constraints across different contexts, apparently the paper does not focus on these considerations. To me, the joint or separate uncertainty in both PXP_X and PYXP_{Y|X} makes more sense to me.

Furthermore, given the assumption of no uncertainty in the marginal distribution of context XX, one would expect generalization error bounds that are stronger than those presented in the paper, potentially avoiding the curse of dimensionality described at the end of Section 3.1. It's particularly concerning that Assumption 3.4 requires a dimension-independent convergence rate, which the results in Section 3.1 cannot achieve for high-dimensional spaces without imposing unrealistically strong smoothness constraints.

方法与评估标准

The numerical experiments would be substantially strengthened by including real-world datasets to demonstrate the effectiveness of the proposed method. Currently, the empirical validation appears limited to an artificial synthetic data, which may not fully capture the complexities encountered in practical applications. Having real data also helps solidify the concept-shift-only setup investigated in this paper.

理论论述

The results look reasonable to me and the proofs are quite standard, although I did not check every detail and I wonder if they are optimal or too conservative given that the context distribution is assumed to be known exactly.

实验设计与分析

As mentioned above, I think the experiments are too preliminary and it would be nicer to have real datasets support the assumed setup in the paper.

补充材料

I went through them on a high level.

与现有文献的关系

In the literature, both joint and separate uncertainty in concept and covariate shift are studied. This paper focuses exclusively on the concept shift. It might have important applications and implications, but unfortunately the paper, in its current form, does not articulate it well.

遗漏的重要参考文献

None.

其他优缺点

NA.

其他意见或建议

None.

作者回复

We would like to thank the reviewer for dedicating the time to review our paper and for providing the insightful comments. Due to the character limit, we cannot upload the revised manuscript, however we have edited according to the reviewer's helpful suggestions. Reference can be found in our reply to Reviewer wdwV.

Claims and Evidence

1&2 We would like to first note that our work does not assume known context distribution PXP_X. We only assume that no covariate shift takes place (which has been relaxed in our revision), and we aim to learn an optimal policy that is robust to any shift of PYXP_{Y\mid X} within the δ\delta KL-divergence. The setup is similar to the setting of [2], with the latter focusing on a finite covariate space.

We also note that the per-x optimization formulation proposed your comment 1 itself is highly challenging when XX has continuous components and/or is high-dimensional ---indeed, it then requires to evaluate E[YX=x]\mathbb{E}[Y | X=x] for each xx. At a high level, this is the challenge our work addresses.

In our revision, we have extended our framework to incorporate context shift. One can find that the policy value in this case is V_δ(π)=E_P[r(X)(απ(X)eY(π(X))+ηπ(X)απ(X)1+ηπ(X)+απ(X)δ)]\mathcal{V}\_{\delta}(\pi) = -\mathbb{E}\_{P}[r(X)(\alpha^*_\pi(X)e^{-\frac{Y(\pi(X))+\eta^*_\pi(X)}{\alpha^*_\pi(X)}-1}+\eta^*_\pi(X)+\alpha^*_\pi(X)\delta)], where r(x)=PXPXr(x)=\frac{P'_X}{P_X} with PXP'_X being the shifted context distribution. Here, we make a conscious choice to estimate the context shift (as opposed to hedge again the worst-case shift) because in most practical situations, users have access to the covariates in the target environment and the context shift is identifiable and estimable; it is thus unnecessary to guard against the worst-case shift. We have added this extension to our manuscript.

3 As discussed in our previous reply, we do not assuming knowing PXP_X and solving the per-xx optimization problem is challenging with continuous and/or high-dimensional XX---this is where the curse of dimensionality kicks in (think of the task of estimating conditional mean).

We also note that Assumption 3.4 is standard in offline learning literature [1,2,3,4,5]. The empirical sensitivity analysis of Assumption 3.4 can be found in [7] which justifies it. The results in [7] also parallels standard conditions in double-machine-learning, achievable by a variety of machine-learning methods [6].

Methods And Evaluation Criteria: We are now running a new set of experiments with real-world dataset, which will be ready before the camera ready version.

Theoretical Claims: The optimality of our regret bound is verified by our lower bound result of Ω(n1/2)\Omega(n^{-1/2}) in Theorem 4.6. In terms of results in literature, our setting is similar to that of [2], however [2] only considers discrete PXP_X with finite support, while we extend the case to continuous unknown PXP_X with infinite support. We also improve the regret bound of in [2]. Please see table 1 for an overview of results in literature and our results.

Relation To Broader Scientific Literature: Concept shifts occurs in many real-world situations. For example, in advertising, the customer behavior can evolve over time as the environment changes, while the population remains largely the same. In personalized product recommendation, similar population segments in developed and emerging markets may prefer different product features.

Most existing robust policy learning algorithms model joint distributional shift without distinguishing the sources. The suboptimality of these algorithms under concept shift is because the worst-case distributions under the joint-shift model and the concept-drift model can be substantially different, so it would be a “waste” to consider joint shift under concept drift. With one extra bit of information, our work shows that we can obtain a better policy. The above discussion was already in our introduction, but we expanded it in our revision.

审稿人评论

Thank you for your response. I still have some concerns about the paper's setting that appear contradictory:

  1. You mention the revision assumes that there is no context shift in XX, yet at the same time, your response aims to address continuous and/or high-dimensional XX scenarios. Could you clarify whether different iid training and testing samples from the same underlying distribution are viewed as a type of distribution shift in your framework?

  2. In the proposed way to handle context shift, the revision seems to require that the shifted distribution PXP_{X'} is absolutely continuous with respect to PXP_X. This appears to be a strong assumption, especially for the continuous and/or high-dimensional XX emphasized in the response. And, how would you obtain r(X)r(X)?

  3. The bound in Theorem 1 of [7] depends on the dimension dd, and it only satisfies Assumption 3.4 when the sieve estimators meet specific smoothness requirements that may be too restrictive in high-dimensional settings.

I feel my main concerns haven't been adequately addressed. I will keep my score, but am open to further discussion.

作者评论

We would like to thank the reviewer for the timely responses and questions.

1 We are not sure if there is any misunderstanding, but we see no contradiction between "no distributional shift in PXP_X" and "continuous and/or high-dimensional context XX scenario". This just means that the distribution PXP_X (which can be continuous and/or high-dimensional) that generates the training contexts and the testing contexts is the same. The challenges that comes from the continuity and the high-dimensionality of XX is orthogonal to context shift.

To avoid any further confusion, please allow us to reiterate our problem setting. We aim to learn a concept shift robust policy from a training dataset D={(X_i,A_i,Y_i)}_i=1n\mathcal{D}=\{(X\_i,A\_i,Y\_i)\}\_{i=1}^n consisting of iid samples. The context (could be continuous and/or high-dimensional) XiPXX_i\sim P_X, the actions Aiπ0(Xi)A_i\sim\pi_0(X_i) conditioned on the context XiX_i, and the outcome YiP_Y(A_i)XiY_i\sim P\_{Y(A\_i)|X_i} is sampled from a distribution supported on R\mathbb{R}, conditioned on the context XiX_i and the action AiA_i. The optimal policy π\pi^* is robust to any kinds of concept shift, which is to say it gains the highest outcome in expectation over any testing sample path D={(X_i,π(X_i),Y_i)}_i=1n\mathcal{D}'=\{(X'\_i,\pi^*(X'\_i),Y'\_i)\}\_{i=1}^n, where contexts X_iP_XX'\_i\sim P\_X (a different sample path from (X_i)i=1n(X\_i)_{i=1}^n in D\mathcal{D}, but their underlying distribution PXP_X is the same), actions π(X_i)\pi^*(X'\_i) are taken by the policy π\pi^* conditioned on the context XiX'_i, and outcomes are sampled from a shifted distribution Y_iP_Y(π(X_i))X_iY'\_i\sim P'\_{Y(\pi(X\_i))|X\_i}, such that the KL-divergence between P_Y(a)X_iP\_{Y(a)|X\_i} and P_Y(a)X_iP\_{Y'(a)|X\_i} is within δ\delta for any action aa in the action set. This is the standard problem setting in offline distributional robust optimization literature [1-5].

Our revision includes the extension of context shift, which has the same problem setup as above, except that now P_XP\_X in the training dataset is shifted to P_XP'\_X in the testing dataset and P_Y(a)XP\_{Y(a)\mid X} does not shift. As before, the context XX can be continuous and/or high-dimensional.

To conclude, we studied the offline concept shift robust learning problem and in our revision, we also add the extension of context shift robust learning under estimable likelihood ratio.

2 Absolute continuity is required for all kinds of widely used ff-divergences, including KL-divergence, Chi-squared divergence, and total variation distance. These divergences are well-studied in offline distributional robust optimization literature [1-5], even under continuous and/or high-dimensional XX [1,3-5], and as a result, absolute continuity has been assumed therein. We would like to politely point out that this is a strong assumption considering the literature. On the contrary, it is a standard assumption to define the distributional shift robust learning problem [1-5].

For learning r(x)r(x), we note that by definition r(x)=dPXdPX(x)r(x)=\frac{dP'_X}{dP_X}(x), where dPdP\frac{dP'}{dP} is Radon-Nikodym derivative. As discussed before, since the context shift is often identifiable (i.e. we have access to context samples before and after the distributional shift, which are empirical realizations of PX,PXP_X,P'_X respectively), we can use regression techniques in our manuscript to fit r(x)r(x), similar to the estimation of the propensity score π0(x)\pi_0(x). We also note that we have derived double robustness results in the presence of context shift.

3 We agree that the convergence rate depends on the dimension dd, but such difficulties induced by high dimensionality are intrinsic for estimating and/or learning with conditional mean functions in nonparametric statistics. Note that the previous work [2] only considers XX with a finite support.

We would like to learn about references overcoming this issue if you can kindly point out them.

审稿意见
3

This paper develops a distributionally robust policy learning framework under concept drift by focusing on shifts in conditional reward distributions while assuming stable covariate distributions. It introduces a doubly robust estimator with root‑n convergence for policy evaluation and proposes an efficient policy learning algorithm with optimal regret bounds.

给作者的问题

  1. What is the effect on the overall estimation error when nuisance parameter estimators converge slower than the required rate?

  2. Can the proposed ERM and de-biasing approach be efficiently extended to handle continuous action spaces?

论据与证据

The paper supports its claims through rigorous theoretical proofs (e.g., asymptotic normality and regret bounds) and empirical studies comparing with benchmark methods. However, some claims rely on strong assumptions for nuisance parameter estimation, which might require additional discussion.

方法与评估标准

The proposed methods and evaluation criteria are well-tailored to the problem of concept drift. The separation of conditional reward shifts from joint distribution shifts is well motivated, and the use of simulated experiments with cross-fitting provides a reasonable evaluation framework.

理论论述

I reviewed the proofs for asymptotic normality and the regret upper/lower bounds. They appear methodologically sound.

实验设计与分析

The experimental design is robust, featuring multiple data splits, cross-validation, and a clear comparison against an established benchmark. However, the reliance on simulated data and sensitivity to hyperparameters may limit insights into performance on real-world datasets.

补充材料

I examined the proofs provided in Appendix D.1 (for strong duality), D.2 (for asymptotic normality of the policy value estimator), and D.4 (for the regret lower bound).

与现有文献的关系

The key contributions relate closely to recent advances in distributionally robust optimization and double machine learning. The work refines prior approaches—such as those by Si et al. (2023) and Mu et al. (2022)—by targeting concept drift specifically.

遗漏的重要参考文献

While the paper cites relevant studies, it could benefit from discussing additional works on robust inference under distribution shifts, particularly recent advances in DRO and robust causal inference that address similar issues in a broader context.

其他优缺点

Strengths:

  1. Provides a clear theoretical framework with rigorous proofs and optimal regret bounds.

  2. Effectively integrates doubly robust estimation with de-biasing and cross-fitting techniques.

  3. Presents a well-structured algorithm and detailed explanation of the methodological steps.

Weaknesses:

  1. Relies on strong assumptions for nuisance parameter estimation rates without extensive empirical sensitivity analysis.

  2. Some derivations, especially in the ERM formulation, lack complete justification.

  3. Limited discussion of potential drawbacks or failure cases in the simulation setup.

其他意见或建议

  1. Clarify the presentation of cross-fitting steps and the functions involved in de-biasing, as some parts could benefit from more detailed explanations.

2.Provide additional details on hyperparameter selection in the simulation studies and discuss potential impacts of different settings.

3.A brief discussion on how the methodology might generalize to real-world data or continuous action spaces would enhance the paper's practical insights.

作者回复

We would like to thank the reviewer for dedicating the time to review our paper and for providing the insightful comments. Due to the character limit, we cannot upload the revised manuscript, but we have edited according to the reviewer's helpful suggestions. We use the following references.

[1] Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributionally robust batch contextual bandits. Management Science, 2023.

[2] Mu, T., Chandak, Y., Hashimoto, T. B., and Brunskill, E. Factored dro: Factored distributionally robust policies for contextual bandits. Advances in Neural Information Processing Systems, 35:8318–8331, 2022.

[3] Athey, S. and Wager, S. Policy learning with observational data. Econometrica, 89(1):133–161, 2021.

[4] Zhou, Z., Athey, S., and Wager, S. Offline multi-action policy learning: Generalization and optimization. Opera- tions Research, 71(1):148–183, 2023.

[5] Kallus, N., Mao, X., and Uehara, M. Localized debiased machine learning: Efficient inference on quantile treatment effects and beyond. arXiv preprint arXiv:1912.12945, 2019.

[6] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21, 1 (2018), C1–C68.

[7] Jin, Ying, Zhimei Ren, and Zhengyuan Zhou. "Sensitivity analysis under the f-sensitivity models: a distributional robustness perspective." arXiv preprint arXiv:2203.04373 (2022).

[8] Kallus, Nathan, and Angela Zhou. "Policy evaluation and optimization with continuous treatments." International conference on artificial intelligence and statistics. PMLR, 2018.

[9] Rosenbaum, P.R. (2002). Observational Studies. Springer Series in Statistics. Springer, New York, NY.

[10] Dudley, Richard M. "The sizes of compact subsets of Hilbert space and continuity of Gaussian processes." Journal of Functional Analysis 1.3 (1967): 290-330.

Weakness

1 See Question 1

2 Thank you for your helpful comments. The ERM step follows from standard duality results in the DRO literature. To improve readability, we have added a detailed explanation of the ERM derivation in our manuscript. With an empirical dataset, it is natural to propose an ERM solution based on the loss function inspired by the strong duality result in Lemma 2.3.

3 A potential drawbacks in our framework (as well as in other distributional robust optimization works) is the choice of δ\delta. The parameter δ\delta controls the size of the uncertainty set considered and thus controls the degree of robustness in our model --- the larger δ\delta, the more robust the output. The empirical performance of the algorithm substantially depends on the selection of δ\delta. A small δ\delta leads to negligible robustification effect and the algorithm would learn an over-aggressive policy; a large δ\delta tends to yield more conservative results. A more detailed discussion can be found in [1]. We have incorporated the above in the revision.

Comments and Suggestions

1 In terms of cross-fitting and de-biasing technique, we have added more detailed explanation in our manuscript.

2 In our simulation, we set K=3K=3, which is the minimal number of splits possible, and the default spline threshold at 0.001 without fine-tuning. Under this default choice, we see that the algorithm already performs well. Increasing KK and decreasing the spline threshold would increase the computation complexity.

3 In real-world applications, knowing the source of the distribution shift effectively shrinks the uncertainty set, thereby yielding less conservative results (compared with the joint modeling approach). Moreover, since in most cases, practitioners have access to the covariate in target environment, it is then possible to identify and estimate covariate shifts: when the decision maker observes none or little covariate shifts and would like to hedge against the risk of concept drift, it is suitable to apply our method which outperforms existing method designed for learning under joint distributional shifts. We are now applying our method to a real dataset, which would be ready before camera ready version.

See Question 2 for generalization to continuous action space.

Questions

1 The rate Assumption 3.4 is standard in literature [1,2,3,4,5]: it suffices to have oP(n1/4)o_P(n^{-1/4})-rates on all nuisance parameters or no rate on g^π\hat{g}_\pi at all if π0\pi_0 is given. This assumption also parallels standard conditions in double-machine-learning, achievable by a variety of machine-learning methods [6]. The empirical sensitivity analysis can be found in [7] which justifies Assumption 3.4. We have added this discussion to our manuscript.

2 We agree with the reviewer that it might be possible to apply our ERM approach to the continuous action space extension, however the problem setting would deviate from the current discrete case, as discussed in [8]. We leave this for future works.

审稿人评论

Thank you for your response. My questions have been addressed.

作者评论

Thank you so much for your time and positive feedback!

最终决定

This paper presents a study on distributionally robust policy learning, specifically focusing on concept drift. The proposed methods and theoretical analyses are reasonable and the implementation appears sound.

However, the core ideas presented in this work, such as the separate treatment of covariate and concept shifts via DRO, the use of doubly robust techniques for evaluation and learning, and the derivation of learning guarantees, are not entirely novel and have been well explored in existing literature. While the paper presents extensions in several aspects, the overall contribution is perceived as somewhat marginal in light of the existing body of research. Therefore, I recommend a weak acceptance of this paper.