Distributionally Robust Policy Learning under Concept Drifts
We propose a minimax optimal offline policy learning algorithm that is robust under concept drifts.
摘要
评审与讨论
This paper propose the distributionally robust method for offline bandit under concept shift, where the P(Y|X) is shifting. They propose a doubly robust method and DRO under KL divergence for offline policy learning. And they show the Asymptotic normality of the ope and propose a policy learning and the corresponding regret bound.
给作者的问题
See above.
论据与证据
-
'To be concrete, imagine that the distribution of covariates changes while that of Y | X remains invariant — in this case, the distribution shift is identifiable/estimable since the covariates are often accessible in the target environment. As a result, it is often unnecessary to account for the worst-case covariate shift rather than directly correcting for it.' This is not true, when there is some OOD data, especially in RL.
-
Comparing with [1], what is the other difference except for the doubly robust estimator and you are only considering the concept shift instead of the joint shift?
-
Is it possible that you provide some bias and variance of the proposed value estimator? And what is the benefit of Asymptotic normality, or in other word, why do we want this or what is the advantage compared with other estimator that is not Asymptotic normality?
[1] Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributionally robust batch contextual bandits. Management Science, 2023.
方法与评估标准
Looks good to me.
理论论述
Yes. I check the proof of the regret bound.
实验设计与分析
Yes. I check the whole data generation process.
补充材料
I only go through some of the proof of the lemma from the main text.
与现有文献的关系
Provide a reliable method for the policy learning under concept shift, which is crucial for domain adaptation under the concept.
遗漏的重要参考文献
The following paper also discuss the DRO for policy learning, maybe you can also discuss in the related work.
[1] Distributionally Robust Policy Gradient for Offline Contextual Bandits
其他优缺点
See above.
其他意见或建议
Maybe highlight the difference of your method compared with [1].
Discuss (Hamming entropy integral) intuitively, currently it is hard to follow.
[1] Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributionally robust batch contextual bandits. Management Science, 2023.
We would like to thank the reviewer for dedicating the time to review our paper and for providing the insightful comments. Due to the character limit, we cannot upload the revised manuscript, however we have edited according to the reviewer's helpful suggestions. Reference can be found in our reply to Reviewer wdwV.
Claims And Evidence
1 To our best knowledge, context shift and concept shift are two big sources of out of distribution data. In our setting, we are concerned with concept shift dataset, i.e. OOD data such that the distribution of in testing data is shifted from that in training data. In the quoted sentence, we are discussing the other case of context shift, where the distribution of in testing data is shifted from that in training data, with the conditional reward distribution unchanged. Please let us know if we have misunderstood your question.
In our revision, we have extended our framework to incorporate context shift. One can find that the policy value in this case is , where with being the shifted context distribution. Here, we make a conscious choice to estimate the context shift (as opposed to hedge again the worst-case shift) because in most practical situations, users have access to the covariates in the target environment and the context shift is identifiable and estimable; it is thus unnecessary to guard against the worst-case shift. We have added this extension to our manuscript.
2 Thank you for your question. The concept-shift setting is one of the major differences: we aim at providing a better solution to DRO policy learning when knowing additionally the type of distribution shift. Such a change of objective brings substantial technical challenges, which are what we have addressed in this paper.
Except for the differences in settings, [1] assumes a known behavior policy (and thus known propensity score), while our setting allows for an unknown , which adds new challenge as slow estimation rates of the propensity score could result in high regret bound. We note that unknown setting is ubiquitous in observational studies [9]. This challenge calls for regression methods for fitting and an intricate design of empirical risk minimization (ERM) method, combined with a double-robust construction, in our work to compensate for the unknown . In spite of all these challenges, we managed to show theoretically and empirically that, if only concept shift takes place, then employing [1] is suboptimal, and our algorithm does better with this one bit of extra information. We have added this comparison to our literature review section.
In terms of theoretical analysis, we adopt the chaining technique to achieve a regret rate of , while [1] uses a quantile trick.
3 Thank you for your question. To learn a near-optimal policy that maximizes the distributionally robust value from a dataset, a consistent (i.e. asymptotically normal) estimator of is a necessary intermediate step to achieve good quality learning. Asymptotic normality allows for inference on the policy value (e.g., constructing confidence intervals, conducting hypothesis testing). In terms of the variance of the proposed value estimator, asymptotically the variance is as stated in Theorem 3.5. This is to say that the bias is asymptotically 0, given consistent nuisance parameter estimators. The non-asymptotical bias contributed by each nuisance parameter estimator is carefully analyzed in Appendix D.2, proof of Theorem 3.5.
Other Comments Or Suggestions
1 See Claims and Evidence 2
2 Hamming entropy integral is a variant of the classical entropy integral introduced in [10], based on the Hamming distance, which is a well-known metric for measuring the similarity between two equal-length arrays (policies in our context) whose elements are supported on on discrete sets. Hamming entropy integral is widely used in offline learning literature [1-4] for measuring the complexity of a policy class. Details and examples can be found in [1].
Thanks for the effort in addressing my questions. I will raise my score to 3.
Thank you so much for your time and positive feedback!
This paper investigates distributionally robust policy learning with concept shift. While this problem has been previously studied in the literature, the current work extends to a more general setting where the context space is not necessarily finite. To address this generalized setup, the authors propose a doubly robust estimator. The paper demonstrates that the policy value estimator exhibits asymptotic normality when the nuisance parameters are estimated with sufficient accuracy. Besides, other key contributions include establishing upper bounds for general spaces and providing corresponding lower bounds.
给作者的问题
None.
论据与证据
I found the problem setup with exclusive focus on concept shift to be somewhat artificial. If there is only concept shift without uncertainty about the marginal distribution of the context , it seems more natural to optimize the policy for each individually. Indeed, when encompasses all possible policies (or rectangular in the sense that having separate constraints for each context), by the interchangeability principle, the optimal policy that solves equation (1) satisfies: where is the set of randomized actions.
The restriction to specific policy families in the paper typically serves the purpose of generalization—learning a policy that performs well across the entire context space. However, the concept-shift-only setup implies no need for such generalization since is assumed to be known exactly. While there might be practical applications requiring certain policy forms or impose constraints across different contexts, apparently the paper does not focus on these considerations. To me, the joint or separate uncertainty in both and makes more sense to me.
Furthermore, given the assumption of no uncertainty in the marginal distribution of context , one would expect generalization error bounds that are stronger than those presented in the paper, potentially avoiding the curse of dimensionality described at the end of Section 3.1. It's particularly concerning that Assumption 3.4 requires a dimension-independent convergence rate, which the results in Section 3.1 cannot achieve for high-dimensional spaces without imposing unrealistically strong smoothness constraints.
方法与评估标准
The numerical experiments would be substantially strengthened by including real-world datasets to demonstrate the effectiveness of the proposed method. Currently, the empirical validation appears limited to an artificial synthetic data, which may not fully capture the complexities encountered in practical applications. Having real data also helps solidify the concept-shift-only setup investigated in this paper.
理论论述
The results look reasonable to me and the proofs are quite standard, although I did not check every detail and I wonder if they are optimal or too conservative given that the context distribution is assumed to be known exactly.
实验设计与分析
As mentioned above, I think the experiments are too preliminary and it would be nicer to have real datasets support the assumed setup in the paper.
补充材料
I went through them on a high level.
与现有文献的关系
In the literature, both joint and separate uncertainty in concept and covariate shift are studied. This paper focuses exclusively on the concept shift. It might have important applications and implications, but unfortunately the paper, in its current form, does not articulate it well.
遗漏的重要参考文献
None.
其他优缺点
NA.
其他意见或建议
None.
We would like to thank the reviewer for dedicating the time to review our paper and for providing the insightful comments. Due to the character limit, we cannot upload the revised manuscript, however we have edited according to the reviewer's helpful suggestions. Reference can be found in our reply to Reviewer wdwV.
Claims and Evidence
1&2 We would like to first note that our work does not assume known context distribution . We only assume that no covariate shift takes place (which has been relaxed in our revision), and we aim to learn an optimal policy that is robust to any shift of within the KL-divergence. The setup is similar to the setting of [2], with the latter focusing on a finite covariate space.
We also note that the per-x optimization formulation proposed your comment 1 itself is highly challenging when has continuous components and/or is high-dimensional ---indeed, it then requires to evaluate for each . At a high level, this is the challenge our work addresses.
In our revision, we have extended our framework to incorporate context shift. One can find that the policy value in this case is , where with being the shifted context distribution. Here, we make a conscious choice to estimate the context shift (as opposed to hedge again the worst-case shift) because in most practical situations, users have access to the covariates in the target environment and the context shift is identifiable and estimable; it is thus unnecessary to guard against the worst-case shift. We have added this extension to our manuscript.
3 As discussed in our previous reply, we do not assuming knowing and solving the per- optimization problem is challenging with continuous and/or high-dimensional ---this is where the curse of dimensionality kicks in (think of the task of estimating conditional mean).
We also note that Assumption 3.4 is standard in offline learning literature [1,2,3,4,5]. The empirical sensitivity analysis of Assumption 3.4 can be found in [7] which justifies it. The results in [7] also parallels standard conditions in double-machine-learning, achievable by a variety of machine-learning methods [6].
Methods And Evaluation Criteria: We are now running a new set of experiments with real-world dataset, which will be ready before the camera ready version.
Theoretical Claims: The optimality of our regret bound is verified by our lower bound result of in Theorem 4.6. In terms of results in literature, our setting is similar to that of [2], however [2] only considers discrete with finite support, while we extend the case to continuous unknown with infinite support. We also improve the regret bound of in [2]. Please see table 1 for an overview of results in literature and our results.
Relation To Broader Scientific Literature: Concept shifts occurs in many real-world situations. For example, in advertising, the customer behavior can evolve over time as the environment changes, while the population remains largely the same. In personalized product recommendation, similar population segments in developed and emerging markets may prefer different product features.
Most existing robust policy learning algorithms model joint distributional shift without distinguishing the sources. The suboptimality of these algorithms under concept shift is because the worst-case distributions under the joint-shift model and the concept-drift model can be substantially different, so it would be a “waste” to consider joint shift under concept drift. With one extra bit of information, our work shows that we can obtain a better policy. The above discussion was already in our introduction, but we expanded it in our revision.
Thank you for your response. I still have some concerns about the paper's setting that appear contradictory:
-
You mention the revision assumes that there is no context shift in , yet at the same time, your response aims to address continuous and/or high-dimensional scenarios. Could you clarify whether different iid training and testing samples from the same underlying distribution are viewed as a type of distribution shift in your framework?
-
In the proposed way to handle context shift, the revision seems to require that the shifted distribution is absolutely continuous with respect to . This appears to be a strong assumption, especially for the continuous and/or high-dimensional emphasized in the response. And, how would you obtain ?
-
The bound in Theorem 1 of [7] depends on the dimension , and it only satisfies Assumption 3.4 when the sieve estimators meet specific smoothness requirements that may be too restrictive in high-dimensional settings.
I feel my main concerns haven't been adequately addressed. I will keep my score, but am open to further discussion.
We would like to thank the reviewer for the timely responses and questions.
1 We are not sure if there is any misunderstanding, but we see no contradiction between "no distributional shift in " and "continuous and/or high-dimensional context scenario". This just means that the distribution (which can be continuous and/or high-dimensional) that generates the training contexts and the testing contexts is the same. The challenges that comes from the continuity and the high-dimensionality of is orthogonal to context shift.
To avoid any further confusion, please allow us to reiterate our problem setting. We aim to learn a concept shift robust policy from a training dataset consisting of iid samples. The context (could be continuous and/or high-dimensional) , the actions conditioned on the context , and the outcome is sampled from a distribution supported on , conditioned on the context and the action . The optimal policy is robust to any kinds of concept shift, which is to say it gains the highest outcome in expectation over any testing sample path , where contexts (a different sample path from in , but their underlying distribution is the same), actions are taken by the policy conditioned on the context , and outcomes are sampled from a shifted distribution , such that the KL-divergence between and is within for any action in the action set. This is the standard problem setting in offline distributional robust optimization literature [1-5].
Our revision includes the extension of context shift, which has the same problem setup as above, except that now in the training dataset is shifted to in the testing dataset and does not shift. As before, the context can be continuous and/or high-dimensional.
To conclude, we studied the offline concept shift robust learning problem and in our revision, we also add the extension of context shift robust learning under estimable likelihood ratio.
2 Absolute continuity is required for all kinds of widely used -divergences, including KL-divergence, Chi-squared divergence, and total variation distance. These divergences are well-studied in offline distributional robust optimization literature [1-5], even under continuous and/or high-dimensional [1,3-5], and as a result, absolute continuity has been assumed therein. We would like to politely point out that this is a strong assumption considering the literature. On the contrary, it is a standard assumption to define the distributional shift robust learning problem [1-5].
For learning , we note that by definition , where is Radon-Nikodym derivative. As discussed before, since the context shift is often identifiable (i.e. we have access to context samples before and after the distributional shift, which are empirical realizations of respectively), we can use regression techniques in our manuscript to fit , similar to the estimation of the propensity score . We also note that we have derived double robustness results in the presence of context shift.
3 We agree that the convergence rate depends on the dimension , but such difficulties induced by high dimensionality are intrinsic for estimating and/or learning with conditional mean functions in nonparametric statistics. Note that the previous work [2] only considers with a finite support.
We would like to learn about references overcoming this issue if you can kindly point out them.
This paper develops a distributionally robust policy learning framework under concept drift by focusing on shifts in conditional reward distributions while assuming stable covariate distributions. It introduces a doubly robust estimator with root‑n convergence for policy evaluation and proposes an efficient policy learning algorithm with optimal regret bounds.
给作者的问题
-
What is the effect on the overall estimation error when nuisance parameter estimators converge slower than the required rate?
-
Can the proposed ERM and de-biasing approach be efficiently extended to handle continuous action spaces?
论据与证据
The paper supports its claims through rigorous theoretical proofs (e.g., asymptotic normality and regret bounds) and empirical studies comparing with benchmark methods. However, some claims rely on strong assumptions for nuisance parameter estimation, which might require additional discussion.
方法与评估标准
The proposed methods and evaluation criteria are well-tailored to the problem of concept drift. The separation of conditional reward shifts from joint distribution shifts is well motivated, and the use of simulated experiments with cross-fitting provides a reasonable evaluation framework.
理论论述
I reviewed the proofs for asymptotic normality and the regret upper/lower bounds. They appear methodologically sound.
实验设计与分析
The experimental design is robust, featuring multiple data splits, cross-validation, and a clear comparison against an established benchmark. However, the reliance on simulated data and sensitivity to hyperparameters may limit insights into performance on real-world datasets.
补充材料
I examined the proofs provided in Appendix D.1 (for strong duality), D.2 (for asymptotic normality of the policy value estimator), and D.4 (for the regret lower bound).
与现有文献的关系
The key contributions relate closely to recent advances in distributionally robust optimization and double machine learning. The work refines prior approaches—such as those by Si et al. (2023) and Mu et al. (2022)—by targeting concept drift specifically.
遗漏的重要参考文献
While the paper cites relevant studies, it could benefit from discussing additional works on robust inference under distribution shifts, particularly recent advances in DRO and robust causal inference that address similar issues in a broader context.
其他优缺点
Strengths:
-
Provides a clear theoretical framework with rigorous proofs and optimal regret bounds.
-
Effectively integrates doubly robust estimation with de-biasing and cross-fitting techniques.
-
Presents a well-structured algorithm and detailed explanation of the methodological steps.
Weaknesses:
-
Relies on strong assumptions for nuisance parameter estimation rates without extensive empirical sensitivity analysis.
-
Some derivations, especially in the ERM formulation, lack complete justification.
-
Limited discussion of potential drawbacks or failure cases in the simulation setup.
其他意见或建议
- Clarify the presentation of cross-fitting steps and the functions involved in de-biasing, as some parts could benefit from more detailed explanations.
2.Provide additional details on hyperparameter selection in the simulation studies and discuss potential impacts of different settings.
3.A brief discussion on how the methodology might generalize to real-world data or continuous action spaces would enhance the paper's practical insights.
We would like to thank the reviewer for dedicating the time to review our paper and for providing the insightful comments. Due to the character limit, we cannot upload the revised manuscript, but we have edited according to the reviewer's helpful suggestions. We use the following references.
[1] Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributionally robust batch contextual bandits. Management Science, 2023.
[2] Mu, T., Chandak, Y., Hashimoto, T. B., and Brunskill, E. Factored dro: Factored distributionally robust policies for contextual bandits. Advances in Neural Information Processing Systems, 35:8318–8331, 2022.
[3] Athey, S. and Wager, S. Policy learning with observational data. Econometrica, 89(1):133–161, 2021.
[4] Zhou, Z., Athey, S., and Wager, S. Offline multi-action policy learning: Generalization and optimization. Opera- tions Research, 71(1):148–183, 2023.
[5] Kallus, N., Mao, X., and Uehara, M. Localized debiased machine learning: Efficient inference on quantile treatment effects and beyond. arXiv preprint arXiv:1912.12945, 2019.
[6] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21, 1 (2018), C1–C68.
[7] Jin, Ying, Zhimei Ren, and Zhengyuan Zhou. "Sensitivity analysis under the f-sensitivity models: a distributional robustness perspective." arXiv preprint arXiv:2203.04373 (2022).
[8] Kallus, Nathan, and Angela Zhou. "Policy evaluation and optimization with continuous treatments." International conference on artificial intelligence and statistics. PMLR, 2018.
[9] Rosenbaum, P.R. (2002). Observational Studies. Springer Series in Statistics. Springer, New York, NY.
[10] Dudley, Richard M. "The sizes of compact subsets of Hilbert space and continuity of Gaussian processes." Journal of Functional Analysis 1.3 (1967): 290-330.
Weakness
1 See Question 1
2 Thank you for your helpful comments. The ERM step follows from standard duality results in the DRO literature. To improve readability, we have added a detailed explanation of the ERM derivation in our manuscript. With an empirical dataset, it is natural to propose an ERM solution based on the loss function inspired by the strong duality result in Lemma 2.3.
3 A potential drawbacks in our framework (as well as in other distributional robust optimization works) is the choice of . The parameter controls the size of the uncertainty set considered and thus controls the degree of robustness in our model --- the larger , the more robust the output. The empirical performance of the algorithm substantially depends on the selection of . A small leads to negligible robustification effect and the algorithm would learn an over-aggressive policy; a large tends to yield more conservative results. A more detailed discussion can be found in [1]. We have incorporated the above in the revision.
Comments and Suggestions
1 In terms of cross-fitting and de-biasing technique, we have added more detailed explanation in our manuscript.
2 In our simulation, we set , which is the minimal number of splits possible, and the default spline threshold at 0.001 without fine-tuning. Under this default choice, we see that the algorithm already performs well. Increasing and decreasing the spline threshold would increase the computation complexity.
3 In real-world applications, knowing the source of the distribution shift effectively shrinks the uncertainty set, thereby yielding less conservative results (compared with the joint modeling approach). Moreover, since in most cases, practitioners have access to the covariate in target environment, it is then possible to identify and estimate covariate shifts: when the decision maker observes none or little covariate shifts and would like to hedge against the risk of concept drift, it is suitable to apply our method which outperforms existing method designed for learning under joint distributional shifts. We are now applying our method to a real dataset, which would be ready before camera ready version.
See Question 2 for generalization to continuous action space.
Questions
1 The rate Assumption 3.4 is standard in literature [1,2,3,4,5]: it suffices to have -rates on all nuisance parameters or no rate on at all if is given. This assumption also parallels standard conditions in double-machine-learning, achievable by a variety of machine-learning methods [6]. The empirical sensitivity analysis can be found in [7] which justifies Assumption 3.4. We have added this discussion to our manuscript.
2 We agree with the reviewer that it might be possible to apply our ERM approach to the continuous action space extension, however the problem setting would deviate from the current discrete case, as discussed in [8]. We leave this for future works.
Thank you for your response. My questions have been addressed.
Thank you so much for your time and positive feedback!
This paper presents a study on distributionally robust policy learning, specifically focusing on concept drift. The proposed methods and theoretical analyses are reasonable and the implementation appears sound.
However, the core ideas presented in this work, such as the separate treatment of covariate and concept shifts via DRO, the use of doubly robust techniques for evaluation and learning, and the derivation of learning guarantees, are not entirely novel and have been well explored in existing literature. While the paper presents extensions in several aspects, the overall contribution is perceived as somewhat marginal in light of the existing body of research. Therefore, I recommend a weak acceptance of this paper.