3.5

/10

withdrawn4 位审稿人

最低3最高5标准差0.9

3.3

置信度

ICLR 2024

AdaO2B: Adaptive Online to Batch Conversion for Out-of-Distribution Generalization

Xiao Zhang,Sunhao Dai,Jun Xu,Yong Liu,Zhenhua Dong,Ji-Rong Wen

OpenReview PDF

提交: 2023-09-24更新: 2024-03-26

TL;DR

An adaptive online to batch conversion approach that can handle out-of-distribution generalization.

摘要

关键词

online to batch conversionout-of-distribution (OOD) generalizationstreaming applicationsbandit

评审与讨论

审稿意见

评分: 3置信度: 22023-10-25

Motivated by real-world streaming applications that may be sampled from time-varying distributions, this paper aims to address the out-of-distribution (OOD) generalization challenge faced by existing online-to-batch conversions for contextual bandits. To this end, an adaptive online-to-batch conversion (AdaO2B) approach is proposed, which is designed according to a theoretical analysis of the OOD generalization bound. It seems that experimental results verified the performance of the AdaO2B approach.

优点

The existence of time-varying distributions in real-world streaming applications is convincing, which does affect the generalization of the online-to-batch (O2B) technique. So, it is natural and reasonable to investigate the out-of-distribution (OOD) generalization of O2B.
The authors try to address the OOD generalization problem of O2B by first establishing a generalization error bound regarding the combination weights used in O2B, and then developing an algorithm, namely AdaO2B, to find suitable weights that are helpful for reducing the generalization error bound.
Experimental results on both synthetic data and real-world data are reported.

缺点

Although the motivation for investigating OOD generalization of the online-to-batch (O2B) conversions is convincing, it is actually not clear why the O2B of bandits algorithms should be focused. The authors should provide more explanations on the motivation. Moreover, it is natural to ask whether the OOD generalization of O2B with full-information algorithms has already been addressed.
I am not convinced that the generalization error defined in Eq. (3) can reflect the generalization ability of the estimated reward parameters $\theta$ , because the first term $C_{ER}$ is an upper bound that does not depend on $\theta$ , and the second term directly computes the reward by using $\theta$ . Note that in the bandit setting, one should first utilize the estimated parameters $\theta$ to select an arm, and the reward of the arm is computed on the underlying model of the reward function i.e., $\theta^\\ast$ .
It is not clear whose regret is reflected by the weighted regret defined in Eq. (4). So, combining with the above concern on the definition of the generalization error, I am not convinced that the generalization error bound derived in this paper can really reflect the hardness of OOD generalization.
Although the authors argue that "theoretical analysis provides justification for why and how the learned adaptive batch learner can achieve OOD generalization error guarantees", the proposed algorithm actually is not strictly designed under their theoretical results. Specifically, it is too heuristic to learn the weights via the Multi-Layer Perceptron.

问题

My main concerns have been described in the above Weakness part. There are some suggestions.

In the related work, the authors should provide more detailed review on previous online learning studies, especially that on the bandits setting.
In the related work, the review on transfer learning should be reorganized. The current version is too sudden, because "transfer learning " almost has not been discussed in previous contexts.

审稿意见

评分: 3置信度: 32023-10-27

This work studies a setting for the online-to-batch conversion in scenarios where the data distributions diverge between the training and testing phases. The work offers some theoretical analysis to show that the generalization error in the testing stage can be bounded by the weighted regret in the online learning stage and nd the distributional divergence between the training and testing stages. Based on these findings, this paper further proposes a heuristic algorithm to return the batch learner, where the authors considers more adaptive data buffer collecting methods, e.g., sliding window approach. Experiments are conducted to validate these findings and to compare different heuristic algorithmic designs.

优点

The writing of this paper is comprehensive with easy-understanding organization. This paper carries out a series of experiments to validate their findings.

缺点

In the context of OOD settings with online-to-batch conversion techniques, the theoretical contributions of the paper fall short of providing novel insights. The established relationship between the generalization error, online learning stage regret, and distribution divergence seems somewhat intuitive and might be anticipated by some information-theoretic results. Given the paper’s emphasis on OOD settings, it is noticeable that the algorithmic design leans more towards addressing distribution shift issues rather than actively identifying potential outliers. Additionally, the parameter $\beta$ appears to be important in determining the final model’s performance. Due to this, conducting ablation studies to show the impact of different choices of $\beta$ values would have been a valuable addition to this work.

问题

See the weakness.

审稿意见

评分: 5置信度: 42023-10-31

This work focuses on the OOD generalization problem for the online to batch (O2B) conversion. From the theoretical perspective, this work proves the relationship between online regret and the OOD generalization error. From the experimental perspective, this work proposes a new O2B conversion method, AdaO2B, with learnable weight for decision models. This work designed a context-aware loss function to learn the weight of decision models. AdaO2B shows good performance on both synthetic and real-world datasets.

优点

The AdaO2B framework with learnable weight and context-aware loss function is widely applicable to the different bandit algorithms.

缺点

Weakness 1: The proof of Theorem 1 can be decomposed into calculating the difference of $\theta_{ada}$ and $\theta^*$ in training dataset $P$ and the distance for $\theta_{ada}$ in distribution $P$ and $Q$ . For the first part, the proof is the same compared to Theorem 3.1 of [1]. For the second part, this work uses the results of [2]. The technical novelty of Theorem 1 is not clear and needs to be discussed in detail.

Weakness 2: In the experimental part, the experimental dataset is variable in different episodes, and the AdaO2B framework prefers to assign larger weight $\beta_n$ to those $\theta_n$ , whose training distribution is similar to the testing distribution $Q$ . However, in the theoretical analysis part, this work assumes training distributions in different episodes $n$ are the same, which does not capture the dynamic nature of the data stream in the training process and does not match the intuition of adaptive $\beta_n$ . Hence, it will be better to discuss the distribution shift in the training process to make up the difference between theoretical analysis and experimental results.

[1] Orabona, F. (2019). A modern introduction to online learning. arXiv preprint arXiv:1912.13213. [2] Shui, C., Chen, Q., Wen, J., Zhou, F., Gagné, C., & Wang, B. (2020). Beyond H-divergence: Domain adaptation theory with jensen-shannon divergence. arXiv preprint arXiv:2007.15567, 6.

问题

Question 1: Can you discuss novel points in the proof process of Theorem 1?

Question 2: Can you discuss how to deal with the distribution shift in the training process in Theorem 1?

Question 3: In the discussion on the Kuairec dataset, this work mentions that reservoir sampling performs well since it can capture severe distribution drift. Can we observe a similar phenomenon if we add severe distribution drift to synthetic data?

Typo 1: It should be $W$ instead of $K$ in Corollary 1.

审稿意见

评分: 3置信度: 42023-10-31

The authors present an online-to-batch conversion method named AdaO2B for ensembling a sequence of online bandit models, each having been run on a different, subsequent time period (e.g., morning day 1, afternoon day 1, night time day 1, morning day 2, ...) and trained to fit the obtained rewards [Li et al., 2010. http://rob.schapire.net/papers/www10.pdf]

To perform online-to-offline batch conversion, the authors snapshot K of their past bandit models (as well as the corresponding logged actions, contexts, and rewards), and train an MLP neural network to predict a mixture over the K bandits which fits the logged rewards. In other words, given a context, the MLP predicts a softmax weighting of the K bandit models such that the corresponding weighted mixture of their estimated rewards fits the logged reward in the training data.

The authors provide regret bounds to suggest that an adaptive mixture can do better than a fixed online-to-offline batch conversion strategy (e.g., a static mixture of all K bandits' reward estimates), and provide a number of different strategies for selecting the K constituent bandit models in the ensemble.

The paper concludes with experiments on a synthetic dataset and a real-world video recommendations dataset (KuaiRec) to compare AdaO2B with different fixed online-to-offline batch conversion methods. The authors also provide ablations showing the effects of different choices for K, and the different methods of selecting the K constituent bandits.

优点

Dealing with change points and non-stationarity is an important part of building adaptive online interactive systems, so the problem addressed is worthy of focus. Moreover, learning an adaptive mixture of recommenders, each fitted to a different regime or time period, is a sensible approach to dealing with non-stationarity.

缺点

The main weaknesses of this paper are in the areas of 1) novelty; 2) soundness of the theoretical analysis; and 3) completeness and significance of the empirical evaluations.

Regarding novelty, there is already a great deal of prior work on learning adaptive or hierarchical mixtures. The adaptive method proposed in this paper was previously proposed by Jacobs et al. 1991 (https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf), though not cited here. And adaptive mixtures have been employed elsewhere in the recommations space: Kumthekar et al. 2019 (https://research.google/pubs/pub49380/) focus on adaptively mixing shared models for multitask learning, which is admittedly different from handling temporal drift.

Regarding theoretical soundness, my main concern is that the bound in theorem 1, equation (5), does not seem particularly useful, but perhaps I am misunderstanding it. It has two terms: the first is what I would call the "value-add" term of using an adaptive mixture rather than a fixed mixture. And the second describes the worst-case / "out of distribution" penalty which cannot be alleviated by the proposed method. Therefore, only the first term is of potential use. But here, the value-add term is just assumed to have some worst-case fixed upper bound "C_{WREG}" and the denominator is the sum of the mixture weights \Beta_{1:N}, which is often constrained to be equal to 1.0. So I am not sure what useful information this bound conveys.

Further, the corollary in equation 6 puzzles me because it assumes that nature's densities p_i and q_i are paired in some fashion. But in actuality there is no guarantee that nature will swap q_i for p_i; it could instead swap q_j with p_j. And therefore a double sum seems more appropriate:

   \sqrt{ 2 logK  +  K \sum_{ij} \pi_i \tau_j D(p_i || q_j) }.

Finally regarding the empirical results, I have two concerns:

i) In experiments on KuaiRecs (the real-world dataset), AdaO2B shows no benefit over using a fully online (FOL) bandit. Therefore, I am not sure why one would use it, especially since it requires training constituent models online anyway.

ii) The experiments do not show comparisons against the baseline of training a single recommender system offline, directly on the logged bandit feedback. This baseline approach might work just as well as, or better than, the proposed adaptive mixture-of-bandits model. And it is the more obvious thing to try.

问题

I am interested to know if I misunderstand the usefulness of Theorem 1. Related, to this question, are the mixture weights \Beta_{1:N}, constrained to equal 1.0 ?

Additionally, I am confused that the quantity C_{WReg} in equation 5 does not seem to be a random quantity, but an expectation is being applied to it anyway: E_p[ C_{WREG} ]. Why?

伦理问题详情

no concerns