PaperHub
5.0
/10
Rejected4 位审稿人
最低3最高6标准差1.2
5
6
6
3
2.8
置信度
正确性2.8
贡献度2.8
表达2.8
ICLR 2025

Distribution-Dependent Rates for Multi-Distribution Learning

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

We devise adaptive and non-adaptive algorithms for the multi-distribution learning problem and provide distribution-dependent guarantees using tools from empirical process theory and drawing inspiration from pure exploration multi-armed bandits.

摘要

关键词
multi-distribution learningdistributionally robust optimizationpure exploration multi-armed bandits

评审与讨论

审稿意见
5

This paper presents innovative strategies within the framework of Multi-Distribution Learning (MDL), with the primary objective of identifying the best-performed distribution. It is informed by principles from Distributionally Robust Optimization (DRO) and multi-armed bandit theory, proposing both non-adaptive and adaptive methodologies. The non-adaptive techniques, namely Uniform Exploration (UE) and Non-Uniform Exploration (NUE), yield both distribution-independent and distribution-dependent bounds. Furthermore, the paper introduces an adaptive method in the interactive environment, LCB-DR, which further optimizes performance by employing optimistic sampling strategies analogous to the Upper Confidence Bound for Exploration (UCB-E) utilized in multi-armed bandit scenarios.

优点

The paper is well-written and clearly conveys the problem setting and the conclusion to the reader. This work provides a distribution-dependent bound with analysis, which has not been reported in the literature. The distribution-dependent bound enjoys an exponential decay which can be compared to the probability of identification failure in the Best-arm Identification. Furthermore, this paper does not limit itself to non-adaptive exploring but extends to an adaptive exploring strategy, which is the UCB-E algorithm.

缺点

Since the author mentions that MDL draws inspiration from multi-armed bandits, I have found that identifying the best-performed distribution can be viewed as an analogy to identifying the best arm (BAI) in MAB. In lines 199-203, the author also mentions a connection between this work and BAI, which is a HaH_a term; It would be better if the author could draw more comparisons between MDL and BAI. Since several works in BAI can achieve instance-independent bound \cite{audibert2010best, chen2017towards}. How does the objective upper bound guarantee relate to the existing bound shown in the BAI literature? I believe it is important to show whether the given bound is tight when reducing the problem setting to the existing work.

问题

  • What is the definition of aa^\star in lines 199-203.

  • What is ll appearing in the RHS of Eq in lines 263-266?

  • Is MM a known parameter or an unknown parameter to the agent?

  • In lines 320-321, why ΔDR(a)Bn\Delta_{\text{DR}}(a) \approx B_n induces the comparison to be M2n\tfrac{M^2}{n} v.s. σT2+ΣT2+VT\sigma^2_T + \Sigma^2_T + V_T? Both exponential terms should be 11 when the exponential factor becomes 00.

伦理问题详情

There are no ethical concerns.

评论

\newcommand{\Uj}{\mathcal{U}_j} Thank you for the feedback. To answer your first point, let us detail how the algorithms presented mirror their BAI counterparts:

  • The UE strategy in the BAI literature similarly samples nn times from each arm aa and ultimately picks the one with best empirical reward μTo(a)1ni=1nr(Xa(i))\mu_T^\text{o}(a) \coloneqq \frac{1}{n}\sum_{i=1}^n r(X_a^{(i)}). Since this is an unbiased estimate of the true mean, the regret analysis is a straightforward application of Hoeffding's inequality and yields the rate aΔ(a)exp(nΔ2(a))\sum_a \Delta(a)\exp(-n\Delta^2(a)), ignoring constants. In the MDL setting, there are two variables: the decision aa and the distributions QQ (serving the role of the ''arm'' in the BAI context). In essence, we apply UE on the latter to construct empirical proxy μTo(a)=minQ1ni=1nr(a,XQ(i))\mu_T^\text{o}(a) = \min_Q \frac{1}{n}\sum_{i=1}^n r(a,X_Q^{(i)}). Now, the analysis is more involved since μTo\mu_T^\text{o} is no longer an unbiased estimate of the true objective function, due to the minimization. For this reason, we instead rely on tools from empirical process theory and obtain the analogous regret of aΔ(a)exp(n[Δ(a)logk/n]2)\sum_a \Delta(a)\exp(-n[\Delta(a)-\sqrt{\log k / n}]^2). For more details on the BAI UE, we refer to Chapter 33 of Lattimore and Szepesvári - Bandit Algorithms.

  • Our LCB-DR algorithm essentially uses the UCB-E strategy from BAI to find the worst-performing distribution for each decision aa. Hence, let us take an iteration jj of LCB-DR and compare it to UCB-E for kk arms over TT rounds. Focusing on the exponentially decaying term, the latter guarantees a rate of exp((Tk)/H)\exp(-(T-k)/H), where H=aΔ2(a)H = \sum_a \Delta^{-2}(a) captures the instance complexity. In LCB-DR, our goal is to apply the same idea on the distributions QUQ\in\mathcal{U}. However, by round jj, we have already collected data in the first j1j-1 rounds and would like to make use of it, in addition to the TjT_j samples collected in this round. Intuitively, if some distribution has already been played enough times, the learner can identify it as suboptimal. This is what the set \UjU\Uj\subset\mathcal{U} captures: it is a guess of the distributions that the learner is still unsure about and will thus end up playing on round jj (note that we do not constrain the decision to be from \Uj\Uj, so this quantity can be unknown). In Appendix D, we derive a new analysis of UCB-E for the setting where the learner starts the game with some samples, as is the case here. Corollary D.1 shows that \Uj\Uj indeed contains the arms played. The regret bound shows that the learner essentially plays UCB-E only on arms \Uj\Uj and, thus, the resulting rate scales with Hj=Q\UjΔaj2(Q)H_j=\sum_{Q\in \Uj}\Delta_{a_j}^{-2}(Q), which can be much smaller than HH (in MDL, this translates to summing over all of U\mathcal{U}) as it only sums over a subset of U\mathcal{U}. In addition, the total number of pre-existing samples T~j\tilde T_j from \Uj\Uj also contributes to the rate and we only offset by the size kjk_j of \Uj\Uj. Putting these together, the LCB-DR regret for this round scales with exp((Caj21)(Tj+T~jkj)/Hj)\exp(-(C_{a_j}^2\wedge 1)(T_j+\tilde T_j-k_j)/H_j), where CajC_{a_j} is a complexity measure specific to MDL (see Section 4.1 for some intuition).

With regards to tightness of these bounds, it remains an open question. Translating results from BAI to MDL is not very straightforward, for the reasons mentioned above, and our work does not focus on the lower bounds. It would be interesting to see such guarantees in follow-up works.

As for the questions posed:

  • We apologize for the omission. Here, a=argmaxaAμDR(a)a^* = \text{arg}\max_{a\in\mathcal{A}} \mu_\text{DR}(a) is the optimal distributionally robust decision, which is assumed to be unique for the LCB-DR algorithm (its analysis is the only one that uses CaC_a). More generally, we can define
Ca={ΔDR(a)Δa,mina∉argmaxaAμDR(a)ΔDR,minΔa,mino.w.C_a=\begin{cases} \frac{\Delta_\text{DR}(a)}{\Delta_{a,\text{min}}} & a \not\in \text{arg}\max_{a\in\mathcal{A}} \mu_\text{DR}(a) \\\\ \frac{\Delta_{\text{DR,min}}}{\Delta_{a,\text{min}}} & \text{o.w.} \end{cases}
  • l=Al=|\mathcal{A}| is the number of decisions available to the learner, as defined in Assumption 1.

  • MM does not have to be known to the learner.

  • Perhaps a more appropriate characterization would be ΔDR(a)Bn\Delta_\text{DR}(a)-B_n being suitably small. The idea is to mirror the Hoeffding v.s. Bernstein difference: for a random variable bounded in [0,M][0,M] and with variance σ2\sigma^2, the bounds are exp(t2/M2)\text{exp}(-t^2/M^2) v.s. exp(t2/(σ2+Mt))\text{exp}(-t^2/(\sigma^2+Mt)), up to constants. For suitably small tt, the σ2\sigma^2 term dominates the sum σ2+Mt\sigma^2+Mt and the comparison boils down to M2M^2 v.s. σ2\sigma^2, where the smaller term is better (note that here, the latter is always better). Similarly, when ΔDR(a)Bn\Delta_\text{DR}(a)-B_n is small relative to σT2+ΣT2+VT\sigma_T^2+\Sigma_T^2+V_T (which occurs in small sample regimes), the latter dominates the denominator of the NUE rate and the UE v.s. NUE comparison reduces to M2/nM^2/n v.s. σT2+ΣT2+VT\sigma_T^2+\Sigma_T^2+V_T.

评论

I like the response by the author and their explanation is clear to me. I will keep my score.

审稿意见
6

This paper studies multiple distribution learning (MDL) from bandit optimization point of view. By connecting pure-exploration setting in bandits to MDL, it develops instance-dependent sharp regret rates, thereby improving the current instance-agnostic rates in the MDL literature.

优点

The paper uses techniques from pure-exploration bandits to develop instance-dependent simple regret rates, which serves as less-conservative complements to the current instance-agnostic MDL error bounds.

缺点

  • The paper claims one of its contribution being developing problem-dependent rates for MDL, because “Oftentimes, it is more intuitive to analyze the learner’s performance in a fixed setting, as opposed to considering a worst-case instance for each sample size. When domain knowledge is available, a “one-size-fits-all” rate does not provide any insight on how to take advantage of this information”. However, the upper bound rates developed in this paper depend on the knowledge of the unknown optimality gap; how would this be integrated into domain knowledge remains unclear.
  • The problem setting seems very similar to Kirschner et al for distributionally robust online contextual bandit problem, but no discussion is provided on the differences and connections.

问题

  • Following up the first point in the weaknesses, Would the story of the paper rather be, given the knowledge that the uncertainty set U\mathcal{U} is fixed, one can develop exponential rates that scale with an unknown but fixed sub-optimality gap of each arm?
  • The paragraph in the introduction states “The current literature is populated with distribution-independent rates”, but there were not any relevant literature cited in this paragraph.
  • How would the proof be adjust to accommodate the case where the learner interacts with the environment but the optimal arm is non-unique?
  • Typo on Line 223: a fixed number times -> a fixed number of times
评论

Thank you for the thoughtful comments. Below, we address the weaknesses and questions posed.

Weaknesses

  • (Distribution-dependent analysis) Our motivation for studying distribution-dependent rates is not necessarily to leverage domain knowledge of suboptimality gaps to devise algorithms, but to understand how regret scales with instance-specific quantities (e.g., suboptimality gaps). In the multi-armed bandit literature, it is well-established that such guarantees typically decay much faster with the sample budget compared to their distribution-independent counterpart (e.g., exp(T)\exp(-T) vs. 1/T1/\sqrt{T}). Our goal is to bring this perspective to the MDL paradigm.

    A consequence of this is that when domain knowledge is indeed available, distribution-dependent rates can reveal the benefits of exploiting this information. For example, the NUE analysis shows the interplay between variance and sample size allocations, so that when the data is real-valued and the learner has some understanding of the variances across distributions, they can gain insight on how directing more data to higher-variance distributions impacts regret.

  • (DR Bayesian optimization) The setting of https://arxiv.org/pdf/2002.09038 is quite different from ours: they assume the learner receives some stochastic context ctc_t, sampled from a distribution in an uncertainty set Ut\mathcal{U}_t that changes with time, makes a decision xtx_t and receives a reward f(xt,ct)f(x_t,c_t) with additive noise. Their objective is to optimize a cumulative notion of regret. In our setting, we have a fixed uncertainty set and the task is to collect data from it, with the end goal of producing a decision that is distributionally robust in an offline sense.

Questions

  • (Paper motivation) This interpretation is a good way to put it. As highlighted above, the idea is to understand how the rates scale with (possibly unknown) instance-specific parameters. The stark contrast between the distribution-dependent and independent guarantees stem precisely from this difference in perspective: the former fixes an environment and studies how an algorithms performs in it, whereas the latter applies to worst-case environments for each sample size and, thus, cannot scale with environment-specific quantities.

  • (Omitted literature) This literature is presented in Section 1.2. We will make the appropriate citations earlier for further clarity.

  • (Non-unique optimum) The LCB-DR analysis (the other strategies do not assume a unique optimum) remains unchanged if we more generally redefine the complexity measure

Ca={ΔDR(a)Δa,mina∉argmaxaAμDR(a)ΔDR,minΔa,mino.w.C_a=\begin{cases} \frac{\Delta_\text{DR}(a)}{\Delta_{a,\text{min}}} & a \not\in \text{arg}\max_{a\in\mathcal{A}} \mu_\text{DR}(a) \\\\ \frac{\Delta_{\text{DR,min}}}{\Delta_{a,\text{min}}} & \text{o.w.} \end{cases}
  • (Typo) Thank you for pointing this out. We will make the necessary adjustment.
审稿意见
6

The authors study distribution-dependent guarantees in the multi-distribution learning framework. They prove that distribution-dependent bounds are tighter than distribution-independent bounds. Specifically, they derive finite sample bounds under uniform and non-uniform exploration and propose an algorithm that improves over non-adaptive counterparts.

优点

The authors clearly compared the finite sample bounds under uniform exploration against that under non-uniform exploration, and they highlighted where non-uniform exploration could have gains.

缺点

The authors compared their proposed algorithm against uniform sampling, but not non-uniform sampling. Non-uniform sampling benefits from varied sampled sizes and would be a stronger baseline to compare against.

It would be nice to provide experimental results, even in very simple set ups, to showcase the strength of their proposed algorithm. The main results of this work are in theoretical results and there are substantial theoretical contributions, and thus I understand the experimental results may not be necessary.

问题

How does this work relates to active learning, where one would also take an adaptive strategy? What are the barriers that prevent active learning algorithm from being applied to the problem setting studied in this work?

Are there relevant lower bounds available? If so, how do the upper bounds proven in this work compare to the lower bounds?

评论

Thank you for the thoughtful feedback. Below, we address the identified weaknesses and respond to your specific questions.

Weaknesses

  • (UCB v.s. NUE) Our core interest is in comparing adaptive v.s. non-adaptive strategies. While NUE incorporates a more nuanced approach than UE by exploiting the variance of different distributions, its advantage arises from a fundamentally different principle than LCB-DR, which leverages optimism by collecting data in an adaptive manner. The comparison between UE and LCB-DR directly examines the impact of adaptivity in terms of suboptimality gaps, providing insights into the trade-offs between simple exploration and more sophisticated adaptive strategies. Switching to NUE does not add any insight in this direction. In essence, NUE and LCB-DR offer complementary benefits relative to UE, and we opted to evaluate each one separately.
  • (Experimental results) While experiments are indeed interesting to evaluate our methods, the central goal of this work is to provide a fundamentally different type of analysis for MDL algorithms. Our distribution-dependent approach examines how regret decays relative to instance-specific parameters (e.g., suboptimality gaps), offering a perspective that previous analyses did not address. This type of theoretical insight is challenging to illustrate through experiments, as the distribution-dependent rates of existing algorithms are not known.

Questions

  • Active learning (AL) shares some similarities with MDL in that both involve actively collecting data. However, there are two fundamental differences between the two:

    • AL is specific to supervised learning, where the data is i.i.d. from a single distribution; the learner's goal is then to select which points to label, given access to covariates. In contrast, MDL can also address unsupervised tasks and involves multiple distributions; the learner's goal is to decide which distributions to sample from. This distinction means that, in MDL, the data can be tailored to fundamentally different tasks (when distributions are very dissimilar), leading to sampling strategies that differ significantly from those in AL.
    • The ultimate goal of AL is to produce a decision with good expected performance under the single data-generating distribution. In MDL, the objective is instead to obtain uniformly good performance across a collection of distributions. As a result, the way the learner uses the collected data to make decisions can vary significantly between the two settings.

    Due to these fundamental differences in the scope of both settings, ideas from AL are not so directly transferable to MDL.

  • As far as we are aware, there are no lower bounds for distribution-dependent rates.

评论

I thank the authors for the response. I will keep my score and defer to other reviewers who are more familiar with this research area.

审稿意见
3

This paper addresses the multi-distribution learning problem, where the learner aims to optimize the model's worst-case performance across a set of distributions. The main contribution is a reformulation of this problem as a pure exploration multi-armed bandit task and obtain simple regret bounds that depend on the sub-optimal gap of actions. The first part of the paper studies the non-adaptive case, where the learner cannot interact with the environments. Here, the authors provide simple regret bounds for both uniform exploration (UE) and non-uniform exploration (NUE). The second part explores the interactive case, where environment interaction is permitted, and proposes an LCB-based algorithm that better a lower simple regret than UE.

优点

  • This paper offers a new perspective by formulating multi-distribution learning as a pure exploration problem in multi-armed bandits.
  • Based on this view, gap-dependent bounds are derived for both adaptive and non-adaptive cases for the multi-distribution learning problem.

缺点

  • On the Significance of the Results: One of my main concerns is the significance of the results achieved in this paper, as they rely on strong assumptions, and their implications are not rigorously discussed.

    • Assumptions: The paper appears to address a simplified case where the action space A\mathcal{A} is discrete and finite, and the data space is restricted to 1-dimension in the analysis. In contrast, continuous decision sets are more commonly studied in the literature, such as Blum et al. (2017), Sagawa et al. (2020), and Soma et al. (2022). Although Section 5 discusses an extension to infinite decision sets, the proposed approach using an ϵ/k\epsilon/k-cover would result in a method with prohibitive computational costs.
    • Results for the Non-Adaptive Case: It is not entirely clear to me why the results for Non-Uniform Exploration (NUE) would be better than Uniform Exploration (UE), as the NUE outcomes depend on minQnQ\min_Q{n_Q}, which could potentially be very small. The arguments in Section 3.3 are too intuitive to me, with several approximations made that require further justification. For instance, I am confused by the statement "considers a case ΔDR(a)Bn\Delta_{DR}(a) \approx B_n" in line 321. It is uncertain whether we can disregard the term ΔDR(a)Bn\Delta_{DR}(a) - B_n in the comparison, as this term varies across arms, and the value of BnB_n differs between UE and NUE cases.
  • On the Proposed Method in Section 4: The proposed method for adaptive cases requires knowledge of HjH_j, which depends on the suboptimal gap Δa,min\Delta_{a,\min} and is generally unknown in practice. Although the authors provide some discussion in Remark 4, it remains unclear how this issue would be addressed. Additionally, the setting of ϵt\epsilon_t is also confusing. While this quantity appears to only require to be lower bounded, it is still unclear how to set this value to ensure that Condition Eq. (1) is not violated.

  • About literature review: The discussion of the convergence rate for related work in lines 139-152 is inaccurate. Although Soma et al. (2022) claim a result of O(B2+kT)O(\frac{\sqrt{B^2 + k}}{T}), their analysis overlooks the non-oblivious property of the learning process, rendering the result invalid. This issue was identified by Zhang et al. (2023), and the currently best-known result remains O(B2+klogkT)O(\frac{\sqrt{B^2 + k \log k }}{T}) in this line of research. One may check Section 2.3 in Zhang et al 2023 for a discussion.

问题

  • Is it possible to extend the results to continuous spaces without a significant increase in computational complexity? For example, could pure exploration in linear bandit settings be considered?
  • Could you provide more detailed explanations on the comparison between UE and NUE? (Please refer to the first point under weaknesses for a more detailed discussion.)
  • How can the parameter TjT_j be set in practice to ensure the theoretical guarantees?
评论

\newcommand{\Eb}{\mathbb{E}}\newcommand{\Xc}{\mathcal{X}}\newcommand{\Ac}{\mathcal{A}}\newcommand{\dam}{\Delta_{a_j,\text{min}}} Thank you for the insightful comments. Below, we address the issues presented:

  • (1-dimensional data assumption) The restriction to real-valued data is only for NUE and its comparison to UE. Otherwise, the analyses hold for arbitrary spaces.

  • (NUE results) The UE v.s. NUE comparison mirrors the Hoeffding v.s. Bernstein inequalities for variables bounded in [0,M][0,M] and with variance σ2\sigma^2. In particular, the latter reduces to exp(t2/M2)\exp(-t^2/M^2) v.s. exp(t2/(σ2+Mt))\exp(-t^2/(\sigma^2+Mt)), up to constants. For suitably small tt, this boils down to M2M^2 v.s. σ2\sigma^2, where the latter always wins. Similarly, in Section 3.3, we compare the rates when ΔDR(a)Bn\Delta_\text{DR}(a)-B_n is suitably small, which occurs in small sample regimes, yielding the analogous M2/nM^2/n v.s. σT2+ΣT2+VT\sigma_T^2+\Sigma_T^2+V_T. As noted, this term indeed varies across arms and may be more ''suitably small'' for some.

    Moreover, while the guarantees of UE and NUE use Bn=logk/nB_n=\sqrt{\log k/n} and Bn=GTB_n=G_T, respectively, we note in Appendix C that either term works for both rates. We use GTG_T for NUE as it showcases the dependence on variance. However, when comparing the strategies, we assume small sample sizes, where BnB_n is less significant and the focus is on the quantities above, so that we apply the same BnB_n to both algorithms.

As for the prohibitive cost of creating a cover, this may indeed be a barrier in full generality. Let us highlight some settings where this is more feasible, the latter being the subject of your first question.

  • (VC classes) Binary classification has been a major focus of the literature (e.g., Blum et al. (2017)). We mention this in Section 5 and further develop it in Appendix F. Suppose that the data is of the form (X,Y)\Xc×0,1(X,Y)\in\Xc\times\\{0,1\\}, decisions are binary-valued functions on \Xc\Xc and the reward function is r(a,(x,y))=Ia(x)=yr(a,(x,y)) = \mathbb{I}\\{a(x)=y\\}. In Appendix F.1, we show that given a finite ϵ/k\sqrt{\epsilon/k}-cover \Acϵ\Ac_\epsilon of (\Ac,L2(Qˉ\Xc))(\Ac,L^2(\bar Q_\Xc)) (we refer to Appendix F.1 for the definitions), we can run the algorithms on it and incur an approximation penalty of ϵ\epsilon on the regret. Suppose that \Ac\Ac has finite VC dimension dd. To construct a cover with high-probability, we can sample O(k2(d+log(1/δ))/ϵ2)O(k^2(d+\log(1/\delta))/\epsilon^2) points from Qˉ\Xc\bar Q_\Xc and project \Ac\Ac onto this sample. By the Sauer–Shelah lemma, we also know that \Acϵ(k2(d+log(1/δ))/(dϵ2))d|\Ac_\epsilon|\lesssim(k^2(d+\log (1/\delta))/(d\epsilon^2))^d. With a slightly different analysis, we can also reduce the k2k^2 to kk.
  • (Linear classes) Suppose that r(a,x)=vψ(a,x)r(a,x)=v^\top\psi(a,x), for some known vector vv and feature mapping ψ\psi both lying in the unit Euclidean ball in dd dimensions (recall that the learner has access to rr). Let ψ(a,Qˉ)\EbXQˉ[ψ(a,X)]\psi(a,\bar Q)\coloneqq \Eb_{X\sim\bar Q}[\psi(a,X)], where Qˉ\bar Q is the uniform mixture of Q1:kQ_{1:k}. The problem then reduces to constructing a cover of \Ac\Ac w.r.t. the pseudo-metric d(a,a)=ψ(a,Qˉ)ψ(a,Qˉ)2d(a,a')=\Vert\psi(a,\bar Q)-\psi(a',\bar Q)\Vert_2. While we cannot access Qˉ\bar Q directly, we can sample from it to approximate ψ(a,Qˉ)\psi(a,\bar Q). With O(dlog(1/(ϵδ))/ϵ2)O(d\log(1/(\epsilon\delta))/\epsilon^2) samples, we can construct an ϵ\epsilon-cover of size O(1/ϵd)O(1/\epsilon^d) with high-probability.

Regarding unknown quantities in LCB-DR, we make some remarks:

  • ϵj\epsilon_j depends on the unknown quantities Ha1:aj1H_{a_1:a_{j-1}} and \dam\dam. The former are sums of the suboptimality gaps of a1:j1a_{1:j-1}, for which we have already ran UCB-E. Hence, we can estimate them using the current data: Δar(Q)μ^nTˉr(ar;Q)μT(ar)\Delta_{a_r}(Q)\approx\hat\mu_{n_{\bar T_{r}}}(a_r; Q) - \mu_T^\circ(a_r). The term \dam\dam is trickier as we have not yet collected samples for aja_j. One idea is to use the data gathered so far via the proxy Qμ^nTˉj1(aj;Q)Q\mapsto\hat\mu_{n_{\bar T_{j-1}}}(a_j; Q). However, this may not yield a good estimate if the available data is not informative for aja_j. Given the assumption r[0,1]r\in[0,1], another option is to use 11 as a crude upper bound of \dam\dam. Note that a large value for ϵj\epsilon_j is not an issue since the regret is a decreasing function of it.
  • Regarding TjT_j, we make a correction to Remark 4: Theorem 3 holds provided that the lower bound TjϵjHjT~j+kjT_j\gtrsim\epsilon_jH_j-\tilde T_j+k_j is satisfied. The regret, however, scales with ϵj\epsilon_j, and if the bound is loose, the dependence on TjT_j is worse. The change is in Line 464: if we plug in a different value for ϵj\epsilon_j, the rate is no longer O((Caj21)(Tj+T~jkj)Hj)O\left(\frac{(C_{a_j}^2\wedge 1)(T_j+\tilde T_j-k_j)}{H_j}\right). Ideally, we would choose TjT_j so that the lower bound is as tight as possible. In practice, if the right-hand side is difficult to estimate, the learner can choose a large value for ϵj\epsilon_j, making the other terms relatively small, and select TjT_j slightly above it. As discussed, a large ϵj\epsilon_j should not be an issue.
评论

I appreciate the authors' detailed feedback. However, I remain concerned about the rigor of the arguments and the assumptions presented (e.g., the one-dimensional assumption for the NUE case and the difficulties in setting the parameter). In the comparison of UE and NUE, it is unclear how the term minQUnQ\min_{Q \in \mathcal{U}} n_Q influences the performance of the two bounds. I believe a more rigorous analysis would be beneficial. For instance, it would be helpful to provide a specific condition (e.g., nn being greater than certain quantities) under which NUE is strictly better than UE, rather than relying on the asymptotic arguments in Section 3.3. Additionally, the establishment of covering still appears to be computationally challenging in the stated cases, as its complexity scales exponentially with the dimension and the size of the data.

评论

\newcommand{\DDR}{\Delta_{\text{DR}}(a)} \newcommand{\ub}[2]{\underbrace{#1}_{#2}} Thank you for the follow-up highlighting these issues. Below, we provide additional clarification on the concerns raised:

(Rigor of UE v.s. NUE) We agree that the UE v.s. NUE comparison is rather handwavy, so let us extend the example of Section 3.4.3 for a more concrete setup: suppose that we have C1C\gg 1 distributions Q1,,QCQ_1,\dots,Q_C, and the distributions Q1,,QC1Q_1,\dots,Q_{C-1} share a common variance of 1 and distribution QCQ_C has the much larger variance of CC. Consider the non-uniform allocation of 2C2C samples that allocates 1 sample for each Q1,,QC1Q_1,\dots,Q_{C-1} and CC samples to QCQ_C (more formally, it should be C+1C+1, but we ignore the +1+1). In addition, suppose that the reward function is bounded in r[0,C]r\in[0,C]. Note that a uniform allocation here would sample 2 times from each distribution, so that applying the same BnB_n (namely, the first formula of Theorem C.1 scaled by the bound on rr) to both UE and NUE yields BnClogCB_n \asymp C\sqrt{\log C} to both bounds. Then using Lemma G.4 (as in Section 3.4.3), we can conclude that the exact bounds, up to contants, are

\ubexp((\DDRClogC)2C2)UEv.s.\ubexp((\DDRClogC)2logC+C(\DDRClogC))NUE\ub{\exp\left(-\frac{(\DDR-C\sqrt{\log C})^2}{C^2}\right) }{\text{UE}} \quad\text{v.s.}\quad \ub{\exp\left(-\frac{(\DDR-C\sqrt{\log C})^2}{\sqrt{\log C} + C(\DDR-C\sqrt{\log C})}\right) }{\text{NUE}}

Now, suppose that \DDR=ClogC+ϵa/C\DDR = C\sqrt{\log C} + \epsilon_a/C for some small ϵa>0\epsilon_a>0. Then, the comparison becomes

\ubC2UEv.s.\ublogC+ϵaNUE \ub{ C^2 }{\text{UE}} \quad\text{v.s.}\quad \ub{ \sqrt{\log C}+\epsilon_a }{\text{NUE}}

Evidently, when CC is large, the NUE strategy of sampling more from the higher-variance distribution QCQ_C yields a better rate. This is true for any arm, provided that ϵaC2\epsilon_a\ll C^2.

This example illustrates a broader setting where NUE can outperform UE: the term minQnQ\min_Q n_Q did not pose an issue here since, up to constants, it is equal to the UE allocations. Then, by leveraging the different variances, NUE significantly reduced the dependence on them.

(Computational challenges) Indeed, the exponential dependence on the dimension poses a computational challenge. The primary focus of this work is on the statistical complexity of learning in the MDL problem, and computational efficiency, while an important aspect, is outside the scope of this analysis. We note that such challenges are not unique to our work but are inherent to problems involving high-dimensional distributions. This aligns with the standard PAC learning framework, where a statistical-computational tradeoff often exists. Bridging this gap in the MDL paradigm, whether through new algorithms or alternative covering techniques, is an interesting direction for future research.

AC 元评审

This is a borderline paper. The paper addresses multi-distribution learning by means of multi-armed bandit algorithms and they attain regret bounds in this setting. The reviewers are particularly critical of the strong assumptions that are imposed and a lack of justification of the made assumptions/discussion of the assumptions. I believe that a major revision is necessary to address the shortcomings pointed out by the reviewers.

审稿人讨论附加意见

The reviewers replied to the authors after the rebuttal but no major discussion evolved.

最终决定

Reject