PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
5
4
3.5
置信度
创新性2.8
质量3.0
清晰度2.8
重要性3.0
NeurIPS 2025

Improved Regret and Contextual Linear Extension for Pandora's Box and Prophet Inequality

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

We study the Pandora’s Box problem in an online learning setting with semi-bandit feedback. In each round, the learner sequentially pays to open up to $n$ boxes with unknown reward distributions, observes rewards upon opening, and decides when to stop. The utility of the learner is the maximum observed reward minus the cumulative cost of opened boxes, and the goal is to minimize regret defined as the gap between the cumulative expected utility and that of the optimal policy. We propose a new algorithm that achieves $\widetilde{O}(\sqrt{nT})$ regret after $T$ rounds, which improves the $\widetilde{O}(n\sqrt{T})$ bound of Agarwal et al. [2024] and matches the known lower bound up to logarithmic factors. To better capture real-life applications, we then extend our results to a natural but challenging contextual linear setting, where each box's expected reward is linear in some known but time-varying $d$-dimensional context and the noise distribution is fixed over time. We design an algorithm that learns both the linear function and the noise distributions, achieving $\widetilde{O}(nd\sqrt{T})$ regret. Finally, we show that our techniques also apply to the online Prophet Inequality problem, where the learner must decide immediately whether or not to accept a revealed reward. In both non-contextual and contextual settings, our approach achieves similar improvements and regret bounds.
关键词
Pandora's BoxProphet InequalitySemi-banditsRegret minimization

评审与讨论

审稿意见
4

This work improves the regret guarantee for the online learning problem in the Pandora's box problem from O~(nT)\tilde{O}(n\sqrt{T}) to O~(nT)\tilde{O}(\sqrt{nT}), via a Bernstein-type construction for the optimistic construction. The paper further extends the problem to the contextual case with linear rewards, and achieves an O~(ndT)\tilde{O}(nd\sqrt{T}) regret in that problem, where dd is the dimension of the context.

优缺点分析

The results of this paper are solid, and the paper is written well, striking a good balance between intuitions and technical details. However, I think some discussions are still required to further improve the readability of this paper.

First, Bernstein-type solutions are given in Guo et al. (2019) to solve the sample complexity for auctions and other related problems. This paper mentions that the use in analysis is substantially different from that of Guo et al. (2019), but not in detail. As for me, the idea of using Bernstein inequality to construct stochastically dominated empirical distributions is similar.

Second, it is interesting and also natural to ask why the dependence on the number of boxes nn is different for the non-contextual and contextual problems in this paper, but the authors do not discuss this point.

问题

  • Could the authors explain in detail how their use of Bernstein inequality is different from Guo at al. (2019)?
  • Could the authors address why the similar algorithm gives different reliance on nn for the regret bound on the non-contextual and contextual problems?

局限性

Yes.

最终评判理由

I will keep my score.

格式问题

No.

作者回复

We thank the reviewer for the valuable feedback. Below we address your concerns.


W1 and Q1: This paper mentions that the use in analysis is substantially different from that of Guo et al. (2019), but not in detail. As for me, the idea of using Bernstein inequality to construct stochastically dominated empirical distributions is similar. Could the authors explain in detail how their use of Bernstein inequality is different from Guo et al. (2019)?

A: Our use differs significantly in both the algorithmic perspective and the analysis.

  • Algorithmic perspective. Since Guo et al. (2019) focuses on the sample complexity, they first collect sufficient samples over all the distributions of the boxes to construct empirical distributions, and then output a near-optimal policy from pessimistic distributions, which reallocate the probability mass of empirical distributions toward lower outcomes. This is exactly the opposite of our optimistic scheme, where we shift mass to higher outcomes. As we focus on the regret minimization, the algorithm needs to balance exploration and exploitation. To this end, our algorithm uses optimistic distributions with Bernstein-type adjustment, which encourages the algorithm to adaptively explore the distributions of the boxes.

  • Analytical perspective. The analysis of (Guo et al. 2019) requires sufficient and equal amount of samples for each box. However, since our algorithm adaptively explores boxes, the sample size varies across boxes, and some boxes may only have a small sample size. We therefore conduct a finer‑grained study of the exploration process and show that a Bernstein‑type concentration bound naturally captures its adaptiveness and delivers the desired regret guarantees.


W2 and Q2: it is interesting and also natural to ask why the dependence on the number of boxes nn is different for the non-contextual and contextual problems in this paper, but the authors do not discuss this point. Could the authors address why the similar algorithm gives different reliance on nn for the regret bound on the non-contextual and contextual problems?

A: The different dependence on nn arises primarily because in the contextual setting, our analysis needs to bound the estimation error of context-dependent mean, which leads to a linear dependence on nn. Unlike the non-contextual case where the reward samples are i.i.d., in the contextual case the mean reward varies with the context, and the observed samples are not identically distributed across rounds. To ensure stochastic dominance, we construct an optimistic distribution by shifting the sample values in an optimistic manner. The regret introduced by this value-shifting scales linearly with nn, and the technique that yields the O~(nT)\widetilde{O}(\sqrt{nT}) bound in the non-contextual setting, does not extend to the contextual case.

评论

Thank the authors for their reply. I appreciate that, and highly recommend them to add the above discussions into their final version.

审稿意见
5

The authors study the Pandora's Box problem in a semi-bandit feedback setting, where the learning agent observes the realized rewards of the open boxes in a round. The first main contribution is regret bound of O~(nT)\tilde{O}(\sqrt{nT}), which improves upon the state-of-the-art. The improvement is brought about by an application of a Bernstein-type concentration inequality for a tighter upper confidence bound in the OFU-based algorithm, as well as a more refined sensitivity analysis on the error term Termi,tTerm_{i, t} in Section 3.2. The authors also provide an improvement in a linear contextual generalization, and they improve the regret bound scaling from T5/6T^{5/6} (TT is the number of rounds) to T1/2T^{1/2}.

优缺点分析

Strengths:

  1. I find the overall theoretical contributions substantial and worthwhile. Although the algorithm design is following the well-established optimism-in-face-of-uncertainty principle, the overall careful analysis that leads to n\sqrt{n} is still an interesting technical contributions. The sensitivity analysis is tailored to the Pandora's Box setting, and appears novel in my opinion.

  2. The sketch proof in Section 3.2 is quite clearly written, and will be interesting to researchers to explore further generalizations.

  3. In the linear contextual case, the improvement in regret bound is substantial.

Weaknesses:

  1. The boundedness assumption on the random noise in the linear contextual case seems to be a limiting assumption. In the traditional linear bandits, it suffices to assume a bounded mean reward, while the noise can have an unbounded supported as long as it is sub-Guassian. What is the fundamental barrier that forbids such a generalization.

  2. The authors assert that their regret bound of O(nT)O(\sqrt{n T}) is tight, due to the regret lower bound of Ω(nT)\Omega(\sqrt{nT}) in Gatmiry et al. [16]. My understanding is that Gatmiry et al. [16] focuses on the purely bandit feedback case, meaning that after each round, the learning agent only observes the net utility gained, but does not observe the individual rewards of the opened boxes. Can the authors confirm that the lower bound in Gatmiry et al. [16] also applies in the authors' semi-bandit feedback setting? If yes, the authors should provide more explanations. If it is the otherwise, the authors should emphasize on the difference between the feedback model.

问题

In addition to the question raised in Weakness point 2, I have an additional question and an additional suggestion:

Question: It appears to me that the construction of the empirical distributions in the linear contextual case is different from the non-contextual case. Can I check that if I apply the algorithm in Section 4 on the non-contextual case (1 feature dimension per box), do we end up with O(n2T)O(n^2 \sqrt{T}) regret, or is there some better way?

Suggestion: The authors' model could have some connection to the cascading bandit model:

https://arxiv.org/abs/1502.02763

While the two models involve semi-bandit feedback depending on the ordering of basic arms/boxes, they are quite different in terms of the reward function. Nevertheless, I believe the above work is worth mentioning.

局限性

Yes

最终评判理由

I have read the response, and is satisfied with the response on the lower bound. I recommend the authors to provide more details that the lower bounds apply also to the authors' feedback setting to make things clearer.

格式问题

None

作者回复

We thank the reviewer for the valuable and positive feedback. Below we address your concerns.


W1: The boundedness assumption on the random noise in the linear contextual case seems to be a limiting assumption. In the traditional linear bandits, it suffices to assume a bounded mean reward, while the noise can have an unbounded supported as long as it is sub-Guassian.

A: In fact, this assumption can be relaxed. For each box, if the noise distribution is sub-Gaussian, then, the reward distribution is also sub-Gaussian, and our algorithms still work with minor modifications. Specifically, if the reward distribution of box ii is KiK_i-sub-Gaussian, one can set a range [Ki2log(2T2n),Ki2log(2T2n)][-K_i \sqrt{2\log(2T^2n)},K_i\sqrt{2\log(2T^2n)}]. Then, the algorithm updates parameters only if the received reward of each box ii is within range of [Ki2log(2T2n),Ki2log(2T2n)][-K_i \sqrt{2\log(2T^2n)},K_i\sqrt{2\log(2T^2n)}]. On those rounds, the algorithm and its regret analysis coincide exactly with the bounded‑reward setting. Meanwhile, by a union bound on sub-Gaussian tails, the probability that any reward ever exceeds its range is at most 1/T1/T. Therefore, its contribution to the expected regret is only O~(1)\widetilde{O}(1).


W2: Does Ω(nT)\Omega(\sqrt{nT}) lower bound in Gatmiry et al. [16] apply here?

A: The Ω(nT)\Omega(\sqrt{nT}) lower bound in Gatmiry et al. [16] indeed applies to our setting since their lower bound holds even in the full-information setting, thereby applying to ours. Specifically, Gatmiry et al., prove Ω(nT)\Omega(\sqrt{nT}) regret lower bound by using Ω(n/ϵ2)\Omega(n/\epsilon^2) lower bound for sample complexity. To see this, we here reiterate their proof idea. For the full-information feedback, if an online algorithm achieves o(nT)o(\sqrt{nT}) regret, then one can use online-to-batch conversion to get a ϵ\epsilon-optimal policy with o(n/ϵ2)o(n/\epsilon^2) sample complexity, which contradicts to Ω(n/ϵ2)\Omega(n/\epsilon^2) lower bound.


Q1: Can I check that if I apply the algorithm in Section 4 on the non-contextual case (1 feature dimension per box), do we end up with O(n2T)O(n^2\sqrt{T}) regret bound?

A: Our contextual setting uses a separate parameter θi\theta_i for each box ii, which subsumes the non-contextual case as a special instance. In the non-contextual case, it suffices to set the feature dimension d=1d=1, leading to a O~(nT)\widetilde{O}(n\sqrt{T}) regret bound. While this is worse than the O~(nT)\widetilde{O}(\sqrt{nT}) bound of our dedicated non-contextual algorithm, the linear dependence on nn arises from the estimation error of context-dependent means. The technique used to obtain the tighter O~(nT)\widetilde{O}(\sqrt{nT}) bound does not carry over to the contextual setting. Whether the O~(ndT)\widetilde{O}(nd\sqrt{T}) bound is tight remains an open question for future work.


Q2: The authors' model could have some connection to the cascading bandit model: ``Kveton et al., Cascading Bandits: Learning to Rank in the Cascade Model, ICML 2015''. While the two models involve semi-bandit feedback depending on the ordering of basic arms/boxes, they are quite different in terms of the reward function. Nevertheless, I believe the above work is worth mentioning.

A: We thank the reviewer for highlighting this interesting paper. We will make sure to include this reference in the revision.

评论

Dear Reviewer oZNk,

Thank you for taking the time to review our paper. We hope to check in to see if you had any remaining questions or concerns following the rebuttal. We’d be happy to clarify or discuss further if needed.

Best,

Authors

审稿意见
5

This paper studies an online variant of the Pandora’s box problem with nn boxes. In this problem, the player sequentially plays T instances of the Pandora’s box problems, one in each round. The assumption is that the reward distributions are unknown to the player but remain fixed throughout the horizon. In this setting, the authors prove an O(nT)O(\sqrt{nT}) regret bound against the best policy in hindsight, thereby improving the existing O(nT)O(n\sqrt{T}) regret bound. Their algorithm first constructs an optimistic version of the empirical distributions for each box, and then it computes the thresholds by using Weitzman’s method applied to the optimistic distributions. They obtain improved regret by using a sharper bound for a regret upper bound obtained earlier by [Agarwal et al].

优缺点分析

The refined analysis is technically novel and interesting. The authors also extend their result to a linear contextual setting and the prophet inequality. The paper is well-written, easy to follow, and the related work has been discussed adequately.

问题

  1. Is it possible to establish O(nT)O(\sqrt{nT}) regret bound with high probability? The paper only proves this bound in expectation.
  2. The authors assume that the reward distributions have a bounded support. Can the results be extended to distributions with unbounded support?
  3. For completeness, it is better to briefly reproduce the arguments from [2] leading to Eqn (4).

局限性

N/A

最终评判理由

The authors have addressed my concerns. I'll keep my positive score.

格式问题

None

作者回复

We thank the reviewer for positive feedback and your acknowledgment on our contributions. Below we address your concerns.


Q1: Is it possible to establish O~(nT)\widetilde{O}(\sqrt{nT}) regret bound with high probability?

A: Yes. We can obtain a high-probability bound of O~(nT)\widetilde{O}(\sqrt{nT}) under regret metric t=1T(R(σ_D;D)R(σ_t;D))\sum_{t=1}^T (R(\sigma\_{D};D)-R(\sigma\_{t};D)) via two modifications in our analysis. First, LHS of Eq(4) now has no expectation, but since σt\sigma_t and Ht,iH_{t,i} are deterministic given history, we have R(σ_t;H_t,i1)R(σ_t;H_t,i)=E_t[R(σt;Ht,i1)R(σt;Ht,i)]R(\sigma\_t;H\_{t,i-1})-R(\sigma\_t;H\_{t,i}) = \mathbb{E}\_t [R(\sigma_t;H_{t,i-1})-R(\sigma_t;H_{t,i})] where E_t[]\mathbb{E}\_t[\cdot] is the conditional expectation given history before round tt. Then, we can follow the same analysis to arrive at i,tQt,imt,i\sqrt{\sum_{i,t} \frac{Q_{t,i}}{m_{t,i}} }. To handle this, let Mt,i=Qt,iIt,imt,iM_{t,i}=\frac{Q_{t,i}-I_{t,i}}{m_{t,i}}, and Mt,i_t\\{M_{t,i}\\}\_{t} is martingale difference sequence. By Freedman's inequality, with probability at least 1δ1-\delta, we have tMt,i2Vlog(1/δ)+log(1/δ)\sum_{t} M_{t,i} \leq \sqrt{2V \log(1/\delta)} + \log(1/\delta) where V=t=1TE_t[Mt,i2]V= \sum_{t=1}^T \mathbb{E}\_t [M_{t,i}^2 ]. Notice that V=Qt,iQt,i2mt,iQt,imt,iV =\frac{ Q_{t,i} -Q_{t,i}^2 }{m_{t,i}} \leq \frac{ Q_{t,i} }{m_{t,i}}. Thus, we have tMt,i2log(1/δ)tQt,i/mt,i+log(1/δ)12tQt,i/mt,i+2log(1/δ)\sum_{t} M_{t,i} \leq \sqrt{ 2\log(1/\delta)\sum_{t}Q_{t,i}/m_{t,i} }+\log(1/\delta) \leq \frac{1}{2}\sum_{t}Q_{t,i}/m_{t,i}+ 2\log(1/\delta) where the last step uses the AM-GM inequality. Rearranging it gives t=1TQt,imt,iO(log(1/δ)+t=1TIt,imt,i)\sum_{t=1}^T \frac{Q_{t,i}}{m_{t,i}} \leq O (\log(1/\delta)+\sum_{t=1}^T \frac{I_{t,i}}{m_{t,i}} ). Therefore, by choosing δ\delta properly and applying a union bound, with high probability, the regret is bounded by O~(nT)\widetilde{O}(\sqrt{nT}).


Q2: The authors assume that the reward distributions have a bounded support. Can the results be extended to distributions with unbounded support?

A: For an arbitrary unbounded distribution, we believe the online Pandora's Box is intractable without further assumption. The main hardness here is that a box may yield an extremely large reward with a very small probability. For example, consider two instances in which there is only one box cost 11. In instance I1I_1, the box generates reward 2T2^T with probability 2T+12^{-T+1}, and generates reward 00 otherwise. In the other instance I2I_2, the box generates the same reward with probability 2T12^{-T-1} and 00 otherwise. Then, the learner needs to decide whether to open the box. The optimal strategy in I1I_1 always opens the box, whereas the optimal strategy in I2I_2 never opens it. However, distinguishing these two instances are statistically hard within TT rounds, and thus the online Pandora's Box with arbitrary unbounded distribution is intractable.

However, if the reward distribution of each box is subgaussian, then, our algorithms still work with minor modifications. Specifically, if the reward distribution of box ii is KiK_i-subgaussian, one can set a range [Ki2log(2T2n),Ki2log(2T2n)][-K_i \sqrt{2\log(2T^2n)},K_i\sqrt{2\log(2T^2n)}]. Then, the algorithm updates parameters only if the received reward of each box ii is within range of [Ki2log(2T2n),Ki2log(2T2n)][-K_i \sqrt{2\log(2T^2n)},K_i\sqrt{2\log(2T^2n)}]. On those rounds, the algorithm and its regret analysis coincide exactly with the bounded‑reward setting. Meanwhile, by a union bound on subgaussian tails, the probability that any reward ever exceeds its range is at most 1/T1/T. Therefore, its contribution to the expected regret is only O~(1)\widetilde{O}(1).


Q3: For completeness, it is better to briefly reproduce the arguments from [2] leading to Eqn (4).

A: Thank you for the suggestion. We will reproduce those arguments in the next version.

评论

I would like to thank the authors for their responses. The responses did not significantly change my perspective on the paper, thus, I will keep my current evaluation.

审稿意见
4

This paper studies the online Pandora’s Box problem and proposes an algorithm that achieves a regret bound matching the known lower bound and improves over Agarwal et al. in the non-contextual setting. The authors then extend their approach to both the contextual linear setting and online Prophet Inequality problem. The key idea is to mimic Weitzman’s optimal policy by constructing optimistic empirical distributions that stochastically dominate the unknown true distributions. This allows the algorithm to compute reservation values using Weitzman’s thresholding rule, while encouraging exploration.

优缺点分析

Strengths:

This paper tackles the online Pandora’s Box problem and proposes a novel algorithm that improves over prior work in both regret performance and generality. By leveraging optimistic empirical distributions to mimic Weitzman’s optimal threshold-based policy, the authors match the known minimax regret lower bound in the non-contextual setting, improving over the prior regret bound from Agarwal et al. The theoretical contributions are technically sound and show clear improvement over existing methods, and the analysis is rigorous with detailed proofs included.

Weaknesses:

The main weaknesses of this paper lie in clarity and structure, which make the contributions harder to interpret than necessary:

Some terms (e.g., full-information feedback, magnitude of the utility function) lack explanations.

Several naming choices could be improved: e.g., “Term” in Equation (4) is vague and could be replaced with something more meaningful.

The thresholds in Weitzman’s algorithm can be referred to more meaningful standard terminology such as reservation values, Weitzman’s indices or Gittins indices that are widely used in prior work.

The paper does not clearly distinguish between previously known and newly introduced formulations and results. For example, Section 3 should be titled "non-contextual online Pandora’s Box" to reflect the setting.

Theorem statements and algorithm descriptions lack sufficient context. For example, Algorithm 2 is used for the online problem, but is simply labeled for “Pandora’s Box”, which could be misread as referring to the classic setting. Similarly, Algorithm 1 is essentially Weitzman’s original policy for the classic setting, but this is never explicitly stated — and could be deferred to the appendix to prioritize space for the new contributions like Algorithm 5. Parameters such as delta are not referred with meaningful names in all theorem statements.

The problem formulation is fragmented: the assumptions, utility function definitions, and cumulative regret notions are introduced in separate places, making it difficult for readers to see the overall setup. A unified and clearly structured formulation section, distinguishing old and new contributions, would greatly improve clarity. It is also unclear why the contextual-linear setting is defined on a [−1/4,1/4] support — this choice is not motivated or explained.

Moreover, the benchmark definitions used for regret are not clearly stated across settings. For example, the paper mentions that Gergatsouli and Tzamos consider a weak benchmark—but what precisely is that benchmark? Similarly, what is the optimal policy benchmark considered in Gatmiry et al.? What is the benchmark used in this work for both the non-contextual and contextual cases?

Finally, the proofs in the main text could be better structured by emphasizing high-level intuition and key steps, rather than presenting long sequences of inequalities that would be more appropriate in the appendix.

问题

Why the contextual-linear setting is defined on a [−1/4,1/4] support?

Why the principle of optimism under uncertainty is helpful for improving the regret guarantee in this work?

What is the weak benchmark considered by Gergatsouli and Tzamos? Similarly, what is the optimal policy benchmark considered in Gatmiry et al.? What is the benchmark used in this work for both the non-contextual and contextual cases?

局限性

The limitations in the assumptions of the contextual linear setting compared to real-world motivations could be mentioned.

最终评判理由

After reading the responses, I agree the technical contribution is significant but can be stated more clearly. The writing also needs much improvement.

格式问题

No major formatting issues.

作者回复

We thank the reviewer for the valuable feedback. Below we address your concerns.


W1: Improvement of clarity and structure.

A: Thank you for the suggestions; we will edit the paper accordingly. In particular, we will better emphasize the intuition for the proofs in the main text and defer technical details to the appendix. That said, several of your comments are very minor and can be easily addressed. In fact, it is worth noting that all other reviewers commented and agreed that the paper is well-written. While we appreciate your suggestions and will revise accordingly, we sincerely hope that you could consider raising the score given that the main weakness you pointed out pertains to clarity and structure.


W2: Algorithm 1 is essentially Weitzman’s original policy for the classic setting, but this is never explicitly stated.

A: Algorithm 1 is a general version of Weitzman's algorithm since it accepts any threshold vector as an input. Algorithm 1 coincides with the classical Weitzman algorithm precisely when those thresholds are chosen optimally for the true underlying distributions, as noted in Section 2.


Q1: Why the contextual-linear setting is defined on a [1/4,1/4][-1/4,1/4] support?

A: This choice is made purely for simplicity, avoiding the need to re‑derive some standard results under a different scaling. In fact, both our algorithm and analysis easily extend to any other [K,K][-K,K] support setting by simply replacing every [1/4,1/4][-1/4,-1/4]-bounded argument with its [K,K][-K,K]-bounded analogue. We will clarify this in the revision to make it clear that [1/4,1/4][-1/4,1/4] is for convenience and to make the paper concise by citing existing work where possible.


Q2: Why the principle of optimism under uncertainty is helpful for improving the regret guarantee in this work?

A: Our contribution here does not stem from the optimism under uncertainty principle, which Agarwal et al. (2024) already apply yielding only a O~(nT)\widetilde{O}(n\sqrt{T}) regret bound. While we build on their framework, the improvement we obtain derives from taking different approach altogether wherein we construct the optimistic distribution (shift probability mass based on Bernstein type bound), and more importantly, the improvement owes to our highly refined analysis. We highlight these differences in algorithm design and analysis in Section 3.1 and Section 3.2, respectively.


Q3: What is the weak benchmark considered by Gergatsouli and Tzamos? Similarly, what is the optimal policy benchmark considered in Gatmiry et al.? What is the benchmark used in this work for both the non-contextual and contextual cases?

A: Since Gergatsouli and Tzamos (2022) allow the distributions of the boxes to be chosen adversarially across rounds, competing against the optimal strategy is intractable. Therefore to ensure sublinear regret, Gergatsouli and Tzamos (2022) introduce two weak benchmarks: (1) a non-adaptive benchmark that opens the same fixed set of boxes in every round, and (2) a partially-adaptive benchmark that may choose a different set in each round based on the algorithm’s past choices. On the other hand, Gatmiry et al., (2024) and our work (in both non-contextual and contextual settings) both use the optimal strategy as the benchmark. This is Weitzman's algorithm with full knowledge of the product distribution at each round. We will make clarification in the next version.

评论

Thanks for your clarifications. I think you can make it more clear that compared to prior work, your work introduces structural assumptions to make the comparison with optimal policy and sublinear guarantee achievable, if I understand it correctly. While I agree the technical contribution of your work is significant, I do think the writing needs improvement. That said, I'm happy to raise my score.

最终决定

The paper studies an online setting of the classic Pandora's box problem and improves the regret bound of prior work. Overall, reviewers are positive about the submission, and they think the paper has interesting technical contributions. I would also recommend authors to include some clarification discussions in the rebuttal to the final version of the paper.