Feasible Action Search for Bandit Linear Programs via Thompson Sampling
An efficient method based on Thompson Sampling to find feasible actions for LPs with bandit feedback
摘要
评审与讨论
This paper studies the linear feasible action search problem, where the goal is to find a point in the input space such that some linear constraints are satisfied. In each round the learner can query a point and observe a noisy version of its constraints. The authors provide a Thompson sampling approach to this problem and provide theoretical guarantees on its performance, both in the standalone feasible action search problem and when applied to the safe linear bandit problem.
给作者的问题
- Theoretical and experimental comparison to Gangrade et al (2024b).
论据与证据
- Some hyperparameters of the algorithm are not clearly specified (e.g. omega).
- From the writing of the paper, I do not understand if the noise distribution is inflated.
- Theorem 8 is not written very coherently. The sentence below says that it reduces to something much more interpretable, which is not clear to see. Perhaps it would be better to just state the more interpretable condition?
- On pg7 there is a discussion on the mild dependence on m, but m is not clearly defined here.
方法与评估标准
- There are no experiments in the main paper. In the appendix experimental results are shown but these are just comparison of their algorithm with different parameter settings. The algorithm run in these experiments seems to have a slightly different stopping time, it would be good to see how the original algorithm performs.
- Most importantly, it would be interesting to see an experimental comparison to Gangrade et al (2024b).
理论论述
- Statement and proof of Lemma 11 contains many typos.
- Lemma 5 – not clear what ‘under ball’ means? Not enough details were provided to verify proof sketch of Lemma 5. Looking at the full proof of Lemma 5 in B.1, there were also some steps I could not verify, e.g. line 858.
- I tried to verify the proof of Theorem 8 but could not. I particularly struggled with lines 1019-1025.
实验设计与分析
I do not follow the motivation for the experimental setup, nor do I really understand what all the results are showing. It would be good to clarify the writing of Appendix A and give motivation for all parameter choices.
补充材料
I looked quickly at the experimental results and some of the proofs.
与现有文献的关系
- More referencing of related work that this paper builds upon would be good, e.g. line 55 the authors state that their approach builds on a recent bandit feasibility test but do not cite the work proposing this.
- The authors mention that their approach builds on prior work by Gangrade et al (2024b) but the comparison of the results in this paper to those results is not sufficient. The authors claim their approach is more computationally efficient (which I can believe), but they do not mention how their theoretical bounds compare to those in Gangrade et al (2024b). Additionally, it would be good to see experiments to demonstrate the claimed efficiency gains.
遗漏的重要参考文献
Missing discussions of Bayesian anslyses of TS, e.g. Russo & Van Roy
其他优缺点
The introduction included a lot of notation which was not always defined. This made reading it quite difficult.
It would be good to give specific examples of where it is not possible to know a safe action a priori – this doesn’t seem like too strong of an assumption in many cases (e.g. often there is prior data available). However, it appears to be one of the key motivations of this work so some clear examples would be good.
其他意见或建议
Proofread carefully and make sure all notation is defined (and also include reminders to save having to scroll pages to find something hidden in the middle of a sentence).
Our thanks for your work.
Comparison to Gangrade et al 2024b. Thanks for pointing out these omissions, which we will fix.
- Theoretical: Gangrade et al 2024b only study testing. We can extend it to FAS via something like our Lemma 7. In this case, the bound is a) for their Algorithm 1, and b) for the -game relaxation proposed on their page 9.
- These are similar to Thm 9, up to a factor of for us relative to a). This shows up in all known efficient methods (line 347, col 2).
- While they don't talk about safety costs, similar bounds to ours should also hold for them.
- Experimental: As discussed starting line 150 col 1, the method of Gangrade et al. is inefficient. In particular, they propose choosing actions by solving matrix games in each round (page 9 of their paper). The simulations in Appendix A work with and our method takes milliseconds for each round on a laptop. However, their proposal would need to solve games per round, which would literally take years!
- We will include a comparison for small .
- However, making their method practical for nontrivial is an open problem beyond the scope of our submission. This would require studying how to appropriately relax and solve the optimisation problem described in (our) line 160 col 1.
- In particular, we want to point out that even simple ideas like nets don't work, since one would need a net of resolution meaning games per round, which is prohibitive even for
Claims & Evidence: We disagree that terms are not defined, but are happy to clarify.
- . This is defined on line 184 col 2 and used throughout the paper. It is the standard radius of the confidence ellipsoid.
- Noise distribution: The underlying noise distribution is stated as in Theorem 9. This is indeed "inflated" in the sense of the common terminology for linTS (although our paper is not about linTS).
- : Recall that (line 30 col 2, line 162 col 2). So is the number of rows of , or equivalently, the number of unknown constraints. Line 334, col 1 (just before Theorem 8) actually explicitly says this.
- Theorem 8: We disagree. This theorem first describes a condition on a measure . Then it defines a measure as the law of when (equivalently, a pushforward of under the map ), and says that follows a -CO condition with instantiations of in terms of the condition on . The CO-condition itself is described in Def. 4, and forms one of the central technical concepts of the paper.
Theoretical Claims:
-
"Under ": as stated on line 154 col 1, "An inequality is said to hold under an event if , where indicates .'' The statement of the lemma defines the event , and the inequality of line 267 col 2 holds once multiplied by an appropriate indicator. This "under an event" terminology is used to lighten the visual load of expressions in the cramped two-column format.
- Line 858: by definition, minimises over . Under so this inequality is trivial.
-
Theorem 8 proof, line 1019 - 1025: There is a typo in line 995: as stated in the statement of Theorem 8, so there is an extra sign in this line. If this is corrected, things should make sense.
- The display in 1019-1021 is opens up . Then, by the discussion previous to 1019, if we find that which implies local optimism (the event ). But the condition on , and the independence of from the history directly means that this even has chance at least given the history (which fixes ). Finally, the norm of the rows of is just the norm of , which is also controlled by the condition on .
-
Typo in Lemma 11: thanks. It should read .
-
We will make sure to proofread the paper closely, and ensure that symbols are reiterated in their context to smooth over the issues you describe.
Relation to broader literature:
- Line 55: This is the work of Gangrade et al. 2024b (discussed line 150 col 1).
- Theoretical bounds of Gangrade et al. 2024b: see above.
Rationale of the experiments: In a nutshell, appendix A attempts to build and study a practical methodology out of the theoretical study in the main text. The plan is discussed in detail on page 11. We will both discuss the early stopping trick in the main text, and include plots without it.
Thanks for clarifying some of my confusion about notation. I will keep my score the same as I still think experimental comparison to Gangrade et al 2024b would be good, and have not yet seen the results. I hope in the revised version the authors also make it explicit that Gangrade et al 2024b have improved theoretical guarantees.
Apologies - we didn't realise that you wanted to see the experimental comparison. A pilot version of this is included below.
We ran a implementation of the method of Gangrade et al in the setting of their experiment with varying (which they call ), following their section 5. In particular, we set . However, unlike their experiment, we study take the domain to be the box (they looked at a ball domain instead), which allows us to be slightly faster in solving the games. Noise standard deviation is and .
Within this setup, we studied the two scenarios that Gangrade et al look at (up to the ball \to box domain change)
- We study infeasible scenarios with the constraints and , in which case the instance is infeasible, and the margin is .
- We study the feasible scenario with the constraints and . Again, the margin is . We set the approximation level in this case to If accepted, we will expand this to studying various values of , and more granular variation in .
Each experiment was carried out times, and we studied in both scenarios. Since you had indicated that you wanted to see results without early stopping, we do exactly that: both methods are run without the early stopping heuristic. Note that both methods would be sped up by this.
Observations. Firstly, every run of both methods was correct - this indicates that both methods are too conservative. However, surprisingly, not only does FAST take less time per round, but it also takes significantly fewer rounds in total! This can partly be expected: instead of (relaxed) linUCB/OFUL, FAST is using thompson sampling based exploration, and so should enjoy the improved behaviour usually seen for TS. The difference becomes even more stark upon realising that the stopping time for FAST improves quadratically in the advantage that TS has. This, coupled with the ultimately quite lossy -based relaxation used for EOGT (Gangrade et al) might explain the improvement.
Mean stopping time for infeasible case (mean over 50 runs, rounded to nearest whole number)
| Ours | 887 | 170 | 58 | 24 |
| EOGT-based | 14436 | 2840 | 1031 | 374 |
Mean stopping time for feasible case (mean over 50 runs, rounded to nearest whole number)
| Ours | 1000 | 224 | 95 | 39 |
| EOGT-based | 16472 | 3809 | 1598 | 618 |
Computational Costs: note that ultimately EOGT requires solving games per round, and takes about times more samples, so in total costs about the cost of our method in this setup. To allow this simulation cheaply, instead of using a library linear programming method, we hard coded the fact that lies in a box domain to reduce to optimising for choices of in each round (further using the fact that is essentially one dimensional since there are only two constrants, and sums to ). This took roughly one hour for the whole EOGT-based simulations, while the whole simulation for our method took about 2 seconds in total.
This paper studies the problem of identifying a point that satisifies a set of linear constraints, using only noisy observations of the linear constraints. They give an algorithm that identifies a strictly feasible point after rounds, where is the largest safety margin of any action. They then consider the safe linear bandit setting and show that by combining their algorithm with standard safe linear bandit algorithms, they an guarantee regret and cumulative violation (hiding all non factors).
update after rebuttal
I maintain my positive score. The authors addressed the minor concerns that I had.
给作者的问题
- The problem setup states that the algorithm only needs to handle the cases where the optimal margin is either strictly less than zero or strictly greater than zero . As such, this doesn't handle the case where . I think it is important for the paper to discuss the case of as the problem is still feasible in this case.
- What is the advantage of interpreting the minimum of the linear functions via the zero-sum game formulation ? This is mentioned several times and appears in the algorithm design, but at the end of the day it looks like the parameter distribution is chosen to be the same for each row of (actually each row uses the same realization). Couldn't we have gotten to the same result by treating each of the rows as separate linear functions and then used the minimum, as is done in the regret minimization safe linear bandit literature (e.g. Pacchiano et al. 2021)? I agree that the paper's approach is elegant, but the game theoretic formulation also seems to add more difficulty in understanding.
- In the application to safe linear bandits in Corollary 10, the paper applies the algorithm LC-LUCB from Pacchiano (2024). To my knowledge, LC-LUCB requires a safe action and the constraint value at that action . However, the paper only passes a safe action and the margin lower bound . I believe that it needs to be clarified how exactly LC-LUCB was applied here.
论据与证据
Overall, the claims look okay. I am looking for some clarification on the guarantees for the safe linear bandit setting as detailed in the Questions box # 3.
方法与评估标准
The approach of using thompson sampling for efficient decision-making is sensible. Evaluating the approach with stopping time, safety cost, regret, and risk makes sense.
理论论述
My only possible concern with the theoretical claims is the proof of Corollary 10, which I detail in # 3 in the Questions box.
Other than this, I did not check the proofs in close detail, but the theoretical results are consistent with what I would expect.
实验设计与分析
There are no experiments.
补充材料
I only looked at the proof of Corollary 10 in the appendix.
与现有文献的关系
To my understanding, the paper has the following contributions to the literature:
- An efficient algorithm to the problem of identifying a feasible point of linear constraints with noisy feedback.
- The analysis approach extends the linear TS approach of Abeille and Lazaric (2017) to this new setting.
- An algorithm for safe linear bandits with regret and risk without knowing a feasible point. I think this contribution is more minor given that a simple extension of Gangrade et al. (2024b) would give the same guarantees.
遗漏的重要参考文献
None that I can think of.
其他优缺点
No others
其他意见或建议
No others
Our thanks for your work.
Questions:
-
Corollary 10: This is indeed an important point, thanks for your careful reading. The issue is resolved by Remark 11 (page 8) and Appendix F.1 (page 47) of the paper of Pacchiano et al. (https://arxiv.org/pdf/2401.08016).
- In short, remark 11 summarises how it is enough for them to have a lower bound on the safety margin of the safe action that is a constant factor away. In our case, the margin of the output is bounded between and , and by the stopping condition, with we have , so we get exactly such an approximation. There is however, a small modification needed (as discussed by Pacchiano et al.) in that the orthogonal projection stuff in LC-LUCB cannot be carried out with only a lower-bound. This does not affect the regret bounds.
- We will expand upon the discussion of section B.4 to discuss this aspect.
- We also wanted to point out that in fact most such methods for SLBs only really need a lower bound on the margin of the safe action (and the regret then scales with this bound). Intuitively, the point is just that this bound yields an initial exploratory set, and determines the rate of expansion of the safe set estimate. LC-LUCB is a bit of a special case because of the orthogonal projection step they take.
-
: We will be happy to discuss this. Short answer is that in this case no method can operate in finite time.
- As an example, consider the case , and with standard Gaussian feedback noise. Then the questions becomes one of testing if or using samples from . But with any finite number of samples , we cannot reliably distinguish between and , and so we can never reliably conclude the procedure in finite time. This simple scenario embeds into more rich situations, and so if the optimal margin is exactly , we can never be -sure of this fact, and so never stop in finite time. This is also reflected in the lower bound (line 340 col 2), which diverges to as .
-
Rows instead of : Yes, we can write things in this way with no change. The reason why was introduced was to give ourselves the option of using the minimax theorem when working the results out. This (surprisingly) ended up not being needed, but by the time we figured this out, much of the paper was already written in this language, and it was too late to rewrite all of it.
Relation to broader literature: while we mostly agree with your reading, we would like to point out that while yes, an extension of Gangrade et al. 2024b along the lines of our submission would do this job, this method is completely impractical.
- Their proposed relaxation (page 9 of their ICML paper) requires solving matrix games in each round. With (the setting studied in our Appendix A), this requires games in each round, and so each round would take years on a laptop, while our method runs in milliseconds.
I appreciate the authors' detailed and clear response. I maintain my positive score.
This work studies Feasible Action Search problems with linear constraints, in which a learner aims to find a point with maximal safety margin. The learner may declare that the constraints are infeasible if it fails to find a feasible point.
The authors suggest an algorithm called FAST. It is based on Thompson Sampling: in each time period, it makes a random perturbation on the estimated constraint matrix by adding some noise, and then selects a point that maximizes the safety margin with respect to the perturbed constraint matrix. The algorithm keeps track of an upper confidence bound on the optimal safety margin and a lower confidence bound on the progressive mean of safety margins, and employs a stopping rule such that declares infeasibility when the upper confidence bound drops below zero or outputs the average of attempted points when the lower confidence bound becomes -close to the upper confidence bound. Under a carefully chosen noise distribution for random perturbation, the algorithm finds -optimal point within interactions with high probability, nearly matching the lower bound .
The algorithm FAST can also be useful for SLB problems in the sense that by running FAST in the initial phase the learner can find a feasible solution and then run SLB algorithms starting from the discovered feasible solution.
给作者的问题
-
Will the UCB version of the suggested algorithm work? I am imagining the UCB-like algorithm that runs without random perturbation. More specifically, the algorithm may pick up the query point optimistically within the confidence ellipsoids for , by solving . I guess that it could be easier for the authors to analyze this UCB version…
-
Does the notion of local optimism lead to over exploration? Since the occurrence of local optimism ensures the occurrence of global optimism, making the local optimism occur sufficiently often may induce global optimism to occur too often.
论据与证据
This paper mainly focuses on making theoretical claims. The proofs look solid to me.
方法与评估标准
I believe that the suggested algorithm is sensible -- it can also be implemented in practice.
However, the choice of performance measure is not completely convincing to me. The suggested algorithm aims to maximize the safety margin, whereas the eventual goal would be to find out one feasible action.
理论论述
I carefully read the statements and the discussions provided in the main body of the paper, and I couldn’t find any suspicious claim. Although I am not that familiar with the random perturbation technique, I was able to understand the technical novelty made by the authors (global optimism vs. local optimism).
实验设计与分析
Some simulation study was included in Appendix. Although I couldn’t check all the details, the experiment was designed reasonably, and the behavior/performance of the algorithm is well reported.
补充材料
The supplementary material was not submitted.
与现有文献的关系
The authors have already discussed the relation to a wide range of literature, namely, Best Arm Identification, Best Safe Arm ID, Minimax and Pareto Bandits, Feasibility Testing, etc. I guess that this work may also be related with online linear programming (e.g., Li and Ye (2021), Online Linear Programming: Dual Convergence, New Algorithms, and Regret Bounds), but I am not certain about the connection.
遗漏的重要参考文献
None.
其他优缺点
This work adopts the safety margin as the learner’s objective. This choice allows the authors to utilize the techniques developed in minimax/pareto bandits, deviating from the existing work on safety bandits. This would be a unique feature of this work, but further justifications are needed. Why do we need the safest action if our goal is to optimize another objective eventually? Perhaps, the problem should be called “Safest Action Search” instead of “Feasible Action Search”.
其他意见或建议
See Questions for Authors.
Our thanks for your work.
-
Max-safety-margin v/s only feasible. Thanks for bringing up this important point. The short answer is that our setup already accomodates the viewpoint you express through our use of the multiplicative approximation.
- Indeed, any feasible point would have a safety margin of at least , and so would satisfy our notion of approximation with (since . So, if we just run FAST with we will stop with a feasible action in time, gaining a factor of .
- We note that the stopping time bound above is just as tight as the general bound in Theorem 9. By the same reasoning as in line 340 col 2, if then finding a feasible point is equivalent to finding an -approximate optimum in a low-regret problem, and so needs samples. We lose a factor of , which appears in all known efficient methods (line 346 col 2).
- Finding a point with near-maximum margin of course has practical applications, mainly in terms of building resiliency. For instance, a manufacturer balancing various quality and cost constraints would want the solution to be remain feasible under small perturbations of the process (translating to perturbations of in our setting). This would be true for a point with positive safety margin, but not so if the margin is close to zero. captures a tradeoff between the costs of identifying such a point, and its resilience.
- The application of optimising a different objective later is captured well by the low-regret SLB problem discussed in section 4. Here, in fact, positive margin is a necessity, since this margin determines the rate at which information can be accrued in the early rounds (also see the lower bound due to Pacchiano et al. 2021). This is reflected in the regret bounds scaling with Our proposed solution improves upon this general bound: we pay an extra constant factor in the exploration costs to end up with a point with close to optimal margin, which in turn improves the regret by a potentially large factor of (Line 80, Col 1). If instead we had stopped with near-zero margin, the regret would suffer.
- Of course, here exact optimal margin is not necessary, which is why Corollary 10 sets for this application.
- The application of optimising a different objective later is captured well by the low-regret SLB problem discussed in section 4. Here, in fact, positive margin is a necessity, since this margin determines the rate at which information can be accrued in the early rounds (also see the lower bound due to Pacchiano et al. 2021). This is reflected in the regret bounds scaling with Our proposed solution improves upon this general bound: we pay an extra constant factor in the exploration costs to end up with a point with close to optimal margin, which in turn improves the regret by a potentially large factor of (Line 80, Col 1). If instead we had stopped with near-zero margin, the regret would suffer.
- We do want to say that your point is well-taken, and these aspect should indeed have be clarified and discussed in more detail in the paper. If accepted, we will include something like the above to clarify these issues in the paper.
- Indeed, any feasible point would have a safety margin of at least , and so would satisfy our notion of approximation with (since . So, if we just run FAST with we will stop with a feasible action in time, gaining a factor of .
-
UCB version: yes, this works, but is computationally infeasible.
- This method was proposed as a test by Gangrade et al 2024b (and can be extended to FAS via a tracking result similar to our Lemma 7).
- However, as discussed in the paragraph starting line 150, col 1, for continuous , this technique needs to solve games in each round. This makes it completely impractical even in the setting we simulated in section A (since for . For instance, our method runs in milliseconds per round, but this method would take years. Finding an efficient UCB-version of such algorithms is an open problem.
-
Does local optimism lead to over exploration? This may indeed be true, which is why the main CO condition (Def. 4) is formulated in terms of global optimism.
- However, we note that our bounds on the optimism rate match those of prior work on single-objective TS, even though they are studying global optimism.
- To be explicit: under the same conditions on the noise in which prior work shows -global optimism, our coupled construction gives -local optimism. This suggests that new techniques are needed to control global optimism rates well not only for FAST, but for linTS in general. This is beyond the scope of our submission.
-
Relation to online linear programming: while interesting, this appears to be a quite different problem. In the setup in the reference, the constraints are revealed one-by-one (and completely), while in our case, constraints are never directly revealed, but we can noisily measure (all of) their values at various . Of course, the other important difference is that they are aiming to optimise, while we are aiming at maximin points. Nevertheless, we will read this thoroughly to see if it is appropriate to discuss.
The paper studies the feasible action search problem for linear bandits. In particular, the goal of the learner is to identify a point in a convex set that satisfies a set of linear constraints. The learner repeatedly interacts with the environment and receives as feedback the value of all the constraints for the played action with some additional noise. The learner should stop when it identifies an action that almost maximizes the safe margin or determines that the system of linear inequalities is unfeasible. The approach is based on Thompson Sampling. The authors prove that the algorithm stops in rounds providing cumulative constraint violation, where is the optimal safety margin. It detects feasibility or identify an multiplicative approximation of an optimal solution. The designed algorithm can be exploited to derive novel results for regret minimization in linear bandits with safety constraints. In particular, it can be used to remove the assumption that the learner knows a safe action.
给作者的问题
None.
论据与证据
Yes
方法与评估标准
Yes.
理论论述
No.
实验设计与分析
No.
补充材料
No.
与现有文献的关系
The problem under study is novel, even though it has many connections with other works extensively discussed in the paper.
遗漏的重要参考文献
The literature on primal-dual methods for online learning with long-terms constraint is closely related to Low-Regret SLBs. There, one crucial component is the estimation of the Slater's parameter, which is essentially your optimal safety margin. Probably, the most closely related work is [1], which derives a general primal-dual method (that can be applied to linear bandits) to learn a strictly feasible solution. Even if not explicitly stated, their results should imply some constant (independent from ) bounds for your setting.
[1] Castiglioni, M., Celli, A., Marchesi, A., Romano, G., & Gatti, N. (2022). A unifying framework for online optimization with long-term constraints. Advances in Neural Information Processing Systems, 35, 33589-33602.
其他优缺点
The paper studies the important problem of learning safe strategies in linear bandits. This is a crucial component for designing regret minimization algorithms for linear bandits. Moreover, the proposed algorithm is efficient and the theoretical analysis is involved. Finally, the work has interesting connections with other problems as discussed in the comprehensive analysis of related works.
其他意见或建议
Line 29 right: should be .
Our thanks for your work. Thanks also for the interesting reference. The section 8 of this paper certainly describes a related idea, although there is no real adaptation to the value of the Slater parameter ( margin), which is an important aspect of our study. We will both try to carefully work out the concrete implications (by working out the primal/dual regrets appropriate here), and of course discuss this paper in the related work section.
The paper tackles the problem of finding the point maximizing the safety margin relative to the unknown constraints and objective parameter of an LP, under noisy evaluations. Two flavours of this problem are studied. In one, the feasible action search setting (FAS), the goal is to find a point within a factor of of the optimal safety margin with high-probability, while minimizing the number of samples required. In the other, the regret-control setting, the goal is to control both the safety cost as well as the regret, defined as the difference between the value of the objective function and the true optimal value of the objective function.
The authors introduce a Thompson Sampling based algorithm for this setting, FAST (Feasible Action Search via Thompson Sampling). The algorithm is accompanied by formal guarantees that is able to stop after samples in the FAS setting, where is the optimal margin and it is also shown that using FAST to instantiate existing safe linear bandit algorithms results in a regret scaling as and risk.
The paper could benefit from improved clarity in notation and general presentation, numerical experiments and possibly a few words about the importance of the problem tackled here. Nonetheless, the reviewers agree that this submission presents a solid contribution to the conference.