7.1

/10

Poster5 位审稿人

最低4最高5标准差0.5

3.0

置信度

创新性2.8

质量2.6

清晰度2.8

重要性2.4

NeurIPS 2025

Thompson Sampling for Multi-Objective Linear Contextual Bandit

Somangchan Park,Heesang Ann,Min-hwan Oh

OpenReview PDF

提交: 2025-05-08更新: 2025-10-29

TL;DR

We propose the first Thompson Sampling algorithm with Pareto regret guarantees in multi-objective linear contextual bandit.

摘要

We study the multi-objective linear contextual bandit problem, where multiple possible conflicting objectives must be optimized simultaneously. We propose $MOL-TS$, the first Thompson Sampling algorithm with Pareto regret guarantees for this problem. Unlike standard approaches that compute an empirical Pareto front each round, $MOL-TS$ samples parameters across objectives and efficiently selects an arm from a novel effective Pareto front, which accounts for repeated selections over time. Our analysis shows that $MOL-TS$ achieves a worst-case Pareto regret bound of $\widetilde{O}(d^{3/2}\sqrt{T})$, where $d$ is the dimension of the feature vectors, $T$ is the total number of rounds, matching the best known order for randomized linear bandit algorithms for single objective. Empirical results confirm the benefits of our proposed approach, demonstrating improved regret minimization and strong multi-objective performance.

关键词

Multi-Objective Multi-armed BanditLinear Contextual BanditPareto OptimalityThompson Sampling

评审与讨论

审稿意见

评分: 5置信度: 22025-07-02

In this paper, the authors study stochastic linear contextual bandits under the multi-objective setting, in which each arm corresponds to a vector of $L$ objective rewards, whose expected reward is the inner product between the hidden objective value and the arm’s context. As the Pareto front is non-convex, they define the effective Pareto optimality, which is essentially the convex hull of all arms. They measure performance using the cumulative minimal effective Pareto sub-optimality, defined as the $\ell_\infty$ distance between the performance and the convex hull.

They present a Thompson Sampling-based algorithm to solve this problem efficiently. Specifically, for each objective, they sample a logarithmic number of hidden objective parameters and estimate the empirical objective reward of each arm using the maximum inner product between the sampled parameters and the context. The algorithm then randomly plays an action sampled uniformly from the empirical Pareto front. They establish a regret bound of order $\sqrt{T}$ for their algorithm. Experiments are provided to demonstrate the benefits of the algorithm.

优缺点分析

Strength:

The multi-objective linear contextual bandit setting is definitely significant and interesting. Applying Thompson Sampling to this setting is a natural idea. The authors propose a clean algorithm for this setting, with a regret upper bound that matches the performance of Thompson Sampling in the linear contextual bandit setting.
I think optimistic sampling is a novel idea, as it allows one to perform optimistic estimation in a computationally efficient way compared to the UCB bounds ( $\mathcal{O}(d^2 M)$ for optimistic sampling per action vs. $\mathcal{O}(d^3)$ in UCB). The corresponding intuition and proof sketch look correct to me. This analysis may shed light on future work in the bandit literature.

Weakness:

I feel that the main reason one would prefer a TS-based approach over a UCB-based approach is its computational efficiency, as suggested by Agrawal et al. (2012), especially when there are many arms. Unfortunately, I don't see the authors highlighting this point. I would suggest that the authors add a section discussing the computational complexity of the proposed approach.

Agrawal, Shipra and Navin Goyal. “Thompson Sampling for Contextual Bandits with Linear Payoffs.” International Conference on Machine Learning (2012).

问题

How does $M$ affect the constant in the regret bound of Theorem 2? For example, what happens if $M$ doubles?

局限性

Yes

最终评判理由

I would maintain my positive evaluation of this work and recommend it for acceptance.

格式问题

作者回复

2025-07-31

We thank the reviewer for your time and thoughtful review of our paper. We greatly appreciate the insightful feedback, especially the positive recognition of our use of optimistic sampling.

Computational Cost : We will gladly discuss the computational complexity of our proposed approach. Unlike previous works that require computing the entire Pareto front, which involves comparing all pairs of arms and has a computational complexity of $O(K^2)$ (where $K$ is the number of arms), our method significantly reduces the computational cost. Instead of explicitly identifying all effective Pareto optimal arms, we play an arm based on a random weight vector, thus selecting an arm that is optimal with respect to the random weight, which is effective Pareto optimal arm. This approach takes the computational complexity to $O(KL)$ . Our method saves computational time since $K>>L$ .

Discussing the computational complexity involves several factors, including both the selection of arms and the method for computing the Pareto front. Beside comparing computational complexity between UCB and TS which involves considering multiple factors of each algorithm, our experiments show that the proposed Thompson Sampling algorithm is computationally faster. We believe that the computational efficiency of TS, as demonstrated empirically, arises from the way it avoids the explicit computation of the entire Pareto front and instead samples arms based on a random weight vector, leading to lower computational overhead. The following table shows the average time complexity with $T = 10000$ , and $L = 4$ .

Time Complexity (s)	$`MOL-UCB`$	$`MOL-TS`$ (ours)
$d = 5,\ \ K = 50$	16.73	5.09
$d = 10, K = 50$	17.39	6.92
$d = 5,\ \ K = 100$	34.64	5.97
$d = 10, K = 100$	37.25	7.29
$d = 15, K = 100$	40.87	11.43
$d = 10, K = 200$	79.95	7.81
$d = 15, K = 200$	88.08	11.69

It is also important to mention that anthor main reason of using TS over UCB is the superior empirical regret performances which are presented in the experiments in the paper (Figure 2~13).

The constant dependency with respect to $M$ : The regret bound in Theorem 2 depends on $M$ through the term $\sqrt{2d\log\frac{2LMdT}{\delta}}$ , which arises from the fact that multiple sampled parameters need to satisfy the concentration property [1]. Doubling $M$ results in a logarithmic increase only in the regret bound.

[1]Abeille and Lazaric. "Linear thompson sampling revisited" (2017).

2025-08-06

Thanks to the authors for their answer. I have no further questions.

审稿意见

评分: 4置信度: 32025-07-02

The paper proposes a Thompson sampling (TS) based approach for multi-objective linear bandits, with the goal of optimizing the effective Pareto regret. The authors define this new notion to overcome the issues associated with measuring cumulative regret using the Pareto regret idea. The algorithm (MOL-TS) essentially takes multiple samples from the posteriors to construct a randomized optimistic confidence bound for each arm, based on which the effective Pareto front is determined at that round, and action is taken randomly from this front. The paper establishes a $\tilde{O}(d^{3/2}\sqrt{T})$ worst-case regret bound, which matches the lower bound for single-objective linear bandit TS.

优缺点分析

The paper is well-written and the results are promising. However, due to the missing appendix, the proofs cannot be verified and experimental details are missing at present.

问题

The optimistic TS using $M$ samples basically behaves like a UCB-type algorithm, by taking large number of samples and taking the maximum, this maximum (for large $M$ ) behaves like the mean added with an upper confidence term, depending on the Gaussian tail and $M$ , however unlike UCB it is slightly randomized (amount of randomization controlled by $M$ ) - can the authors provide any comment on this connection?
Can you explain why for each individual objective, greedy outperforms UCB, but somehow the Pareto regret is way better for UCB (in Figure 2)? This figure is misleading and leads to question the effectiveness of the Pareto regret (and effective PR) notion. Appendix is missing - proofs and other experimental details.
Can you be more explicit in the regret bounds in the orders of dependence on $K$ and $L$ ? Is the bound tight in these? Also, it might be good to show the dependence on $M$ (or $p$ ) in theorem 2 and corollary 1.
[Contextual Bandits with Linear Payoﬀ Functions by Chu et al. 2011] shows lower bound for linear contextual bandits (single-objective) of the order $\Omega(\sqrt{Td})$ and a matching upper bound (up to log terms). For TS, the recent [Geometry-Aware Approaches for Balancing Performance and Theoretical Guarantees in Linear Bandits by Luo et al] used a slight modification of TS to achieve this regret bound for randomized algorithms. Even in the high-dimensional case, the dependence on dimension and sparsity for TS is $O(s\log d)$ (e.g. [Thompson Sampling for High-Dimensional Sparse Linear Contextual Bandits by Chakraborty et al.]), demonstrating better dependence on dimension (stochastic setting). - in the current paper, the dependence on $d$ seems to be slightly higher at $d^{3/2}$ - can the authors comment on this? At least the authors should include a brief discussion on single-objective regret bounds in the literature to claim “best known order for randomized linear bandit algorithms for single objective”.
What is the parameter $\delta$ in the algorithm 1?
From a practician’s perspective, how do you tune all the hyper parameters in algorithm 1. Theoretical results give some indication on $M$ , but what about the others, in particular $\lambda, c$ and $\delta$ ? Did you tune these parameters for the numerical experiments? If so, did you tune the associated parameters for greedy and UCB based methods for a fair comparison?

局限性

See the questions.

最终评判理由

I increased my score based on the authors responses.

格式问题

Clarify the setting, whether the contexts are adversarial or stochastic. It seems the former given there are no stochastic assumptions, still it would be good to clarify your setup.

作者回复

2025-07-31

We thank the reviewer for your time reviewing our paper. We sincerely hope that our answers can help clarify your questions.

The appendix has been uploaded. We would like to respectfully clarify that the full appendix, containing all proofs, experimental details, and additional experimental results, has been submitted as part of the supplementary material. Every theoretical claim made in the main paper is rigorously proved in the appendix, and our empirical evaluations are documented in detail.

We sincerely hope that, in light of this information, the reviewer might reconsider any concerns based on the (unintended) mistaken assumption that proofs or technical details are missing. We are concerned that an evaluation influenced by this misunderstanding could inadvertently affect the overall assessment of the well-prepared paper. We greatly appreciate the reviewer’s time and thoughtful engagement, and we are happy to point to any specific section of the appendix upon request.

We provide answers to your questions below:

Does the optimistic TS using $M$ samples behaves like UCB-type algorithm? : No, they are fundamentally different. UCB algorithm is deterministic, as they use upper confidence bound to evaluate arms, and the optimism comes from deterministically evaluating arms within the estimated confidence bounds. This algorithm is a fixed strategy for exploration.

Thompson Sampling is a randomized algorithm. The optimism in TS is handled through multiple sampling from the posterior distribution where the number $M$ controls the randomness of optimism and controls the boundary that the sampled parameters following concentration property [1] (more specifically, for every $m\in [M]$ , $\lVert\tilde\theta_{t,m}^{(\ell)}-\hat\theta_t^{(\ell)}\rVert_{V_t} \le O(d\log LM)$ holds with probability at least $1-\delta$ ) . Even with large $M$ , the maximum of the sampled parameters does not guarantee an optimistic value in the same way as UCB does.

[1] Abeille and Lazaric. "Linear thompson sampling revisited" (2017).

Comparison between greedy and UCB algorithm : We are more than happy to clarify this question. The single-metric regret measure may not fully explain the observed behavior in Figure 2 (also Figure 3~13 in Appendix F) because it is difficult to calculate a single value of regret that captures the full complexity of the reward vector across multiple objectives. It becomes more difficult as the number of objectives grows, which may lead to misleading interpretations in certain cases. To the best of our knowledge, for the first time, our paper demonstrates the experiments of objective-wise total rewards.

Pareto regret focuses on the trade-offs between objectives, but it does not account for the total rewards across all objectives. Our newly defined effective Pareto regret covers this problem, which offers a more comprehensive measure of performance. This new definition represents an important step forward in overcoming the difficulties associated with measuring the regret in multi-objective settings.

Explicit regret equation : In the Appendix B.2, Theorem 2 shows that algorithm $`MOL-TS`$ has the worst-case regret upper bound by $\left(1+\frac{2}{p-\frac{\delta}{T}}\right)c_T(\delta)\sqrt{2Td\log\left(1+\frac{T}{\lambda}\right)} + 2\delta \Delta_{\text{max}}$ , where $c_T(\delta) = \left(R\sqrt{d\log\left(\frac{1+(T-1)/(\lambda d)}{\delta/L}\right)}+\lambda^{1/2}\right)\left(1+\sqrt{2d\log\frac{2LMdT}{\delta}}\right).$ The regret depends on the number $L$ and $M$ up to logarithmic factor, and it does not depend on the number of arms $K$ .

Comparisons with LinTS-type algorithms for single objective : We are happy to elaborate on this. First of all, the well-known minimax lower bound for the linear contextual bandit is $\Omega(d\sqrt{T})$ [2], so there is a $\sqrt d$ gap between ours and the lower bound. (Please note that $O(\sqrt{dT})$ bounds such as Chu et al. 2011 possess $log K$ factors for finite arms and $O(d^{3/2}\sqrt T)$ regret algorithms including ours do not depend on the number of arms. We hope that we are discussing the results based on the relevant context.) The regret bound in our paper is $\widetilde O(d^{3/2}\sqrt T)$ , which matches the best-known regret bound of the linear Thompson Sampling (LinTS) type algorithms (e.g., [1]) in single objective setting. It is known that this bound cannot be improved for LinTS and its variants due to the inevitable requirement of posterior variance inflation by a factor of $\sqrt d$ , a limitation that arises in the analysis of worst-case regret for Thompson Sampling compared to fixed optimism-based algorithms [3]. Our work extends this well-known result to the multi-objective setting, and the core regret bound still aligns with the fundamental limitations of LinTS in single objective setting.

Our work extends this well-known result from single objective to multi-objective setting, where we also face the same fundamental limitations in terms of the scaling with $\sqrt d$ . Therefore, while the regret bound in the multi-objective setting grows more complex due to the additional factor of $L$ (the number of objectives), the worst-case regret bound is $\widetilde O(d^{3/2}\sqrt T)$ .

[2] Dani, Hayes, and Kakade. "Stochastic Linear Optimization under Bandit Feedback" COLT (2008).
[3] Hamidi and Bayati. "On Frequentist Regret of Linear Thompson Sampling" (2020).

What is $\delta$ ? : In Algorithm 1, the input $\delta$ represents a confidence bound input that plays two key roles, RLS estimations and concentration property. As described in Appendix C, $\delta$ measures the probability bound on the distance between the true parameter and the RLS estimates [4]. This ensures that the estimated parameters are close enough to true values with probability at least $1-\delta$ . Also, it controls the probability that the sampled parameters follow the concentration property [1]. This property ensures that sampled parameters are close enough to RLS estimate with high probability. It is crucial property for random exploration in the Thompson Sampling algorithm, as it helps balance the trade-off between exploration and exploitation, while maintaining reliable confidence bounds.

[4] Abbasi-Yadkori et al. "Improved algorithms for linear stochastic bandits" (2011).

Extra experiments of $\lambda$ , $c$ , $\delta$ : The main focus of our experiments is to compare the performance of the algorithms with different values of $M$ and $L$ . The other hyperparameters, including $\lambda, c$ , and $\delta$ were kept fixed in our experiments, with the value set to $\lambda=1, c=1$ , and $\delta=0.01$ . We did explore additional results for different initial values of these hyperparameters. But these values were not central to the core ideas of our paper, and varying these parameters does not change the results. Regarding the comparison with greedy and UCB algorithm, we used the same fixed hyperparameters for fairness, ensuring that the comparison is made under consistent restriction. The following table measures the average of total regret over 10 different instances in $T = 10000$ with different $\lambda, c$ , and $\delta$ .

$\lambda = 2, c = 1, \delta=0.01$	$\epsilon$ -Greedy	$`MOL-UCB`$	$`MOL-TS`$ (ours)
Pareto Regret	240.71	178.29	73.93
Effective Pareto Regret	285.38	268.92	95.50

$\lambda = 1, c=2, \delta=0.01$	$\epsilon$ -Greedy	$`MOL-UCB`$	$`MOL-TS`$ (ours)
Pareto Regret	235.12	196.08	105.86
Effective Pareto Regret	279.80	289.89	133.27

$\lambda = 1, c=1, \delta=0.1$	$\epsilon$ -Greedy	$`MOL-UCB`$	$`MOL-TS`$ (ours)
Pareto Regret	239.18	211.66	77.42
Effective Pareto Regret	285.00	310.29	99.16

Adversarial setting : We are happy to modify Section 3 of the paper, that the contexts are adversarially given.

2025-08-06

Thank you for the response. Most of the questions were answered by the authors. I went through the appendix and my concerns regarding this are satisfied. The paper is pretty good in terms of the theoretical contributions and I raised my score to reflect that. For the final version, it might be good to state delta used in Algorithm 1 in the main text, at least a brief note on what this parameter indicates.

While I understand that TS and UCB are fundamentally different, I meant that in this case they can be seen connected in the following way. For linUCB type methods, $\text{UCB}_t(x) = x^\top \hat{\theta}_t + \alpha \cdot \sqrt{x^\top V_t^{-1} x}$ . In your case, you take $M$ samples from the posterior and take the maximum - if you look at the distribution of this, this is centered around a term roughly similar to the UCB-type, where $M$ controls the exploration-exploitation trade-off ( $\alpha$ ). If you take one sample, then you cannot control this - taking a correct choice of $M$ properly balances this trade-off, as you show.

I am still a bit confused about Figure 2 - I understand that it is harder to metrize performance across all objectives by a single measure, but it seems greedy outperforms the MOL-UCB for ALL objectives, however, the pareto regret shows a different story. So with this new metric, it seems it is possible to do better overall (in the new metric sense) when individually you perform worse for all underlying objectives. Can the authors provide an explanation for this?

2025-08-06

Thank you very much for engaging with our rebuttal and for taking the time to re-evaluate your score. We truly appreciate the increased score and your acknowledgment after reviewing the appendix.

We also value your perspective on the connection between UCB methods and optimistic sampling in Thompson Sampling, particularly regarding the role of $\alpha$ (or $M$ in our setting) in controlling the exploration–exploitation trade-off.

To address your question on Figure 2, we are happy to further explore scenarios that could lead to the results shown in Figure 2.

Consider a setting with four arms, each associated with a deterministic two-dimensional reward:

$\boldsymbol{r}_{a_1} = [1, 0]^\top$
$\boldsymbol{r}_{a_2} = [0, 1]^\top$
$\boldsymbol{r}_{a_3} = [0.9, 0.9]^\top$
$\boldsymbol{r}_{a_4} = [0.8, 0.8]^\top$

Arms $a_1$ , $a_2$ , and $a_3$ are Pareto optimal (and also effective Pareto optimal), whereas arm $a_4$ is strictly dominated, resulting in a Pareto regret (and effective Pareto regret) of $0.1$ per round when $a_4$ is selected.

Now compare two algorithms:

The first algorithm selects arm $a_1$ in half of the total rounds and arm $a_2$ in the other half (order does not matter). This results in a per-round reward of $[0.5, 0.5]$ and zero Pareto regret.
The second algorithm selects arm $a_4$ in all rounds, achieving a reward of $[0.8, 0.8]$ but incurring a constant regret of $0.1$ per round.

Despite the first algorithm achieving no Pareto regret and optimal total rewards in each individual run, its average reward vector is lower than that of the second algorithm. This highlights the unintuitive possibility that an algorithm with lower regret can yield lower average rewards than another.

While this phenomenon may occur under both Pareto regret and effective Pareto regret, it is less pronounced with effective Pareto regret (e.g., Figure 2, the difference between Pareto regret and effective Pareto regret). This is because the effective Pareto front may exclude Pareto optimal arms that lie inside the convex hull. For example, as illustrated in the left subfigure of Figure 1, some arms that are Pareto optimal may still fall outside the effective Pareto front.

To see this more concretely, consider a similar setting with reward vectors:

$\boldsymbol{r}_{a_1} = [1, 0]^\top$
$\boldsymbol{r}_{a_2} = [0, 1]^\top$
$\boldsymbol{r}_{a_3} = [0.7, 0.7]^\top$
$\boldsymbol{r}_{a_4} = [0.6, 0.71]^\top$

In this case, choosing $a_4$ incurs zero Pareto regret (with losses in rewards in both dimension) but non-zero effective Pareto regret. Effective Pareto regret better copes with these cases than the plain Pareto regret does.

We greatly appreciate your insightful question and the opportunity to elaborate on the distinctions between Pareto regret and effective Pareto regret, and their relationship to singleton reward performance.

If you have any further questions, we would be happy to provide additional clarification. Thank you again for your valuable time and thoughtful feedback.

2025-08-09

Thank you for the clarification. It might be good to elaborate on this example regarding the seemingly strange behavior in Figure 2 in the revised version.

审稿意见

评分: 4置信度: 32025-07-03

This paper studies the multi-objective linear bandit problem, which extends traditional linear contextual bandit problems by requiring optimization across multiple, often conflicting objectives simultaneously. In this paper, the authors proposes MOL-TS as the first randomized algorithm for multi-objective contextual bandits with Pareto regret guarantees, achieving $O(d^{3/2}\sqrt{T})$ regret. Also, this paper firstly introduces "effective Pareto optimal arm" that not only satisfies standard Pareto optimality but also ensures total rewards across all objectives remain Pareto optimal with repeated selections. Numerical experiments demonstrate improved performance in both regret minimization and objective-wise total reward maximization

优缺点分析

Strengths

This research addresses multi-objective bandit literature by introducing the first Thompson Sampling approach with theoretical guarantees.
The algorithm sample multiple models from the posterior distribution and compute an optimistic reward estimation to adapt to the multi-objective setting, which is novel.
The effective Pareto optimality concept provides a more robust framework for multi-objective optimization, ensuring better long-term performance across all objectives rather than just instantaneous optimality.

Weaknesses

The regret bound achieved by this paper is $O(d^{3/2}\sqrt{T})$ , which is worse than UCB-based approaches for multi-objective linear bandits.
It would be better if the authors also discuss how to modify the previous approaches so that they could work for the new definition of Pareto optimality.

问题

In feel-good TS[1], it is shown that for standard linear bandits, the regret upper bound is $O(d\sqrt{T})$ , which matches with the UCB-based approaches. Can similar techniques be applied in multi-objective linear bandit settings to achieve a tighter bound?

[1] Tong Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning.

局限性

Yes.

最终评判理由

The authors' rebuttal helps me assess this work better. I will keep my original score.

格式问题

作者回复

2025-07-31

We sincerely thank the reviewer for taking the time to carefully review our paper. We especially appreciate the positive comments on the strength of our work, especially acknowledging optimistic sampling.

Regret bound of $\widetilde{O}\bigl(d^{3/2}\sqrt{T}\bigr)$ and possible adaptation of Feel-Good TS

We are happy to answer this. First, the regret bound of $\widetilde{O}\bigl(d^{3/2}\sqrt{T}\bigr)$ for our algorithm—viewed as a member of the LinTS family—is not improvable in terms of the $d$ dependence. Hamidi & Bayati (2023) [1] show that Linear Thompson Sampling incurs an inherent posterior-variance inflation by a factor of $\sqrt{d}$ , leading to a frequentist regret lower bound of $\widetilde{O}(d^{3/2}\sqrt{T})$ is the best possible for LinTS-type algorithms. If not for such an inflation, LinTS-type algorithms would incur a linear regret in $T$ [1]

Our work extends this single-objective result to the multi-objective setting, and the bound we obtain remains consistent with these fundamental limitations of LinTS.

[1] Hamidi & Bayati. “On Frequentist Regret of Linear Thompson Sampling” (2023).

Adapting Feel-Good TS is indeed intriguing. However, even in the single-objective case, adding the Feel-Good exploration term typically requires approximate MCMC to sample from a non-conjugate posterior—unless strong distributional assumptions are imposed. Because we assume only classical sub-Gaussian noise, such sampling would introduce significant computational overhead.

For these reasons we base our algorithm on LinTS techniques. Extending Feel-Good TS to the multi-objective setting is an interesting avenue for future research, but lies beyond the scope of the present work.

Previous methods and our new notion of Pareto optimality: We are happy to revise the discussion to clarify how our newly defined concept—effective Pareto optimality—can benefit existing algorithms as well. Existing multi-objective bandit methods typically sample directly from the standard Pareto front, which can lead to inferior cumulative rewards across objectives. Replacing that step with our notion of effective Pareto optimality could improve the performance of those algorithms as well as our own, further underscoring the contribution of this work.

2025-08-05

Thanks for the answers. It helps me assess this work better. I will keep my original score.

审稿意见

评分: 4置信度: 42025-07-03

The paper studies the multi-objective linear contextual bandit problem, where a decision-maker (or learner) must simultaneously optimize several potentially conflicting objectives. The goal is to learn a policy for selecting an effective Pareto optimal arm (Definition 5) for each given context, whose mean reward vector is Pareto optimal and maximizes the cumulative rewards across all objectives. The performance of the learned policy is measured in terms of minimizing effective Pareto regret (Definition 7), which is the sum of effective Pareto sub-optimality gap (defined in Eq. (1)).

The authors propose a Thompson sampling-based algorithm, MOL-TS (Thompson Sampling for Multi-Objective Linear Bandits), for this setting with sub-linear effective Pareto regret guarantees. This paper introduces the notion of effective Pareto front, which consists of arms whose selection maximizes the total reward. The authors have also validated the performance of MOL-TS using synthetic problem instances and show that MOL-TS outperforms baseline algorithms in both regret minimization and cumulative objective-wise rewards.

优缺点分析

The following are the strengths of the paper:

This paper studies the multi-objective multi-armed bandit problem, which has applications in many areas, especially having several potentially conflicting objectives.
This paper proposes a Thompson sampling-based algorithm, MOL-TS, to learn a policy to select the best effective Parato optimal arms for each context. They have theoretically shown that MOL-TS enjoys a sub-linear regret guarantee while choosing a better arm that maximizes the total reward.
The authors have also empirically validated the performance (lower regret and higher cumulative rewards) of MOL-TS using synthetic problem instances.

The following are the weaknesses of the paper:

Assumption: Assuming a linear reward function may not be practical in real-life applications. Since there has been a lot of work considering the non-linear bandit setting, it would be much better to have a more practical algorithm by considering a general reward function.
Selecting the best arm: Selecting an effective Pareto optimal arm can lead to a better performance than simply selecting a Pareto optimal arm. Since there can be multiple effective Pareto-optimal arms, it is unclear how to choose a single best arm among them. Doing this may need additional constraints, but it may make it more practical, as one can explain why a specific effective Pareto-optimal arm is selected. Otherwise, different effective Pareto-optimal arms may lead to different values of the objectives.
Analysis novelty: The multi-objective linear contextual bandits and Thompson sampling-based bandit variants are now separately well-studied problems. From the paper, it is unclear what the key challenges are faced in the regret analysis, especially after getting an upper bound on the effective Pareto sub-optimality gap in Section 5.2.

问题

Please address the weaknesses raised in Strengths And Weaknesses*.

局限性

I have raised a few limitations of the paper in my response to the Strengths And Weaknesses*. Since the paper is a theoretical contribution to bandit literature, I do not find any potential negative societal impact of this work.

最终评判理由

The authors’ rebuttal has addressed my concerns. Overall, this paper proposes, for the first time, a Thompson sampling–based algorithm for the multi-objective contextual bandits setting. This work can serve as a fundamental building block for designing bandit algorithms in more complex real-world applications, including modeling nonlinear objectives or incorporating refined criteria for the optimal arm from Pareto optimal arms.

格式问题

I found no major formatting issues in this paper.

作者回复

2025-07-31

We would like to sincerely thank the reviewer for taking the time to review our paper. We address the feedback you provided point by point below.

Why linear assumption? : We appreciate the reviewer’s feedback regarding the assumption of a linear reward function, but we believe this is not a fundamental weakness of the paper. Analyzing Thompson Sampling in multi-objective bandits (particularly in contextual bandits including linear and generalized linear settings) has remained as an open problem, and to the best of our knowledge, for the first time, our work presents a fundamental milestone in the development of randomized algorithms with the (worst-case) Pareto regret guarantees in multi-objective linear contextual bandits.

The focus of our paper is providing the first randomized algorithm, with Pareto regret analysis and empirical validation in linear reward function. Extending our approach to the (non-linear) GLM is straightforward. Hence, our results aren't limited to linear settings. Further extension to handle more general function class is certainly an interesting direction. However, it is important to note that the optimistic sampling that we propose for Pareto regret analysis can be naturally generalized to other non-linear reward functions. Our work is a noteworthy contribution that sets the foundation for future work with more complex settings. Overall, the fact that we have solved this long-standing open problem in the multi-objective linear contextual bandit should be considered a notable contribution, rather than considered as a weakness.

What is a best single arm? : In multi-objective bandits—and, more broadly, in multi-objective optimization—a single best optimal solution is generally not guaranteed to exist. As noted in our Introduction, multiple objectives often conflict, so improvement in one objective can degrade another. The appropriate notion of optimality is therefore Pareto optimality: the set of solutions that cannot be improved in any objective without worsening at least one other.

If a unique “best” arm did exist under some scalarized criterion, the problem would degenerate to a single-objective bandit setting—contradicting the essence of multi-objective learning. Prior work thus focuses on identifying the Pareto-optimal set (the Pareto front), where each arm is non-dominated by the rest. However, simply enumerating Pareto arms does not address how to maximize cumulative reward across all objectives. Our contribution is to introduce effective Pareto-optimal arms, together with an algorithm that attains sub-linear regret for this richer goal. This adds a fresh perspective to the multi-objective bandit literature.

Should one wish to impose problem(or user)-specified weights and reduce the problem to a single scalar objective, our algorithm naturally specializes to that case and still guarantees sub-linear regret. Nevertheless, that scenario is not the focus of this paper, and therefore it should not be viewed as a weakness of our work.

Technical challenges in analysis : We are happy to elaborate on this. First of all, a simple, straightforward combination of Thompson Sampling (TS) techniques with existing analyses for multi-objective linear contextual bandits is not sufficient for multi-objective linear contextual bandits This gap explains why no worst-case regret analysis of TS currently exists for this setting.

As explained in Section 5.3, one of the key technical challenges in deriving the regret bound of Thompson Sampling algorithm for multi-objective linear contextual bandits lies in ensuring the probability of optimism. Previously, there are many papers of multi-objective UCB-type algorithm with Pareto regret analysis (Drugan & Nowe (2013), Tekin & Turgay (2018), Turgay et al. (2018), Lu et al. (2019), Xu & Klabjan (2023)). The analysis of UCB algorithm is almost similar to that of single objective setting, attaining Pareto regret bound $\widetilde O(d\sqrt{T})$ where the number of objectives depend up to logarithmic factor. By contrast, deriving comparable guarantees for TS in the multi-objective linear contextual setting remains an open problem, and our work is the first to tackle these technical obstacles directly.

Suppose we follow standard Thompson Sampling algorithm, by setting $M=1$ . Widely known in single objective setting $(L=1)$ , the probability that $\tilde\theta_t^{(1)}$ is sufficiently optimistic is $\mathbb{P}\bigl( x_{t,a_t}^\top(\tilde\theta_t^{(1)}-\hat\theta_t^{(1)})\ge c_{1,t}(\delta)\lVert x_{t,a_t}\rVert_{V_t^{-1}} \bigr) \ge \tilde p$ , where $\tilde p$ is a constant probability. In multi-objective setting $(L>1)$ , as the arm is randomly selected, this optimism must hold for all objectives. For any randomly selected arm, we need to ensure this optimism for all parameters $(\tilde{\theta}\_t^{(\ell)})\_{\ell\in[L]}$ , i.e., $\bigcap_{\ell\in[L]}\mathbb{P}(x_{t,a_t}^\top(\tilde\theta_t^{(\ell)}-\hat\theta_t^{(\ell)})\ge c_{1,t}(\delta)\lVert x_{t,a_t}\rVert_{V_t^{-1}})\ge \tilde p^{L}$ . In the worst-case, this results in the regret bound $\displaystyle\widetilde O(\frac{1}{\tilde p^L}d^{3/2}\sqrt{T})$ , which become exponentially large as the number of objectives $L$ increases.

The optimistic sampling resolves this problem with multiple sampled parameters. Suppose $M>1$ . For each objective $\ell\in[L]$ , there are $M$ sampled parameters $(\tilde\theta\_{t,m}^{(\ell)})\_{m\in[M]}$ . The arm is evaluated with the sampled parameter that maximizes the evaluation, i.e., $\tilde\theta\_{t,a}^{(\ell)}=\arg\max\_{\theta}\{(x_{t,a}^\top\tilde\theta\_{t,m}^{(\ell)})\_{m\in[M]}\}$ . Then, the probability bounds for optimism become $\mathbb{P}(x_{t,a_t}^\top(\tilde\theta_{t,a}^{(\ell)}-\hat\theta_t^{(\ell)})\ge c_{1,t}(\delta)\lVert x_{t,a_t}\rVert\_{V_t^{-1}}) \ge 1-(1-\tilde p)^M$ . Following above steps, we have the probability of optimism at least $(1-(1-\tilde p)^M)^L$ . We choose large enough $M$ so that this probability is bounded below by $\tilde p$ .

We modify Thompson sampling algorithm with optimistic sampling to ensure the optimism and attaining the regret bound $\widetilde O(d^{3/2}\sqrt{T})$ . Addressing this challenging problem requires the development of a novel approach and perpectives. We would be more than happy to include more explanations in Section 5.3 in the final version to further strengthen our contributions.

Overall remarks : We appreciate the reviewer’s feedback, but we do not believe the points raised constitute fundamental weaknesses of our paper. We are confident that our work makes substantial contributions to the multi-objective contextual bandit literature and respectfully ask the reviewer to re-evaluate our work in light of our clarifications.

If any questions remain, please let us know, and we would be more than happy to provide further explanations.

2025-08-08

Dear Reviewer fjDD,

Thank you again for your service and the time you invested in reviewing our submission.

We have addressed each of the points you raised in our rebuttal and provided detailed clarifications. With fewer than 36 hours remaining in the discussion period, we would be grateful if you could consider our responses and let us know whether any questions remain.

We would sincerely appreciate your re-evaluation of the paper in light of the clarifications provided, and we are ready to answer any further questions you may have before the discussion window closes.

Sincerely,
Authors

2025-08-09

Thank you for your detailed rebuttal. Since all my concerns have been addressed, I am increasing my rating. Please consider incorporating these points into the revised paper, especially highlighting your main technical contributions.

审稿意见

评分: 5置信度: 32025-07-03

The paper studies the multi-objective linear contextual bandit problem. The paper introduces a new concept effective Pareto regret and proposes a Thompson sampling based algorithm relying on the principle of optimism, The paper proves the effective Pareto regret (hence the Pareto regret) of the algorithm. The paper concludes with experiments.

优缺点分析

Strength:

Significance: The paper establishes the first TS based algorithm with worst case Pareto regret guarantee, and the regret bound matches the best known order that in the single objective setting. The multi-objective is very common in real word scenarios but less studied.
Clarity: The presentation leading up to the introduction of the algorithm is very clear for someone like me who do not have the extensive background in multi-objective settings.
Originality: I really like the optimistic sampling intuition

Weakness:

I think this is minor but I think it would be interesting to see some stretch analysis on the number $L$ in the appendix (or in the main text). Is there any reason that you omitted the comparison between $M=1$ and $M = \log L$ when $L=8$ in the appendix? What if $L$ keeps getting larger.

问题

This is still on the problem of $L$ . I am curious why in the appendix if we keep $L$ the same, the difference between $M=1$ and $M=\log L$ is larger when $d$ is larger?
I am also a bit confused about why $M=1$ does not seem to be that bad given the algorithm relies on optimistic sampling.

局限性

Yes

最终评判理由

All my concerns are resolved and I vote for acceptance.

格式问题

None

作者回复

2025-07-31

We thank the reviewer for taking the time to review our paper. We sincerely appreciate the insightful comments, particularly the recognition of the optimistic sampling. We view this as an excellent opportunity to further clarify and strengthen our contributions, ensuring a more thorough understanding of our approach and its implications.

Streching Analysis on the number $L$ : We thank the reviewer for highlighting this issue. We will gladly revise the technical lemmas in Appendix C to provide a more detailed explanation of how the number of objectives $L$ affects the regret analysis of our algorithm.

The Comparison between $M=1$ and $M=\log L$ , for $L=8$ : We are also grateful for this feedback and happy to show our results. Due to the limited space (by large number of objectives), we omitted the comparison between $M=1$ and $M=\log L$ . We will add the results of additional experiments comparing these two settings. However, when $L\ge8$ , the results demonstrate that Thompson Sampling with optimistic sampling clearly exhibits lower effective Pareto regret. The following table is the experimental results of total regret when $L=8, 12$ and $T=10000$ .

Pareto Regret $(L=8)$	$M=1$	$M=O(\log L)$
$d = 5$	3.72	3.29
$d = 10$	10.53	8.81
$d = 15$	24.39	17.84

Effective Pareto Regret $(L=8)$	$M=1$	$M=O(\log L)$
$d = 5$	4.53	3.89
$d = 10$	13.81	11.39
$d = 15$	32.16	23.96

Pareto Regret $(L=12)$	$M=1$	$M=O(\log L)$
$d = 5$	2.42	1.57
$d = 10$	6.59	4.15
$d = 15$	11.87	8.73

Effective Pareto Regret $(L=12)$	$M=1$	$M=O(\log L)$
$d = 5$	2.96	1.91
$d = 10$	8.86	5.43
$d = 15$	16.45	11.97

The difference with respect to $d$ : As the dimension $d$ increases, the theoretical worst-case regret increases as well, due to the higher uncertainty in the parameter estimation. Empirically, this behavior is consistent with our experiments, as the regret gap between $M=1$ and $M=\log L$ grows more significant.

Why $M=1$ does not seem to be bad? : The worst-case theoretical regret bound does not imply that the algorithm must fail to the worst-case empirical results. It is essential to note that without optimistic sampling, the theoretical worst-case regret could grow exponentially. We resolve this issue by adopting optimistic sampling in $`MOL-TS`$ , which mitigate the potential for exponential growth in regret. However, the empirical results do not necessarily be the worst-case scenario. While optimistic sampling still achieving better outcomes empirically, optimistic sampling is employed in $`MOL-TS`$ to resolve the theoretical issue and ensure better performance in the worst-case scenario.

评论- Response to rebuttal

2025-08-03

Thanks for the response. I don't have any remaining issues.

最终决定Accept (poster)

2025-09-17

This paper studies the problem of cumulative reward maximization in multi-objective stochastic linear contextual bandits. The paper observes that the conventional definition of Pareto regret does not take into account the optimality of cumulative rewards. To address this issue, the authors introduce the notion of Effective Pareto optimality of cumulative reward vectors. Building on this, they propose an algorithm called MOL-TS, based on Linear Thompson Sampling, which aims to minimize the Effective Pareto regret. In MOL-TS, regularized least squares are performed for each objective, and for each objective $M$ samples are drawn from the posterior distribution. By taking the maximum, each parameter is evaluated optimistically. The authors also prove a worst-case regret bound. The key algorithmic and analytical novelty lies in incorporating this form of optimism, which prevents the regret coefficient from growing exponentially with respect to $L$ . However, this appears to be essentially the only substantial contribution of the paper.

After the review and discussion phases, all reviewers expressed a positive evaluation of this paper. While there were questions about the technical novelty, comparisons with UCB-type algorithms, and whether the dependency on $d$ is optimal, the authors’ responses were found satisfactory, and several reviewers raised their scores accordingly.

As also pointed out by Reviewer EmBN, the current theorem statement (Theorem 2) in the main part of the paper does not make the explicit dependence on important parameters such as $L$ entirely clear. If possible, presenting such dependencies directly in the main part would help make the paper’s main contributions more transparent.