PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
3
4
4
3
ICML 2025

Adaptive Sample Sharing for Multi Agent Linear Bandits

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

This paper addresses the agent collaboration in linear bandit problems via sample sharing, deriving an effective bias-variance trade-off for regret minimization.

摘要

关键词
linear banditcollaborationsample sharing

评审与讨论

审稿意见
3

The paper studies an adaptive sample sharing problem for multi-agent linear bandits, where agents' true parameters may be different. The authors propose a separation test, which detects the stopping time for beneficial collaboration. The authors provide both the separation time upper-bound and cumulative pseudo-regret upper bound. The authors also compared the pseudo regret with existing methods. Finally, the authors empirically conduct experiments using both synthetic data and real-world data.

给作者的问题

In Figure 1, how do you compare BASS with CLUB and SCLUB? The constant term of SCLUB and CLUB depend on MM, and your algorithm's upper bound has different parameters. It's unclear why BASS has tighter upper bound.

论据与证据

I have several confusions when reading the paper.

  1. Can you elaborate on the bias-variance tradeoff in Figure 1? Why is the orange ellipsoid inside the blue ellipsoid?

  2. I'm confused of the connection between Table 1 and Theorem 6.4? How do you get the Tlog(T/d)\sqrt{T}\log(T/d) upper bound from Theorem 6.4?

方法与评估标准

The paper studies the separation time and cumulative regret of the proposed algorithm, which looks reasonable to me.

理论论述

I feel the theoretical parts are not well presented. More specifically in Theorem 6.4, what is v(δ,T)v(\delta, T)? How do you quantify nie(t)n_i^e(t)? The authors should have a remark to discuss the quantity of the bound. Currently, the upper bound is not very clear to me.

实验设计与分析

I do not find big issues in experiments. This paper is mostly a theoretical work.

补充材料

Theoretical proofs look mostly clear to me. Minor issue: in the appendix content table, real section names are not presented.

与现有文献的关系

This research is related to the broader multi-agent collaborative learning, where multiple agents collaboratively share data for learning purposes.

遗漏的重要参考文献

I am not very familiar with the collaborative learning literature in linear bandits so I'm unsure whether essential references are missed.

其他优缺点

The paper provides detailed upper bound analysis on pseudo regrets, the number of misaligned agents, and separation time. They also provides the lower bound on separation time. However, the lower bound for pseudo regrets is missed. How close is your algorithm's upper bound to the lower bound?

其他意见或建议

I'm curious why the authors define "pseudo regret" instead of "regret". The definition looks the same as the standard regret.

作者回复

We would like to thank the reviewer for their positive and constructive comments. Below, we address your concerns and provide detailed answers to your questions.

1/ The orange ellipsoid (representing the collaborative estimation) corresponds to a reduced estimation variance, as it yields a smaller Mahalanobis radius of the confidence ellipsoid—note that the smaller ellipsoid might not be contained in the bigger one, we will provide a more general illustration to underline that. However, notice that the center of the orange ellipsoid differs from that of the blue ellipsoid due to the introduced bias, illustrating the bias (OLS estimate deviation) - variance (confidence ellipsoid radius reduction) tradeoff.

2/ We will add a section in the appendix to properly detail the steps to obtain Table 1 from Theorem 6.4 and Lemma 6.1. The proof is similar to CLUB and SCLUB, and the improvements stem from both the Mahalanobis separation criterion and a refined analysis from Abbasi-Yadkori et al 2011.

3/ The definition of ν\nu remains unchanged across all theorems: it is the regret upper bound without sample sharing. Considering the upper bound of Theorem 6.4, the first term corresponds to the regret minimization improvement due to the variance reduction (represented by μ\mu) and the second term corresponds to the cost of sharing (i.e., the shared OLS estimate bias, due to misassigned agents). This second term features an interplay between the determinant and the exponential term in nie(t)n_i^e(t), which overall results in a O(4LδρminTγ2R21)\mathcal{O}(\frac{4L}{\delta \rho_{\min}\sqrt{T}^{\gamma^2R^2 - 1}}) regret. This leads to an overall behavior of the upper bound in O(Tlog(T/d))\mathcal{O}(\sqrt{T} \log (T/d)) as depicted in Table 1.

4/ Thank you for pointing out the inconsistency in the appendix content table; we will revise the naming of the appendix sections accordingly.

5/ To derive a lower bound on the cumulative pseudo-regret, we consider the optimal case of NN collaborative agents sharing the same bandit parameters, forming a single—known—cluster. This leads to: 1Nμ(δ,T)Ri,0,Tcluster\frac{1}{N}\mu(\delta, T) \leq R_{i,0,T}^{\mathrm{cluster}}. We will further compare the additional terms stemming from the clustering estimation in the upper bound of Theorem 6.4 to this lower bound to underline the cost of clustering.

6/ As defined by in [1], pseudo-regret uses expected rewards, while regret uses realized rewards; for our paper we use Bubeck’s definition.

7/ Thank you for pointing out the typo in Table 1: the proper constant term should be dMNd\sqrt{\frac{M}{N}} for CLUB and SCLUB which is equivalent to dρminN\frac{d}{\sqrt{\rho_{\min}N}} for balanced cluster sizes. As mentioned in 2/, the differences (d\sqrt{d} factor and T/dT/d in the logarithm) are due to the use of a Mahalanobis metric combined with a refined analysis (i.e., applying the Elliptical Potential lemma on the design matrix Ai\mathbf{A}_i).

[1] Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1), 1-122.

Additional comments:

If you have any additional questions about the paper and the theoretical and empirical analysis, we would be happy to provide further clarification.

Please let us know if you need any additional information, clarification or modification that we could provide for you to improve your score.

审稿人评论

Thank you for your detailed responses. I decide to increase my score to 3.

审稿意见
4

This paper considers the problem of multi-agent linear bandits in a collaborative setting where the aim is to maximise the cumulative reward across all agents. In this problem each agent has the same (static) set of arms but has a different parameter. The idea is if the parameters of two agents are close enough together then pooling their results to create a single parameter estimator is better than learning independently.

Whereas previous works assume that the agents form clusters in which the intra-cluster parameter distances are within a certain threshold, this work has no such assumptions. They also show that in the clustering case their algorithm theoretically outperforms existing algorithms for that case. The algorithm is also relatively efficient.

They also give the results of experiments, both on real and synthetic data, showing that their algorithm outperforms others.

给作者的问题

None

论据与证据

I have not read the proofs so can’t confirm

方法与评估标准

Yes

理论论述

I have not read the proofs

实验设计与分析

I did not check

补充材料

I did not read the supplementary material.

与现有文献的关系

There has been much work on collaborative multi-agent linear bandits. This paper gives the first algorithm that makes no assumptions on the distances between the parameters of the agents. Most other works assume a clustering assumption, and this work also shows that the same algorithm has a state of the art result there as well (albeit only a slight improvement on the previous state of the art). These two facts combined make this work a significant contribution to the field.

遗漏的重要参考文献

Unknown

其他优缺点

I feel the result is strong. But also, as far as I am aware from interacting with ChatGPT, the original clustering algorithm of Claudio Gentile et. Al. treats all agents in a cluster the same when it has learnt it. If significant time passes the strategy of playing each agent the same in a given cluster is non optimal and it is better to learn them independently, whilst the algorithm of this paper appears to eventually learn them independently, which seems to be another strength (although I guess other clustering algorithms have been proposed that do this).

其他意见或建议

I very much like the way the paper is presented - building up from the fundamental principles of two agents sharing observations and the single agent stochastic linear bandit. However, in my subjective opinion I believe that it is best to place the full theoretical result as soon as possible in the work (which would be before these sections).

作者回复

We would like to thank the reviewer for their very positive and insightful comments. Thank you very much for your kind appreciation of the paper's results.

Thank you for the suggestion regarding the structure of the paper; we will revise it, as we agree it improves the overall logical flow.

Additional comments:

If you have any additional questions about the paper and the theoretical and empirical analysis, we would be happy to provide further clarification.

审稿意见
4

This paper studies the collaboration of multiple heterogeneous agents in addressing a linear bandit problem. It proposes an adaptive sample approach to dynamically determine whether agent pairs should cooperate (utilize observations). Based on the technique, the paper proposes the BASS algorithm to address the multi-agent linear bandit problem with a tighter theoretical regret bound and better empirical performance.

After rebuttal

This is a nice piece of work. The reviewer would like to see the final version with all the comments addressed accordingly.

给作者的问题

See above

论据与证据

Yes

方法与评估标准

Yes

理论论述

The reviewer does not check the proof in detail, but the reviewer goes through the detail explanation in the main text and believes the intuition before the theoretical results of this paper make sense.

实验设计与分析

Yes, the simulations look reasonable to the reviewer.

补充材料

No

与现有文献的关系

The idea proposed in this paper may initiate a broader impact in the community of sequential decision-making and reinforcement learning.

遗漏的重要参考文献

No

其他优缺点

Overall, the reviewer believes the paper makes a fair contribution, and thinks the approach can be extended in broader areas, like other heterogeneous multi-agent bandits setting, and MARL as well.

Strengths:

  • New approach and better results: the adaptive proof discussed in Sections 3.2 and 5 is new to the reviewer, and the BASS algorithm based on the technique is also novel in the literature with slightly tighter regret bounds.

  • Clear writing: the authors did a good job in presenting their work, with toy two-agent as an example, and then extended to multi-agent, with synchronous action as a start, and then generalized to heterogenous actions. The illustrative figures (Figures 1 and 2) also help the reviewer better appreciate the fundamental idea behind the approach.

Weakness:

  • Some sections can be merged; for example, Section 4 should be in the Preliminaries, and Section 7 should just be put as a remark for Theorem 6.4 instead of a full section.
  • Some presentation/writing needs further clarifications, see "Other Comments" below.

其他意见或建议

In Section 5:

  • The presentation logic also makes the reviewer a bit confused. At the end, the action sequence, or Ai\bm A_i, are different across agents. Why the section starts from the general case in Eqs. (3) and (4), and then go to the homogeneous sequence case in Lemma 5.2, and then go back to general case in Definition 5.1? Is there a challenge of extending from Lemma 5.2 to Definition 5.1?
  • In the equation of definition 5.1, the reviewer is a little bit lost: what is ss in the definition, and where tt as an input to Ψ\Psi appears in the equation?

In Section 6:

  • In Algorithm 1’s Line 3, how the neighbors N^\hat{\mathcal{N}} are determined by the updated graph Gt\mathcal{G}_t in Line 9, or by the environment?
作者回复

We would like to thank the reviewer for their positive and constructive comments. Below, we address your concerns and provide detailed answers to your questions.

1/ Thank you for the remark; we will revise the structure of our paper, as we agree it improves the overall logical flow.

2/ There is indeed a challenge in analyzing the effects of asynchronous pulling. It leads to differently shaped ellipsoids and requires a rescaling term in the form of Ai1Aj\mathbf{A}_i^{-1} \mathbf{A}_j, which makes the proof much more difficult. Therefore, we restrict our analysis to the synchronous setting. In Definition 5.1, we use the general case because this definition is used in the practical implementation of the algorithm, which allows for empirical evaluation of our method in the asynchronous setting.

3/ Thank you for pointing out this typo: the correct definition consider the minimum w.r.t ss and compare the later to 0:

\Psi(i, j, t) = \mathbf{1}\left \\{ \min_{s \in ]0, 1[} \frac{\gamma^2}{4} - \sum_{l=1}^d \mu_{tl}^2 \frac{s (1 - s)}{\beta_{i,t}^2 + s (\beta_{j,t}^2\eta_{tl} - \beta_{i,t}^2)} < 0 \right\\} \enspace.

We will correct this and update the corresponding subscripts.

4/ The graph Gt\mathcal{G}_t is updated and stored in a central server that first requests the design matrix and the regressand of each agent in the neighborhood of agent ii. Then, send collaborative OLS variables to agent ii to pull an arm. Hence the neighborhood is determined by a central server to manage the sampling sharing logic within the agents network. Note that our work focuses on sample efficiency rather than communication; however, sample sharing for a general decentralized setting is a promising line of future work.

Additional comments:

If you have any additional questions about the paper and the theoretical and empirical analysis, we would be happy to provide further clarification.

审稿人评论

The reviewer thanks the author for the response. The reviewer will hold their positive evaluation on the paper.

审稿意见
3

This paper considers a multi-agent linear bandit problem in which each agent seeks to estimate its own linear parameter (so that they can minimize regret) while all agents select arms from a shared set. In this setting, agents are allowed to share reward observations with other agents to reduce the uncertainty of parameter estimates at the cost of increasing the bias of the estimates. To construct proper collaboration sets for agents, the authors leverage the idea of overlapping ellipsoid test, proposing an OFUL-based algorithm with Mahalanobis distance-based agent similarity criterion. This paper provides theoretical guarantees for cumulative pseud-regret, agent separation time (length of communication), and the expected number of misassigned agents. The proposed algorithm is also numerically evaluated in both synthetic and real-world datasets.

给作者的问题

  • I am trying to better understand the claim that, unlike "most" existing approaches, this work does not rely on any assumptions about the structure of the bandit parameters. Do works that assume an agent cluster structure (e.g., Gentile et al. 2014, Ban & He 2021, Wang et al. 2023) impose assumptions on the parameters? Could the authors clarify this point?

  • I am puzzled by the discussion following Lemma 5.1, which states that as we consider synchronous pulling, we have A_i,t=A_i,t=A_t {\bf A}\_{i, t} = {\bf A}\_{i, t} = {\bf A}\_t. My understanding is that even if agents ii and jj are synchronous, if they pull different arms, then A_i,tA_j,t{\bf A}\_{i, t} \neq {\bf A}\_{j, t}. Could the authors clarify this point? Please let me know if I have misunderstood anything.

论据与证据

This paper claims three key contributions: no parameter assumption, anisotropic approach, and analysis. However, as one of the main contributions, the algorithm description in Section 6.1 appears too brief. Could the authors provide a more detailed explanation to enhance clarity?

方法与评估标准

This paper employs appropriate metrics, such as cumulative pseudo-regret and clustering scores, to evaluate performance. However, while the number of communications (i.e., agent separation time) appears to be considered in the theoretical analysis, it is not reported in the numerical study. Could the authors clarify this or provide the corresponding simulation results?

理论论述

  • In Theorem 6.2, the term β(δ,ATs)\beta(\delta, {\bf A}_{T_s}) is not explicitly specified.
  • In Theorems 6.3 and 6.4, the terms ν(δ,T)\nu(\delta, T) are not explicitly specified. Could the authors clarify whether they take the same form as in Theorem 6.2?
  • I am seeking a better understanding of the general regret upper bound for all agents presented in Theorem 6.3. Could the authors summarize the assumption-free bounds in a similar manner to Table 1 for clarity?

Providing these clarifications in the paper would enhance readability.

实验设计与分析

Could the authors provide comments on the results presented in Table 3? In particular, could they offer an explanation or intuition for why almost all benchmark algorithms received zero scores? Additionally, could the authors discuss whether the approaches proposed in this paper could be incorporated into these algorithms?

补充材料

I am aware of the BASS algorithm implementation details and complexity analysis in the Appendix.

与现有文献的关系

This work may bring attention to the potential of incorporating Mahalanobis distance as a similarity measure in the multi-agent regret minimization research community.

遗漏的重要参考文献

I am not aware of essential references that were not discussed.

其他优缺点

Strength:

  • This paper proposes an interesting algorithm for multi-agent linear bandit and provides extensive theoretical analysis.

Weakness:

  • Some of the results are not presented clearly.

其他意见或建议

N/A

作者回复

We would like to thank the reviewer for their positive and constructive comments. Below, we address your concerns and provide detailed answers to your questions.

1/ We will add a specific section in the appendix, featuring a flow-chart, where we will better expand the description of the algorithm major steps: a/ the samples sharing (L 2-4), b/ the UCB pulling (L 5-6), c/ the local variables update (L 7) and d/ the graph Gt\mathcal{G}_t update (L 9).

2/ The focus of this paper is to analyze the effect of sample sharing on regret minimization in the general case—i.e., without any assumption on the agent clustering structure. To that end, we ignore considerations such as communication cost. However, for completeness, we will add a section in the appendix presenting the empirical communication cost in the federated setting (using a central server), along with a theoretical upper bound in the case where the clustering structure is assumed (since in that setting, the number of agents per cluster can be quantified).

3/ The term β(δ,At)\beta(\delta, \mathbf{A}_t) is stated in Theorem 4.1 and corresponds to the confidence ellipsoid radius of the OLS estimate. We will restate and clarify the definition of β(δ,At)\beta(\delta, \mathbf{A}_t) in Theorems 5.1, 6.1, 6.2, 6.3, and 6.4, and underline that the definition of ν\nu remains unchanged across these theorems.

4/ A particularity of the sample sharing problem is that it unfolds in two distinct phases: one where sharing is beneficial, and another where independent parameter estimation is preferable. Theorems 6.2 and 6.3 present the cumulative regret at the transition point between these phases, highlighting the gain from sharing before switching to an independent regime. Table 1 reports the asymptotic regret in the independent setting, after collaboration has ceased. Thus, the results of Theorems 6.2 and 6.3 cannot be presented in the same form. However, the upper bound can be summarized as a tradeoff between the acceleration of the first term (in μ\mu), compared to the independent regret (ν\nu), and the bias from data sharing, captured in the second term.

5/ The other algorithms perform poorly in terms of clustering. Rather than accurately recovering the cluster structure, other approaches focus on regret minimization, either by design or by regret-oriented hyperparameter optimization. Figure 8 in Appendix R.3 shows the progressive degradation of the clustering score. Notably, no existing work in the literature provides theoretical guarantees or empirical validation regarding the quality of the clustering.

6/ Beyond the analysis of sample sharing itself, the most significant aspect to revise is the use of the Euclidean distance criterion (as in Gentile et al. (2014); Ban & He (2021); Wang et al. (2023)), which is effectively equivalent to the test proposed by Gilitschenski et al. (2012), where the ellipsoids considered are E(θ^i,λmin(At)I,β~)\mathcal{E}(\hat{\mathbf{\theta}}_i, \lambda^{\min}(\mathbf{A}_t)\mathbf{I}, \tilde{\beta}) and E(θ^j,λmin(At)I,β~)\mathcal{E}(\hat{\mathbf{\theta}}_j, \lambda^{\min}(\mathbf{A}_t)\mathbf{I}, \tilde{\beta}) thereby ignoring most of the arm-pulling history.

7/ We will add a summary table in the appendix outlining the different clustering assumptions considered in the literature and mentioned in our related work:

  • no assumption: our paper, except in “Regret analysis with clustered parameters”.
  • same bandit parameter within a cluster: our paper with Assumption 6.1.
  • same bandit parameter within a cluster and a minimum distance between clusters: Gentile et al. 2014, Li et al. 2016, Li & Zhang, 2018, Wang et al. 2023, Yang et al. 2024.
  • close bandit parameters within a cluster and overlapping clusters definition: Ban & He 2021.

Note that our primary intention is to derive a principle rule to share samples between agents without introducing assumptions on their distribution. We then extend our results to the clustered setting to provide a fair comparison to existing work.

8/ In the synchronous setting, all agents in a cluster select an arm based on the same shared history. Since UCB is a deterministic rule, two agents using the same history will select the same arm. Therefore, their design matrices will also be identical. We will add an additional section in the appendix to better explain this aspect of our setting.

Additional comments:

If you have any additional questions about the paper and the theoretical and empirical analysis, we would be happy to provide further clarification.

Please let us know if you need any additional information, clarification or modification that we could provide for you to improve your score.

审稿人评论

I greatly appreciate the authors' clarification in response to my questions, as well as their plan to incorporate additional material into the paper. I continue to maintain my positive recommendation for this work.

最终决定

I appreciate the work as collaborative learning in bandits is an important and practical topic and the results obtained in this paper are solid. Two of the four reviewers are on the fence and remain so after the rebuttal. While the overall evaluation is a bit on the borderline, I'm leaning towards accepting the paper and concur more with the two reviewers that are more positive.