5.5

/10

Rejected4 位审稿人

最低5最高6标准差0.5

2.8

置信度

正确性3.0

贡献度3.0

表达2.5

ICLR 2025

Clusters Agnostic Network Lasso Bandits

Sofien Dhouib,Steven Bilaj,Behzad Nourani-Koliji,Setareh Maghsudi

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

We solve and theoretically analyze a multi-task contextual bandit problem given a task relation graph using a network lasso objective..

摘要

关键词

multitask learningcontextual banditsgraph total variationnetwork lassoclustering

评审与讨论

审稿意见

评分: 5置信度: 32024-11-04

This paper studies a multi-task linear contextual bandits setting. The tasks are embedded in a graph $G=(V,E)$ . The paper shows the regret bound for the network Lasso policy, which updates the parameter estimation in a predefined way, and presents numerical experiment results to show the effectiveness of the network Lasso policy.

优点

This paper gives a rather complete analysis to the network Lasso policy.

缺点

The paper mixes problem setting with algorithms which creates confusion: e.g., Eq. (2) about how $\hat \Theta$ is updated is not related to problem setting.
The paper sometimes refer to elements of $\mathcal V$ by task, sometimes by user (e.g. Line 129).

问题

The setting is agnostic to the graph. It then becomes opaque to me why we must utilize the graph structure, and how utilizing the graph structure could benefit.
The paper seems to restrict their study to the particular algorithm of network Lasso. This choice seems not justified, and I'm not convinced by its importance.
The regret bound seems to have no dependency on dimension $d$ . This seems strange, e.g. if we plug in $|\mathcal V| =1$ to Theorem 3. (Update: I'm better understanding the paper after the authors' response. Basically, to compare this with linear bandits, I think the polynomial dependency in $d$ becomes the polynomial dependency in a param that's related to the graph structure. I'd raise my score.)

评论- Response to Reviewer kEcQ

2024-11-19

Mixing problem setting with algorithms: Since one of our main contributions is the oracle inequality itself, we have written the network lasso problem in the problem setting. However, since our paper is a bandit algorithm paper, we can move that optimization problem to the algorithms section.
Referring to elements of $\mathcal{V}$ by tasks or users: As it is mentioned in the introduction section, each task represents a user in our motivating example of an e-commerce website. We will also add a sentence indicating that we use the terms 'task' and 'user' interchangeably.
The setting being agnostic to the graph: We respectfully disagree with the reviewer since the graph is given as part of the problem setting and it is used in the optimization framework in Equation (2) via the regularization term defined by edges and weights. The setting is agnostic to the cluster structure i.e. the agent is not explicitly aware which users belong to which clusters. It utilizes side information in the form of a given graph by using the network lasso regularization in order to implicitly learn the cluster structure. This type of approach is common when it comes to multi-task bandit problems with side information (e.g. [16,17]).
The choice of network lasso being unjustified: We respectfully disagree. Indeed, the network lasso problem is a well-known choice when one is given a graph and wants to estimate a piecewise constant signal over that graph [12,13,14]. Our choice is analogues to choosing the LASSO estimator when dealing with high dimensional sparse preference vectors $\theta$ in the single task case [1,2,3,5]
Absence of dependency on dimension $d$ : We respectfully disagree and point out that our regret bound indeed depends on the dimension $d$ . In Theorem 3 we have the term $\log(|\mathcal{V}|d)$ as part of the overall regret. Even in the case of $|\mathcal{V}|=1$ , we would still have a $\log(d)$ dependence.

审稿意见

评分: 6置信度: 22024-11-04

This paper propose a bandit algorithm based on network lasso and provide improved regret bound based on the technique.

优点

Network lasso seems to be an interesting technique applied to bandit problems.

缺点

The current existing benchmarks are mostly using different bandit algorithms designs than the proposed greedy one. Thus, I don't think these are fair comparisons. Typically, greedy bandit algorithms have better performance than non-greedy ones. Can the authors compare their algorithm with other greedy algorithms, such as greedy OLS-Bandit (Bastani et al., 2021) and Lasso-Bandit (Bastani and Bayati, 2020)?

问题

N/A

审稿意见

评分: 5置信度: 32024-11-04

This paper investigates multi-task contextual bandit settings, wherein the learner is provided with a graph that encodes the relationships among the bandit tasks. It is assumed that the preference vectors for these tasks are piecewise constant over the graph, thereby forming distinct clusters. To estimate the reward parameters, the paper formulates and solves a network lasso problem in each learning round.

优点

The paper addresses federated multi-task learning, a growing area of interest due to its collaborative nature that enhances the effectiveness of the learning process. The paper presents both theoretical and experimental results.

缺点

The paper presents an approach to solving the multi-task learning problem by partitioning tasks into clusters, stipulating that tasks within each cluster share the same reward model, meaning they utilize the same feature vector. This approach is particularly relevant in scenarios where the number of clusters exceeds the problem dimension $d$ ; otherwise, the problem effectively reduces to a low-rank multi-task learning problem, which has been extensively studied in the literature. Therefore, to assess the significance of the results, it is crucial to compare both the theoretical and experimental findings with those from low-rank multi-task learning studies. However, the simulations conducted only address cases where the number of clusters is greater than the dimension, which raises questions about the completeness of the experimental analysis.

问题

What are the relaxed symmetry and balanced covariance assumptions in Assumption 2, and how do these assumptions impact the practical applicability of the method?
Assumption 3 states that tasks within the same cluster share the same reward parameter. Under this condition, the problem aligns with the well-explored domain of low-rank multi-task learning. As noted in the paper, the only justification for this approach is when the number of clusters exceeds the dimension d. However, this scenario is typically uncommon and may limit the broader applicability of the method. To this end,

a) Please clarify how the approach differs from or improves upon low-rank multi-task learning methods when the number of clusters exceeds d.

b) Also, discuss potential applications or scenarios where having more clusters than dimensions would be relevant.
In all the experiments conducted, the number of clusters is smaller than $d$ , effectively reducing the problem to a low-rank multi-task learning problem, as studied in works such as Yang et al. and Lin et al. Therefore, the significance of the contribution needs to be reassessed, as it aligns with existing literature on low-rank multi-task learning. Some relevant representative works are: Yang et al. Impact of representation learning in linear bandits, arXiv:2010.06531, 2020. Lin et al., Fast and Sample Efficient Multi-Task Representation Learning in Stochastic Contextual, ICML, 2024.

a) Can you include experimental comparisons with low-rank multi-task learning methods for the scenarios presented in Figure 1?

b) Can you add experiments where the number of clusters exceeds d to demonstrate the unique benefits of their approach in those cases?
Can you provide an experimental comparison with the low-rank multi-task learning results presented in Yang et al. and Lin et al., given that all the plots in Figure 1 correspond to a low-rank case?
Can you compare the proposed approach with the low-rank multi-task learning results specifically for settings where the number of clusters exceeds the problem dimension? This comparison is crucial to demonstrate the effectiveness of the proposed method, as the paper emphasizes these scenarios.
Can you provide a table or figure comparing the regret bounds of the proposed approach in this paper with those of the relevant baseline methods? This comparison would help to clarify how the proposed method performs relative to established approaches in terms of regret guarantees. Further, can you discuss the implications of any differences in the regret bounds, particularly for cases where the number of clusters exceeds d?

伦理问题详情

None

评论- Response to Reviewer kZp1: part 2

2024-11-19

Comparing the regret bounds of the proposed approach in this paper with those of the relevant baseline methods: Our regret bound does not depend on the number of clusters, but rather on other properties of the graph such as the worst topological centrality index and the size of the boundary. In general, the bounds are incomparable, due to our high reliance on the graph. To make the number of clusters appear, we can make some additional assumptions:
- We assume that the algebraic connectivity of a cluster $|\mathcal{C}|$ is $\Omega(|\mathcal{C}|)$ . This holds for example for fully connected clusters having an algebraic connectivity equal to their size. As a result, from Proposition 3 of the appendix, the topological centrality index of such cluster $\mathcal{C}$ behaves as $\Omega(|\mathcal{C}|)$ .
- For a weightless graph, for each cluster, its vertices linking it to the outside have a number at most equal to the number of edges linking them to the boundary. Hence, we can bound the isoperimetric ratio as follows: $\max_\mathcal{C}\iota(\mathcal{C}) = \max_\mathcal{C}\frac{|\partial_v\mathcal{C}|}{|\mathcal{C}|} \leq 2\frac{w(\partial \mathcal{P})}{\min_{\mathcal{C}}|\mathcal{C}|}$ With such assumptions, we have

$\sqrt{\frac{\max_\mathcal{C}\iota(\mathcal{C})}{\min_\mathcal{C}c(\mathcal{C})}} \leq \sqrt{2}\frac{\sqrt{w(\partial\mathcal{P})}}{\min_\mathcal{C}|\mathcal{C}|}.$ where $|\mathcal{P}|$ is the number of clusters. As a result, the horizon-dependent part of the regret in Corollary 1 grows as $\tilde O\left(\sqrt{\overline{T}|\mathcal{V}|} + \frac{w(\partial\mathcal{P})}{\min_\mathcal{C}|\mathcal{C}|}\sqrt{\overline{T}|\mathcal{V}|}\right)$ . The minimum cluster size has a substantial influence on the performance in this case. Since it appears in the denominator, we need an assumption that it behaves as $\Omega(\frac{|\mathcal{V}|}{|\mathcal{P}|})$ . As a result, the last big $\tilde O$ term becomes $\tilde O\left(\sqrt{\overline{T}|\mathcal{V}|} + \frac{w(\partial\mathcal{P})}{\sqrt{|\mathcal{V}|}}|\mathcal{P}|\sqrt{\overline{T}}\right)$ . Thus, when the number of clusters is less than the dimension, we have a linear dependence in the rank of the preference vectors matrix via the $|\mathcal P|$ term.

Compared to the $\tilde O(\sqrt{r\overline{T} |\mathcal V| d}) + |\mathcal V| \sqrt{r\overline{T}})$ regret of [6], our linear dependence on the rank is better as the dependence of [6] is in $\sqrt{rd}$ . This should not be surprising since in our setting, the graph provides additional information that guides the estimation of these dimensions better than without it.
Compared to the $\tilde O(\sqrt{r\overline{T}|\mathcal{V}|})$ $\tilde{O} (r \overline{T} ∣ V ∣)$ of [7], their bound seemingly has a better dependence. However, one must take into account two observations about Theorem 5.4 of [7] that states the regret bound:
- the regret bound holds with a probability at least $O(1-\delta - d^{-10})$ , with the stated bound being independent of $d$ . Hence, seemingly the result gets better when one increases the dimension.
- Looking closer at the proof of Theorem 5.4 of [7], one can see that the regret is bounded by the quantity $\mathcal{G}\_{m} - \mathcal{G}\_{m-1}$ . This quantity is assumed in that theorem to be $\Omega(d)$ , and it is bounded by the number of rounds per task. This implies that the number of rounds per task is $\Omega(d)$ . However, in our paper, we do not assume any relation between the horizon and the dimensionality.

评论- Response to Reviewer kZp1: part 1

2024-11-19

Relaxed symmetry and balanced covariance assumptions in Assumption 2, and their impact on the practical applicability of the method: These are technical assumptions that [3] proposed for the first time, and that [4,5] also used in their analysis, with [5] studying a multi-task setting. According to [3], both assumptions are satisfied for several probability distributions. Whether weaker assumptions can be used to generalize to real-world data distributions is an open question worth considering.
The approach being advantageous only when the number of clusters exceeds the dimension: We did not mention anywhere in our paper that the only justification for this approach is when the number of clusters exceeds the dimension $d$ . Our approach is useful when there is an available graph representing side information that helps in accelerating learning. Had the graph not been available, other approaches leveraging the low rank structure would of course be more interesting. In other words, our work is about leveraging the available graph's stucture conveniently. 2.a Difference from of improvement upon low-rank multi-task learning methods when the number of clusters exceeds d: As we mentioned in the appendix, the rank of the matrix of feature vectors in our case is at most equal to $\min(|\mathcal{P}|,d), which is equal to$ d$ when the number of clusters exceeds the dimension. 2.b Example where the number of clusters is more than the number of dimensions: We think that such a setting makes sense if we want to group the users of an e-commerce website for example due to storage constraints. The dimension can be reduced via some preprocessing step, leading to it being less than the number of clusters.
The experimental setting being restricted to the case where the dimension is higher than the number of clusters : Nowhere in the paper have we claimed that our algorithm is only applicable to the low-rank setting. In fact, it can very well be applied in the case that the number of clusters exceeds the number of dimensions. We chose the low-rank setting to showcase the advantageous dependence on the dimension, guaranteed by our regret bound. Even though [6] corresponds to the low-rank setting, our algorithm surpasses the TraceNorm approach as it leverages the available graph conveniently 3.a Experimental comparisons with low-rank multi-task learning methods for the scenarios presented in Figure 1: First, we insist that our goal is not to solve the multi-task setting with a low rank, but to conveniently leverage a graph encoding similarities between tasks when it is available and when the tasks signal is constant over it. Second, we included a comparison to the TraceNorm policy of [6] in our paper, and we do not find any online implementation of [7]. Moreover, we think that [6] is the best low-rank approach to compare to as it uses the same type of assumptions we have in our work (RE/RSC condition, relaxed symmetry, balanced covariance ...) and similar proof techniques. 3.b Experiments where the number of clusters exceeds $d$ : We provide two sets of additional experiments with $d=2, |\mathcal{P}| = 20$ and $d=10, |\mathcal{P}| = 25$ in a new appendix section D.4, in the revised version, that we write in blue. Both of these cases do not correspond to the low-rank setting, and we can see that our algorithm still outperforms the low-rank approach of [6].
Comparison with the low-rank multi-task learning results presented in Yang et al. and Lin et al., given that all the plots in Figure 1 correspond to a low-rank case: Kindly refer to our answers to questions 3.a and 3.b.
Comparing the proposed approach with the low-rank multi-task learning results specifically for settings where the number of clusters exceeds the problem dimension: Kindly refer to our answer to question 3.b.

审稿意见

评分: 6置信度: 32024-11-08

The paper explores a multi-task contextual bandit setting where tasks are connected by a graph structure, encoding relationships between them. Here, user preference vectors are assumed to form clusters within the graph, but the algorithm doesn’t explicitly determine these clusters. Instead, it uses a network lasso approach with a dynamic regularization parameter, promoting smoothness in preferences within clusters while allowing for distinct clusters across the graph.

优点

Key contributions include:

Oracle Inequality: The authors establish an oracle inequality leveraging a restricted eigenvalue condition, which helps ensure the accuracy of preference vector estimations.
Regret Bound: The proposed approach achieves sublinear regret, showing improved performance over independent task models, especially in large-scale graphs with high-dimensional contexts.
Theoretical and Empirical Results: The paper provides theoretical bounds on estimation error and regret, supported by experiments comparing this method to other graph-based and clustering bandit algorithms, demonstrating its effectiveness in high-dimensional settings.

The work is relevant for recommendation systems and similar applications where user preferences are unknown initially but can be inferred from limited interactions, exploiting known relationships among tasks to enhance learning efficiency.

缺点

While the paper proposes a new problem, the results do not particularly exploit the network LASSO structure as much. They seem to be a straightforward extension of existing results from Oh et. al. (2021) to this setting.

问题

How do I interpret your results and their dependence on various parameters associated with the clusters? You mention different types of network structures but the results are very opaque ( so are the implications of the assumptions).

AC 元评审

2024-12-21

The authors study an instance of the Network Lasso problem within the context of graph-structured clustering bandit problems. They provide an oracle inequality and relate it to fundamental quantities that characterize the relationship between the graph and the true preference vectors of the users. A regret bound is derived, and simulated experiments support the effectiveness of the proposed method. The paper is generally solid, with a comprehensive set of theoretical results and empirical demonstrations. However, reviewers raised concerns about the clarity of the paper, particularly regarding the contribution and interpretation of the theoretical results, due to a lack of detailed discussion. For example, the presentation of the theoretical results could benefit from further elaboration on opaque quantities in the regret bound and the tightness of the dependence on graph parameters. Additionally, the reviewer requested a comparison with relevant methods such as greedy OLS-Bandit, but the authors did not directly address this either with results or discussion. As a result, the paper is borderline rejected in its current form. I encourage the authors to improve the discussion of the theoretical results and provide a comparison with the greedy algorithm in the revised version.

审稿人讨论附加意见

During the discussion, the authors addressed some of the clarity concerns by referencing related literature. However, the interpretation of the theoretical results remains insufficiently addressed, and experiments comparing the proposed method with related approaches have not been added. As a result, the reviewers were not enthusiastic about the paper in its current form.

最终决定Reject

2025-01-22

Reject