Concurrent Reinforcement Learning with Aggregated States via Randomized Least Squares Value Iteration
We prove worst-case regret bound for concurrent RLSVI, reducing the space complexity by a factor of $K$ with only a $\sqrt{K}$ regret increase compared to the single-agent version of \citep{agrawal2021improved,russo2019worst}
摘要
评审与讨论
The authors extend the classic Least Squares Value Iteration (LSVI) method to the setting of multi-agent systems, and show guarantees for a randomized variant.
给作者的问题
Please put an Impact Statement :)
Page 2: Line 94: Can you please elaborate on the difference of the techniques used?
Page 3: Why is defined in this way?
Page 4: Can you please further explain and ?
Page 4: Can you please elaborate on the concept of aggregated-state representations?
Page 5: I do not understand the perturbation that is captured by Equation (4).
Page 6: Can you please clarify the choice of Equation (7)?
Page 8: More discussion on future work is required! :)
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
Yes, to some extent.
实验设计与分析
Yes, to some extent.
补充材料
No.
与现有文献的关系
The theoretical guarantees cover an important gap in the literature.
遗漏的重要参考文献
No.
其他优缺点
None.
The writing is clear!
其他意见或建议
None.
We thank the reviewer for your positive feedback!
-
We will add an impact statement later :)
-
[2] improves the regret bound of [1] by using clipping in their algorithm to avoid unreasonable estimates of the value functions. The rest of their algorithm closely follows [1], using a similar proof approach: constructing a confidence set for the empirical MDP to bound its deviation from the true MDP. They decompose regret similarly to [1] and get a tighter upper bound for the approximation error of the -value function due to their clipping technique.
Our paper uses a different algorithm from [2] and proof strategy to improve the regret bound of [1]. Firstly, we do information aggregation at the aggregated-state level, unlike the algorithms in [1,2] that record the information for every pair of . Secondly, we use a different loss function and penalty term for updating the estimates of -value functions. By the first-order condition the estimated -value function is updated according to the derivation between line 1004 and line 1028, and are carefully chosen based on concentration lemma 8, so that (i) Lemma 9 holds, which is a key step to the proof of Lemma 10 using iteration trick; (ii) the term in (37) between line 1375 and line 1389 can be controlled by a tight upper bound using Cauchy-Schwarz inequality.
-
is the value function of adopting policy at state s during period h, which is a standard definition (see e.g. [3]). We include in the subscript because we consider inhomogeneous transition probabilities and reward functions (i.e. the transition probabilities and reward functions are allowed to be different at each period in our model).
-
On the one hand, and are estimates of rewards and transition probability used in episode based upon all the historical data across all agents from episode to , where is the reward received by state-action pair at time step , and is the probability of transferring to state if taking action at state during time step . On the other hand, and only uses data in episode (much less data compared to the previous two estimates). The algorithm utilizing and has a lower regret bound because we use more data in the buffer (information set), but can be computationally infeasible especially as the number of agents gets large, as we explained in the paper.
-
As we pointed out at the beginning of subsection 2.1, when is too large, the algorithm can be computationally infeasible. So we aggregate the state-action pairs that have close -values under optimal policy. By -error aggregated state-representation, we aggregate all those in the same block if the differences of their values are less than according to definition 1.
-
Here in (4) randomization is applied to the value function estimates by perturbing observed rewards to drive exploration, which is essential to randomized LSVI algorithm (see [4] for more details).
-
As we responded in the 2nd point, is a parameter carefully derived based upon the concentration bound in Lemma 8 in Appendix B, to make Lemma 9 hold, and to control the terms by a tight upper bound in the proof. This is a technical term and is important to make the algorithm and the proof work.
-
We thank the reviewer for the suggestion about more future work. We will extend this in a revised version later. For example, we plan to implement our algorithm on real-world data, and derive concurrent RL results on function approximation beyond the current tabular setup, and compare our results with the existing concurrent functional approximation literature (e.g. [5,6]).
[1] Russo, Daniel. "Worst-case regret bounds for exploration via randomized value functions." Advances in Neural Information Processing Systems 32 (2019).
[2] Agrawal, Priyank, Jinglin Chen, and Nan Jiang. ``Improved worst-case regret bounds for randomized least-squares value iteration." Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, 2021.
[3] Bertsekas, Dimitri, and John N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.
[4] Osband, Ian, et al. "Deep exploration via randomized value functions." Journal of Machine Learning Research 20.124 (2019): 1-62.
[5] Desai, Nishant, Andrew Critch, and Stuart J. Russell. "Negotiable reinforcement learning for pareto optimal sequential decision-making." Advances in Neural Information Processing Systems 31 (2018).
[6] Min, Yifei, et al. "Cooperative multi-agent reinforcement learning: asynchronous communication and linear function approximation." International Conference on Machine Learning. PMLR, 2023.
The paper presents a novel approach to concurrent reinforcement learning (RL) using Randomized Least Squares Value Iteration (RLSVI) with aggregated states. The authors propose a framework where multiple agents interact with a common environment, sharing their experiences to improve decision-making collectively. The paper also discusses the role of the discount factor in RL, arguing that it serves primarily as a tuning parameter to stabilize value updates rather than being an intrinsic part of the environment.
给作者的问题
- How does the proposed algorithm compare to other concurrent RL algorithms in terms of both theoretical guarantees and empirical performance?
论据与证据
See the comments below
方法与评估标准
See the comments below
理论论述
See the comments below
实验设计与分析
See the comments below
补充材料
See the comments below
与现有文献的关系
See the comments below
遗漏的重要参考文献
See the comments below
其他优缺点
Strengths:
-
The paper addresses an important problem in RL, particularly in the context of multi-agent systems.
-
The theoretical results are novel and provide improvements over existing bounds.
-
The empirical results are consistent with the theoretical predictions, providing strong evidence for the efficacy of the proposed algorithms.
Weaknesses:
-
The empirical validation is limited to synthetic environments. It would be beneficial to see how the algorithm performs in more complex, real-world scenarios.
-
The paper could benefit from a more detailed discussion of the practical implications of the theoretical results, particularly in terms of scalability and computational efficiency.
其他意见或建议
I suggest adding a systematic comparison table to explicitly contrast the proposed approach with existing concurrent RL methods, highlighting theoretical guarantees, empirical performance, and practical considerations to emphasize its novelty and superiority.
We thank the reviewer for the positive feedback and constructive comments.
-
We would like to emphasize two points regarding the role of experiments in this work. First, the primary contribution of this paper is theoretical analysis of concurrent model-free RLSVI algorithms. Notably, neither of the two most closely related works [1,2], which focus on single-agent RLSVI, includes numerical results. Second, our experiments using synthetic data are explicitly designed to validate theoretical findings in a controlled environment, a common practice in theoretical literature. Importantly, unlike [1,2], our theoretical results are derived from practical, storage-efficient algorithms, and the corresponding numerical validations provide valuable insights for practitioners. Nevertheless, we agree with the reviewer that extending experiments to real-world datasets would be an interesting direction for future exploration, which we plan to pursue.
-
For scalability, as detailed in lines 248–258 (paragraph under equation (6)), prior approaches [1,2] require storing all historical data, making them computationally impractical at scale, while ours is storage-efficient. Following the reviewer's suggestion, we provide a comparison table below to explicitly contrast our approach with existing related work, including the two most closely related work [1,2] on single-agent RLSVI:
| Agent | Setup | Algorithm | Regret Bound | Regret-Type | Data Stored | Numerical |
|---|---|---|---|---|---|---|
| Single | Tabular | RLSVI [1] | Worst-case | All-history | N/A | |
| Single | Tabular | RLSVI [2] | Worst-case | All-history | N/A | |
| Multi | Tabular | Concurrent RLSVI [3] | N/A | Bayes | All-history | Synthetic |
| Multi | Linear Functional Approximation | Concurrent LSVI [4] | Worst-case | All-history | N/A | |
| Multi | Linear Functional Approximation | Concurrent LSVI [5] | Worst-case | All-history | N/A | |
| Multi | Tabular | Concurrent RLSVI (ours-1) | Worst-case | All-history | N/A | |
| Multi | Tabular | Concurrent RLSVI (ours-2) | Worst-case | One episode | Synthetic |
The following discussions will be added to paper:
-
denotes the size of aggregated states. This reduces complexity and accelerates learning by focusing on grouped state-action pairs.
-
To the best of our knowledge, all existing concurrent RLSVI work focuses on finite-horizon case, while we also cover the infinite-horizon case. The regret bound for the infinite-horizon case is similar to that of the finite-horizon case, by replacing with total time steps .
-
The methods marked as "N/A" in the "numerical" column store all agents' trajectories at every step, making them computationally infeasible as grows large. These approaches (including our first finite-horizon algorithm, second-to-last row in the table) provide only theoretical analyses without numerical results.
-
Although [3] stores all data empirically, their RLSVI algorithm assumes a known parametric feature matrix, making it simpler to implement than ours. Also they evaluate only Bayes regret—less stringent than our worst-case regret—and provide empirical results exclusively on synthetic data without theoretical guarantees.
-
The last row corresponds to storing only the latest episode data, increasing the regret bound by a factor of but reduces space complexity by a factor of , making it computationally feasible.
[1] Russo, Daniel. "Worst-case regret bounds for exploration via randomized value functions." Advances in Neural Information Processing Systems 32 (2019).
[2] Agrawal, Priyank, Jinglin Chen, and Nan Jiang. ``Improved worst-case regret bounds for randomized least-squares value iteration." Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, 2021.
[3] Taiga, Adrien Ali, Aaron Courville, and Marc G. Bellemare. "Introducing Coordination in Concurrent Reinforcement Learning." ICLR 2022 Workshop on Gamification and Multiagent Solutions.
[4] Desai, Nishant, Andrew Critch, and Stuart J. Russell. "Negotiable reinforcement learning for pareto optimal sequential decision-making." Advances in Neural Information Processing Systems 31 (2018).
[5] Min, Yifei, et al. "Cooperative multi-agent reinforcement learning: asynchronous communication and linear function approximation." International Conference on Machine Learning. PMLR, 2023.
The paper adapts the concurrent learning framework via randomized least squares value iteration with an aggregated state representation, to improve exploration efficiency and the worst-case regret bound. Extensive experiments are conducted to support the theoretical findings.
给作者的问题
None.
论据与证据
The claims are well supported by evidence.
方法与评估标准
The methods and evaluation make sense to me.
理论论述
Yes. The proofs are correct.
实验设计与分析
Yes, the experimental designs are valid.
补充材料
No.
与现有文献的关系
The paper is related to the concurrent randomized least-squares value iteration literature.
遗漏的重要参考文献
None.
其他优缺点
Strength
- The paper focuses on an important problem in RL, with sound proof of the worst-case regret bound over the existing ones.
Weakness
- The experiments are only on the synthetic datasets. With experiments on the real-world datasets, it could demonstrate the improvements better.
其他意见或建议
None.
We thank the reviewer for the positive feedback and constructive comments. We are aware that our experiments are only on synthetic data. We would like to emphasize two points regarding the role of experiments in this work.
Firstly, the main focus and contribution of our work is the first theoretical analysis of concurrent model-free RLSVI algorithms. It is worth noted that neither of the two most closely related papers [1,2] (single-agent RLSVI) includes any numerical results.
Secondly, we conduct experiments on synthetic data to specifically validate the theoretical findings in a controlled manner. This approach aligns with existing literature of theory papers. It is also worth noted that, unlike [1,2], our theoretical results are based on practical, storage-efficient algorithms. With carefully designed numerical validations, they provide valuable insights for practitioners. We will add a table comparing with existing theoretical results to discuss the practical implications in terms of scalability and computational efficiency.
That said, we agree with the reviewer that experiments on real-world data would be an interesting direction to extend the current work, and we plan to explore it in the future.
[1] Russo, Daniel.``Worst-case regret bounds for exploration via randomized value functions." Advances in Neural Information Processing Systems 32 (2019).
[2] Agrawal, Priyank, Jinglin Chen, and Nan Jiang.``Improved worst-case regret bounds for randomized least-squares value iteration." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 8. 2021.
This paper proposes a concurrent reinforcement learning framework based on randomized least-squares value iteration (RLSVI) in multi-agent systems to address the problem of exploration. It demonstrates improvements in efficiency and provides theoretical worst-case regret guarantees, supported by synthetic experiments.
All reviewers agree on the clear motivation and contribution of the proposed method, the soundness and clarity of the theoretical analysis, the validity of the experimental setup, and the overall rigor of the presentation. A common concern was the lack of experiments on real-world tasks and the need for more detailed discussion on the implications and comparisons of the results. The authors addressed these points by explaining the rationale behind using controlled experiments and providing a systematic comparison with prior works. Reviewer 35te also raised several technical questions, which were adequately addressed in the rebuttal.
The authors' response was thorough and resolved most of the reviewers' concerns. Given its novelty, technical depth, and relevance, this paper would be a valuable addition to ICML.