An Extensive Analysis on the Underlying Premises Behind Deep Reinforcement Learning Algorithm Design
摘要
评审与讨论
The paper promises to provide an extensive analysis on underlying premises behind RL algorithm design. Specifically, it promises to use the low and high data regime as explanation. It provides a theoretical analysis, and it provides experiments with a small number of algorithms.
优点
An analysis of the difference between large and small experiments would be quite useful, and could provide insight into whether small experiments can be used to predict large-regime outcomes.
缺点
It appears to fulfil this promise, although questions remain. It does not explain the reason behind lo/hi data regime performance of different classes of algorithms. The theoretical results are unconvincing, there is no reason given why proposition 4.2 explains the algorithmic behaviour. The experiments do not provide enough detail (hyperparameters, seeds) to see if the reported differences in performance are statistically significant. The writing style is full of hyperbole, and repetitive. No convincing insight is provided in this paper. The writing of the capital Q is non-standard and distracting.
问题
Anonymous Code is not available. Can you please provide this?
Can you explain the conclusions that you draw based on the theoretical analysis?
Can you please explain the choice of algorithms?
Can you please tone down the claims in the text, to make them more in line with a regular paper?
Thank you for allocating your time to provide a feedback on our paper.
1. “The experiments do not provide enough detail (hyperparameters, seeds) to see if the reported differences in performance are statistically significant”
The experiments do indeed provide details. Please see supplementary material.
2. “Can you please explain the choice of algorithms?”
This is already explained in the first paragraph of Section 4.
3. "It provides experiments with a small number of algorithms."
Our paper provides comparisons across 14 different algorithms across the entire Arcade Learning Environment benchmark both in the low-data regime (Atari 100K) and in the high-data regime (200 million frame training). Our study is one of the largest experiments conducted in reinforcement learning research regarding algorithm performance comparison in the RL benchmarks.
4. “Could provide insight into whether small experiments can be used to predict large-regime outcomes.”
Regarding the focus of our paper please see Author Response item 1 for Reviewer ReTn.
5. “Can you please tone down the claims in the text, to make them more in line with a regular paper?”
Yes, we would be willing to tone down the style as long as the paper conveys the message on the implicit assumptions in deep reinforcement learning research that lead to systematic faulty decisions in algorithm design.
Thank you very much for the responses. I understand your points better, thank you for pointing out your reasoning. In light of the other reviews my original rating may have been generous, I also question the novelty of the insights, I will however not downgrade my rating but leave it unchanged.
The authors question a common assumption in low-data RL research that suitable baselines are the best-performing RL algorithms in high-data RL research (with tuned hyperparameters). They derive a theorem in the linear function approximation setting that shows that lower capacity function approximators (with larger approximation error) can outperform larger capacity models (with smaller approximation error) in the low data regime. They then gather results with a range of standard DRL methods on the Atari 100k benchmark to show their findings extend empirically to current research on DRL.
优点
The authors question a common preconception in the RL community, which is always good to check. After providing the requisite background on the general settings and specific algorithms, they go on to prove their claim in the (linear) function approximation setting. And finally, the authors show that this holds with DRL algorithms on the benchmark that researchers are currently using, with appropriate consideration of changes in hyperparameters for the reduced sample complexity regime.
缺点
The authors only focus on DQN variants, and in particular, the form of the value function learned. The more expressive the value function, the better it does asymptotically, but the worse it does in the low-data regime. This therefore calls into question how general their claims are.
问题
Could you provide more clarity regarding how far your theory and results apply only to the form of value function (scalar vs. various forms of distributional)? It would be useful to be clearer about what the scope of the paper is, because the title and abstract make very general claims about the "data-abundant regime" and the "data-limited regime", whilst the latter half of the paper is restricted to DQN variants.
Edit: I have read the other reviews, the authors' feedback, and the updated paper. I believe the authors were able to address some questions, but I feel there is a still mismatch between the generality of the claims and the experiments, which focus on scalar vs. distributional value functions. As such, I will retain my original rating.
Thank you for dedicating your time to provide a well-considerate review.
1. “Could you provide more clarity regarding how far your theory and results apply only to the form of value function (scalar vs. various forms of distributional)? It would be useful to be clearer about what the scope of the paper is, because the title and abstract make very general claims about the "data-abundant regime" and the "data-limited regime", whilst the latter half of the paper is restricted to DQN variants.”
We would like to highlight that the focus of the paper is to understand the implicit assumptions in deep reinforcement learning, and explicitly discuss how these assumptions can lead to faulty decisions in algorithm design that can systematically affect many consecutive studies in reinforcement learning research (including but not limited to [1,2,3,4,5]). The particular instances of these assumptions here happened to be regarding, as can be observed from the extensive line of work, the Rainbow algorithm which is distributional reinforcement learning [1,2,3,4,5]. We believe the explicit concrete analysis of these implicit assumptions might actually help understanding the effects and the costs of these faulty decisions, and help in clarifying the line of research and future research focuses.
[1] Max Schwarzer, Johan Obando-Ceron, Aaron Courville, Marc G. Bellemare, Rishabh Agarwal, Pablo Samuel Castro. Bigger, Better, Faster: Human-level Atari with human-level efficiency, ICML 2023.
[2] Max Schwarzer, Ankesh Anand, Rishab Goel, R. Devon Hjelm, Aaron C. Courville, Philip Bachman. Data-Efficient Reinforcement Learning with Self-Predictive Representations. ICLR 2021.
[3] Model based reinforcement learning for Atari, ICLR 2020.
[4] Image augmentation is all you need: Regularizing deep reinforcement learning from pixels, ICLR 2021.
[5] PLASTIC: Improving Input and Label Plasticity for Sample Efficient Reinforcement Learning, NeurIPS 2023.
The paper contributes with theoretical arguments for the non-monotonic performance profiles of different algorithms when constrained to various data regimes. The proofs are concerned either with linear value function approximation or with the general problem of learning the distribution of a random variable or its expected value. It then proceeds to illustrate the theoretical claims empirically, in the context of deep learning algorithms on the Atari 100k.
优点
This contribution adds to the body of work that highlights the difficulties of evaluating RL agents and points to some implicit assumptions that could lead to unwarranted performance claims.
It provides a theoretical argument for why larger capacity models can underperform in a low-data regime (compared to low capacity models) which I found sound as much as I could follow it but which I also find limiting since it only concerns linear value function approximation and, in some respects, it's empirically rendered vacuous by recent work, such as Schwarzer 2023.
Similarly, it reiterates some concerns regarding the merits of distributional RL by employing a statistical argument regarding the difficulty of learning a distribution as opposed to a mean. It should be noted that these concerns appear multiple times in the literature (eg. Lyle 2019 and Bellemare 2023 and even the original Bellemare 2017 paper).
Schwarzer, 2023: https://arxiv.org/pdf/2305.19452.pdf Lyle, 2019: https://arxiv.org/pdf/2305.19452.pdf Bellemare, 2023: https://www.distributional-rl.org/
缺点
I found the theoretical work limiting as it's only concerned with the linear case and therefore it ignores the many optimisation challenges of Deep RL (see Lyle, 2023 for a good example of the difficulties of pinpointing what exact intervention in the algorithm correlates with performance) which could easily dominate the sample complexity for any given data regime.
Furthermore I don't find the empirical work much compelling:
- The paper repeatedly depicts distributional RL as being inefficient in the low-data regime. While I completely applaud the call for simpler baselines (as the authors are doing) and better ablations, the paper does not include SPR, yet another Rainbow-based, distributional algorithm, that generally was found to outperform most other methods in Atari 100k (see Agarwal 2021 for a solid empirical study and Schwarzer 2023 for recent work that uses SPR as a starting point).
Including SPR or any of its variations would invalidate the empirical observation claimed in this paper that in the low-data regime "the performance profile of the simple base algorithm dueling (sic) is significantly better than any other algorithm that learns the state-action value distribution" or, elsewhere, "the simple base algorithm dueling (sic) performs significantly better than any algorithm that focuses on learning the distribution".
- While I find the claim that algorithms developed in one data regime won't monotonically transfer to other data regimes is generally true, I don't think it is a novel or surprising observation. There are several papers on tuning Rainbow alone (van Hasselt 2019 with DER, Kielak 2019 with OTR, more recently Schmidt et al) that make this exact point, that various data constraints will require essentially different algorithms.
Schwarzer, 2023: https://arxiv.org/pdf/2305.19452.pdf Lyle, 2023: https://arxiv.org/pdf/2303.01486.pdf Agarwal, 2021: https://arxiv.org/pdf/2108.13264.pdf Schmidt: https://openreview.net/pdf?id=GvM7A3cv63M
问题
- Any reasons for not including SPR in the empirical evaluation?
- If the aim is to illustrate the large sample complexity of distributional RL in the low-data regime, why not pick the best performing distributional algorithm on Atari 100K, replace the objective with the expected version and fine tune both to see if there is a difference indeed.
- I couldn't find details on how Figure 3 was generated. A short description of the experiment in the main body followed by detail in the annex would help.
Thank you for allocating your time to provide a response on our paper.
1. ”While I find the claim that algorithms developed in one data regime won't monotonically transfer to other data regimes is generally true, I don't think it is a novel or surprising observation. There are several papers on tuning Rainbow alone (van Hasselt 2019 with DER, Kielak 2019 with OTR, more recently Schmidt et al) that make this exact point, that various data constraints will require essentially different algorithms.”
We would like to kindly highlight that you are incorrect on the description of the prior work which we also already cite in our paper [1,2,3]. These studies demonstrate that one cannot use the exact same hyperparameters that are tuned for the 200 million frame training in 100K, and if the hyperparameters are specifically tuned these papers state that the best performing algorithm in the high data regime will be again best performing in the low data regime.
On the other hand, our paper demonstrates that even when hyperparameters are specifically tuned for the data regime one should not and cannot expect the performance profile of an algorithm to monotonically transfer from the high data regime to the low-data regime and vice versa.
[1] When to use parametric models in reinforcement learning? NeurIPS 2019.
[2] Do recent advancements in model-based deep reinforcement learning really improve data efficiency? CoRR, 2019.
[3] Dominik Schmidt and Thomas Schmied. Fast and Data-Efficient Training of Rainbow: an Experimental Study on Atari
2. “Including SPR or any of its variations would invalidate the empirical observation claimed in this paper that in the low-data regime "the performance profile of the simple base algorithm dueling (sic) is significantly better than any other algorithm that learns the state-action value distribution" or, elsewhere, "the simple base algorithm dueling (sic) performs significantly better than any algorithm that focuses on learning the distribution".”
We believe you have some confusion here. The SPR algorithm simply uses a temporally self-predictive model to predict future state representations on top of the Rainbow algorithm. Thus, in SPR the rainbow algorithm here simply can be replaced by a dueling algorithm, or in other words SPR paper could have been built on top of the dueling algorithm instead of Rainbow. Just as in the other studies that we discuss in our paper, SPR is also built on top of the Rainbow algorithm, because again as we discussed in our paper Rainbow was the best performing algorithm at the time. Again in this paper we see no ablation study on the algorithm choice (i.e. what SPR is built on). The question here again is what would have happened if the SPR paper did not implicitly assume that, since Rainbow is the best performing algorithm in the high-data regime then it must be in the low-data regime as well, and conducted ablation studies on the algorithm choice, or instead even simply used the dueling algorithm?
Thus, the SPR case is also another instance of these exact implicit assumptions that appears in many recent works that our paper highlights, discusses and provides a deep concrete explicit analysis for.
SPR: Max Schwarzer, Ankesh Anand, Rishab Goel, R. Devon Hjelm, Aaron C. Courville, Philip Bachman. Data-Efficient Reinforcement Learning with Self-Predictive Representations. ICLR 2021.
3. ”It provides a theoretical argument for why larger capacity models can underperform in a low-data regime (compared to low capacity models) which I found sound as much as I could follow it but which I also find limiting since it only concerns linear value function approximation and, in some respects, it's empirically rendered vacuous by recent work, such as Schwarzer 2023.”
Please note that the algorithm proposed in Schwarzer et al. (2023) [1] is also SPR on top of Rainbow. Hence, this paper is also making the exact same faulty assumption that our paper describes, discusses and provides a deep concrete explicit analysis for.
Furthermore, note that the faulty implicit assumptions and underlying premises still persistently appear in a large body of work even in a more recent conference which will be conducted in December 2023, i.e. NeurIPS 2023 [2].
It is evident that learning a distribution is harder than learning the mean, yet there is an entire research line making this mistake systematically as discussed above and in our paper. Thus, explaining explicitly and tediously might be justified in this case to provide insight why these mistakes will lead to systematic faulty decisions in algorithm design.
[1] Max Schwarzer, Johan Obando-Ceron, Aaron Courville, Marc G. Bellemare, Rishabh Agarwal, Pablo Samuel Castro. Bigger, Better, Faster: Human-level Atari with human-level efficiency, ICML 2023.
[2] PLASTIC: Improving Input and Label Plasticity for Sample Efficient Reinforcement Learning, NeurIPS 2023.
Thank you for your answers.
1. I agree, strictly speaking [1,2,3] don't argue that relative rankings of methods developed for one data regime necessarily don't hold for other data regimes.
They do show that large gaps in performance can be closed by careful tuning and interventions on the base implementation that eventually leads to a different algorithm. For example, in order to reach the performance of SimPLE authors of [2] tune the dopamine version of Rainbow which does not have dueling, noisy nets or over-estimation correction. Sure, it's Rainbow in the name but it's hardly the same algorithm. What this prior work implied for me was that you will need to tweak algorithms developed in the large data regime to an extent large enough that you shouldn't expect relative rankings to be preserved.
I also don't believe those papers argued that relative rankings are preserved as you said in your response: "and if the hyperparameters are specifically tuned these papers state that the best performing algorithm in the high data regime will be again best performing in the low data regime", but only that you can tweak and tune them to make them efficient enough for whatever bar you set yourself.
At this point I'm not sure to what degree the assumption of monotonic performance profiles is real in the literature or more of a forced premise so I'm willing to change my score slightly and acknowledge your reading of the prior work.
2. You have at least two claims in your paper that suggest that the algorithm you propose will dominate any algorithm with a distributional objective in the low-data regime:
- "the performance profile of the simple base algorithm dueling (sic) is significantly better than any other algorithm that learns the state-action value distribution"
- "the simple base algorithm dueling (sic) performs significantly better than any algorithm that focuses on learning the distribution"
Therefore I pointed out to one of the current sota methods which uses a categorical objective which would invalidate the any in your claims. I then proceeded in the questions section to suggest that if you want to show that indeed this is the case then you need to go to the sota distributional method on Atari100k and ablate it while tuning both algorithms.
If the aim instead is to show that one specific algorithm with a distributional objective will be dominated by its non-distributional counter-part then this has to be done by tuning both algorithms and showing the performance profiles across hyperparameters and not on a fixed, predetermined set that might favor one or another accidentally.
3. The reason I cited Schwarzer is because it demonstrates that larger models perform better in the low data regime which goes against the insights from your theoretical analyses. Maybe they are wrong to use Rainbow but I don't think it has any bearing on this specific point.
Thank you for your response.
1. ”They do show that large gaps in performance can be closed by careful tuning and interventions on the base implementation that eventually leads to a different algorithm. For example, in order to reach the performance of SimPLE authors of [2] tune the dopamine version of Rainbow which does not have dueling, noisy nets or over-estimation correction. Sure, it's Rainbow in the name but it's hardly the same algorithm. What this prior work implied for me was that you will need to tweak algorithms developed in the large data regime to an extent large enough that you shouldn't expect relative rankings to be preserved.”
Please note that the manuscript you are referring [2] did not intentionally remove those parts (dueling, noisy nets and double Q-learning) to obtain the score. The reason [2] used a Rainbow without dueling, noisy nets or over-estimation correction was because simply Dopamine codebase did not share the code for dueling, noisy nets and double Q-learning. Thus, the choice made by [2] here is not due to an ablation study demonstrating it is good to remove these parts, but rather simply due to the resources available at that given time. Thus, the authors of [2] conducted hyperparameter tuning on this code, in particular discovering the importance of replay ratio, which, as has been recently demonstrated by [3], can increase the performance by 71.4%, and thus the authors of [2] possibly did not need to remove these parts (i.e. dueling, double-Q learning) to obtain the score.
Furthermore, it is possible to reach and surpass the performance of SimPLE without removing these components (i.e. dueling, double-Q learning, NoisyNetworks) at all, just by hyperparameter tuning as demonstrated by [1]. Please see further details in [1] which is also included in our paper.
[1] Hado van Hasselt, Matteo Hessel, John Aslanides. When to use parametric models in reinforcement learning? NeurIPS 2019. [Available Online 12 June 2019]
[2] Kacper Piotr Kielak. Do recent advancements in model-based deep reinforcement learning really improve data efficiency? Unpublished Manuscript, 2019. [Available Online 25 September 2019]
[3] Max Schwarzer, Johan Obando-Ceron, Aaron Courville, Marc G. Bellemare, Rishabh Agarwal, Pablo Samuel Castro. Bigger, Better, Faster: Human-level Atari with human-level efficiency, ICML 2023.
2. “If the aim instead is to show that one specific algorithm with a distributional objective will be dominated by its non-distributional counter-part then this has to be done by tuning both algorithms and showing the performance profiles across hyperparameters and not on a fixed, predetermined set that might favor one or another accidentally.”
We would like to highlight here that the distributional reinforcement learning here is the DER algorithm [1] which is specifically a hyperparameter tuned Rainbow algorithm for the low-data regime.
For the dueling algorithm to be able to provide consistent comparison between DrQ and dueling we used the exact same hyperparameters that are tuned for the DrQ algorithm as in the original paper. Hence, it is even possible to obtain a higher score with dueling if the hyperparameters are specifically searched and tuned for the dueling algorithm.
[1] Hado van Hasselt, Matteo Hessel, John Aslanides. When to use parametric models in reinforcement learning? NeurIPS 2019.
3. “The reason I cited Schwarzer is because it demonstrates that larger models perform better in the low data regime which goes against the insights from your theoretical analyses. Maybe they are wrong to use Rainbow but I don't think it has any bearing on this specific point.”
One important point in [1] is that with the same number of parameters used in the prior studies the CNN architecture obtains IQM 0.28 while the ResNet architecture with the same number of parameters obtains 0.55. Thus one of the main gains achieved here (96% performance gain over CNN) is not due to increased parameters but rather simply using a better architecture.
Again note that the paper [1] uses SPR with weight decay, annealing, harder resets, receding update horizon, and increasing discount factor, all of which are quite recent techniques that contribute substantially to increasing sample efficiency. So much so that for instance, as can be seen from Figure 5 of [1], simply not using resets reduces the performance from 1.01 to 0.40. Thus simply using resets causes 152.5% performance gain. Again we see that simply using standard n reduces the performance to 0.54 from 0.96. Thus simply using receding update horizon contributes by 77% to the performance gain. Again we see that without annealing the performance is reduced to 0.65 from 0.96, where again simply using annealing causes 47.7% performance gain. Again without harder resets the performance is reduced to 0.6 from 1.01 thus simply using harder resets achieves 68.3% performance gain.
Thus looking over all these simple techniques that contribute substantially to the performance achieved, even more importantly with a completely different architecture with the same number of parameters causing 96% performance gain, further demonstrates that the performance gain achieved in [1] is contributed by utilizing resetting that contributes 152.5%, using receding update horizon that contributes 77%, using a different architecture 96%, and using harder resets that contributes 68.3%. Thus, given these it is not true that the paper [1] goes against the insights from our theoretical analyses.
[1] Max Schwarzer, Johan Obando-Ceron, Aaron Courville, Marc G. Bellemare, Rishabh Agarwal, Pablo Samuel Castro. Bigger, Better, Faster: Human-level Atari with human-level efficiency, ICML 2023.
The paper theoretically shows that:
- for finite horizon Markov decision processes (MDP) with linear approximation, lower (resp. higher) capacity function approximation can lead to lower regret in low-data (resp. high-data) regime.
- distribution-based algorithms may have higher sample complexity. In addition, it empirically demonstrates that:
- higher capacity models (e.g., distribution based or higher dimension) may perform well in high-data regime, but lower capacity models can perform better in low-data regime.
- dueling DQN performs best in low-data regime and better than distribution-based algorithms.
优点
The authors propose a theoretical formalization showing that lower-capacity models can perform better than higher-capacity models in low-data regime, while the opposite holds in high-data regime.
The experimental results validate the theoretical results.
缺点
I feel that the result that larger-capacity models require more data to be trained properly is quite evident (although a theoretical proof in the finite horizon MDP setting is appreciated). For instance, I think the paper proposing DER (data-efficient rainbow), which the authors cite, also somewhat acknowledges this fact and consequently tries to improve the sample complexity of rainbow with various tricks to make it perform better in the low-data regime.
I believe the title is too broad and don't clearly reflect the content of the paper. I would strongly suggest the authors to consider changing the title to mention their focus on the low vs high data regime.
I find the presentation a bit confusing in my first read. For instance, Section 2 introduces the infinite horizon case, while Section 3 considers the finite horizon case with slight different assumptions and notations (e.g., stochastic policy vs deterministic policy, discounted vs undiscounted, stationary MDP vs non-stationary,…). See below for some specific examples of issues. I would suggest the authors to start Section 3 by emphasizing the difference with Section 2 and mentioning that they follow the framework proposed by Zanette et al., 2020.
Presentation and formalization issues: The formalization of an MDP is a bit strange and not rigorous: \mathcal P is directly defined as a probability distribution over S x A x S. \rho_0 is not defined. \pi(s, a) is not a map from states to actions. Moreover, the notation doesn't fit the mapping \pi : S \to \Delta(A). The expected cumulative discounted rewards R is not very well defined, since s_{t+1} is also a random variable that depends on a_t.
The notations in (2-5) should be recalled. The definition of Q_\beta should be checked.
In Section 3, why the probability kernel \mathcal P_t depends on t? Same question for \mathcal R_t. This point was not mentioned in the definition of a finite horizon MDP, just above.
Is the discount factor missing in (6)? If not, it may be worthwhile to clearly specify that in the finite horizon case, the total reward criterion is used to avoid any confusion for the reader.
Above (8): Notation \theta_t \in d_t is incorrect.
Regarding \mathcal Q_t(\theta_t)(s_t, a_t), why not write it as \mathcal Q_t(s_t, a_t; \theta_t) like in Section 2?
The definition of the intrinsic Bellman error should be recalled.
Top of page 4: Typo: dimesions
The last sentence before Section 4 is too broad in my opinion. If the algorithms are compared using models with similar capacity, I think the comparison is fair.
Page 7: I find the first paragraph quite hard to understand. First of all, the authors should cite the original reference when mentioning an algorithm for the first time. For instance, DRQ^{ICLR} does not have any reference. Agarwal et al., 2021 write "for DrQ, we used the source-code obtained from the authors". Does DRQ^{NeurIPS} refers to their DrQ(\epsilon)? Also, as far as I know DrQ does not use the dueling architecture.
Fig. 3 (right) seems not be commented in the text.
Edit: After reading all the reviews and the rebuttal, I believe that most of the weaknesses I raised still hold.
问题
The interpretation of Prop. 4.1 is not clear to me. I understand Prop. 4.1 as follows: Even a small total variance error may lead to incorrect order over actions. However, the last sentence after this proposition seems to say: if the total variance error is not small, then there may be an incorrect order over actions. Did I misunderstand something?
What is DRQ^{NeurIPS}?
Does DRQ^{ICLR} use the dueling architecture?
Thank you for providing kind feedback on our paper. We fixed the typos and clarity regarding the presentation with respect to your comments, and updated the paper.
1. "What is DRQ^{NeurIPS}?"
DRQ refers to the paper [1].
[1] Deep Reinforcement Learning at the Edge of the Statistical Precipice, NeurIPS 2021. [Oral Presentation]
2. "Does DRQ^{ICLR} use the dueling architecture?"
Yes, it does. See Section 4.2 and Appendix C of [2].
[2] Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels, ICLR 2021. [Spotlight Presentation]
3. "The interpretation of Prop. 4.1…Did I misunderstand something?"
Prop 4.1 states that making an error of in total variation distance may lead to incorrect order over actions.
The paper's main message is that high capacity RL models that perform best iin high data regimes may perform worst than low capacity RL models in low data regimes. This is supported by a theoretical analysis and empirical evidence. The reviewers generally agree with the paper's main message (in fact this is well-known for all of ML in general), but the analysis raised a lot of questions and discussion. The rebuttal helped to address those questions, but some concerns remain. For instance, the main message of Proposition 4.1 remains questionable. Proposition 4.1 shows that in the worst case there exists a random variable Y such that TotalVariation(Y,X(s,a))<=epsilon, but E[Y] <= E[X(s,a)] - epsilon. However, why does this mean that minimizing total variation within epsilon is naturally comparable to minimizing the difference in expectation to within epsilon? At the end of the day, the main point that Section 4 tries to make is that distributional RL techniques are more data intensive than techniques that simply estimate the mean. This can be argued very simply as follows. First consider the case where distributional and non-distributional techniques are equally expressive. A non-distributional technique that directly estimates the mean will naturally be more data efficient than a distributional technique that optimizes some other difference (e.g., total variation) between two distributions simply because the mean is utimately what we strive to optimize. In other words, optimizing the right metric generally requires less data. In addition, suppose that the distributional technique is more expressive because it represents the entire distribution, while the non-distributional technique is less expressive because it represents the mean of the distribution only, then the distributional technique is more likely to overfit in low data regimes. Overall, the main message of the paper is fine, but the theoretical arguments seem more complex than needed and the write up raises unecessary questions. The experiments support well the main message of the paper.
为何不给更高分
The theory is more complex than needed and it raises unecessary questions that undermine the main message of the paper.
为何不给更低分
N/A
Reject