5.0

/10

Rejected3 位审稿人

最低5最高5标准差0.0

4.0

置信度

正确性2.3

贡献度2.3

表达2.3

ICLR 2025

Learning in complex action spaces without policy gradients

Arash Tavakoli,Sina Ghiassian,Nemanja Rakicevic

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

关键词

action-value learningpolicy gradient methodscomplex action spaces

评审与讨论

审稿意见

评分: 5置信度: 42024-10-23

This paper proposes three universal principles from policy gradient methods and incorporates them into action-value framework.

The method is evaluated in DMC domains with continuous action space.

However, the proposed method does not show a significant improvement in performance over DDPG or D4PG, especially in domains with high-dimensional action spaces.

优点

1, The proposed principles underlying the scalability of policy gradient methods are intriguing observations, and a careful analysis of the gap between the two paradigms is insightful.

2, The presentation of the ideas proposed is clear.

3, Experiments were conducted across multiple domains, although without significant improvement.

缺点

1, There is no theoretical guarantees for computing the maximization in A_m instead of A(eq.17). (In policy gradient methods, using an MC estimator in place of exact summation or integration has theoretical foundations.)

2, The proposed method trains all the predictors from historical argmax approximation to construct a small action space A_m for computing an approximation of best action. The iterative dependency restricted the actions to a subset where most actions are similar, potentially leading to a suboptimal solution.

3, As illustrated in Figure.2, QMLE can transcend DPG through a uniform sampling over action space([global]). But in high-dimensional action space, uniform sampling is usually inefficient. This paper lacks an ablation study of the uniform sampling and the ensemble of argmax predictors in the domains with high-dimensional action spaces.

问题

1, The proposed method uses log-likelihood gradient to train the argmax predictors. Why not use the policy gradient directly to update the the ensemble of argmax predictors?

评论- Response to Reviewer t1ds

2024-11-19

Thank you for recognizing the intriguing observations and clear presentation of our work. We appreciate your constructive feedback and the opportunity to address your concerns and questions.

Question 1: “The proposed method uses log-likelihood gradient to train the argmax predictors. Why not use the policy gradient directly to update the ensemble of argmax predictors?”

Thank you for this pertinent question. The primary reason we opted for the log-likelihood gradient over the policy gradient for training the ensemble of argmax predictors lies in maintaining the off-policy nature of Q-learning. Using the policy gradient directly to update the ensemble would inherently tie the training of the argmax predictors to the behavior policy, thereby compromising the off-policy learning capability of Q-learning.

In traditional algorithms like DDPG, the policy is constrained to a delta function to facilitate off-policy updates while still leveraging policy gradients (Section 2.3). However, this restriction limits the method to continuous action domains and confines the policy to deterministic predictors. By employing the log-likelihood gradient instead of the policy gradient, we decouple the training of the argmax predictors from the policy gradient framework. This allows us to:

Maintain Off-Policy Learning: Q-learning remains fully off-policy, enabling the use of diverse and potentially historical data without being constrained by the current policy's distribution.
Enhance Flexibility in Action Spaces: We can train argmax predictors that are not limited to delta functions, supporting discrete, continuous, and hybrid action spaces (e.g. see Figure 12). This flexibility also permits the use of an ensemble of arbitrary distribution types, rather than being restricted to deterministic policies.

In essence, the log-likelihood gradient facilitates the training of a versatile ensemble of argmax predictors while preserving the foundational benefits of off-policy Q-learning. This strategic choice underpins the scalability and adaptability of our QMLE framework across various action space complexities.

Weakness 1: Lack of “theoretical guarantees for computing the maximization in $A_m$ instead of $\mathcal{A}$ ”

We appreciate this concern. Our approach is theoretically grounded by recent work such as DAVI (Tian et al., 2022), which formally studies the use of Monte Carlo (MC) approximations for Q-maximization within Asynchronous Value Iteration (AVI). DAVI was shown to converge to the optimal value function under standard conditions, providing a theoretical grounding for the combination of tabular AVI with MC approximation for Q-maximization and learned tabular argmax predictors. Given that Q-learning is an instance of AVI, such foundations also extend to tabular Q-learning (with standard modifications to the conditions).

Weakness 2: “The iterative dependency restricted the actions to a subset where most actions are similar, potentially leading to a suboptimal solution.”

To mitigate the risk of restricting actions to a narrow subset, our framework incorporates uniform sampling alongside learned argmax predictors (as detailed in Equation 21). By allocating 90% of the sampling budget to uniform sampling (see Table 1) and the remaining 10% to the ensemble, we maintain diversity in action samples, thereby avoiding statistical bias and ensuring comprehensive coverage. This balanced sampling strategy helps prevent the iterative dependency from causing the action set to become overly similar, reducing the likelihood of converging to suboptimal policies.

Weakness 3: "As illustrated in Figure.2, QMLE can transcend DPG through a uniform sampling over the action space. But in high-dimensional action spaces, uniform sampling is usually inefficient."

You are correct that uniform sampling becomes inefficient in high-dimensional action spaces. To address this, our approach combines uniform sampling with an ensemble of learned argmax predictors. While uniform sampling ensures unbiased coverage, the ensemble focuses computational resources on promising regions of the action space, enhancing sample efficiency.

We believe this hybrid strategy provides a robust foundation and can be further optimized with more sophisticated sampling techniques in future work.

We hope this explanation provides clarity, and we’re always open to further discussion if there are any aspects that need additional attention or if you feel your concerns haven’t been fully resolved.

评论- Response to Author

2024-11-23

Thank you for your response. I will maintain my score.

I appreciate the motivation of the paper, which aims to identify the key principles of policy gradient methods that are effective in complex action spaces. However, it is unfortunate that the proposed action-value framework appears rough, and the experimental evaluation does not truly address complex action spaces.

In particular, the paper's title and introduction both emphasize complex action spaces “involving large, continuous, combinatorial, or structured sub-action spaces” as the focus. Yet, the experiments are conducted solely in the continuous action spaces of DMC, which makes the work seem overclaimed.

Additionally, the ablation study I mentioned in Review Weakness 4 has still not been added. This omission raises further concerns:

In high-dimensional action spaces, while uniform sampling theoretically guarantees optimality, it is often inefficient. At the same time, the iterative dependency in the proposed method restricts actions to a subset where most actions are similar, potentially leading to suboptimal solutions. The effectiveness of both aspects in high-dimensional action spaces remains questionable. Although the authors achieved results comparable to DDPG in DMC, I would like to know which of these two aspects plays a more significant role in the proposed method.

评论- Response to Reviewer t1ds

2024-11-25

Thank you for your engagement during the discussion phase and thoughtful consideration of our response.

Below we address your remaining concerns.

Remaining Concern 1

"The paper's title and introduction both emphasize complex action spaces “involving large, continuous, combinatorial, or structured sub-action spaces” as the focus. Yet, the experiments are conducted solely in the continuous action spaces of DMC, which makes the work seem overclaimed."

We would like to clarify that the environments feature both continuous and multi-dimensional (aka. combinatorial) action spaces, ranging from 1-dimensional (Cartpole and Pendulum) to 38-dimensional (Dog). We agree that showing results on large discrete action spaces is beneficial to further demonstrate the versatility of QMLE.

Rebuttal Revision (Added Experiment): In the newly added Appendix D of our revised version, Figure 10 shows the performance of a discrete-action QMLE on discretized versions of several DMC tasks. Notably, the discretized Dog environment features an enormous action space with $1.35 \times 10^{18}$ discrete actions. The results show that the discrete-action QMLE performs strongly in such enormous action spaces, where DQN fails with an out-of-memory error.
Rebuttal Revision (Reduced Claim): In the revised version, we have now altered the introduction to reduce our claim to large, continuous, or combinatorial action spaces (removing structured action spaces).
Note that, with this added experiment, we have shown strong functionality of QMLE in complex action spaces including (1) continuous, (2) combinatorial (up to 38 action dimensions), and (3) large discrete (up to $1.35 \times 10^{18}$ actions).

Remaining Concern 2:

"Additionally, the ablation study I mentioned in Review Weakness 4 has still not been added. This omission raises further concerns: In high-dimensional action spaces, while uniform sampling theoretically guarantees optimality, it is often inefficient. At the same time, the iterative dependency in the proposed method restricts actions to a subset where most actions are similar, potentially leading to suboptimal solutions. The effectiveness of both aspects in high-dimensional action spaces remains questionable. Although the authors achieved results comparable to DDPG in DMC, I would like to know which of these two aspects plays a more significant role in the proposed method."

Rebuttal Revision (Added Ablation Studies): In the revised version, we have now included ablation studies that should address your remaining concern. Explicitly, Appendix D.1 compares the performance of QMLE against its ablation without amortized maximization. That is, the ablated variant relies solely on uniform sampling for $Q$ -maximization, whereas the standard QMLE uses both a delta-based argmax predictor and uniform sampling. In this experiment, we have reduced the number of samples in each maximization operation to only 2: (i) ablated variant allocates both samples to uniform sampling, (ii) the standard variant allocates 1 sample to the delta-based argmax predictor and 1 to uniform sampling. This is an extreme undersampling scenario designed to specifically allow us to see the impact of using a learned argmax predictor in QMLE.
Additionally, we urge the reviewer to kindly examine the newly added results in Appendix D.2 and D.3, as we believe they help depict a more concrete picture of the ingredients in QMLE.

We believe that our responses and the revisions made to the paper address your remaining concerns. We hope that this clarifies our contributions and merits a reconsideration of your evaluation. If you have any further questions or require additional clarification, please do not hesitate to reach out.

审稿意见

评分: 5置信度: 42024-10-24

This paper challenges the assumption that policy gradient methods are inherently superior for environments with complex/continuous action spaces. The authors identify core principles (Monte Carlo approximations, maximum likelihood estimation, and action-in architectures) in policy gradients and incorporate them into action-value methods. Their method achieves comparable performance to policy gradient methods in continuous control tasks in DMC, demonstrating that action-value methods can handle complex action spaces without the need for policy gradients

优点

The identification of the three core principles is noteworthy and has the potential to influence future research on policy gradient methods, even considering that the paper's goal is to provide alternatives to traditional policy gradients.
The experimental setup appears sound, with comparisons against several baselines across multiple seeds, and the results generally favor the proposed method. However, I would suggest including an ablation study (see Point 3 below).

缺点

My major concern is with the overall goal of the paper, as its central premise is unclear to me. First, what is the issue with policy gradients that the authors are aiming to replace them? This should have been clarified to justify the need for an alternative approach. If the goal is to develop a method that is simpler (i.e., fewer components, reduced computation, fewer hyperparameters) than policy gradients, similar to the approach in [1], I would argue that the proposed method is in fact more complex than policy gradients. It introduces additional hyperparameters and design decisions. Furthermore, the method eventually requires a parametric component for predicting the argmax (Principle 2 in Section 4). In my view, this component functions as a type of policy, albeit in a somewhat convoluted and unintuitive way. Its inclusion seems to contradict the paper’s aim of eliminating the need for a policy and policy gradients. In fact, I would argue that the proposed method cannot be considered purely as an action-value method, as claimed by the authors.
The argument regarding the equivalence of policy gradients and MLE (Section 3.2), while likely correct, is presented in a very informal manner. A more rigorous analysis or formal proof of this equivalence would significantly strengthen the paper.
The paper lacks an ablation study on the different components of the proposed method (Principles 1, 2, and 3). Given the number of design choices involved, an ablation study is essential to better understand the contribution of each principle.
Many of the equations in the paper (Equations 3, 4, 8, 9, 10, 11, 13, 14) are written without an equal sign, making them appear as disconnected mathematical expressions. While their meaning can be inferred from the surrounding text, the lack of clarity in some cases creates an informal mathematical tone and makes it difficult to interpret the exact role of these equations.
The background section spans three pages and covers material likely familiar to most readers. I recommend condensing this section and moving some of the content to the appendix. This would allow for a more detailed presentation of the algorithm (currently in Appendix A) in the main body of the paper, as well as addressing the concerns I raised above.

[1] Seyde, Tim, et al. "Solving Continuous Control via Q-learning." The Eleventh International Conference on Learning Representations. 2023.

问题

What are the limitations of policy gradients that lead to their replacement?
Why does the number of seeds vary across different tasks? Additionally, what do the shaded regions in the figures represent?

评论- Response to Reviewer rHPN

2024-11-19

We appreciate your feedback and the opportunity to clarify and elaborate on several aspect of our work. Below, we address your questions in detail.

Question 1: "What are the limitations of policy gradients that lead to their replacement?"

Our primary aim is not to replace policy gradient (PG) methods but to demonstrate that action-value methods can effectively incorporate the core principles that make PG methods successful in complex action spaces. Nonetheless, while PG methods excel in continuous and high-dimensional action domains, they inherently possess certain limitations:

On-Policy Learning Constraints: PG methods are fundamentally on-policy, relying solely on data generated by the current policy. This restricts their sample efficiency and flexibility.
Complexity of Off-Policy Extensions: Extending PG methods to off-policy settings often involves restrictive policy choices (e.g. deterministic policies in DDPG and TD3) and intricate mechanisms around importance sampling which generally introduce bias for practical success (incl. V-Trace).
Local Optimality Issues: As illustrated in Section 5.1 (Figure 2), PG methods can converge to local optima even in simple problems, due to the local tendencies of policy gradients.

How QMLE Addresses These Limitations:

Off-Policy data utilization: Being based on Q-learning, QMLE naturally supports off-policy learning, allowing it to leverage diverse data sources more efficiently without restrictive policy parameterizations or importance sampling corrections.
Amortized maximization: By integrating Monte Carlo sampling and amortized maximization through Maximum Likelihood Estimation (MLE), QMLE can approximate the argmax operation with statistical global coverage, reducing the risk of converging to local optima.

Question 2.1: "Why does the number of seeds vary across different tasks?"

Simulation of complex physical domains such as Dog and Humanoid are more computationally intensive than simpler ones such as Hopper and Walker. This makes sampling from these domains more time-consuming per each environment step. On the other hand, these tasks are also more challenging and require longer training (5-10 million environment steps). To balance thoroughness and practicality, we conducted the more simulation-expensive tasks over 5 random seeds (Dog and Humanoid), and over 10 random seeds for all other tasks.

Question 2.2: "What do the shaded regions in the figures represent?"

The shaded regions represent the standard error of the mean across the seeds (as discussed in Section B.4).

We hope this response satisfactorily addresses your questions. We look forward to engaging further during the discussion period.

2024-11-21

Thank you for your response. I have also read the feedback from other reviewers.

However, the concerns I raised in the weaknesses remain unaddressed, and I will be maintaining my original score.

评论- Response to Reviewer rHPN

2024-11-23

Thank you for your response and taking the time to read other reviewers' comments.

We have performed the ablation studies you suggested and will shortly include them in the manuscript for your consideration. Additionally, we will address each of the identified weaknesses in detail in our forthcoming response. Apologies for not addressing these issues in the first instance; our priority was to respond to your questions first and then incorporate revisions based on the weaknesses raised.

In the meantime, does our initial response address your questions satisfactorily?

评论- Response to Reviewer rHPN (1/2)

2024-11-25

We thank the reviewer for their engagement during the discussion phase and careful consideration of our response.

Below we address your remaining concerns.

Weakness 3: "The paper lacks an ablation study on the different components of the proposed method."

In the revised paper, we have included a section for ablation studies (Appendix D) with three subsections, each shedding light on the impact of a core component in our framework. We ask the reviewer to kindly examine this newly added section and reach out with questions. We thank the reviewer for suggesting these studies, which we believe have improved the paper significantly.

Weakness 2: "The argument regarding the equivalence of policy gradients and MLE (Section 3.2), while likely correct, is presented in a very informal manner. A more rigorous analysis or formal proof of this equivalence would significantly strengthen the paper."

We would like to clarify that our argument is not that the pure MLE and policy gradient expressions are equivalent. Rather, it is their differences that enable QMLE to maintain off-policy learning and to bypass local optimality issues in policy gradients (for more details, see our initial response to Question 1 raised by the reviewer). We included the side-by-side comparison of MLE and PG expressions as we believed such juxtaposition is helpful in illustrating the nuanced form of MLE that is implicitly used by policy gradient methods.

Weakness 4: "Many of the equations in the paper (Equations 3, 4, 8, 9, 10, 11, 13, 14) are written without an equal sign, making them appear as disconnected mathematical expressions. While their meaning can be inferred from the surrounding text, the lack of clarity in some cases creates an informal mathematical tone and makes it difficult to interpret the exact role of these equations."

Many of such cases are expressions, without a strict equality tied to them. For instance, Eqs. 3, 4, 9 are expressions for TD residuals. Equation 10 is a problem statement. Equation 11 does not have a concrete equality; however, we could equate it to a symbol $J$ by definition ( $\doteq$ ) (we'd be happy to make this change if the reviewer believes the extra definition around the symbol $J$ would help readability).

We will make the following changes in a revised version:

Eq. 8 (which is an approximation, tied to Eq. 7): We would like to examine this more to ensure our revision does not break anything in the paper. But we would like to assure the reviewer that an adjustment will be made to this equation in a next revision (however we may not be able to achieve this prior to the discussion phase).
Eqs. 13 and 14 (which are approximations): We will connect them with $\approx$ sign to the rightmost expression of Eq. 12

Weakness 5: "The background section spans three pages and covers material likely familiar to most readers. I recommend condensing this section and moving some of the content to the appendix. This would allow for a more detailed presentation of the algorithm (currently in Appendix A) in the main body of the paper, as well as addressing the concerns I raised above."

We thank the reviewer for this suggestion. Our background section was crafted to provide a nuanced exposition of the known subjects, but in a way that can help even an RL researcher/practitioner to potentially see something new (which goes hand in hand with our objective towards challenging a conventional wisdom). Additionally, the background ties closely with the main contribution of the paper; as we state in lines 74--75, "[...] our emphasis in this work is less on algorithmic novelty and more on dissecting the core principles that bridge the gap between the two paradigms." To this end, we exposed the details of the specific algorithm in the appendix, which is only used to serve as a proof-of-concept and validation of the core insights.

评论- Response to Reviewer rHPN (2/2)

2024-11-25

Weakness 1.1: "First, what is the issue with policy gradients that the authors are aiming to replace them? This should have been clarified to justify the need for an alternative approach."

We believe our initial response (posted on Nov 18) to the Question 1 raised by the reviewer should have addressed this question. Please do let us know if more clarification is needed.

Weakness 1.3: "Furthermore, the method eventually requires a parametric component for predicting the argmax (Principle 2 in Section 4). In my view, this component functions as a type of policy, albeit in a somewhat convoluted and unintuitive way. Its inclusion seems to contradict the paper’s aim of eliminating the need for a policy and policy gradients. In fact, I would argue that the proposed method cannot be considered purely as an action-value method, as claimed by the authors."

In comparison to PG methods, QMLE does not introduce any additional burden on choosing a parametric distribution. The choice of policy in policy gradient methods is one that can be varied. For instance, in Schulman et al. (2015) "Trust Region Policy Optimization", the authors used a diagonal-covariance Gaussian policy in continuous-action problems and a factored-categorical policy in the discrete-action domains of Atari. However, Beta policies (see, e.g., Chou et al. (2017) "Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution") and (non-factored) Categorical policies are also used with such PG methods, among other possible choices. However, deterministic policies (e.g. Delta policies) are incompatible with the on-policy PG methods, such as TRPO and PPO.

Then, to enable off-policy learning for policy gradient methods, DDPG uses Delta policies (which requires changes to the entire algorithm). While off-policy learning is gained, the limitation to Delta policies restricts the possibility to switch the policy type to a more expressive one if needed (e.g. in multimodal reward landscapes), a stochastic one, or a discrete-action one.

In conclusion, QMLE unifies the possibility to choose any distribution types (whether discrete, continuous, or a mixture of both and whether stochastic, deterministic, or a mixture of both). This is in our view a simplification by unification. Lastly, what are considered fixed choices in policy gradient methods are only the best practices over the originally variable components, which over time have been refined and made it into the mainstream implementations of the policy gradient methods such as PPO.

Weakness 1.2: "If the goal is to develop a method that is simpler (i.e. fewer components, reduced computation, fewer hyperparameters) than policy gradients, similar to the approach in [1], I would argue that the proposed method is in fact more complex than policy gradients. It introduces additional hyperparameters and design decisions."

To respond to the concern, let us examine a continuous-only variant of QMLE with a Delta-based argmax predictor for continuous control problems. In comparison to DDPG, we only have the following added hyperparameters (see also the top segment of Table 1 for QMLE-specific hyperparameters in a general case):

sampling budgets $m_\textrm{target}$ and $m_\textrm{greedy}$
sampling ratios $\rho_\textrm{uniform}$ and $\rho_\textrm{delta}$

Note that the same network architecture can be used by DDPG and QMLE, as done in Section 5.1.

Now we answer two key questions:

Why deal with the extra hyperparameters? The added hyperparameters come with more global properties for QMLE in comparison to DDPG, as illustrated in Section 5.1 and Figure 2.
How sensitive are the choices over these hyperparameters? As we show in Appendix D.1 (blue curves), using even the most basic and minimal choice of sampling budgets $m_\textrm{target} = m_\textrm{greedy} = 2$ and sampling ratios $\rho_\textrm{uniform} = \rho_\textrm{delta} = 0.5$ yields strong performance levels for the delta-based QMLE agent. Also, as we show in Appendix D.2, significant changes to the sampling budgets from 2 to 1000 does not cause significant changes in the performance, which shows that QMLE is relatively robust to these new parameters.

审稿意见

评分: 5置信度: 42024-10-24

The paper challenges the conventional belief that policy gradient methods are superior to action-value methods for complex action spaces. The authors propose that this advantage comes from universal principles that can also be applied to action-value methods, such as using Monte Carlo approximations, amortized maximization, and scalable architectures. They introduce Q-learning with Maximum Likelihood Estimation (QMLE), a framework that adapts these principles to action-value learning. Empirical results show that QMLE performs comparably to policy gradient methods on complex action spaces, particularly on the DeepMind Control Suite. The findings suggest that the strengths of policy gradient methods are not intrinsic but can be replicated within action-value approaches.

优点

The paper presents a novel perspective by challenging the conventional wisdom that policy gradient methods are inherently superior for complex action spaces, opening up new avenues for research and exploration.
The introduction of the Q-learning with Maximum Likelihood Estimation (QMLE) framework effectively integrates principles traditionally associated with policy gradients into action-value learning, showcasing versatility and adaptability.
The authors provide robust empirical results demonstrating that QMLE can achieve performance comparable to policy gradient methods on complex tasks, specifically in the DeepMind Control Suite, thereby validating their theoretical claims.
The paper emphasizes the scalability of the proposed methods, which is crucial for real-world applications involving high-dimensional action spaces, making it a practical contribution to the field of reinforcement learning.
By revealing that the strengths of policy gradient methods can be replicated in action-value approaches, the paper encourages a reevaluation of existing strategies in reinforcement learning, potentially influencing future research directions and methodologies.

缺点

While the paper introduces the QMLE framework, it lacks in-depth theoretical analysis or proof of convergence properties, which could strengthen the foundational understanding of the proposed method.
The empirical evaluations are primarily conducted in the DeepMind Control Suite, which may not fully represent the challenges and complexities found in more diverse real-world environments, limiting the generalizability of the findings.
The paper could benefit from a more thorough comparison with other state-of-the-art action-value methods beyond policy gradient approaches to provide a clearer context for the advantages and limitations of the QMLE framework.
There is little discussion on the sensitivity of the QMLE approach to hyperparameters, which can significantly impact performance in practice; this omission could hinder practitioners from effectively applying the method.
The paper does not offer concrete suggestions for future research directions or improvements to the QMLE framework, which could leave readers uncertain about the next steps for advancing this line of inquiry.

问题

What are the key theoretical assumptions underlying the QMLE framework, and how do they compare to traditional action-value methods?
How does the QMLE framework perform in environments outside the DeepMind Control Suite? Are there specific tasks where its advantages or limitations become more pronounced?
What guidelines can be provided for selecting hyperparameters when using the QMLE framework, and how does their choice impact performance across different environments?
Can you elaborate on the implementation challenges encountered when applying the QMLE framework in practice? Were there any unexpected behaviors observed during training?
How does the QMLE framework stack up against other recent approaches in reinforcement learning, particularly those utilizing action-value methods? Are there specific scenarios where it outperforms others?
What are the authors’ thoughts on potential extensions or modifications to the QMLE framework that could enhance its applicability or efficiency in more complex scenarios?
How robust is the QMLE approach to noise and variability in the environment? Have any experiments been conducted to assess its stability under such conditions?

评论- Response to Reviewer uFAB (2/2)

2024-11-19

Q5: Comparisons with Other Methods

In Appendix C, Figure 7 provides a comparison against QT-Opt and AQL, two notable action-value methods designed to extend Q-learning to complex action spaces. QMLE demonstrates superior performance in most tested domains. We have also included comparisons against SAC, TD3, MPO, PPO, TRPO, A2C in the Figures 5 and 6.

Additionally, methods like DecQN (Seyde et al., 2023) are designed for combinatorial discrete action spaces. They scale to high-dimensional discrete-action problems using rank-1 action-hypergraph representations (Tavakoli et al., 2021). While effective in such settings, these methods are not directly applicable to continuous-action domains (without discretization). We are happy to include performance curves for DecQN if the reviewer believes this would add value.

Q6: Extensions and Modifications

In Appendix E (pages 26–28), we outline three promising avenues for future exploration, including:

Combining QMLE with advances in MLE (e.g. for heteroscedastic variance estimation) and deep RL (e.g. cross-entropy loss for approximate Q-learning and parallel Q-learning).
Extending QMLE to multi-agent reinforcement learning under centralized training with decentralized execution (CTDE).
Exploring adaptive curriculum strategies through dynamically growing action spaces.

We aim to release the QMLE code to encourage the community to iterate on these ideas.

Q7: Robustness to Noise and Variability

We have not explicitly tested QMLE under noisy or highly variable conditions. However, we are eager to explore this direction, particularly if the reviewer has specific scenarios in mind. Testing robustness under such conditions could provide valuable insights into the framework’s practical utility and resilience.

We hope this response clarifies the concerns raised and provides additional context for evaluating our work. Please let us know if there are any aspects that require further elaboration. We are grateful for your constructive feedback and look forward to engaging further.

评论- Response to Reviewer uFAB (1/2)

2024-11-19

Thank you for your insightful review and for highlighting both the strengths and areas for improvement in our paper. We appreciate your thoughtful questions and address them below.

Q1: Theoretical Assumptions

The QMLE framework builds upon established theoretical foundations related to sampling-based approximations for Q-maximization. Specifically, Tian et al. (2022) explored this within the context of tabular Asynchronous Value Iteration (AVI), utilizing uniform sampling alongside a tabular mechanism for caching and retrieving the best historical argmax actions in each state. Our approach extends this by training an ensemble of parametric argmax predictors, thereby generalizing the caching mechanism to function approximators.

These theoretical insights are applicable to the broader family of Value Iteration (VI) algorithms, with tabular Q-learning being an instance. Importantly, Tian et al.'s (2022) convergence proofs accommodate the replacement of a tabular argmax predictor with an ensemble of approximate predictors, provided uniform sampling is maintained.

Our third principle in QMLE—using action-in architectures for representation learning and generalization—is specifically pertinent to action-value learning with function approximation (e.g. neural networks). The theoretical underpinnings of approximate Q-learning remain an open problem, even for the simpler case of action-out architectures. Nevertheless, we believe that conducting a comparative study of action-in versus action-out architectures across diverse domains would be a useful direction for future research.

Q2: Performance Beyond the DeepMind Control Suite

Our focus in this work was to demonstrate that action-value methods can compete with policy gradient (PG) approaches in continuous, high-dimensional action spaces—a domain where PG methods are conventionally considered dominant. The DeepMind Control Suite (DMC) serves as a standard benchmark for such tasks, encompassing complexities like:

Combinatorial action spaces: Tasks range from single-dimensional actions to high-dimensional ones (e.g. 38 action dimensions in the Dog task).
Continuous action spaces: These domains are where PG methods are typically favored, making DMC a representative testbed.

Additionally, our experiments within these domains demonstrate QMLE's flexibility in handling a fusion of discrete and continuous action spaces through two approaches:

Dual Predictors: Utilizing both discrete and continuous argmax predictors simultaneously, allowing the framework to leverage the strengths of each type in different contexts.
Action Space Growth: Initiating with a discrete action space and progressively expanding it into the original continuous space, effectively implementing a curriculum that enhances learning efficiency and policy refinement (see Figure 12).

While we have not yet tested QMLE in environments beyond DMC, we believe its principles are broadly applicable. We hope this work inspires the community to explore QMLE in more diverse domains and welcome suggestions for environments that could further test its robustness and limitations.

Q3: Hyperparameter Selection Guidelines

The hyperparameters of QMLE can be categorized into three groups, as outlined in Table 1 of the paper:

DQN. These follow standard practices from DQN and are adapted directly from Seyde et al. (2023) to avoid overfitting QMLE on the benchmark.
Prioritized Replay. Similarly, these mirror the same values as used by Seyde et al. (2023).
QMLE-specific. These include:
- Sampling budgets ( $m_\text{target}$ , $m_\text{greedy}$ ): Appendix A (lines 935--975) provides guidelines for selecting these values. In general, higher values for $m_\text{greedy}$ improve training interaction quality, while moderate values for $m_\text{target}$ balance computational cost and overestimation risks.
- Sampling ratios ( $\rho_k$ ): We allocate 90% of the sampling budget to uniform sampling to ensure statistically unbiased approximations and dedicate the remainder to local sampling strategies (e.g. from deterministic delta distributions or other argmax predictors). For non-deterministic predictors, evenly distributing the remaining budget (beyond the uniform-sampling ratio) across ensemble members should be a reasonable choice.

Q4: Implementation Challenges

We did not encounter any implementation challenges or unexpected behaviors. Our QMLE implementation is straightforward, built on a simple DQN implementation from CleanRL (Huang et al., 2022). The results presented in the paper are from the initial implementation after correctness validation, underscoring its simplicity and reproducibility.

评论- Summary of additions in the revised version

2024-11-25

Once again, we would like to thank the reviewer for their thoughtful feedback. Below we would like to outline the additions which we have incorporated into a revised version.

We have added ablation studies for QMLE in Appendix D.

Appendix D.1 examines QMLE with and without amortized maximization. Both variants use only 2 action samples for approximate maximization (i.e. sampling budgets of $m_\textrm{target} = m_\textrm{greedy} = 2$ ). The ablated variant relies solely on uniform sampling (i.e. sampling ratio $\rho_\textrm{uniform} = 1$ ). The standard variant uses only a delta-based argmax predictor, allocating 1 sample to the argmax predictor and 1 to uniform sampling. The results demonstrate that amortized maximization significantly improves performance, particularly as the complexity of the action space increases.
Appendix D.2 examines performance under sampling budgets 2 (minimal number of samples needed to allow 1 sample from the learned argmax predictor and 1 from the uniform distribution) and 1000. The results show that, as expected the performance mildly improves due to more accurate approximation of the argmax for the case with sampling budgets of 1000. However, the difference is minimal owing to amortization, which effectively dampens the negative impact of undersampling by enabling reuse of past computations over time. This shows robustness of QMLE wrt. large variations in the sampling budgets.
Appendix D.3 compares the performance of a discrete-only variant of QMLE (using a learned argmax predictor based on a factored categorical distribution) against DQN (which is a variant of QMLE with the action-in architecture ablated, and in turn obviating the need for amortization and approximate maximization). The results here illustrate the benefits of action-in architectures in regards sample-efficiency and generalization over actions (as demonstrated in the lower-dimensional action domains) and computational scalability (as seen in the higher-dimensional action problems, where DQN entirely fails with memory errors).

We hope that these newly added results help shed light regarding your question around sensitivity of the QMLE approach to hyperparameters. Overall, we hope that our responses to your questions / concerns and the added experiments help further clarify our contributions and merit a reconsideration of your evaluation.

评论- Final Follow-up Before Deadline – Thank You for Your Feedback

2024-12-03

Dear Reviewer uFAB,

We want to express our sincere gratitude for your thorough and insightful feedback on our work, "Learning in complex action spaces without policy gradients." We understand that reviewing service is a significant commitment and a time-consuming task. We truly appreciate the attention to detail and the thoughtful consideration you have given to our work. We have carefully considered your concerns and provided detailed responses to the points you raised. We hope that our explanations address your questions and improve the understanding of our work.

The open discussion phase is a distinctive and highly valued aspect of the ICLR experience. We are eager to continue benefiting from this collaborative process. If you have the opportunity, we would greatly appreciate any final feedback or thoughts you might have on our revisions and in light of our responses.

Thank you again for your time and consideration of our work.

Best regards,

The Authors

2024-12-01

Dear Reviewers,

As the discussion period is coming to an end, we kindly ask if you could take a moment to review our responses and paper revisions and consider reevaluating our paper in light of them.

Thank you for your time and consideration.

Best regards, The Authors

AC 元评审

2024-12-22

The paper aims to identify why policy-based methods have seemed to outperform value-based ones in large action spaces and to use this insight to improve value-based RL. The motivation and writing are generally good, with some exception regarding theoretical and mathematical rigor. Reviewers raised a myriad other concerns, including regarding evaluation (choice of domains and ablations), hyperparameter sensitivity, and paper organization, which were largely addressed by the authors. However, reviewers remain skeptical of the significance of the paper.

审稿人讨论附加意见

None of the reviewers raised major concerns, but a collection of more minor concerns that together decreased the perceived significance of the paper. Authors appear to have addressed many of these concerns rather well, but reviewers were unconvinced, leaving me with an overall feeling of missed opportunity for a productive discussion that would have identified the core strengths and weaknesses of the paper (beyond what could quickly be addressed).

I'd like to take this opportunity to discourage one type of concern that I think should be used very sparingly, namely Weakness 1 (and Question 1) by Reviewer rHPN. As a community, we shouldn't require anything specific to be wrong with policy-gradient (PG) methods in order to pursue good and insightful non-PG alternatives. There are many reasons why: the insight could help PG; the alternatives could eventually improve beyond PG; there could be specific cases where they do outperform PG; etc. I'd like to see us dismiss the practice of rejecting papers because they don't outperform unrelated methods.

最终决定Reject

2025-01-22

Reject