4.3

/10

Rejected3 位审稿人

最低4最高5标准差0.5

2.7

置信度

正确性2.7

贡献度2.3

表达2.7

NeurIPS 2024

Achieving Precise Control with Slow Hardware: Model-Based Reinforcement Learning for Action Sequence Learning

Devdhar Patel,Hava T Siegelmann

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

The paper introduces a biologically plausible model of sequence learning that achieves precise control by utilizing different temporal resolutions for training and control.

摘要

关键词

Action Sequence LearningBasal GangliaPrefrontal CortexReinforcement LearningModel Based

评审与讨论

审稿意见

评分: 4置信度: 32024-06-25

This paper addresses the problem that commonly in RL, small reaction times and high action frequencies are required, which is not the case for computations in the brain. As a more accurate model, the authors propose an RL method that learns an internal model to improve performance in high-latency applications.

优点

The paper rightfully describes the issue that, usually, RL methods operate under different conditions than the human brain (higher latencies/lack of information). Developing a biologically plausible model that can deal with higher latencies and lower perceptual frequencies would be beneficial to get a better understanding of the human brain and for realizing certain applications in robotics. The approach of learning an internal model guiding the actions, if no observations are present, seems to follow intuition and could be promising. I would therefore rate the problem as relevant and interesting, and the general model (at least on a computational level - this could be stated more clearly in the paper) as "plausible".

The problem and approach (besides my points in the Weaknesses/Questions sections) are clearly described. The method is evaluated on a wide range of six control problems, showing that the model can produce reasonable control signals.

缺点

I found a few parts of the paper difficult to follow (mainly the policy and evaluation section, see also my questions for this), and the paper could benefit from improvements in these parts. In particular, I am uncertain whether the approach introduced in the paper fits the problem description. It seems that the approach targets the setting of high action frequency but not delays of perception.

Furthermore, I had problems understanding how the policy of the proposed approach works. The authors seem to define a probabilistic policy ( $\pi_\omega$ ) defined by parameters $\omega$ with the past previous action as input (line 177). In Eq. 5, which describes the loss for learning the policy, the previous action does not seem to appear. For this loss, they need state-action pairs from a deterministic policy $\pi_\psi$ . For a final evaluation, I would need clarification on this (see questions).

Also, the experiment section was not easy to follow. I believe it would be valuable to give an overview at the beginning of the section about how the evaluation is structured and state the goals of each conducted experiment (linked to the motivation of the paper). As an example, I think (please correct me if I am mistaken) that the comparison to SAC was conducted to show that the proposed method can learn a controller that, even with the limitation of a lower frequency, does not compromise much performance. I am a bit unclear as to why the proposed method outperforms SAC (what it does in 4/6 tasks), as to my understanding the strength of the method should be in the specific scenario where hardware is more similar to the human brain. That the proposed method achieves higher performance even in this standard scenario suggests to me rather a lack of appropriate parameter tuning. Also, the setting of the ASL subsection and Online planning section should be stated more clearly (see questions).

In lines 104-105, it is confusing to me to imply that the proposed method in comparison to model-based RL has the advantage that a model is not needed after training. The main purpose of applying model-based RL is to learn a model with the option to replan after training. Model-free RL which could be more similar for this application is not mentioned.

Section 3.2. is about macro actions but it does not even provide in a single sentence an overview of what it is. The authors claim that an advantage of their approach is that it uses the principles of RL. Based on my understanding the concepts of hierarchical reinforcement learning and movement primitives would be very relevant here but are not discussed.

Minor:

Line 17 "not"
Line 163 $\psi$ should be subscript
Line 202 Typo in "Experiemental"
I find it confusing that parentheses are used for both equations and citations. In most papers, therefore, for citations square brackets are used.
The plots do not use well the space in the paper (large whitespaces between subplots, the graphs could be made wider), and font sizes between plots differ significantly. Legends are usually integrated into the first subplot only but could be put, e.g., next to the plots to make this information more obvious.
The Readme of the provided code seems to be incomplete. The code cannot be directly used to reproduce the figures of the paper.

问题

Line 31-32: "When RL agents are constrained to human-like reaction times, even state-of-the-art algorithms struggle to perform in simple environments." Is there a reference for this statement?
I am a bit unclear about which exact problem / setting the proposed method addresses. Is it the delayed perception of the state as mentioned in the introduction? Then, why in the evaluation do the authors repeat the same actions instead of introducing a delay? Also, it seems to me that the proposed method can act on the current state for the first action in the action sequence (eq. 5). Or, is it the frequency of action execution? Then, why is the method motivated by the introduction of human latencies in decision-making? I could imagine that stating that humans operate under lower frequencies than autonomous systems might not be accurate.
Is the model probabilistic or deterministic? In (1) it seems deterministic, while in line 163 it seems like a state is sampled from a distribution.
In the section "learning policy": Where is the deterministic policy $\pi_\psi$ described and where do the state-action pairs come from?
For the generative replay in latent space / Figure 5. Are the plots for the right figure provided also for the other environments?
Why does in Fig. 6 HSP outperform SAC in 4/6 tasks? To my understanding, the SAC implementation there does not force any latencies or action repetitions.
How does the approach of SAC with action repetition work? Is the repetition already introduced for learning? Why do you work with action repetition and not use a more coarse-grained discretization?
How does the model-based online planning approach work in detail? What algorithm and approach do you use?
What is the purpose of the experiment "Generative Replay in Latent Space" and what are the implications of the results?

局限性

I think so.

作者回复

2024-08-07

Thank you for your taking the time to read our work and for detailed review of our work.

We demonstrate this in the current work in Figure 3. SAC fails at human-like reaction times. We have updated the text to say that we demonstrate this in experiments.
We provide a control solution for the setting: slow compute frequency but fast actuation and control frequency. We assume no delay in perception or actuation. For RL algorithms, reaction time can be thought of as the time between each reaction (output) of the agent or simply as time between each input taken by the agent since there is not delay assumed and each input produces an output. Additionally, delay can be handled in RL by setting the timestep equal to the delay and appending the last action to the state-space, thus making the MDP stationary again [1]. In this setting, the delay is then equal to the timestep. Therefore, we use timestep as the reaction time for RL. To avoid any confusion, we have removed any mention of latency from the abstract. We provide references in our paper to cite the average human reaction times and the RL control frequencies to reinforce our claim that RL control frequencies are faster than the average human reaction times.
The model is deterministic. Line 163 is a typo and we have fixed it.
There is a typo on line 192. It should be $\tilde{a}\_{t:t+J}$ $\sim$ $\pi\_\phi(s_t, a\_{t-1})$ . Therefore, the actions come from the probabilistic policy $\pi_\omega$ . The states $\tilde{s}\_{t+1:t+J}$ come from iteratively running the model on $s_t$ and the actions produced by $\pi\_\omega$ . There is no policy $\pi\_\psi$ . The parameters $\psi$ are for the critic network. The critic network does not require the previous action in order to output the q-value of the state-action pair. The previous action is required since HSP uses a recurrent network and the action acts as the recurrent connection. In response to this question, we have fixed all the typos within the equations and added detailed explanation of the temporal recall mechanism.
We did not run the ASL experiment for other environments as the performance was worse than the normal variant.
We are assuming that Fig. 6 is a typo. In Fig. 2, HSP does outperform SAC on 3/6 environments and indeed it is a surprising result. We used the same hyperparameters as the SAC paper for both SAC and HSP and did not do hyperparameter tuning beyond the actor-update-frequency parameter for HSP. SAC is evaluated without any action repetition or delay or any sort. HSP is evaluated on ASL = $J$ so that it sees input every $J$ steps. Therefore, indeed HSP is at a disadvantage. We hypothesize that optimizing multiple future actions based on a single state results in reduced preference for myopic policies that maximize the value of the next state but result in a decreased overall reward. This effect is prominent on the Hooper-v2 task where the HSP-4 outperforms even HSP-2. In response to this question, we have added this insight to the updated revision. We also note that while it might seem like HSP beats SAC on the Humanoid task, but when evaluated SAC is marginally better than HSP. For instance, HSP-2 evaluated at ASL of 2 has an average reward of 8852.38 while SAC has an average reward of 9006.25.
We do not introduce repetition during learning because then we will be evaluating one policy of HSP against many different policies of SAC (one for each ASL). Though even in this setting HSP beats SAC, and in response to this question we have added the results for it in the appendix. We wanted to demonstrate the setting of a single policy being able to adapt to missing inputs (at different frequencies). Before training, the frequency modulation can be achieved either by action repetition or by changing the timestep. However, after training, we can only use action repetition.
For the model-based online planning, we use the HSP policy and the model trained during the training of HSP. We use a simple online planning algorithm: for any ASL, we feed the first action produced by HSP to the model to get the next predicted state. This state is then fed into the HSP to produce the next action. This is repeated until the number of actions is equal to ASL. All the actions in the action sequence are then performed in the environment. Regarding lines 104-105: It is true that our method is more related to model-free approaches. Therefore, we compare to SAC. However, we did not find any previous work that demonstrate model-free sequence generation. Therefore, we mention model-based online planning in related work.
As mentioned in the paper, generative replay greatly benefits the from replay in the latent space rather than the input space. This is because the impact of poor precision of replay in latent space is lower than in input space. Thus, latent space replay could also be introduced to HSP. In this work, we implemented a simple generative replay based on the TCRL work as a preliminary work in this direction. Remarkably, on the walker environment, latent space replay demonstrates very strong results where even when evaluated at ASL-16, HSP-16 produces competitive results to SAC that is evaluated on ASL-1. We present this result to encourage research and to further increase the benefit of our work to the community. While we performed limited experiments and hyperparameter tuning on the latent space networks, we still present the results on all environments for completeness. While we have added the above explanation to the paper.

Additionally in response to the weaknesses and minor points, we have added a definition of macro-actions, added hierarchical RL and movement primitives discussion to 3.2, fixed the typos mentioned and changed the citations to square brackets. Thank you for the retailed comments.

[1] Chen, Baiming, et al. "Delay-aware model-based reinforcement learning for continuous control." Neurocomputing 450 (2021): 119-128.

2024-08-14

Thank you for your clarifications.

2: It is still not clear to me whether you apply any delay in RL (appending the action to the state). If not, I still do not understand why there should be a difference between delay and reaction time, as usually, the reaction time is the time between the stimulus (state) and decision (action), which is not the case in usual RL.

I am also unclear about how realistic the model for human decision-making is, as reaction times cannot be confused with observation of the state at a lower rate.

6, 7: it is still surprising to me that HSP outperforms SAC with less information, which indicates to me that the evaluation is somehow flawed.

I still tend towards rejection of the work as at least the presentation and motivation with links to the human sensorimotor system needs to be improved in clarity. If the other reviewers, however, had a better understanding and saw great value in the paper, I would not fight for rejection, therefore I increase my rating to borderline reject (4).

评论- Further clarifications on Delay, Response time, and Performance

2024-08-14

Thank you for the updated rating. We would like to provide further clarifications on the points you raise:

Typically, reaction time and delay mean the same thing for biological agents and is can be measured as the time between stimulus and response or decision. However, if biological agents acted like current RL agents:
1. Observe the state of the environment
2. Compute the action for the observed state
3. Perform the action in the environment until the next action is computed
4. Repeat

Then the reaction time for biological agents would be equal to the timestep (amount of time an action is performed in the environment). Since each action is performed until the next the next action is computed. Thus we can compare the timestep of traditional RL algorithms to the biological reaction times.

Despite the slow reaction times of biological agents, we are capable of remarkable feats of control such as playing tennis, riding a mountain bike, and gymnastics. This is possible due to two mechanisms, both equally important:

Predictive coding to handle delay: In order to handle the large delays (150ms), the brain employs predictive coding where the actions are picked for the predicted state after the delay when the actions will be performed rather than the current state. As we mentioned, this problem has been addressed in previous work, and a simple solution for stationary MDPs is to append the previous action to the current state. However, even if the delay is handled in such a way, the biological agent will still have to perform any action computed for the duration of the reaction time (150ms). If this were the case, complex control tasks would be impossible.
Macro-actions / sequence of actions: Thus, biological agents output a sequence of actions to address the slow reaction time. However, learning a sequence of actions is a difficult problem that has not been previously addressed. In our current work, we provide a solution that can compete with SOTA RL algorithms.

Without our algorithm, RL agents are forced to control the compute frequency and action frequency with a single parameter: timestep. Thus, RL agents need super-human frequency not to address the delay but to control the action frequency. For this reason, slow hardware like raspberry-pi are often incapable of complex RL control due to the low action frequency even though solutions to handle delay exist.

We hope this clarifies why we equate timestep to the reaction time of biological agents even though we do not utilize delay: It is difficult to measure the decision frequency of the brain and we use the reaction time as an approximation of the same.

Higher performance for less information is not unprecedented and it has been already demonstrated in previous works. Specifically in RL, it has been demonstrated that performance can increase when the frame-skip parameter is increased [1]. Braylan et al. demonstrated that on atari games, frame-skip as high as 180 (180 times less information) results in a significant increase in performance in some games (when compared to frame-skip of 0).

We hope these clarifications help shed light on our work. If any points remain unclear, we would greatly appreciate it if you could specify which aspects of our submission or responses need further explanation. Your feedback will be invaluable in refining our manuscript. Additionally, we welcome any further questions or comments you may have during this discussion period.

[1] Braylan, A., Hollenbeck, M., Meyerson, E., & Miikkulainen, R. (2015, April). Frame skip is a powerful parameter for learning to play atari. In Workshops at the twenty-ninth AAAI conference on artificial intelligence.

审稿意见

评分: 5置信度: 32024-07-06

This paper introduces the Hindsight-Sequence-Planner (HSP), a reinforcement learning (RL) model inspired by the brain's ability to achieve precise control using slow neurons. The model aims to mimic human-like sensory and reaction times by leveraging an environmental model for sequence learning. HSP demonstrates competitive performance with fewer observations and actor calls compared to faster RL models. The model is evaluated on various continuous control tasks, showing robust performance even with longer action sequences.

优点

The concept of mimicking brain-like conditions in RL models is innovative and offers a fresh perspective on sequence learning. The integration of temporal recall and sequence learning inspired by neural mechanisms is novel.
The use of a temporal recall mechanism allows for fine-tuned action sequence learning despite operating on slower hardware.
HSP demonstrates competitive performance across various continuous control tasks, showcasing its robustness and adaptability.
The experiments cover a range of continuous control tasks and provide comparisons with state-of-the-art models like Soft Actor-Critic (SAC), highlighting HSP’s efficiency.

缺点

The paper lacks specific numerical performance comparisons to quantify improvements over baseline models.
Sections like "Learning the Model" and "Learning Critic" need further elaboration to highlight their specific contributions and novelty.
The broader implications and potential real-world applications of HSP are not fully discussed.
There is insufficient discussion on how HSP handles situations with highly inaccurate model predictions.
The related work section lacks details on "Macro-Actions," the scalability issues of current methods, and the meaning of "principles."

问题

Can the authors provide more specific numerical comparisons of HSP's performance gains over baseline models in the experimental section?
How does HSP handle situations where the model predictions are highly inaccurate?
Could the authors elaborate on the potential real-world applications where HSP would be particularly beneficial?
Lower computation and scalability in HSP are contributions; how can these two metrics be quantified?

局限性

The authors mention the reliance on an inaccurate model but could provide more in-depth analysis on how this affects different types of tasks?
The scalability of HSP to very large action spaces and high-dimensional state spaces is not thoroughly discussed.
Typo Issue: In Section Abstract: demonstrating that it not can achieve comparable performance at ‘human-like’ frequencies by relying on significantly fewer observations and actor calls. -->demonstrating that it can achieve comparable performance at ‘human-like’ frequencies by relying on significantly fewer observations and actor calls.

作者回复

2024-08-07

Thank you for the detailed review and for acknowledging the novelty and the performance of our method. We address weaknesses not already addressed in the questions first:

The novelty of our approach is the temporal recall mechanism and the introduction of simultaneously trained model using delayed updates. In response to this, we have updated the sections to highlight the contributions and novelty of our work in the updated revision.
In response to this, we have added a definition of macro-actions and elaborated on the scalability issues of existing approaches. We have removed the word ‘principles’ since it is not required and does not add anything to the sentence.

Questions:

We follow the format of SAC papers that provide plots and not tables [1,2]. The main improvements of HSP can be seen in Figure 3. However, if the same information is provided in numerical format, it would require a table with 16 rows and two columns for each environment. But, we can release an excel sheet with the raw data required to recreate the figures presented in the paper. If the reviewer can specify for which result they would like numerical comparisons, we will be happy to provide them. The numerical performance gains of HSP over SAC can be summarized in table 2 provided in the pdf. The table shows the longest action sequence length that produces greater than 50% of the best performance on each environment (higher is better). Here, we pick 50% as a threshold for catastrophic failure, however, the threshold can be different for different applications and environments.
That is a good question. Since the model is trained in parallel to the actor and critic, the during the initial training, any model predictions are highly inaccurate. However, the model improves with training. We find that the final model accuracy has a direct impact on the performance of HSP for longer ASL and is a better predictor for performance than the complexity of the environment. For example: the final mean squared error for the Half-Cheetah task that has 17 dimensions is around 0.05 while that of Ant is around 0.002 even though it has 27 dimensions. As a result, HSP can maintain performance for longer ASL on Ant than Half-Cheetah. In future, we will focus on improving the accuracy of the model to further improve the results. In response to this question, we have added a discussion on the model accuracy and HSL performance in the updated revision.
We believe that HSP will have broad real-world applications. Reinforcement learning algorithms are powerful, however, in the current RL setting the observation, compute and action frequency all utilize the same timestep and need to be synchronized. This limits its practicality as it requires fast costly hardware to be implemented. Additionally, there are many applications might be beyond the reach of even the fastest computers. For example, a recent work demonstrated control of tokamak plasmas using deep-RL [3] that might help accelerate fusion research. However, in their work, they required a control frequency of 10 kHz. This also means that the required compute frequency is 10kHz which might require costly dedicated hardware to achieve. However, with HSP, the compute frequency can be lowered while maintaining the high control frequency as demonstrated in our work. Thus potentially reducing the costs significantly. Similarly, other applications include robotics, drone control (125Hz or 8ms [4]) and any other application that utilizes RL. In response to this question we have added a detailed discussion on the real world applications of HSP to the updated revision.
Lower computation can be quantified by calculating the multiply accumulate operations (MACs) for the number of actions produced.

In our current work, we did not optimize the hidden units hyperparameter for the GRU and linear layers in HSP. We picked the same hidden units for all neural networks across all base lines (256 hidden units). As a result, currently, HSP is more computationally intensive than SAC. But it is still superior to online planning since it does not use the model during inference. In the table 1 of PDF: we provide the average (Multiply-accumulate operations) MACs for different number of actions produced by each benchmark.

Scalability can be quantified by robust performance on difficult RL environments. The difficulty of the environments can be quantified by their action and state-space dimensions, and the performance SOTA algorithms on it. Current macro-action approaches do not scale well to difficult continuous control tasks like the Ant, HalfCheetah and Humanoid and thus demonstrate a poor performance on it.

In response to this question, we have added the compute table to the appendix and an explanation on scalability has been added to the updated revision.

Limitations:

We believe question two answers this limitation to some extent. However, in response to this, we will add an updated plot for model-accuracy vs. performance (largest ASL above a threshold) in the appendix of the paper.
As mentioned above, the scalability of HSP depends on the accuracy of the model and the scalability of the SAC algorithm which HSP utilizes. In response to this, we have added a discussion on the topic to the paper.
We have fixed the typo. Thank you for pointing it out.

[1] Haarnoja, Tuomas, et al. "Soft actor-critic algorithms and applications." arXiv preprint arXiv:1812.05905 (2018).

[2] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." International conference on machine learning. PMLR, 2018.

[3] Degrave, Jonas, et al. "Magnetic control of tokamak plasmas through deep reinforcement learning." Nature 602.7897 (2022): 414-419.

[4] Kaufmann, Elia, et al. "Champion-level drone racing using deep reinforcement learning." Nature 620.7976 (2023): 982-987.

2024-08-14

Thank you for your detailed and thoughtful rebuttal. I appreciate the clarifications on the unique contributions of the temporal recall mechanism and the challenges addressed by the HSP model. After carefully considering your responses and the additional context provided, I will maintain my current score. As the reviewer with the highest score, I believe this is an appropriate evaluation given the innovative nature of your approach and the robust performance demonstrated in various continuous control tasks. However, I also recognize the concerns raised by other reviewers, particularly regarding the need for clearer presentation and deeper exploration of certain aspects, such as the relationship between action sequence learning and real-world biological systems.

审稿意见

评分: 4置信度: 22024-07-08

This paper proposed HSP, a bio-mimic framework for learning-based control. Motivated by human brains, HSP can deal with observation and computation in different frequencies by making the "actor" produce action sequences, similar to the functioning pattern of ganglia and the prefrontal cortex in human brains. HSP employs a model-based training approach to achieve model-free control, resulting in precise behavior despite running on slow hardware. The authors demonstrate the performance of HSP on various continuous control tasks.

优点

This paper is well structured with clear figures, and the motivation and introduction are interesting, bridging the gap between biological and artificial reinforcement learning systems. The empirical results across various continuous control tasks are impressive as well. Overall, the work presents a promising direction for developing more efficient and adaptable RL algorithms that could have broad implications for robotics and other real-world applications.

缺点

One major concern is the performance of the proposed framework. The experiment results do not show a significantly better performance than traditional RL methods, and the performances are worse when $J=16$ in most cases. The latent space variant of HSP shows improvements only in one environment (Walker2d). Further investigation into why this approach doesn't generalize well to other environments would be valuable.
The novelty of the proposed framework is also questionable. In control literature, especially trajectory optimization control, producing action sequences is commonly used, and a similar technique could be viewed as a variant of model predictive control (MPC). It would be better if the authors could provide a detailed comparison with existing control algorithms (like MPC) in the experiment section.
The comparison with model-based online planning is somewhat limited. A more comprehensive comparison with state-of-the-art model-based RL methods would provide better context for HSP's contributions.
Just a small suggestion - The title of the paper mentions "hardware," so I expect to see some real-world control experiments in the paper. It would be nice if the author could really demonstrate the control performance using "slow hardware" like Raspberry Pi or even slower platform.

问题

How does HSP perform in environments with stochasticity? Traditional RL and control methods could flexibly correct errors due to high control frequency. Could you provide results or insights on how the algorithm might be adapted for such scenarios?
The current implementation uses a fixed action sequence length. Have you considered mechanisms for dynamically adjusting the sequence length based on the task or environment complexity?
Can you provide more detailed analysis of the computational costs of HSP compared to baseline methods, especially during inference?
The paper doesn't specify the computational resources required for training the models. Given that training RL models could be computationally intensive, could you provide more details about this?

Some typos:

Equation (1): $a_t$ to $a_{t^\prime}$
Line 272: HSL to HSP

局限性

Limitations are discussed.

作者回复

2024-08-07

Thank you for the detailed review and your acknowledgement of the promising direction and applications of our work.

Weaknesses:

Framework performance: We understand that these concerns are based on figure 2. During training, we evaluate HSP- $J$ on action sequence length of $ASL=J$ . Therefore, the HSP- $J$ learning curves are evaluated at a disadvantage: receiving input only every $J$ steps. This is more apparent when Figure 2 and Figure 3 are viewed together. For instance: On Ant-v2 environment, Figure 2 demonstrates HSP-16 average reward of around 4000 since it is evaluated on ASL of 16. However, when it is evaluated on ASL of 1, the avg. reward is 5405.75. While this performance is still lower than that of SAC (7073.27), when compared at higher ASL, the performance is significantly better than SAC for all environments and all $J$ . Again, looking at Ant-v2, at ASL of 2, HSP-16 has avg. reward of 5726.08 while SAC has a avg. reward -186.15. Thus, HSL trades-off the avg. reward at 50ms (by 23.57%) to not fail at 100ms. We believe that this demonstrates the significant benefit of HSP over existing methods.

SOTA algorithms in RL focus on performance while disregarding other factors. Therefore, all the hyperparameters including timestep are picked to maximize performance. On the other hand, HSP provides an alternative algorithm that can be robust to changes in observation frequency or compute frequency.

Latent space variant: We used the same method as the TCRL paper without any modifications. We presented the Walker-2d results as an encouraging example for further research and to demonstrate that further improvements are possible, thereby increasing the benefit of our work to the community. We understand that we did not clearly state this in the paper and we have added a clarification text in response to your comment. We are willing to move the latent space variant to the appendix and mention it in future work.

Novelty: We respectfully disagree on this point. The novelty of our work is not the production of action sequences but the production of action sequences without utilizing a model. As the reviewer rightly points out, there exists vast amounts of works including MPC and model-based RL that utilize action sequences. However, our work has more similarity to model-free RL than any other area. HSP is primarily a reinforcement learning algorithm based on SAC and does not start with any trained model or system knowledge. Once trained, again, it does not need any model for implementation. In comparison, MPC requires an accurate dynamics model of the system. This limits its practicality to systems that have already been sufficiently modeled. On the other hand, HSP makes no such assumptions. Thus, it is not possible to directly compare HSP to MPC systems. Finally, like existing RL approaches, current MPC approaches require super-human compute frequency to operate well. Examples:

Galliker, Manuel Y., et al. (2022) walker at 10 ms.
Farshidian, Farbod, et al. (2017) four legged robot at 4 ms.
Di Carlo, Jared, et al. (2018) MIT cheetah 3 at 33.33 ms.

Model-based RL: MBRL methods either require a pre-trained model of the environment or require a model during inference or both. Thus, it is not directly comparable to our method and HSP. However, we provide the comparison to online planning as it demonstrates an interesting property of HSP. The performance of online planning depends on the policy and the accuracy of the model. Planning longer sequences requires a model of very high accuracy to reduce the effect of additive noise. The results demonstrates that the effect of additive noise does not affect the performance of HSP to the same extent as online planning. We do not utilize any noise reducing algorithm or SOTA approaches that improve online planning performance since they are not in the scope of our work. In response to this, we have added a detailed explanation of the setting and the results for the online planning experiment.

We understand that the word hardware in the title might cause confusion. We simulated the slow hardware and presented a biologically plausible mechanism for the slow biological hardware. We will change the title of the paper to avoid this confusion. We agree that hardware implementation is a great future demonstration. We are planning to implement it.

Questions:

1 & 2. Like any RL algorithm, the final performance of HSP will depend on the type of stochasticity in the environment. If the stochasticity can be predicted using states and actions and managed by actions of the agent, then $J$ can be larger. However, unpredictable stochasticity can be treated similar to a noisy model. Larger $J$ will lead to poor performance. We believe that the best solution, as you also suggest, is to develop a method to adapt the $J$ and ASL based on the expected noise and stochasticity. This proposed solution is similar to the attention mechanism in the brain and an important future direction of our work. In response to this question, we have added a discussion on stochasticity and added adaptive ASL as future work.

In our current work, we did not optimize the hidden units hyperparameter for the GRU and linear layers in HSP. We picked the same hidden units for all neural networks across all base lines (256 hidden units). As a result, currently, HSP is more computationally intensive than SAC. But it is still superior to online planning since it does not use the model during inference. We provide the average (Multiply-accumulate operations) MACs for different number of actions produced by each benchmark in the pdf attached as a measure of computational cost. In response to this question we have added the table to the appendix.
We provide the relevant details in the Appendix A.2

We would also like to thank the reviewer for their detailed read of our paper and have fixed the typos pointed out by them.

评论- Online Planning more detailed results

2024-08-10

Additionally, in response to the online-planning comment, we have updated figure 4 to also display the average mode prediction error for the last predicted state as a function of the ASL. The error increases exponentially due to the cumulative effect. Since we cannot provide the updated figure, we provide relevant data below:

ASL	Avg. Reward HSP	Avg. Reward Online Planning	Avg. MSE for last predicted state
2	8405.57	8393.16	0.03
4	8445.52	8407.19	0.64
6	8297.33	5642.60	3.56
8	7928.48	2334.13	76.83
10	7070.55	376.83	487.46
12	4814.15	328.20	3723.16

The above table demonstrates the sensitivity of the performance of online planning to the model prediction MSE. On the other hand, the performance of HSP is much more robust to very high model prediction MSE.

作者回复

2024-08-07

We thank all the reviewers for their detailed comments. We believe that all their comments have improved the quality and clarity of our paper. Here we attach additional tables to the PDF to support our rebuttal. They are referenced in the individual rebuttals.

最终决定Reject

2024-09-25

This submission introduces the "Hindsight-Sequence-Planner (HSP) algorithm, a biologically plausible model for sequence learning." The claims to biological plausibility are not supported in the paper with any reference to specific neuroscientific or behavioral data. Beyond this, reviewers found the performance to be not significant or questionable and the proposed approach to be improperly positioned as novel in the context of prior work on reinforcement learning at varying timescales. In addition, the experiments present no quantification of uncertainty to demonstrate that the performance reported is broadly representative of the presented approach, nor other robustness experiments (Henderson et al., 2017; Agarwal et al., 2021); the plots appear to have been produced via a single seed. Lastly, the checklist claims code is provided to "to reproduce the results," but the only link to code is an existing open-source implementation of SAC, thus this claim is made spuriously.