8.2

/10

Poster4 位审稿人

最低4最高6标准差0.7

3.8

置信度

创新性3.8

质量3.3

清晰度2.8

重要性3.3

NeurIPS 2025

Intrinsic Goals for Autonomous Agents: Model-Based Exploration in Virtual Zebrafish Predicts Ethological Behavior and Whole-Brain Dynamics

Reece Keller,Alyn Kirsch,Felix Pei,Xaq Pitkow,Leo Kozachkov,Aran Nayebi

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

TL;DR

A novel intrinsic motivation method based on world-model memory mismatch enables embodied agents to exhibit robust autonomous behaviors that closely match whole-brain neural data from zebrafish.

摘要

关键词

NeuroAIintrinsic motivationzebrafishneural-glialembodied agentsreinforcement learningautonomy.

评审与讨论

审稿意见

评分: 5置信度: 42025-06-24

In this work, the Authors introduce 3M-Progress, a novel framework for intrinsic motivation, and compare the predictions of this framework to previously published behavior and neural activity of larval zebrafish in the futility-induced passivity task. They show that the proposed model, compared to several existing baselines, best matches the animals' behavior in this task and reproduces the animals' neural activity marginally better than the baselines. The proposed model is mapped on the interaction between neurons and astrocytes in the brain.

优缺点分析

Strengths:

The research here pursues an interesting and relevant topic of how intrinsic motivation affects decision making. Although the concept of intrinsic motivation has been used in machine learning for quite some time now, it couldn't be readily used to model biological motivations, despite being loosely inspired by them.
In this work, the Authors compare their model to neuroscience data and consider a number of baseline models, which is, to the day, the best way to propose and validate neuroscience-related hypotheses. They show that the proposed model is the best among the baselines to match the animals' behavior in the task and is in the same ballpark with the other considered models in predicting the animals' neural activity.
The text is well-structured and, for the most part, written clearly, allowing the readers and the reviewers to know exactly what was done in this project and thus allowing to come up with a well-grounded valuation of this work.

Weaknesses:

The language of the introduction and, overall, non-descriptive parts of the paper appears inflated, overstating the scope of the provided results. The paper correctly states that existing models in the field "fail to produce robust autonomous behaviors observed in animals", "reproduce specific experimental findings", or offer "simplified, bottom-up mathematical models"; that current models are limited to "constrained environments with either handcrafted object-centric inputs, or externally-defined rewards", or "non-embodied settings". The current work, in contrast, is claimed to "bridge this gap" by "leveraging biologically-realistic, open-ended scenarios rather than simplified tasks" to yield "a naturalistic theory of neural-glial computation, offering insights into evolutionary pressures underlying the detailed heterogeneity observed at level of individual neural and glial cells." Which seems inconsistent because i) the model is compared to a single experimental observation in a single model organism and ii) the models used in RL/robotics are quite sophisticated. This is fine, however, not to mislead the reader who will learn from this text, the scope of the claims need to be adjusted.
While this work compares the animals' behavior to multiple baselines, which is the proper thing to do, these baselines, as the Authors point out, are not necessarily biologically plausible. As for the specific model proposed here, that matches the behavioral observations in zebrafish larvae, the design choices in the model are done specifically to account for particular biological observations and thus may not generalize to broader scenarios, which would be a typical test generally applied to ned models. Specifically, the distributional matching, enforced by the KL divergence in Equation on Line 210 directly addresses the problem of "Noisy TVs", the exploration hijacked by the white noise, and the constraint in Equation on Line 214 with the symmetric function f described in Line 218 directly enforces the periodicity of behavioral phases observed in larval zebrafish and shown in Figure 3A. Thus, while it makes the proposed model principled and valid by design, it's unclear if (1) the model will generalize to new observations and (2) whether the observed effect won't be explained by other models.
Checking the models against the neural activity is, again, a highly appropriate step, however, the proposed analysis may not be sufficient to decide between the models based on the similarity to the observed neural activity. Specifically, all the model-based frameworks in Panel A exhibit high correspondence to the neural activity recorded in larvae. Yet, they lead to different behaviors. Would the latent dynamics of the other models look similar in Figure 5? Does that mean that the bulk of the neural activity is then irrelevant for this behavior? Should we then rely on this similarity at all?

Overall, it's an interesting and promising work, however, I believe that (1) the strength of the claims need to be adjusted to reflect the scope of the work and (2) then the work's merit needs to be discussed among the reviewers with regard to NeurIPS's expectations in light of the work's results revisited during step (1).

问题

What other biologically plausible models could explain the observed behavior?
What would be the steps to independently distinguish the models based on their neural activity?
What additional effect of the observed behavior or what additional behaviors would the model predict?
Perhaps I've overlooked this in the text, but it would be great to see / highlight the proposed circuit based on the results of this paper and to test / discuss its presence in the brain.

局限性

Limited evaluation: a single model organism, a single behavioral observation, and a single biologically plausible model.

最终评判理由

Post-rebuttal: some (not all) of the concerns were addressed; raising the score accordingly

Post-rebuttal: second iteration: the remaining concerns that I had in the original review were addressed; raising the score.

格式问题

N/A

作者回复

2025-07-31

Overall Response

We thank the reviewer for their detailed, thorough, and constructive feedback. We found the advice and suggestions of the review very helpful for improving the quality and clarity of the paper, and we have implemented all the reviewers suggestions. Below, we provide responses to the reviewer's concerns and specific questions.

Response to Specific Concerns

We agree that some of the language of the introduction was grandiose and nonspecific, and should be rewritten to reflect the specific contributions of the paper.

Response to concern #1

Regarding the comparison of our model to a single experimental observation in a single model organism, we agree that this is a limiting factor in demonstrating the generality of our method. However, to the best of our knowledge, the dataset provided in Mu et al. 2019 is the only example of whole-brain recording during an autonomous cognitive animal behavior. So, it is not feasible to test this method in other behaviors in other organisms since the validation data does not yet exist.

We would also like to point out that the baseline algorithms are state-of-the-art methods in RL/robotics used for model-based exploration. The goal of our paper is to use this unique dataset to demonstrate that these existing approaches to autonomous exploration in reinforcement learning, especially curiosity-driven methods, may not capture the nature of autonomous behavior real animals. The zebrafish dataset is one example of this gap, and can already serve to motivate new algorithms to understand animal autonomy.

In the future, we are excited for new experiments in other organisms to be released so we can continue to develop and validate our model on a wide variety of data. Until then, we are currently working on demonstrating how 3M-Progress can be used as an exploration mechanism in general reinforcement learning problems. However, to emphasize the generality of 3M-Progress, we have added a section in the main paper describing it's implementation and intuition for other embodiments and new environments.

Response to concern #2

We agree with the reviewer's observation that the design choices for 3M-Progress were specifically inspired by observations in the zebrafish. However, the 3M-Progress algorithm itself is a general intrinsic motivation strategy that can be used for other embodiments and tasks. The focus of our paper, however, is to show that existing intrinsic motivation algorithms do not generalize to autonomous animal behavior, and we propose 3M-Progress as a working solution. Rather than testing 3M-Progress on a suite of RL benchmarks, we are testing existing algorithms on an animal autonomy benchmark.

Although 3M-Progress can be benchmarked in standard RL domains, it is beyond the scope of the paper since we are not introducing 3M-Progress to compete with existing algorithms in standard RL domains. To reiterate, we are flipping this view by testing curiosity-driven algorithms in real-life neuroscience tasks for which we have behavioral and neural data to validate against.

Finally, we would like to remark that the purpose of our work is not to propose 3M-Progress a final or complete model of autonomous exploration, but rather to provide a platform for future work of reverse-engineer autonomous behavior in animals, since our agent can be deployed in new settings than the ones it was trained in.

Response to concern #3

Regarding whether the observed effect won't be explained by other models, we first point out that none of the considered algorithms, both model-free and model-based intrinsic motivation, captured the autonomous behavior. Here, biological plausibility just refers to whether the algorithm could be feasibly implemented by an animal brain. For example, Disagreement assumes that an agent maintains $N$ world model simultaneously learned online, which is generally intractable for biological agents.

However, the remaining algorithms (learning progress, ICM, max-entropy, PID control) are biologically plausible in principle and yet they do not capture the behavior. Further, we could design a simple algorithm that does capture the behavior, such as balancing an action cost with a homeostatic drive, which seems biologically plausible and avoids curiosity or introducing world models. The PID controller (Figure 4A) is one such example, designed to simply accumulate feedback error until hitting a threshold, where it switches between one of two possible states. However, Mu et al. 2019 explicitly ruled out such hypotheses with extensive experiments, and found that the futility-induced passivity behavior was not driven by fatigue nor positional homeostasis. Rather, both the behavioral and neural evidence support an error-tracking mechanism implemented by neural-glial circuits.

This provides the basis for curiosity-driven intrinsic motivation, since error-tracking in a novel environment implicitly assumes the presence of a predictive model. Given this basis, our paper focuses on how a predictive model should be used for intrinsic motivation, and show that classic curiosity-driven approaches from reinforcement learning do not capture this specific case of autonomous animal behavior.

Response to concern #4

Regarding the saturation of the chosen metric, this can be attributed to our choice of evaluation criteria, which we explain below. Although the zebrafish behavior itself involves multiple transitions that occur in sequence over a single trial, we instead chose to compare behavior at the level of individual transitions for convenience.

First, we point out that no intrinsic reward algorithms exhibited stable behavioral transitions within an episode over the course of training except for 3M-Progress (Figure 2A). For example, where 3M-Progress agents exhibit a passive-to-active transition, no such transition occurs for other agents in the same episode. This is because the baseline agents algorithms are unstable as they exhibit spurious or transient behaviors. Thus, in order to facilitate comparisons between individual transitions in 3M-Progress agents and baseline algorithms, we took a generous approach by allowing the missing transition in baseline algorithms to come from any episode from any model checkpoint. This means that despite the baseline agents not meeting the full behavioral deserderata of stable transitions within a single episode, the individual transitions are on equal footing with 3M-Progress agents, yielding a relatively high model-brain alignment (albeit, still lower than 3M-Progress).

One way to address this is to require that both active and passive transitions must come from the same episode for all baseline algorithms, as was done for 3M-Progress and as the animal exhibits. Since other algorithms fail to produce both transitions in a single episode, we simply choose the behavior following the identified transition to represent the missing transition. For example, RND often exhibits a valid passive transition early in the episode, and remains passive for the remainder of time. For the active transition, we take any window of time following the passive transition to represent the active transition. Using this criteria, RND, next highest performant model overall after 3M-Progress, falls from a model-brain alignment of $\approx 0.93$ to $\approx 0.51$ on active transitions. This evaluation criteria reflects the fact that the zebrafish behavior is characterized by sequences of transitions within a trial.

We can further distinguish between models on the basis of their latent dynamics (Figure 5). While we did not originally include the latent dynamics of the other intrinsic reward algorithms, we have added this to Appendix F. Despite the model-brain alignment for ICM, RND, $\gamma$ -Progress, and Disagreement being fairly high (but still lower than 3M-Progress), their latent dynamics fail to capture the neural-glial circuit observed in Mu et al. 2019.

Response to Questions

Q1: What other biologically plausible models could explain the observed behavior?

We answer this question above in the "Response to Concern #3" section.

Q2: What would be the steps to independently distinguish the models based on their neural activity?

We answer this question above in the "Response to Concern #4" section.

Q3: What additional effect of the observed behavior or what additional behaviors would the model predict?

Given any dynamics prior that encodes intrinsic preferences, 3M-Progress agents will seek out behaviors in new environments relative to their alignment with their prior. The specific behavior they pursue primarily depends on the pretraining environment(s) from which the prior memory is obtained.

Q4: Perhaps I've overlooked this in the text, but it would be great to see / highlight the proposed circuit based on the results of this paper and to test / discuss its presence in the brain.

We agree that the proposed circuit is core component of our paper, and we thank the reviewer for the opportunity to clarify this for the reader. 3M-Progress is a direct absorption of the model-based error-tracking and alignment---each of it's components reflect a computational motif of the neural-glial circuit proposed in Mu et al. (2019). We explain these details in the last paragraph of the "3M-Progress" section within the Methods portion, as well as in the "Latent Dynamics of 3M-Progress Agents Reflect Underlying Neural-Glial Computations" section within the Results portion. However, in order for the proposed circuit to be clearly outlined in the paper, we have added a section in the Methods portion that summarizes these details to explain how 3M-Progress provides a normative circuit model.

评论- Thanks for your response.

2025-08-04

I would like to thank the Authors for their detailed response.

Many of its points sounded convincing. These include that statements that:

There aren't other similar datasets to perform similar analyses.
A goal here is to show that ML notions of motivation do not apply, while biological ones do.

I am still on the fence about a few remaining points including:

The lack of other biologically plausible algorithms. I agree with the Authors' rebuttal point that the readily-available models, especially for the task at hand, may not be abundant; still, having alternative models (even if the Authors had to come up with them based on the biological motivation literature) could improve the work by a lot.
I may be missing it, but what about the following concern? "all the model-based frameworks in Panel A exhibit high correspondence to the neural activity recorded in larvae. Yet, they lead to different behaviors. <...> Does that mean that the bulk of the neural activity is then irrelevant for this behavior?"

Overall, I appreciate the Authors' willingness to adjust the claims in the paper. Even though the Authors have provided a detailed response here about the circuit, I would encourage the Authors to crystallize the circuit in one diagram and a related short text paragraph in the Results. Otherwise, even though I still have some reservations as outlined above, some of my concerns that were addressed warrant an increase in the score. While still borderline, it should be sufficient for the paper's acceptance.

Please feel free to reach out regarding the remaining concerns.

评论- Addressing remaining concerns

2025-08-06

We thank the reviewer again for their constructive feedback. We would like to take the opportunity to clarify any remaining questions.

Regarding your first concern, we would like to clarify that several of the readily-available models (ICM, RND, $\\gamma$ -Progress) are biologically plausible. Please see “Response to Concern #3” and our response to “Q3” to reviewer Ypm3 for full details. In short, each algorithm assumes the agent maintains at most two memories that are learned in either inverse or random feature spaces, both of which are biologically plausible design choices. Additionally, the homeostatic model (Fig. 4, 5) and PID controller (Fig. 5) are both algorithms inspired by the biological motivation literature. The homeostatic algorithm consists of an agent trained to balance positional homeostasis with energy expenditure. The PID controller uses sensorimotor feedback error to act as a homeostatic regulator between active and passive states, directly based on the hypothesis presented in Mu et al. (2019) from the experimental neuroscience literature. Additionally, we test several ablations of 3M-Progress (see item D2 in response to reviewer fjGy), each of which are alternative biologically plausible algorithms. We have clarified these details in the paper so that the biological plausibility of the various algorithms is apparent to the reader.

Regarding your second concern, the “Response to Concern #4” was meant to address this. Specifically, it is not the case that the bulk of the neural activity is irrelevant for behavior. Rather, the high performance of other models despite exhibiting different behaviors is primarily due to our evaluation criteria. To reiterate some details in “Response to Concern #4”, we were generous in our initial evaluation criteria by allowing transitions in baseline algorithms to come from any model rollout or checkpoint because they do not exhibit both transitions within an episode. On the contrary, the real zebrafish behavior is characterized by sequences of transitions within a trial. When we apply this criterion that both transitions must come from a single model episode, to the next most performant model drops in model-brain alignment by nearly 40%. Please see “Response to Concern #4” for more details and let us know if you have any additional questions.

We have followed the reviewer’s advice to add a short paragraph illustration of the circuit in order to clarify this result. Here we provide a draft of this paragraph, and we hope this improves the clarity of the proposed circuit to the reader.

"3M-Progress detects sensory-motor mismatch using a prior memory as an expectation of how action is coupled to sensory-feedback under normal, ethological environment dynamics (Fig. 2A). This is functionally equivalent to signaling failed swim-attempts by NE-MO neurons in zebrafish. Similarly, the exponential weighted average is a discrete-time leaky integrator on model-memory-mismatch (NE-MO input), which is functionally equivalent to radial astrocytes that accumulate NE-MO signals during failed swim-attempts and decay during passivity (Fig. 2C). PCA reveals that these design choices lead to the emergence of the proposed neural-glial circuit (Fig. 2A) in the dominant latent dimensions of the agent's core module (Fig. 5). Thus, agents trained with 3M-Progress successfully validate the error-detection and accumulation circuit hypothesis, whereby glial responses accumulate evidence of motor futility via noradrenergic signaling to drive behavioral suppression, and neural responses reflect transient activation patterns associated with detecting mismatches between expected and actual sensory outcomes during unsuccessful swim attempts."

Once again, we kindly thank the reviewer for their helpful and thorough feedback. Please reach out if you have any remaining questions or concerns.

2025-08-07

Thanks, that all makes sense.

Re: 2): please include the more stringent of your criteria in the final version of your paper.

In the light of the clarity of the provided responses, I am raising my score.

2025-08-07

Thank you so much, we will most definitely include the more stringent criteria in the final version.

审稿意见

评分: 4置信度: 52025-06-24

The paper describes a model of learning, behavior, and neural activity in zebra fish; the fish simulation is swimming head-fixed in a simulated dm-lab environment. Although this is an ambitious project requiring a large amount of engineering work at multiple levels, the paper focuses in particular on the learning algorithm which enables natural bi-modal swimming behavior, and brain-like neural activity. The learning model requires training a vision encoder, a proprioception encoder, a policy model, a world model (used to define an intrinsic reward), and a value model (used to predict the intrinsic reward for PPO). The core of the analysis looks into the alternative definition of the world model and intrinsic reward to enable a realistic behaviour. The 3M progress is described as a working solution. The resulting behaviour nicely oscillates between “passive” and “active” behaviour. The artificial neural network activity is then compared with the recorded neural and glial cell activity, and the activity correlation is also highest with this algorithm.

优缺点分析

The project is extremely ambitious and requires a lot of engineering work. While the project is impressive and achieves multiple conceptual milestones on a high level, I believe this paper will have a greater impact if it were written better. For instance, a clearer description of the contributions and findings is needed, and some of the mathematical notation is also unclear. I believe the technical work is deep, and this paper could become a very good paper with a substantial re-writing of the contributions and the technical description.

One high-level concern: In hindsight, the paper falls partially into some of the difficult existential problems of large-scale simulation work in computational neuroscience. While the algorithm is sound and the simulation is impressive, I am doubtful that the 3M-progress will remain as “the good” model of head-fixed zebra fish. In the end, the behaviour appears to be a bi-modal oscillatory behaviour between active tail and not moving. Although M3-Progress seems to be a working solution, I wonder if a simpler model would also achieve that (like oscillating between maximizing action and fatigue, without modeling curiosity or proprioception), and I also wonder if unintended details of M3-P are responsible for the results. For instance, could the oscillating behaviour just be an effect of the lag between the new and old parameters?

A possible simple way to address my high-level concern is to make a clear list of technical contributions and provide more details to make them individually usable or reproducible. I believe that great individual technical contributions are shadowed by the grand claims, which are less solid. I am thinking of technical contributions that have a higher chance to be useful to other researchers, like the embodied zebra fish dm-lab simulation (if this is a contribution of this work), the novelties of the intrinsic RL algorithm, or for setting a benchmark in a clear quantitative metric (e.g., correlation of neural activity from the Zebra fish dataset).

Suggestions to improve the writing, which is currently sub-optimal. Addressing some of them is very likely to increase my grade.

A) Clarifying the contributions: make clearer what the list of technical contributions is in the intro.

B) Unclear technical description of the model:

B1) When reading paragraph 139 about the PPO algorithm, more basic things like the action space were not well defined

B2) The notation Phi is not clear; it is used with small variations in Figure 2a, l. 166, 169, 174 are sometimes Phi(S_t, sometimes Phi^I, sometimes Phi_t) but it never defined rigorously. I feel that it broadly means “some network model” which is not a good notation. There should be clear notation to dissociate the vision encoder, proprioception encoder, encoder used in other algorithms, etc…

B3) Paragraph line 167: It is not clear if this section is a summary of related work, if this is a contribution from the paper. I then understood that these are the baseline models, but it is quite obscure when reading the paper linearly in logical order.

B4) The notations: Var( {omega} ) is not very clear. omega is a Gaussian density, that’s okay, but what is the Var( {omega} )

B5) The whole paragraph between line 210 and 211 is quite vague. It is written as if that were proof. But I believe it’s also part of the definition of the 3M progress algorithm? What is L_theta_old after in line 210.

B6) Definition of intrinsic reward has no equation number. Why not simply define r = abs(epsilon_hat - epsilon)? The usage of a general f is not helping, but is adding to the general confusion.

C) More details for the quantitative results, and More info is needed to reproduce the results or compare results on the same dataset.

C1) It would be useful to define mathematically the way the Pearson correlation is computed. Is this trial averaged in some way? Is that the definition of correlation, smoothing time in some way? Currently, all the “model-brain alignment” metrics are very obscure because high single-trial correlation over time for two units would be very strange because the behaviour itself is mostly different in each trial. So each unit is most likely not correlated to itself in different behavior histories. Ideally, one would need more than one metric, see for instance [3]. I also find it unsatisfactory that the chosen metric is saturated for most models (see Supplementary Figure 6 and 7)

C2) Also, a mathematical definition of the “one-to-one” alignment would be needed, as it is needed to compute the correlation metrics and reproduce the quantitative results. Are you using some optimal bijective mapping between two networks (as in [3])? Is it first assumed that the number of neurons in the model and the brain are identical via random sampling before applying the optimal assignment? These kind of choices are changing a lot the quantitative results, so they should be explained clearly.

D) I wish the results would be analyzed more deeply. The paper tends to claim grand achievements (“the model passes the neuro ai turing test”). I would not recommend that, but why not? Yet, it would be at least necessary to give all quantitative and technical details to reproduce and justify this wording. And scratch below the surface to illustrate the stated results. Here are some suggestions: D1) Grand claims like “the model passes the neuro ai turing test”, need to be clearly linked to a set of behaviour and neural activity measurements which are clearly defined and detailed.

D2) I wish an ablation study had been run on the M3-Progress to better understand what element of the model is responsible for the quantitative reports.

E) Missing reference to previous work: In addition to the fly model from Janelia, there have been other attempts to model the organism and its environment before the “embodied neuro ai” terminology, like the “swimming neuron” [1] or the open worm project [2]. Reference [3] also modeled neural activity and behavior at large in mice.

[1] https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010899 [2] https://openworm.org/ [3] https://proceedings.neurips.cc/paper_files/paper/2023/hash/ec702dd6e83b2113a43614685a7e2ac6-Abstract-Conference.html

问题

Questions are part of my main feedback above; the main ones are:

I wonder if a simpler model would also achieve that (like oscillating between maximizing action and fatigue, without modeling curiosity or proprioception)
How is the model-brain alignment correlation metric defined mathematically?

局限性

I see no potential negative impact on society.

最终评判理由

The authors have written in their rebuttal that they will describe and highlight their technical contributions better and clarify the model description when it was mathematically ambiguous.

格式问题

Minor things: some of the equations do not have numbers. Some of the lines do not have numbers.

作者回复

2025-07-31

Overview

We thank the reviewer for their detailed, thorough, and constructive feedback. We found the advice and suggestions of the review very helpful for improving the quality and clarity of the paper. We have implemented all of the reviewer's suggestions. Below, we outline the specific changes we made per suggestion, and answer the reviewer's questions.

Response to Specific Concerns

A. We agree that the technical contributions of the paper were overshadowed in the introduction of the paper. We have rewritten the introduction to be specific, less grandiose, and include a list of technical contributions:

3M-Progress, a novel intrinsic motivation algorithm that tracks divergence between an online world model and frozen dynamics prior.
3M-Progress agents reproduce whole-brain neural-glial calcium dynamics and behavior, providing the first goal-driven model of neural-glial computation.
A virtual environment that closely replicates zebrafish experiments in silico, serving as a platform for further investigating whole-brain neural-glial mechanisms.

B. We agree that there were several areas in the technical description of the model that were lacking. We addressed each of the reviewer's concerns as follows.

B1-B2. We have added a description of the action space to paragraph 139, as well as clear definition of $\phi$ to avoid confusion. Specifically, we altered Figure 2A to denote two separate feature encoders, $\phi_I$ and $\phi_J$ , denoting visual and proprioceptive encoders, respectively. Then, we specified that when $\phi$ is used without a subscript, it refers to the concatenated features from each encoder, $\phi_I$ and $\phi_J$ . Finally, we make clear that these encoders form the basic architecture of the agent and remain the same for all the intrinsic reward algorithms. In addition, we point the reader to specific details of the encoding models in Appendix D.

B3. We agree that the placement of the description of baseline algorithms was not properly demarcated and interrupted the logical flow of the section. To fix this, we moved the description of the baselines to Appendix C.1, where we are able to provide additional detail on their respective algorithms and the intuition behind their intrinsic rewards. As a result, the Methods cleanly flows directly from the mathematical description of Model-Based Intrinsic Motivation to the description of 3M-Progress.

B4. We agree that the notation for Disagreement, $Var(\\{\omega_{\theta_i}: i\in [N] \\})$ was unclear because each $\omega_{\theta_i}$ is a Gaussian density; we are actually taking the variance across the sample set $\\{x_i \sim \omega_{\theta_i} : i \in [N]\\}$ . Since our implementation uses the mean of each Gaussian as a point estimate of $\omega$ , we now write $Var(\\{\mu_{\theta_j}: j \in [N]\\})$ .

B5. We agree that the paragraph at 210 was vague, and unclear whether it was part of the 3M-Progress definition. The intention of this paragraph was to provide a formal justification for how choosing the pretraining and evaluation environment determine which behavioral partitions the agent discovers. However, it wasn't well connected to the existing text. We have removed this paragraph from the main description of 3M-Progress. Instead, we have rewritten it as a formal proof and clearly motivate it in another section where we explain the intuitions behind 3M-Progress reward dynamics (we describe this in more detail in our response to D2).

B6. We have added equations numbers to the intrinsic reward, and reformatted the description so that each element (KL Divergence, exponential filter, and intrinsic reward definition) appears together at once. We agree that using the extra symbol $f$ was confusing. This choice was originally made to emphasize that different activation functions can be used with 3M-Progress to elicit different behaviors. However, this could be more clearly motivated in the text since we did not provide a description of how such activation functions could be used. To fix this, we have changed both Figure 2A and the description of 3M-Progress to use only the absolute value function to avoid confusion. In addition, we added a clear description of how each component (including the activation function) affects the learned behavior---we explain this change in more detail in our response to D2.

C. We agree with the reviewer that the definition of the metric used for model-brain and inter-animal alignment was not defined in the paper. We added a complete description to Appendix E.

C1. The Pearson correlation is not computed from single trial data, but from trial averaged responses between subjects within the same behavioral block as identified and labeled by Mu et al. 2019. So, high correlation over time for two units is common in this case, since the units taken between subjects are undergoing the same behavior. We added a full derivation of the one-to-one correlation metric to Appendix E---more on this in C2.

Regarding the saturation of the chosen metric, we explain why this occurs and provide alternative evaluation criteria in our "Response to concern #4" within our rebuttal for Reviewer M8K4. To avoid repetition, we kindly refer the reviewer there for complete details.

C2. The mathematical definition of the one-to-one alignment has been added to Appendix E and referenced in the relevant sections. The mapping is not bijective, as it does not assume an equal number of neurons between the model and the brain. Rather, the alignment maps each neuron in the animal brain to any unit in the model, and this mapping is restricted to any individual model unit in the 100th percentile of correlation. That is, unlike typical model-brain alignment metrics that rely on linear regression between a set of model units in order to predict a single neuron, our metric uses a single unit.

D. We agree with the reviewer that some of the achievements, especially the NeuroAI Turing Test, could be better defined in the text. In addition to adding a precise definition (Appendix F) and explaining how our model meets that definition in this context, we have also rewritten the paper with less grandiose language overall in order to make the contributions and results more clear.

D1. We have added the definition of the NeuroAI Turing Test to the results section. Passing this test means that according to a chosen metric and a given task, the behavioral and neural alignment between the model and animal is close to, or the same as, the behavioral and neural alignment between two individual animals.

D2. We agree that the elements of 3M-Progress could be better explained to understand how each component contributes to the model-brain and model-behavior alignment. However, one can view some of the baseline algorithms as ablations themselves. For example, $\gamma$ -Progress can be viewed as 3M-Progress without the pretrained memory and symmetric activation function. It's failure to capture the behavioral switching indicates that learning both memories online is not sufficient. ICM can be viewed as 3M-progress without the second memory by using predictions from one world model alone.

We agree that a 3M-Progress specific ablation would still be helpful. We have conducted additional experiments to illustrate how each component contributes to matching the behavior, and thus to explaining the neural-glial data. For example, raising the progress horizon, $\gamma$ , leads to longer, sustained activity in each behavioral transition. Changing the activation function to something asymmetric, like ReLU, leads to one-way transitions towards model-memory-agreement. We add these ablations and discuss their consequences in the Appendix.

E. We have added the recommended references to the related work section.

Response to Questions

Q1: I wonder if a simpler model would also achieve that (like oscillating between maximizing action and fatigue, without modeling curiosity or proprioception)?

One seemingly plausible possible explanation for the behavior is that the zebrafish cycles between the positional homeostatic drive to move forward until it becomes fatigued due to an action cost. However, Mu et al. 2019 explicitly ruled out this hypothesis with extensive experiments, and found that the futility-induced passivity behavior was not driven by fatigue nor positional homeostasis. Rather, both the behavioral and neural evidence support an error-tracking mechanism implemented by neural-glial circuits.

This provides the basis for curiosity-driven intrinsic motivation, since error-tracking in a novel environment implicitly assumes the presence of a predictive model. Given this basis, our paper focuses on how to use a predictive model for intrinsic motivation, and show that existing curiosity-driven approaches from reinforcement learning do not capture this specific case of autonomous animal behavior.

Further, model-free methods were consistently lower in whole-brain alignment than model-based methods, supporting our hypothesis based on the data in Mu et al. 2019 that this behavior is driven by world models. In particular, a PID controller (Appendix C), designed explicitly to integrate feedback error in a model-free way and switch between two control states (active and passive), is among the worst performing models on whole-brain data (Figure 3B).

Thus, although a simpler model could achieve the behavior, it would be inconsistent with the whole-brain data and neural-glial circuit proposed in Mu et al. 2019.

Q2: How is the model-brain alignment correlation metric defined mathematically?

We adopt the model-brain alignment metric defined in [1], with details defined in page 18, "Noise-corrected neural predictivity".

[1] Nayebi, Aran, et al. "Mouse visual cortex as a limited resource system that self-learns an ecologically-general representation." PLOS Computational Biology (2023)

评论- Thank you

2025-08-01

Thank you for responding to my comments and editing the manuscript accordingly. With better writing and better descriptions of the empirical contributions, it becomes a very good paper.

审稿意见

评分: 6置信度: 32025-06-30

This paper introduces a novel intrinsic reward based on modeling the divergence between an ethological prior and a current world model. This intrinsic reward model is applied to studying larval zebrafish, specifically their different behavioral patterns and their whole-brain neural-glial dynamics. Using this model, the paper demonstrates far greater alignment with zebrafish behavioral data compared to other intrinsic reward models, as well as outperforming them in predicting neural imaging data.

优缺点分析

Strengths:

Overall, this paper does an excellent job in clearly establishing the value of the 3M-Progress method, implementing a suite of baselines for each experiment and reporting error bars showing a clear improvement in comparison to those baselines. There is little to fault with regards to the quality, significance, and originality of this paper.

Weaknesses:

The placement of baselines in L167-182 is confusing, as I initially had the impression that these were a component of the actual model being introduced. This content could all be moved to the supplementary material.
More time could be spent elaborating on the 3M-Progress reward model, specifically regarding the intuition behind the scenarios in which the reward might be especially high or low.
A couple of minor typos: “sizing relaitve” in L134, “ecnoded” in the Fig. 2 caption.

问题

What kinds of behavior does this reward elicit in other environments of the DeepMind Control Suite?
What is the behavior of the zebrafish agent in moments of high/low reward?
Are there differing computational constraints for the different intrinsic reward models? If so, what implications does that have for the biological plausibility of the model?

局限性

Yes

最终评判理由

The authors have clearly answered my questions in their rebuttal, and agreed to reorganize the paper in a way that addresses my concerns about clarity.

格式问题

N/A

作者回复

2025-07-31

Overview

We thank the reviewer for their time and their positive position on our paper. We address the concerns and questions presented in the review below.

Response to Specific Concerns

Organization: We agree that the organization of the baselines in L167-182 was confusing and not clearly explicated. We have moved this content to the supplementary material, where we explain these algorithms in more detail.
Clarity: We agree that adding more intuition on the 3M-Progress reward would be helpful to the reader. Following the initial motivation and explanation in the 3M-Progress section, we have added a section describing how the reward function changes as the agent pursues different behaviors.
Typos: We thank the reviewer for their careful attention and have fixed the typos.

Response to Questions

Q1: What kinds of behavior does this reward elicit in other environments of the DeepMind Control Suite?

Because 3M-Progress is designed for continual exploration in environments with dynamics that structurally differ from the pretraining environment, the standard DeepMind Control Suite is not a suitable benchmark since it was designed for policies to learn an environment-specific control task, where training and testing are always in-distribution. However, it is not uncommon to introduce domain shifts or task variation if policy generalization is the central question (e.g., the DMControl Generalization Benchmark), and we agree that this would be an interesting line of future work.

The focus of our paper, however, is to show that existing intrinsic motivation algorithms do not generalize to autonomous animal behavior, and we propose 3M-Progress as a working solution. Rather than testing 3M-Progress on a suite of RL benchmarks, we are testing existing algorithms on an animal autonomy benchmark.

Although 3M-Progress can be benchmarked in standard RL domains, it is beyond the scope of the paper since we are not introducing 3M-Progress to compete with existing algorithms in standard RL domains. To reiterate, we are flipping this view by testing curiosity-driven algorithms in real-life animal environments for which we have behavioral and neural data to validate against.

In order to concisely illustrate how 3M-Progress can be used in other contexts, we devote some time in the methods section to explain a generalization of the algorithm to new behaviors and other environments. Specifically, we show how to generalize 3M-Progress from a single pretraining environment to $N$ environments, across a variety of sampling procedures to choose behaviors in the evaluation environment.

Q2: What is the behavior of the zebrafish agent in moments of high/low reward?

In moments of high reward, the agent is pursuing a behavior that is aligned with its prior memory. In moments of low reward, the agent is pursuing the same behavior over an extended amount of time. This is best illustrated by the rightmost panel of Figure 2B. The agent first computes $\epsilon_t$ , the KL divergence between its current predictive model and its pretrained one. This divergence will be high when the agent chooses actions that produce transition dynamics that differ from those of the pretraining environment---we call this memory disagreement. Initially, the moving average baseline is zero, and so sustaining a behavior with high KL divergence will be rewarded. Eventually, the moving baseline will catch up according to the timescale determined by the progress horizon, $\gamma$ . As it does so, the KL divergence for this behavior decays. The agent is thus incentivized to pursue a behavior with low KL divergence in order to deflect away from the current baseline and recieve a positive reward. This corresponds to pursuing transition dynamics that are as close as possible to that of the pretraining environment---we call this memory agreement. Applying the absolute value as an activation function creates a cyclic transition behavior.

We completely agree that more time should be spent explaining these properties in detail, and thank the reviewer for pointing this out. As mentioned before, we have added an additional section for this in the paper. This includes the description of 3M-Progress mentioned in response to Q1.

Q3: Are there differing computational constraints for the different intrinsic reward models? If so, what implications does that have for the biological plausibility of the model?

3M-Progress imposes an additional computational constraint by requiring a pretrained world model memory. Whereas the other intrinsic reward models learn everything online, 3M-Progress uses a two stage approach. We argue this feature makes our method more biologically plausible than a completely online method since animals come equipped with inductive biases when they solve new decision-making tasks or encounter new environments, rather than starting from a tabula rasa network. These biases are determined both by genomic information as well learned from online experience as the animal develops locomotor maturity.

Learning progress assumes an agent implements long-term memory by decaying a world model learned from online experience, and computes intrinsic reward by comparing predictions between the long-term memory and the current model to track temporal history. This is biologically plausible in the sense that animals are free to create online memories that decay in time.

The Intrinsic Curiosity Module (ICM) computes intrinsic reward as the prediction error of a world model learned in a feature space optimized for inverse dynamics prediction. This design choice is biologically plausible, as there is evidence of inverse dynamics features in animal neural data [1].

Random network distillation trains a neural network to predict random features, and this prediction-error is used as the intrinsic reward. This is biologically plausible, since animals may encode sensory input with random projections [2].

Disagreement maintains an ensemble of $N$ world models, and the intrinsic reward is computed as the variance in their predictions. This is not biologically plausible, since it is intractable for animals to create and learn $N$ memories simultaneously for a single task (e.g., $N>3$ is already intractable for RL systems [3, 4]).

[1] Aldarondo, Diego, et al. "A virtual rodent predicts the structure of neural activity across behaviours." Nature (2024)

[2] Ganguli, Surya, and Haim Sompolinsky. "Compressed sensing, sparsity, and dimensionality in neuronal information processing and data analysis." Annual review of neuroscience (2012)

[3] Kim, Kuno, et al. "Active world model learning with progress curiosity." International conference on machine learning. PMLR, 2020.

[4] Doyle, Chris, et al. "Developmental curiosity and social interaction in virtual agents." arXiv preprint (2023).

2025-08-03

Thank you very much for responding in depth to my questions and agreeing to make the necessary edits to the paper.

审稿意见

评分: 5置信度: 32025-07-03

The paper proposes 3M-Progress, a model-based intrinsic motivation reward that allows a virual zebrafish to exhibit autonomous behaviors matching behavioral and neural-glial dynamics recorded in autonomously behaving larval zebrafish. A 6-link virual zebrafish was embedded in customized fluid-dynamics MuJoCo environment was used in the experiments. Agents trained with 3M-Progress are compared to baselines implementing other intrinsic reward signals: ICM, RND, Disagreement $\gamma$ -progress as well as model-free entropy-based exploration term and random policy. The results showed 3M-Progress was significantly better at matching behavioral and neural-glial dynamics from real zebrafish recordings in both active and passive transitions.

优缺点分析

Strengths:

This work is a timely cross-disciplinary contribution, as it is one of the first work to bring goal-driven embodied agent settings to model neuron–astrocyte interaction that is quantitatively validated against whole-brain recordings, which significantly improve ethological validity over prior work.
3M is a clever and novel formulation of intrinsic drive in animals, as the memory-based divergence elegantly avoids the “white-noise” issues of prediction-error signals and yields stable, repeatable behavioral cycles.
The empirical evaluation is very solid with multi-level metrics, and also the results are statistical significant with non-trivial gains over other methods.

Weaknesses:

The current environments are constrained to only allow passive transitions and fluid movement to simulate active transition observations. This limits the ecological realism, but I do understand the effort of creating such environment and agent is non-trivial.

问题

Is there a reason $\theta_{old}$ is only trained in a separate phase from $\theta_{new}$ ? What would the effect of using moving average be here?

局限性

yes

最终评判理由

The author has addressed my concerns and answered my question. I maintain my initial assessment that it's a good paper with insights that would be beneficial to share with the community.

格式问题

N/A

作者回复

2025-07-31

Overview

We thank the reviewer for their positive position on our paper. Below, we clarify and questions and concerns presented in the review.

Response to Specific Concerns

We agree that the ecological realism of the evaluation environment can be improved (though we note that this differs from the pretraining environment). The evaluation environment was chosen to match the parameters of zebrafish experiments conducted in Mu et al. (2019), including the head-fixed constraint on movement as well as the striped grating constraint on visual input. This allows us to make meaningful behavioral and neural comparisons between agents trained in simulation with zebrafish acting in vivo. For the pretraining environment, however, agents we're trained in a naturalistic environment with ecological visual inputs (mossy/grassy floors) and fluid forces.

Future work will include extending the 3M-Progress algorithm to other behaviors, both in zebrafish and in other animals. The core bottleneck to this procedure, however, is the existence of open source neural and behavioral data for autonomous behavior in animals in order to validate a model of animal autonomy. To the best of our knowledge, the dataset from Mu et al. (2019) is the first of its kind.

Response to Questions

Q1: Is there a reason $\theta_{old}$ is only trained in a separate phase from $\theta_{new}$ ?

Yes, the pretraining stage to obtain $\theta_{old}$ is the crux of how our 3M-Progress differs from existing intrinsic motivation algorithms. Traditional algorithms (like learning progress) learn both $\theta_{old}$ and $\theta_{new}$ online in the evaluation environment, and use the difference between the old and new models as a learning signal. In contrast, we propose that $\theta_{old}$ be trained in an environment with entirely different transition dynamics than the evaluation environment. Motivated by the idea that animals are endowed with strong priors on how environment dynamics unfold in ethological circumstances, $\theta_{old}$ provides a fixed prior on the transition dynamics the agent expects to encounter. This is achieved by training in a separate phase characterized by a separate environment. In the evaluation environment, $\theta_{old}$ is frozen and compared with the online model $\theta_{new}$ , and this residual forms the learning signal. This signal tells the agent how different the dynamics in the new environment are from the old environment, and it can be used to motivate the agent to seek out behaviors that align the dynamics between the two.

Q2: What would the effect of using moving average be here?

Both 3M-Progress and learning progress employ exponential moving averages, but on different variables. Learning progress obtains $\theta_{old}$ by an exponential filter over the current world model parameters $\theta_{new}$ . The effect of this is a signal that loosely corresponds to the rate of change of their prediction-error and avoids perseveration on dynamics with constant prediction error (e.g., white noise). On the other hand, we fix $\theta_{old}$ while $\theta_{new}$ is learned (as described above), and apply the exponential moving average to $D_{KL}(\omega_{\theta_{old}}\mid\mid\omega_{\theta_{new}})$ . This results in a signal that tracks how predictions compare to the predictions from a pretrained model and avoids perseveration on dynamics with constant KL divergence (e.g., white noise).

2025-08-08

Thank you for your thorough responses. My questions are addressed and I don't have any more follow up. I maintain my initial assessment that it's a good paper with insights that would be beneficial to share with the community.

最终决定Accept (poster)

2025-09-17

This paper introduces a novel intrinsic reward mechanism based on modeling the divergence between an ethological prior and a current world model. The proposed approach is applied to the study of larval zebrafish, specifically examining their diverse behavioral patterns and whole-brain neural–glial dynamics. The research addresses an interesting and relevant question concerning how intrinsic motivation influences decision-making. Most of the reviewers’ comments have been adequately addressed by the authors, making this paper suitable for publication.