7.0

/10

Poster3 位审稿人

最低4最高5标准差0.5

3.7

置信度

创新性2.3

质量3.0

清晰度3.3

重要性2.3

NeurIPS 2025

Flexible inference for animal learning rules using neural networks

Yuhan Helena Liu,Victor Geadah,Jonathan W. Pillow

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

摘要

关键词

Computational neurosciencelearning rule inferencedecision-makingde novo learningneural networksanimal behavior

评审与讨论

审稿意见

评分: 5置信度: 42025-06-28

In this paper, the authors study propose methods for inferring learning rules underlying behavior in de novo learning tasks. In particular, they use neural networks (both feedforward and recurrent) to parameterize changes to input weights over time. They apply their method to simulated data as well as the IBL dataset, and demonstrate that recurrent architectures succeed in inferring non-Markovian learning rules, and also outperform methods like REINFORCE.

优缺点分析

Strengths:

The paper is well-written and the methods are clearly explained and intuitive.
The proposed methods reveal interesting learning rule phenomenology (e.g. non-Markovian, asymmetric, suboptimal negative baseline) from real data that is not captured by existing methods like REINFORCE.

Weaknesses:

As far as I can tell, across both the simulated and real data experiments, only tasks with 1D stimuli are used. This is understandable in the context of the IBL dataset, but it is unclear whether these methods do well beyond this simplified setting, given possible concerns about sample complexity (as illustrated in Fig. 2c).
The choice of modeling the animal's policy as a GLM seems prohibitively restrictive in general, albeit it does aid in interpretability of learning rules, and does seem adequate for the IBL dataset based on Table 1. Similar to the previous point, I would imagine this modeling choice would be problematic for any task with multiple stimuli that, e.g., are selectively attended to in a context-dependent manner? If so, this concern limits the applicability of the proposed method.

问题

Regarding Fig. 3, the RNN GLM appears to miss the weak dependence of $\Delta W_{stim}$ on the stimulus $s$ , something that is attributed to finite data effects. Why should finite data be a noticeable problem for this task considering that it is with simulated data?

局限性

Yes

最终评判理由

The additional experiments have clarified some of my concerns about the scope of the proposed method. As such, I am increasing my score to 5.

格式问题

N/A

作者回复

2025-07-31

We sincerely thank the reviewer for their thoughtful and constructive feedback. We appreciate the recognition of our manuscript’s clarity and the interesting learning rule phenomenology uncovered. Your suggestions (particularly regarding scope and generalization) helped us expand the manuscript with new analyses, clarify our framing and contributions, and better highlight the broader applicability of our approach. Below, we address your points in detail and summarize the corresponding improvements to the paper.

Extension to higher-D input and better clarification of scope & significance. We thank the reviewer for highlighting the importance of extending our approach to higher-dimensional and more complex learning settings. To evaluate generalization to higher-D inputs, we increased the input dimensionality by adding additional distractor input streams (results summarized in Table C below). We also tested a 2D version of the Poisson click task with context modulation (see the next section). We acknowledge that extending to higher-dimensional inputs increases sample complexity, as expected from the curse of dimensionality. Nonetheless, our experiments show that even with higher-D inputs, key behavioral features are recoverable (Table C)—and with more data, recovery quality improves substantially. To substantially strengthen the manuscript by demonstrating robustness beyond the 1D setting, we'll include these results (Table C and fitted function plots) and this discussion in the updated manuscript version.

Additional Input Dim	Default (Relative RMSE)	Features Captured	3× Samples (Relative RMSE)	Features Captured
1	1.00	✓✓✓	0.59	✓✓✓
3	1.09	✓✓✓	0.62	✓✓✓
10	1.35	✓✓✓	0.75	✓✓✓
30	1.58	✓✓✓	1.20	✓✓✓
100	-	-	1.67	✓✓✓

Table C: Recovery performance of the inferred learning rule across increasing input dimensionality. The additional input streams follow the same distribution as the original input. Checkmarks (✓) indicate the model successfully captured key qualitative features from Fig. 2: (1) slowing at high weights, (2) amplitude-dependent learning, and (3) choice asymmetry; – indicates failure to capture these features; RMSE computed on normalized to the baseline (1D, default sample size).

To better clarify the scope of our original submission: as the reviewer also acknowledged, we focused on simplified 1D tasks because they mirror the IBL datasets and many laboratory neuroscience settings where animals are introduced to novel tasks and must learn from scratch. This de novo learning scenario—where prior structure or instruction is minimal—is both conceptually significant and common in experimental contexts (see also discussion with CMVq under “Novelty”).

Even this simple task setting involves several key hallmarks of biological learning—e.g., history dependence, biases, and suboptimal strategies—and provides a tractable platform for reverse-engineering learning dynamics. As in prior foundational work (e.g., XOR, Poisson clicks [Brunton et al., Science, 2013]), this ‘start simple, validate, then generalize’ strategy allows us to isolate core principles before scaling up. Our new results (Table C) demonstrate that this approach can extend to higher-D inputs, indicating its broader applicability and motivating future work. We will add a paragraph reflecting these discussion points to the manuscript.

GLM too restrictive. We thank the reviewer for this insightful point, which prompted us to run new simulations and better delineate the scope and limits of our current framework. While this choice of GLM aids interpretability and aligns with common practice in neuroscience, we acknowledge that it limits expressivity and will highlight this limitation more explicitly in the revised manuscript.

To probe this concern, we tested a more complex 2D version of the Poisson clicks task with context-dependent decisions, where the relevant stimulus dimension is modulated multiplicatively by a context variable (Table D below). This task introduces a nonlinear decision mapping that violates GLM assumptions. To better understand the limits of our method, we first evaluated the idealized case where the true decision mapping is known to the inference framework—a setting not assumed to reflect real-world use but helpful for benchmarking inference quality. As expected, our method accurately recovers the learning rule in this case. Under model mismatch—where the inference framework assumes a standard GLM instead—performance degrades, but the model can still capture key qualitative features of the learning rule. Both the robustness and limitations of our current framework are illustrated by these results and the clarified scope discussion, which will be incorporated into the revised manuscript.

	Relative RMSE	Features Captured
Decision model known	1.0	✓✓✓
Model mismatch	1.62	✓✓✓

Table D: Recovery of key qualitative features (in Fig. 2) in a context-modulated Poisson click task. The first row corresponds to the case where the inference framework knows true decision structure, $logit(P(\text{right})) = w_L \cdot c \cdot x_L + w_R \cdot (1 - c) \cdot x_R$ for rightward cumulative cue and context variable $c$ between 0 and 1, and leftward cumulative cue between 0 and –1. The second row represents model mismatch where the inference framework still assumes a GLM model $logit(P(\text{right})) = \hat{w}_L \cdot x_L + \hat{w}_R \cdot x_R + \hat{w}_c \cdot c$ . ✓ = feature captured; RMSE relative to known-model case.

About Figure 3. We thank the reviewer for the great question. While the black curve (conditioned on past several consecutive correct trials) clearly shows stimulus dependency, the red curve (conditioned on past several consecutive incorrect trials) shows weaker effects. This is primarily due to finite data, which applies even in our simulations: our simulated animals learn relatively quickly, resulting in few instances of consecutive errors. This imbalance makes it harder for the model to learn the error-conditioned cases due to sparser coverage. To test this explanation, we simulated with animals starting at worse performance, thereby increasing the number of consecutive incorrect trials. As expected, the model then recovered the dependency more reliably. This observation also connects to the Fisher Information analysis in Appendix A, which highlights how various experimental variables (e.g. initial condition) can affect inference, and that knowledge could help to inform experimental design. We will include these new plots and a paragraph reflecting this discussion in the revised manuscript.

Broader Framing and Impact. We appreciate the reviewer’s comments, which prompted us to reflect more deeply on the broader vision and significance of this work.

Our study is especially timely for the NeurIPS community, which is increasingly engaging with problems at the intersection of deep learning and neuroscience. As the literature on inferring biological learning rules continues to grow (Related Works), a central and still underexplored challenge is to model de novo learning settings, where prior structure is minimal and animals must learn from scratch. These settings demand models that are both flexible enough to capture complex, history-dependent behavior and interpretable enough to yield scientific insights.

Our framework directly addresses this need. By fixing the decision model using standard, well-supported forms, we isolate the inference of trial-by-trial weight updates—an underexplored but foundational problem. This modular design mirrors a common strategy in both neuroscience and machine learning, where complex systems are understood by studying components separately before tackling joint inference (e.g., [46]). Our results show that even within this simplified setup, rich behavioral features can be recovered and generalized to more complex settings (e.g., higher-dimensional inputs and context-dependent tasks), as we now demonstrate in our revised manuscript.

This framework thus provides a new lens on suboptimal, history-dependent learning in animals and lays the groundwork for future extensions—such as jointly inferring both learning rules and policy mappings—once additional data or stronger inductive biases are available. Motivated by the reviewer’s feedback, we have added new experiments and clarified these broader implications in the manuscript to better articulate how this work serves as a stepping stone toward modeling more complex and ecologically valid settings of biological learning. We will revise the manuscript to more clearly highlight this broader framing.

2025-08-06

I thank the authors for their detailed reply. The additional experiments have clarified some of my concerns about the scope of the proposed method. As such, I am increasing my score to 5.

2025-08-06

Thank you again for your insightful and constructive feedback!

审稿意见

评分: 4置信度: 42025-07-02

This paper proposes a method to meta-learn algorithmic-level learning rules from binary decision-making data during learning. The authors compare two learning rules parameterizations. Both learning rules output updates to the weights of a dynamic Bernoulli GLM (the policy). The learning rules’ inputs are the current stimulus, task covariates, current learning rule parameters and reward. They compare a feedforward neural network (DNNGLM), which is Markovian in the sense that only the current trial information is used to compute the changes to the policy network, and a recurrent network (RNNGLM), which is non Markovian as it can keep previous trial information in its hidden neuron dynamics.

They first verify the ability of their method to recover a ground-truth learning rule on an in silico dataset, before inferring learning rules on an IBL dataset. On this dataset, they find that RNNGLM can fit the data better than DNNGLM, thus providing evidence for non-Markovian learning in rodents. They reverse engineer the successful learning rules, which display interesting features such as negative baseline and side asymmetries.

优缺点分析

Strengths:

The paper is clearly written and presented, with a thorough literature review.
The methods and most of the results appear sound and trustworthy to me.
The paper goes beyond proposing a novel method and testing it on toy tasks, by considering a real dataset and makes actual predictions for learning strategies in mice. Note that I don’t feel knowledgeable enough to assess the novelty/impact of these results on interpreting the inferred learning rules (non-Markovian, side asymmetry and negative baseline).

Weaknesses:

Novelty (from the method point of view): This work positions itself in a crowded subfield of meta-learning learning rules that has been very active in the past 5+ years, and I am slightly hesitant to label the novelties of the method highlighted by the authors as properly novel rather than variations on previous methods (see question 1 below).
Degeneracy: I would expect many different learning rules to fit the binary decision data well, i.e. your model should be quite under-constrained by the experimental data at hand, and I would thus expect many different learning rules to perform well (see question 2 below).

问题

1/ Novelty: The authors insist on the “de novo” learning aspect, but this seems to me more of a dataset choice than a limitations of previous methods. Could you elaborate?

2/ Degeneracy: how robust are the conclusions made in this study, specifically for the IBL dataset? I saw in supplementary what appeared to be a thorough robustness analysis of the simulated data case, but not for the IBL data case, but maybe I’m mistaken. What are the error bars in fig 4B? If the robustness analysis that I’m asking for is already present in the paper, I am a bit surprised by how similar all the recovered learning rules look, are there differences between them?

3/ Is the comparison of DNNGLM to RNNGLM a fair one, doesn’t the RNN have more parameters (the hidden state)/be a generalization of the DNNGLM and would therefore be expected to fit data at least as well than DNNGLM? Is this point covered by the DNNGLM with the history of previous trials?

4/ L78: on the disadvantages of synaptic level learning rules: these types of learning rules can and have been fit to behavioral data in the references already cited in this manuscript, they are not restricted to neural data. I think that besides having similar names (learning rules and plasticity/learning rules), these rules have little in common, and there is no need for the authors to further differentiate themselves from them.

5/ L265: “recovering mechanistic insights”. I don’t understand what mechanisms are being uncovered in this study, the learning rules considered operate at an abstract level: the weights are policy weights, not actual synapses.

局限性

Yes

最终评判理由

The authors have made meaningful improvements in response to my (and other reviewers’) comments, adding clarifications and analyses that undeniably strengthen the paper. However, while the method is novel, technically sound, and clearly presented, I still have reservations about this paper's potential impact and distinctiveness in an already crowded subfield. I therefore maintain my score of 4 and recommend borderline acceptance for NeurIPS.

格式问题

None

作者回复

2025-07-31

We sincerely thank the reviewer for their thoughtful and constructive feedback. We appreciate the recognition of our work’s clarity, soundness, and relevance to understanding learning strategies from real data. Your comments helped us substantially improve the manuscript by clarifying our contributions and novelty, as well as adding new analyses to address concerns about degeneracy and fairness. Below, we respond point-by-point to describe these improvements and how your suggestions strengthened the paper.

Novelty. We thank the reviewer for this important comment. In particular, we appreciate the opportunity to clarify why the de novo framing represents more than a dataset choice — it reflects a fundamentally different modeling challenge. Most prior meta-learning work focuses on training agents to learn well across tasks, often assuming normative objectives (e.g., reward maximization). These approaches are designed to improve task performance, not to recover the actual learning rules employed by an agent (e.g., an animal).

In contrast, our work addresses a fundamentally different challenge: inferring the learning rule that a biological agent used to learn from scratch. This de novo setting presents several new challenges. First, animals often behave in "suboptimal" ways that would not appear to increase their expected reward (Akrami et al., Nature, 2018; Kastner et al., Cell reports, 2022), which violates the normative assumption. Without this assumption, the solution space becomes far less constrained—animals may exhibit complex and surprising learning dynamics, including history-dependence, biases, or inconsistent strategies. Capturing such richness requires flexible models that do not rely on strong structural priors. Second, external stimuli (not just reward histories) must be integrated into the update, which introduces complexity and goes beyond classic multi-armed bandit models. Third, many existing approaches assume fixed feature representations, static policy forms, or hand-designed learning rules that do not adapt during training — limiting their flexibility. These challenges limit the applicability of previous methods, which achieve lower performance compared to ours (Table 1 in the manuscript).

We will revise Related Work and Discussion to better highlight this distinction and clarify the methodological and conceptual novelty of our contribution.

Degeneracy. We agree that inferring flexible learning rules from behavior poses fundamental challenges around degeneracy—an inherent issue in this domain, also encountered in related work such as [58]. Indeed, running our models with different random seeds yields slight variations in recovered rules, despite similar train and test log-likelihoods. To quantify this, we repeated all inference runs across multiple seeds and added shaded regions denoting standard deviation to the learning rule plots. While some variation exists, the key interpretable features—such as side bias, negative baseline, and reward history dependencies—are consistent across seeds. We will include these plots in the updated manuscript.

Importantly, these features are not merely visual artifacts but functionally meaningful: eliminating the negative baseline via a non-negativity constraint significantly reduced performance (Table 1 in the manuscript), and so did truncating history (Table B in response to Reviewer DYot). Additionally, our method recovers ground-truth rules in simulation, further demonstrating its ability to identify meaningful structure despite under-constrained settings.

We initially omitted shaded variability regions to highlight overall trends, which appeared stable across seeds. However, we appreciate the reviewer’s point and will revise all plots to reflect variability—further underscoring that the key trends are robust and exceed the variation. We will also expand the limitations section to explicitly discuss degeneracy and highlight these new plots illustrating robustness. We will also update Fig. 4B figure legend to explain that the error bars reflect standard deviation across cross-validation seeds; we apologize for the omission.

Finally, we would like to clarify that our work has so far incorporated several mitigation strategies to help address degeneracy, which we will more clearly highlight in the revised Discussion. First, we used cross-validation to assess generalization. Second, we included structural ablations and constraints (e.g., non-negativity, history truncation) to test the necessity of key features. Third, we leveraged large-scale behavioral data, where extended training over thousands of trials is common, and pooling across animals increases data availability. We will expand our Discussion section to more clearly articulate these points.

Is Comparison of DNNGLM to RNNGLM fair. We appreciate the reviewer’s concern and agree that RNNGLMs have more parameters due to their hidden states. To test whether the performance gap stems purely from model capacity, we conducted additional comparisons where we matched the parameter count between DNNGLM and RNNGLM. Even with equalized capacity, RNNGLM continued to outperform DNNGLM (p = 5.53e-04), suggesting that the key driver of the gap is not the model size. Importantly, no matter how large a DNN becomes, its memoryless architecture prevents it from capturing history dependencies—unless history is manually appended as input, which introduces brittleness and hand-tuning. We will add these new results and clarifications in the revised manuscript.

On synaptic level learning rules. We agree with the reviewer that synaptic-level learning rules and the behavioral-level rules we study operate at fundamentally different levels of abstraction, and thus have little in common beyond the shared term “learning rule.” To avoid confusion, we will simply cite these synaptic plasticity studies as a complementary line of work—focused on neural-level mechanisms that can also, in some cases, be fit to behavioral data—and remove all subsequent unnecessary sentences that attempt to further distinguish our approach.

Mechanistic insights. We agree that “mechanistic” may be misleading here. Our intent was to highlight interpretable structure in the inferred learning rules (e.g., asymmetries, biases), not biophysical mechanisms. We will revise the manuscript to remove the word “mechanistic”.

Broader outlook. Before closing, we’d like to note that the reviewer’s comment, on novelty and de novo learning, has prompted us to reflect more deeply on the broader vision and significance of this work.

De novo learning—where agents must learn from scratch in a novel task without strong priors—is especially important in neuroscience, where animals are often trained in entirely new tasks. Understanding how learning unfolds during this phase is both scientifically valuable and practically important for improving training protocols and experimental design. Yet despite its ubiquity, this regime remains understudied.

We also note that learning rule inference is gaining traction, with recent works (e.g., [21, 24, 46, 58]) exploring related questions in structured learning and agent modeling—though not on the de novo setting we consider here. This makes our focus on inferring learning rules from scratch particularly timely for NeurIPS 2025, as the field begins to tackle more open-ended and biologically grounded formulations of learning. If the reviewer is aware of especially relevant prior work, we would be genuinely happy to engage further.

By reflecting on the reviewer’s feedback, we believe our approach lays a foundation for a broader research direction. While our work is the first to infer nonparametric, non-Markovian learning rules in de novo settings, we see this as just the beginning. Animals in naturalistic environments face even more complex challenges that require flexible strategies constructed from scratch [14]. Our approach offers a principled first step toward modeling such capabilities, and we hope it inspires future work that builds on this foundation using richer tasks, additional behavioral signals, and more expressive architectures.

We will revise the manuscript to more clearly highlight this broader framing in the Related Work and Discussion sections.

2025-08-06

I thank the authors for their clarifications and the welcome additional experiments.

2025-08-06

Thank you again for your thoughtful review and for engaging with our rebuttal. We’re currently preparing a follow-up comment to share additional results related to your points, and will follow up soon.

2025-08-07

To clarify my comment above, I did not mean additional experiments were required, only that the ones presented in the rebuttals were a welcome/beneficial addition to the paper. Apologies for the ambiguity.

2025-08-08

Hi Reviewer CMVq, just a quick follow-up — we wanted to share a compact summary of results already described in our initial rebuttal, related to your comment on degeneracy.

As noted in our initial response, we reran our analyses across multiple random seeds, which yielded similar log-likelihoods and qualitatively consistent learning rule characteristics. We also added shaded regions to the plots to show the standard deviation across seeds

To convey the key takeaways despite the file upload restriction, we provide Table E below, which summarizes the critical features observed in the updated plots:

	After correct	After incorrect
Right	0.0228 ± 0.0043	-0.0155 ± 0.0048
Left	0.0127 ± 0.0041	-0.0062 ± 0.0051

Table E. Mean ± standard deviation of ΔWstim across random seeds, computed for different trial outcomes (correct vs. incorrect) and stimulus side. These values summarize the updated DNNGLM learning rule plots (Fig. 5A), with similar trends observed for RNNGLM. This table provides results for current Wstim at 0 but similar trends are observed at other values. The results capture key features: a negative baseline following errors and side-dependent asymmetries; these features are robust across seeds—that is, the condition-dependent differences in ΔWstim exceed the variability introduced by random initialization.

In a similar vein, we summarize our updated plot in Fig. 5B (past four trials) in Table F below to show the consistency of history-dependency across seeds:

	Current correct	Current incorrect
Past correct	0.0205 ± 0.0077	-0.0082 ± 0.0089
Past incorrect	0.0014 ± 0.0068	-0.0268 ± 0.0072

Table F. Similar to Table E but for RNNGLM, conditioned on past consecutive four trials being correct vs incorrect (Fig. 5B). These results capture history dependency that is robust across seeds.

If any additional clarification and discussion would be helpful, we’d be happy to provide it. Thank you again for your insightful and constructive feedback!

审稿意见

评分: 4置信度: 32025-07-03

This paper looks at how animals learn new tasks from scratch. The authors argue that previous methods for modeling animal learning are too restrictive. They either assume a specific, predefined mathematical form for the learning rule (like Q-learning or REINFORCE) or are limited to simple bandit tasks where the underlying task structure is already known.

The main subject of this paper is that it proposes a flexible new framework that uses deep neural networks to infer the learning rule directly from an animal's behavioral data. The goal is to discover the actual and potentially complex strategies animals use without being constrained by human assumptions.

优缺点分析

Strength

The framework demonstrated its accuracy in simulation. The DNNGLM successfully recovered the weight trajectories and underlying known Markovian learning rule. Crucially, when faced with a non-Markovian rule, the DNNGLM failed, while the RNNGLM successfully captured the history-dependent effects, consistent with the author’s claim for the necessity of the recurrent architecture for complex learning.

Weakness

The choice of a GRU is perhaps a limitation since the field has largely moved on toward transformer-based architectures given their ability to do long-range dependencies. The paper discuss the use of Transformers to a single sentence in the future directions section.

The non-Markovian effects reported span just four trials (Fig. 5B). Is this a genuine feature of mouse learning, or is it a limitation imposed by the GRU's difficulty in propagating information over longer horizons? A Transformer, with its direct attention mechanism, could potentially discover much longer temporal dependencies, and could lend even more weight to the authors’ conclusions.

问题

Could you clarify the nature of the error bars in Figure 4B (Standard Deviation? Standard Error of the Mean over animals? Over cross-validation seeds?)?
The current model assumes the learning rule f_θ is static throughout the entire course of learning. However, it's plausible that animals "learn how to learn". Is it possible that the learning rule itself is non-stationary and adapts as the animal gains more experience with the task structure?

3 Could you perform an analysis where you fit your RNNGLM independently to the first half and the second half of the sessions for each animal? If the inferred f_θ is qualitatively different between the early and late phases of learning, this could be strong evidence for a non-stationary learning rule that the current model cannot capture.

局限性

yes

最终评判理由

I thank the authors for their detailed reply. I am increasing my score to 4.

格式问题

作者回复

2025-07-31

We are very grateful to the reviewer for their detailed and insightful comments. We especially appreciate the suggestions regarding model architecture, temporal structure, and nonstationarity, which helped us expand and clarify several key aspects of our work. Below, we summarize how your feedback led to specific improvements in our analyses and manuscript.

Transformer extension. We strongly agree that extending to Transformer-based models is an important next step, given the known limitations of GRUs. Motivated by the reviewer’s suggestion, we implemented a Transformer-based parameterization of the weight update function. As shown in Table A below, it achieves modestly improved performance. Also, the inferred learning rule plot, which we'll add to the manuscript, captures non-Markovian reward history effects (stronger weight updates after consecutive rewards, similar to RNNGLM). These findings further underscore the history-dependent characteristics we uncover and highlight the promise of Transformer-style models for more in-depth future work, particularly in the larger-data settings envisioned for future extensions. We will add these results and corresponding discussion to the revised manuscript.

Model	Test per-trial LL	P-val (comparing to RNNGLM)
RNNGLM	-0.4837	n/a
TransformerGLM	-0.4806	9.72e-03

Table A: Comparison of RNNGLM and TransformerGLM test performance on the IBL dataset.

We also hope to clarify that the GRUs were chosen as a practical and interpretable starting point for de novo learning rule inference (see our discussion with CMVq, under “Novelty”, for more details). GRUs offer favorable inductive biases, which are especially useful in behavior-level modeling with limited and noisy data. In particular, MLPs — substructure in GRUs — are known to exhibit spectral bias, favoring smoother, low-frequency functions during training [70-82]; this inductive bias is well-characterized and can be directly tuned (e.g., via width), which can promote generalization even in data-limited regimes. Also, our new analysis in the next paragraph indicates that GRUs can capture dependencies beyond just a few trials. In contrast, Transformer-based models are more expressive but have less well-understood inductive biases; that said, recent work (e.g., Vasudeva et al., ICLR’25) shows that Transformers can also exhibit simplicity biases under certain training conditions.

Why illustrated Non-Markovian effects over only four trials. Thank you for raising this thoughtful question. The four-trial horizon in Fig. 5B is not meant to imply that GRUs can’t learn further than four trials, but a pragmatic choice in illustration (we will clarify this in the updated figure legend). In our dataset, long sequences of consecutive unrewarded trials (e.g., >5 in a row) are rare, making it difficult to estimate such effects without extrapolating beyond the fitted data. To further probe long-range dependencies, we analyzed $\Delta W$ conditioned on reward outcomes from ten trials ago—specifically comparing sequences in which all four consecutive trials from eleven to eight steps prior were rewarded versus unrewarded. We observed a consistent separation in $\Delta W$ across these conditions, suggesting that the model captures non-Markovian effects extending well beyond a few trials. We will add this plot and clarification to the revised manuscript, and could include additional analyses to quantify how far back in time these differences remain detectable. Our goal was to highlight clear evidence of non-Markovian reward-history structure using an illustrative example as a step beyond conventional Markovian models.

Prompted by the reviewer’s suggestion, we ran additional analyses truncating the history window. As shown in Table B below, performance degraded when history was limited to 10 trials, confirming the importance of long-range dependencies. We also ran a new simulation with a handcrafted longer-horizon ground truth learning rule to show that RNNGLM can effectively capture influences well beyond a few trials. Both the new simulation plot and the following table will be included in the revised manuscript to underscore the importance of modeling extended history-dependencies.

Model	Test per-trial LL	P-val (comparing to no truncation)
TransformerGLM (with truncated memory)	-0.4838	1.44e-03

Table B: Impact of history truncation (after 10 trials) on test log-likelihood in the IBL dataset. We compared model performance with and without memory truncation (i.e., restricting context length) to assess the importance of extended temporal time window.

Clarify nature of the error bars. Thank you for catching this omission. The error bars reflect standard deviation across cross-validation seeds. We will clarify this in the updated figure legends.

The nonstationarity extension. This is a great question. As noted in the manuscript, our current model assumes a static $f_\theta$ , consistent with common practice. To probe potential nonstationarity in the learning rule, we followed the reviewer’s suggestion and fit separate models to early and late learning phases. However, we did not observe significant improvement on test data (p = 9.98e-01); and while the fitted learning rules differed slightly, core qualitative features (e.g., reward-history dependence and asymmetry) remained consistent. We also tried adding trial numbers as an auxiliary input, again with no performance gain. These results suggest stable trial-level learning patterns in this dataset—though we agree that more explicit modeling of dynamic rules is a promising future direction (e.g. via phased fitting as the reviewer suggested or adding trial number auxiliary input). We thank the reviewer for raising this and will include these results and discussion in the revision.

2025-08-08

I thank the authors for their detailed reply. I am increasing my score to 4.

2025-08-08

Thank you again for your constructive and insightful comments!

最终决定Accept (poster)

2025-09-17

This paper proposes a flexible framework for inferring learning rules directly from animal behavior in de novo learning tasks. Rather than assuming a parametric form like Q-learning or policy gradient, the authors parameterize the learning rule with neural networks, allowing per-trial policy updates to be learned from data. They validate the approach on simulations, where the method recovers known rules, and then apply it to a large mouse dataset. Results show that recurrent parameterizations capture history-dependent, non-Markovian learning dynamics, and reveal asymmetries between correct and incorrect trials. The rebuttal further strengthened the paper by adding Transformer-based extensions, long-horizon analyses, and checks for nonstationarity.

Reviewers agreed that the paper is clearly written, technically sound, and provides a useful framework for reverse-engineering learning rules from behavior. Strengths include the novelty of applying flexible neural parameterizations to de novo learning, the combination of simulation and real animal data, and interpretable findings such as side biases and negative baselines. Reviewers also appreciated that the authors went beyond toy problems to extract insights from real data, and that additional experiments addressed concerns about scope, degeneracy, and architecture choices.

The main weakness concerns were about novelty in a crowded subfield, potential degeneracy in inferred rules, and limits of the modeling choices (GLM decision policy, GRUs vs Transformers, 1D tasks). Some reviewers felt the contribution is incremental relative to prior meta-learning work, and that applicability to richer tasks remains to be shown. Still, the rebuttal added convincing evidence that key behavioral features are robust, clarified limitations, and positioned the work well within ongoing efforts in computational neuroscience. On balance, this is a timely and solid contribution that moves the field forward. I recommend accept (poster).