Inverse Reinforcement Learning with Switching Rewards and History Dependency for Characterizing Animal Behaviors
We develop an novel inverse reinforcement learning framework that can model the history-dependent switching reward functions in complex animal behaviors
摘要
评审与讨论
The paper introduces Switching Inverse RL (SWIRL), an inverse reinforcement learning framework for characterizing animal behaviors. In this problem setting, the goal is to infer reward functions and policies from animal behavior trajectories. To achieve this, SWIRL introduces two main design choices: time-varying rewards controlled by latent modes and biologically plausible history dependency. The time-varying rewards allow the model to capture switching behaviors. The history dependency occurs at two levels: the decision level (transitioning from one hidden mode to another conditioned on the previous state) and the action level (history-conditioned policies and rewards). The framework is optimized with an expectation-maximization procedure to recover the parameters for the hidden transition kernel, the policy, and the reward function. The paper compares SWIRL to prior IRL baselines and ablations of SWIRL on three sets of experiments: a 5x5 gridworld environment, a labyrinth dataset where water-deprived mice move freely based on varying internal goals, and a dataset where mice wandered an empty arena without explicit rewards. Across these experiments, they find SWIRL to outperform baselines in reward correlation, log-likelihood, and segment accuracy.
给作者的问题
- How is the reward function optimized? Looking at Algorithm 1 in the appendix and the auxiliary function (Eqn 5, 6, 7), it seems the gradient of the reward function is not directly passed through (because of the soft Q iteration). Do you backpropagate the gradient for the reward function through the Q iteration? If so, how many Q iterations do you use?
- Why do you apply different optimizers to the transition kernel (L-BFGS) and the reward function (Adam)? Is it an empirical design choice?
- Is there a reason why ARHMM and rARHMM are not included in experiments 1 and 2?
Updates After Rebuttal
The authors have adequately addressed my questions. I will maintain my score.
论据与证据
The central claim of this paper is that incorporating switching rewards and history dependency results in better inverse RL methods for characterizing animal behaviors. This claim is supported by three sets of experiments with tabular state and action spaces:
- The first is a 5x5 gridworld environment where the underlying reward is accessible. Compared to baselines with constant rewards and less history dependence, SWIRL scores the highest correlation between predicted and ground truth rewards, log-likelihood of the held-out test data under the policy, and accuracy of mode prediction.
- The second set of experiments is on a water-restricted labyrinth dataset. Since the ground truth rewards and behavior modes are not accessible, the paper only provides a quantitative comparison of the data log-likelihood under the policy, where SWIRL outperforms the baselines. They qualitatively visualize the mode switching in trajectories as well as the reward maps. Both indicate that the method learns a reasonable switching reward.
- The third set of experiments is on a free-space wandering dataset. The held-out log-likelihoods and reward visualizations support their claim. They further plot the log-likelihood as a function of the number of hidden modes and show that SWIRL performance improves with more hidden modes, whereas removing the state dependence leads to degrading performance as the number of hidden modes increases.
方法与评估标准
The paper proposes an inverse RL framework for characterizing animal behaviors. Since animal behaviors are likely controlled by time-varying objectives (e.g. find water when thirsty, otherwise go home), the paper proposes to infer a switching reward conditioned on latent modes. Moreover, animal behaviors are influenced by the history of observations, and thus the paper introduces history dependence to the hidden transition kernel, the reward, and the policy. This framework is optimized by an expectation-maximization procedure, which iteratively performs RL on the current reward estimate, uses the policy to estimate an evidence lower bound on the data likelihood, and updates the parameters via gradient descent. The proposed method is biologically plausible and mathematically sound.
The evaluation criteria vary based on the problem setting. When the animal behavior data is synthetically generated by following a time-varying reward, then it is feasible to compare the correlation between predicted and ground truth rewards and the accuracy of mode switching. Otherwise, if one only has access to the dataset, then data likelihood under the policy is the only feasible evaluation criteria.
理论论述
The paper did not make theoretical claims.
实验设计与分析
As discussed in the "Claims And Evidence" section, the experimental design is sound and supports the claims made in this paper.
补充材料
I reviewed all parts of the supplemental material.
与现有文献的关系
This work has implications for both the neuroscience community and the machine learning community, offering a useful tool for understanding animal behaviors and a flexible framework for general inverse RL problems.
遗漏的重要参考文献
The paper provides sufficient context for the reader to understand its contributions.
其他优缺点
Strengths
- The paper introduces a novel inverse RL framework for characterizing animal behaviors. The proposed method models switching rewards using a hidden mode variable and incorporates history context into the hidden transition kernel, the policy, and the reward in a biologically plausible manner.
- The proposed method shows strong empirical results across three sets of experiments. 3, The visualizations of learned rewards and mode transitions are particularly elucidating.
Weaknesses
- The experiments in this paper feature tabular states and action spaces. The implications for continuous domains remain unclear.
- The proposed method is computationally heavy and requires manual tuning of the number of hidden modes.
- The method description is not quite clear. See questions below.
其他意见或建议
- I suggest adding a more detailed description of evaluation metrics such as log-likelihood and segmentation accuracy. Currently, they are briefly introduced on lines 250-256 and lack clarity.
- I suggest adding a more detail description of the evaluation domains. For example, describe the state and action spaces for each environment used for evaluation.
We greatly appreciate the time and effort reviewer 5en4 dedicated to analyzing our work and providing such constructive feedback. We are pleased that the reviewer recognized the novelty of our work, the technical soundness, its implications for both neuroscience and ML community and the experiment design & presentation. Below, we address each of the questions and concerns:
1. Scalability concern: Please refer to section 1 of our reply to Reviewer 9ThW.
2. Evaluation metrics: We will add the detailed description of evaluation metrics in our next revision. Test log-likelihood is computed from the forward variable, where , and the computation of forward variable is described in Appendix A.2. For test hidden mode segmentation accuracy, we use the Viterbi algorithm to predict the hidden mode of each timepoint and compute the percentage of correct hidden mode prediction compared to ground truth data. For reward correlation, we compute the Pearson correlation between learnt reward and ground truth reward.
3. Experiment setup: We will add details of each environment in our next revision. For the state and action space:
5x5 Gridworld: State space: Discrete(25); Action space: Discrete(5). The 5 actions are ‘up’, ‘left’, ‘down’, ‘right’, ‘stay’.
Labyrinth: State space: Discrete(127); Action space: Discrete(4). The labyrinth has a binary-tree structure and the 4 actions are ‘move in left’, ‘move in right’, ‘move out from leaf node’, ‘move out from non-leaf node’.
Spontaneous Behavior: State space: Discrete(9); Action space: Discrete(9). Each state is a MoSeq syllable that defines a behavior motif. The 9 actions are just the next state the agent transits into.
4. How is the reward function optimized? The soft Q iteration is differentiable and we backpropagate the gradient for the reward function through the Q iteration. We utilized 100 iterations and found the soft Q iteration usually already converged before 100 iterations in our experiments. We will add this discussion in our next revision.
5. Why do you apply different optimizers to the transition kernel (L-BFGS) and the reward function (Adam)? This is an empirical choice. A rational reason behind that could be that in the M-step, the transition kernel typically yields a smoother loss surface compared to the reward function. As a result, with second derivatives, second-order optimizers like L-BFGS can take advantage of the smooth curvature, leading to faster and more stable convergence. In contrast, the reward function often has a more complex loss landscape, where Adam tends to perform better due to their robustness and adaptive step sizes. We will add this discussion in our next revision.
6. Is there a reason why ARHMM and rARHMM are not included in experiments 1 and 2? ARHMM/rARHMM models the emission probability as , where the marginalization over the action space makes the log-likelihood incomparable to models like SWIRL in general case. Experiment 3 (spontaneous behavior) is the only experiment where action is effectively the same as the next state, making it possible to compare ARHMM/rARHMM with SWIRL by test LL.
We thank again for the reviewer’s effort and humbly request that the reviewer consider raising their score if the above reply adequately addresses their concerns.
This paper addresses the limitation of traditional IRL, which assumes rewards depend only on the current state, making it insufficient for modeling long-term, history-dependent decision-making in animals. To capture this dependency, the paper introduces SWIRL, an IRL framework that models behavior as transitions between short-term decision-making processes, each governed by a distinct reward function. SWIRL infers these processes and their transitions using history-dependent reward modeling. The authors conducted experiments on simulated and real-world animal behavior datasets.
给作者的问题
Although the discussion section mentions POMDPs, I believe the statement that "SWIRL can be extended to POMDPs" is not entirely accurate. POMDPs represent a parallel approach rather than a direct extension of SWIRL. It would be more appropriate to compare SWIRL directly with existing POMDP-based methods rather than framing it as a future extension. Could the authors clarify their perspective on this distinction?
论据与证据
The paper makes well-motivated claims about the necessity of incorporating history-dependent rewards in IRL to model long-term animal decision-making. However, the paper does not provide a detailed analysis of how SWIRL scales to larger environments or higher-dimensional state spaces, which is crucial given its reliance on history-dependent modeling. Empirical results on computational efficiency would strengthen this claim. Overall, the paper provides substantial evidence to support its core claims but could improve in these areas to further solidify its contributions.
方法与评估标准
- The proposed method increases the input state size by incorporating multiple past states, resulting in larger network inputs, higher computational complexity, and slower training and inference.
- Hierarchical RL decomposes complex decision-making into multiple levels of abstraction. A high-level policy selects sub-goals or temporally extended actions, while a low-level policy executes fine-grained actions to achieve them. In SWIRL, the hidden modes resemble the sub-task structure in HRL, as they dictate different behavioral phases. However, unlike HRL, where the hierarchy is explicitly defined, SWIRL infers latent behavioral modes from data. A direct comparison with HRL-based methods—particularly goal-conditioned RL or options-based methods—could clarify whether SWIRL provides advantages in learning structured behaviors without predefined hierarchical policies.
- POMDPs explicitly model uncertainty by maintaining a belief state over hidden variables. POMDP-based approaches have been widely used to model decision-making under partial observability, where the agent must infer unobserved environmental factors. SWIRL introduces hidden modes, which serve as an internal latent state governing behavior. Given this similarity, a comparison with POMDP-based RL methods could provide insights into whether belief-state tracking could enhance SWIRL’s performance.
- The experiments only evaluate small history lengths, which may not be sufficient for capturing long-term dependencies in sequential decision-making. Given that the paper aims to model history-dependent decision-making, it could include a discussion on the feasibility of using alternative approaches, such as RNNs or Transformers, to learn sequential dependencies more effectively.
- Since the reward network must learn multiple distinct reward functions corresponding to different hidden modes, there is a potential issue when some hidden modes appear infrequently. This could lead to underfitting in the reward networks for rarely occurring modes. Is there any discussion on how to mitigate this issue, such as through data augmentation, regularization, or constraints on reward function learning?
理论论述
N/A
实验设计与分析
- Since incorporating past states increases input size and computational complexity, it would be valuable to assess the efficiency of SWIRL in larger environments. A computational runtime analysis or an empirical study on training efficiency would provide useful insights.
- The paper primarily compares SWIRL against history-agnostic IRL models. While this highlights the benefit of incorporating history dependency, additional baselines, such as RNN-based RL methods [1, 2] and Transformer-based RL models [3, 4], could provide a stronger contextual comparison in handling long-term dependencies and sequential decision-making.
- The reward network must learn multiple distinct reward functions for different hidden modes. However, if certain modes appear infrequently, they may not receive sufficient training, leading to potential underfitting. A sensitivity analysis on the frequency of hidden modes and its impact on reward learning would help assess robustness.
[1] Memory-based Control with Recurrent Neural Network.
[2] Deep Recurrent Q-Learning for Partially Observable MDPs.
[3] Decision Transformer: Reinforcement Learning via Sequence Modeling.
[4] Offline Reinforcement Learning as One Big Sequence Modeling Problem.
补充材料
Yes. I reviewed all parts of the supplementary material.
与现有文献的关系
- The relationship between SWIRL and broader scientific literature, particularly Hierarchical RL and POMDPs, could be further clarified, as SWIRL’s hidden modes resemble the sub-task structures in HRL and the latent states in POMDPs, making a direct comparison with these methods valuable for distinguishing its contributions.
- This paper models history dependency by explicitly incorporating past states into the input, increasing the input size and computational cost. However, alternative sequence modeling methods, such as RNNs and Transformers, have been extensively studied for capturing long-term dependencies in reinforcement learning. Discussing whether such architectures could complement or improve SWIRL’s performance would provide valuable context and better position the work within the broader literature on history-aware decision-making.
遗漏的重要参考文献
POMDPS:
[1] Learning Predictive State Representations.
[2] Inverse Reinforcement Learning in Partially Observable Environments.
Hierarchical RL:
[3] Data-Efficient Hierarchical Reinforcement Learning.
RL + RNN:
[4] Memory-based Control with Recurrent Neural Network.
[5] Deep Recurrent Q-Learning for Partially Observable MDPs.
RL + Transformers:
[6] Decision Transformer: Reinforcement Learning via Sequence Modeling.
[7] Offline Reinforcement Learning as One Big Sequence Modeling Problem.
其他优缺点
See comments.
其他意见或建议
- According to the submission guidelines: Section headings should be numbered, flush left, and set in 11 pt bold type with the content words capitalized. Leave 0.25 inches of space before the heading and 0.15 inches after the heading. Similarly, subsection headings should be numbered, flush left, and set in 10 pt bold type with the content words capitalized. Leave 0.2 inches of space before the heading and 0.13 inches afterward.
However, it seems that the author has modified these formatting rules. - The full name of IRL appears multiple times throughout the paper, including in Section 1 (Introduction), Section 3.2, and Discussion. It is recommended to define it only once at the first occurrence and use the abbreviation consistently thereafter.
- In line 122, the term "autoregressive process (ARHMM)" appears, while in line 189, it is written as "autoregressive hidden Markov model (ARHMM)." Please clarify which is the correct full name for ARHMM to maintain consistency.
We sincerely appreciate the time and effort reviewer CRRd dedicated to analyzing our work. Below, we address each of the questions and concerns:
1. Hierarchical RL, RL+RNN and POMDPs: We will refer to the works listed in the “Essential References Not Discussed” section using bracketed citations in this response.
1.1 [1][3][4][5][6][7]: We thank the reviewer for pointing out the RL methods with history dependency or multiple reward functions. However, it is worth noting that our method is solving the Inverse RL problem (given expert demonstrations, trying to recover the expert’s policy and the reward function) instead of the RL problem (given reward function, trying to find the optimal policy). IRL is a much harder problem and the goal is very different from RL. As a result, it is not feasible to apply those RL methods as baselines for our method.
1.2 [2]: it is worth noting that [2] infers reward functions from expert demonstrations while assuming the Observation function P(z|s, a) is already known—a condition that does not hold in our case. To the best of our knowledge, there is no existing Inverse POMDP method that can effectively address our problem. We recognize the need of discussing related POMDP literature and will add it in the next revision.
2. Scalability and longer history: Please refer to section 1 and section 4 of our reply to Reviewer 9ThW. We appreciate the idea of using RNNs or Transformers to learn sequential dependencies more effectively and will add it in the next revision as a future direction.
3. Hidden mode with infrequent occurrence: In cases where a hidden mode occurs infrequently, regularization or constraints can be incorporated to improve the reward recovery. For instance, if the reward is expected to be sparse, L1 regularization can be used; if large reward coefficients are undesirable, L2 regularization can help. When the occurrence of a mode is too low and prior knowledge is too limited, a common practice is to reduce the number of hidden modes, effectively merging the rare mode with a more frequent mode with similar reward function.
4. Section heading issue: We apologize for the heading space issue and will correct it in our next revision.
5. Full name of IRL appears multiple times: We will use the abbreviation consistently in our next revision.
6. ARHMM abbreviation issue: ARHMM should be the abbreviation of the autoregressive hidden Markov model. We will address this issue in our next revision.
7. Response to Questions For Authors:
Thank you for the thoughtful comment. Upon reflection, we agree that the original statement may have been imprecise. While SWIRL is not a POMDP method in the formal sense (e.g., modeling belief states or latent dynamics explicitly), it does address non-Markovian behavior by augmenting the state space with recent history (e.g., using {s_t, s_{t-1}, s_{t-2}, …}). This is a commonly used strategy to handle partial observability or non-Markovianity—similar in spirit to how LSTMs or memory-based policies are used in POMDP settings. So while SWIRL is not derived from a POMDP framework per se, it shares key characteristics with POMDP-appropriate methods in its use of temporal context. When we wrote that it “can be extended to POMDPs,” we were referring to the potential for a more principled approach—e.g., explicitly modeling latent states or belief updates—to further generalize the method. We will revise the discussion to clarify this distinction and avoid the implication that our current method is directly extendable in a formal POMDP sense.
We thank again for the reviewer’s effort and humbly request that the reviewer consider raising their score if the above reply adequately addresses their concerns.
Thank you for the clarification. While I understand that your method focuses on IRL rather than RL, my suggestion regarding RNN or Transformer-based models was motivated by your claim of modeling history dependency, not by a desire for RL baselines. Currently, history is handled via fixed-length state augmentation (up to L=4), where policy and reward depend on a short sequence of past states. While this may capture some short-term context, it lacks the expressive power of true sequence models like RNNs or Transformers, which better model long-term, variable-length dependencies. Since history modeling is a key contribution of the paper, the current implementation feels overly simplistic. I would be interested to hear the authors’ perspective on this point, particularly whether they see potential for incorporating more expressive sequence models into the SWIRL framework, or have specific reasons for preferring the current design.
Regarding hidden modes with infrequent occurrence, I appreciate the suggestions regarding regularization and mode merging, my concern was primarily about the robustness of reward learning under mode imbalance. A sensitivity analysis on mode frequency and its effect on reward learning would provide more concrete evidence of the method’s reliability under imbalanced settings.
We sincerely thank the reviewer CRRd for their acknowledgement on the distinct difference between RL and IRL. Following the rebuttal comment, we conducted additional experiments and addressed their two concerns as below:
1. Use a RNN/transformer-based policy model
1.1 We acknowledge the reviewer’s concern. Although previous usage of RNN/transformer-based policy models is limited to RL literature, we are aware that some recent IRL work concerning the fine-tuning of LLMs essentially have used transformer-based policy models [1][2].
1.2 It is possible to use a RNN/transformer-based policy model in SWIRL. We conducted an additional model-free SWIRL experiment following the same setup as Appendix D.1 but with a transformer-based policy model. To facilitate interpretation and visualization of the learned reward function, we constrained the transformer’s input sequence length to be 2. Our results show that the transformer-based model-free SWIRL achieves reasonable performance. Although the recovered reward maps appear noisier compared to previous results (Fig. 2A and Fig. 8 in the paper), they still successfully identify the key high-reward regions. We visualize the transformer-based SWIRL results in https://anonymous.4open.science/r/SWIRL_rebuttal-C46B/sim_result_iql_transformer.pdf.
1.3 The choice between using the model-based SWIRL presented in the main body of our paper and the model-free variant depends on the specific use case. As discussed in Section 1.1 of our response to Reviewer 97hW, IRL is inherently a challenging problem. Performing IRL in a scalable, model-free manner often sacrifices reward recovery accuracy, even though the recovered policy could perform well. Moreover, employing RNN/transformer-based policy models, while allowing compatibility with variable-length histories, also makes the interpretation of inferred reward function hard. Therefore, for neuroscience applications that involve moderate-sized state-action spaces and prioritize reward recovery and interpretability, we recommend using the model-based SWIRL implementation. That said, we also present the above additional experiment, providing concrete evidence that the model-free SWIRL described in Appendix D.1 can be implemented with a transformer-based policy and achieve reasonable performance. For applications that prioritize scalability and focus primarily on policy recovery instead of reward recovery, the transformer-based model-free SWIRL can be an ideal choice.
2. Sensitivity analysis on mode frequency and its effect on reward learning
We conducted an additional experiment with SWIRL to examine how low mode frequency affects reward learning. In this experiment, we simulated a 5x5 gridworld environment featuring three hidden modes: water, home, and explore. Both water and home were associated with sparse reward maps, each with different high-reward states, while explore had a dense, uniform reward map. We generated three datasets with those switching reward maps, each consisting of 100 trajectories with a traj length of 500 steps (50000 timepoints in total). In the first dataset, the home mode occurred with low but still normal frequency (5,999 timepoints; 12% of the data). In the second dataset, home mode frequency was further reduced (1,202 timepoints; 2.4%), and in the third dataset, home mode occurred extremely infrequently (332 timepoints; 0.66%). We applied SWIRL to each dataset, both with and without regularization (reg). Without reg, SWIRL performed well when the home mode occurred at 12%, but its performance degraded significantly at lower frequencies (2.4% and 0.66%). With reg (applying L1 reg to two of the reward maps and L2 reg to the third), SWIRL was able to recover reasonable reward maps even at 2.4% occurrence. However, when the home mode appeared in only 0.66% of the data, even reg could not achieve reasonable reward recovery. These results demonstrate that SWIRL is capable of handling reasonably low mode occurrences when appropriate reg is applied.
We visualize the results in https://anonymous.4open.science/r/SWIRL_rebuttal-C46B/sim_sensitivity_analysis.pdf. (A) True reward maps. (B) Plot illustrating the Pearson correlation between the true and discovered reward maps over 10 runs. The x-axis represents different datasets (12%/2.3%/0.66% home modes) and regularizations (without/with reg). ‘Overall’ is the correlation across all three hidden modes. ‘Home mode’ is the correlation for just the home mode. (C) SWIRL discovered reward maps.
We thank again for the reviewer’s efforts and humbly request the reviewer to consider raising the score as we have clarified and addressed all concerns raised by the reviewer.
Reference: [1] Wulfmeier et al. "Imitating language via Scalable Inverse Reinforcement Learning." NeurIPS 2024 (2024). [2] Li et al. “Joint Reward and Policy Learning with Demonstrations and Human Feedback Improves Alignment.” ICLR 2025 (2025).
In this work, the Authors extend the IRL framework, designed previously to consider multiple goal maps in real-world agents concurrently (Ashwood et al & Zhu et al), by explicitly incorporating history-dependent policies and rewards into the model. Using this new framework, the Authors model several standard datasets in the field (Rosenberg et al & Markowitz et al) where they show overall improvement in performance. The relation of the proposed method to other models, viewed as its edge cases, is discussed.
给作者的问题
What are the novel, unique insights that can be obtained from the considered and/or other data using the proposed algorithm but not the previous algorithms?
Post-rebuttal:
Upon the discussion with the Authors, I agree that 1) considering the history dependency is important and 2) that not having pre-clustered data has the potential to make a difference. Thus, I agree that the score can be raised and the work qualifies for a weak accept.
论据与证据
The main claims that (1) the consideration of history-dependent policies and rewards in this specific IRL framework is novel and (2) that the consideration of the history-dependency improves the model’s predictive power on the (Rosenberg et al) dataset are correct (I elaborate in the sections below). While history-dependency has indeed been considered in other prior work, including the models of neuroscience-relevant data (e.g. by “hacking” the definition of the Markovian state through defining it as a sequence of the past states), the addition of such historic dependency to the IRL framework is novel.
方法与评估标准
The methods and evaluation criteria here follow the well-established pattern used previously in prior works by Ashwood et al & Zhu et al. Specifically, the work here uses the datasets by Rosenberg et al & Markowitz et al, which have been shown to be adequate for this task and have been overall widely used in the field. As such, the methods and criteria here do make sense for the task at hand.
理论论述
The methods used here make perfect sense for the task. The use of the E-M (Forward-Backward) algorithm has been previously argued to be the go-to tool for the POMDP-like problems, such as the problem here. I’ve scanned through the derivations in the Appendix; they looked overall correct.
实验设计与分析
The experimental design here follows the standard practice in the field. Apart from the use of standard datasets, the work provides a reasonable set of baseline models and ablations.
补充材料
I’ve scanned through the entire supplementary material.
与现有文献的关系
While the Authors correctly cite prior work (including works by Ashwood et al and Zhu et al) and correctly discuss the novelty of this new work in relation to prior work (the consideration of the history-dependency), it is specifically this part that is the most concerning for me. The addition of the history dependence is, arguably, a minor tweak upon Zhu et al’s work. Zhu et al’s work, in turn, was a minor (though a much bigger) tweak upon Ashwood et al’s work and, as such, to the best of my knowledge, hasn’t made it to one of these conferences. While adding the historical dependency is an important step in the right direction, this paper would’ve been much stronger if it had provided novel insights into the existing data, however, the qualitative conclusions for the Rosenberg et al & Markowitz et al data match those of the previous work. Thus, while the work appears completely correct to me, the scope of the novelty in the current format sadly precludes me from recommending the acceptance of this paper at this point.
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
N/A
We greatly appreciate the time and effort reviewer FWjW dedicated to analyzing our work. We are sorry that the key contributions of SWIRL are likely misunderstood by the reviewer and apologize for not clarifying this fact more clearly in the paper. Below we will provide a thorough discussion on the relationship between Ashwood et al., Zhu et al. and SWIRL and clarify the novel contribution of SWIRL:
1. Comparison among Ashwood et al., Zhu et al. and SWIRL
1.1 Ashwood et al. proposed the reward function as a dynamical combination of a number of reward maps. The inference method proposed in Ashwood et al. assumes the reward function changing dynamics is the same for every trajectory, which greatly limits its applicability. As discussed in Sec. 4.2 of our paper, Ashwood et al. used the short, pre-clustered and stereotyped labyrinth trajectories trialized and sampled from the original long trajectories. In their dataset, for water-restricted mice (with a water reward in labyrinth), each trajectory is very short (20 steps) and has very similar reward changing dynamics and state visitation sequence (always start from ‘home’ states e.g. 1,0, go to water port, then go back to ‘home’). They then conduct a separate experiment on another dataset of water-unrestricted mice (with no water reward in labyrinth) to find ‘explore’ map, where the trajectories are also very short, pre-clustered and stereotyped.
1.2 In Zhu et al., the agent is switching between a number of reward functions, and this model can handle trials with different reward switching dynamics. However, Zhu et al. only tested it on the same labyrinth dataset as Ashwood et al. and did not really show this ability. Furthermore, in Fig. 3e of our paper, I-1 (Zhu et al.) has a lot of unreasonable fast-switchings between hidden modes, which shows that the method of Zhu et al. cannot accurately recover the reward switching dynamics in the long, non-stereotyped raw behavior data. Also, Zhu et al. cannot recover the reward with action-level history dependency (e.g. Fig. 3c).
1.3 The introduction of state-dependent decision mode transition and action-level history dependency in SWIRL is critical and enables the effective reward learning on long, non-stereotyped naturalistic trajectory of the labyrinth dataset. Compared to the datasets used by Ashwood et al. and Zhu et al., in our experiment we simply segment the original data of water-restricted mice into trajectories of 500 time steps each, without any clustering. Our trajectories are much longer and each trajectory maintains the original different reward switching dynamics from raw data. Our result shows that only models with both decision-level and action-level history-dependency can recover the switching rewards accurately.
2. Response to Questions For Authors: SWIRL provides novel insights into real animal datasets
2.1 For labyrinth, as described above, SWIRL shows the switching between ‘home’, ‘water’ and ‘explore’ reward functions in the long, naturalistic water-restricted mice trajectories, which is a new result that has not been reported before. Rosenberg et al. does not have RL/IRL analysis and Zhu et al. & Ashwood et al. only show the switching between ‘home’ and ‘water’ in the short, pre-clustered, handcrafted version of water-restricted mice trajectories. Our result is clearly much more naturalistic and reliable than previous literature and supports the following important claim: Mice make decisions based on both current state and their history.
2.2 For spontaneous behavior, Markowitz et al. shows DLS dopamine activity correlates with reward through RL experiment. We verify the claim with IRL experiment and provide novel insights that there exist switching hidden decision modes with varying dopamine correlations. To the best of our knowledge, we are the first to perform switching reward analysis on this type of dataset, and our finding of switching hidden decision modes in those spontaneous behavior syllables is new.
3. In addition, we are also the first to incorporate this switching reward idea with model-free inverse-Q-learning-based IRL method, enabling scalable application, which is an important contribution to the ML community. We discuss model-free SWIRL implementation in detail and exhibit its reasonable performance on the gridworld experiment (same env as Sec. 4.1) in Appendix D.1.
We thank again for the reviewer’s effort and humbly request that the reviewer consider raising their score if the above reply adequately addresses their concerns regarding the scope of the novelty.
Thank you for your response. I would like to start a conversation with you here.
As of now, I don't think I've misunderstood the key contributions of this work, as you imply. Feel free to argue / convince me otherwise. Details below:
- The three works in question (Ashwood et al, Zhu et al, and yours) consider the same phenomenon at different timescales: Ashwood et al look at ~20-step segments; Zhu et al look at super short segments; you look at ~500-step segments. While each other the three works makes a claim that the timescale they consider is the only "correct" one (and I have my preferences here as well), the truth is that each of them corresponds to a model whose usefulness should be ultimately measured in terms of the conclusions derived form it and written in plain words. This way, Ashwood et al uncover several goal maps, which they then interpret and explore --- that is, without a doubt, useful. Zhu et al's work cater to the whole IBL / Pillow Lab idea that behavioral sequences should be modeled as short-term HMMs. While I haven't seen any novel quantitative results emerging from it, this at least conforms to a larger research agenda. What are your qualitative results? I would argue that the length of the trajectories is a modeling choice rather than a result. I would also argue that long trajectories have been considered in prior work, although in different scenarios. What does including the history-dependency teach us about brain, compared to prior models?
Please kindly list your contributions, in plain qualitative words, an we will take it from there!
Thanks
We greatly appreciate the reviewer FWjW’s for their willingness to start a conversation with us. Unfortunately, in ICML 2025, authors can only provide one final response after the reviewer's comment. We will try our best to clarify our contributions here.
We present the first generative IRL model able to capture naturalistic mouse behavior over hours of maze exploration with switching decision modes. SWIRL incorporates history-dependency, enabling data-driven testing for this hypothesis in naturalistic animal behavior, as history-dependency itself is a key concern in behavioral neuroscience. [1] A.K. 'The what, how, and why of naturalistic behavior.' Current Opinion in Neurobiology (2022)
Our main contribution is a model that can model complex naturalistic behavior that prior models lacking history-dependencies could not. As labs begin recording neural data during hours-long free exploration, our framework offers a powerful tool to link brain activity to naturalistic behavior: Identifying when an animal seeks water, explores, or rests is essential to dissect the underlying neural circuits—each could be driven by distinct processes. This segmentation is not possible with pre-clustered, trialized data. From this perspective, prior work (Ashwood et al. (DIRL), Zhu et al.) goal fundamentally differs from ours.
We emphasize that the key difference between SWIRL and prior work is not the length of behavioral trajectories, but whether the model can handle raw naturalistic behavior trajs. Below Details 2 and the linked figure highlight the differences between our data and those used in previous work.
Details
-
Animal behavior experiments in neuroscience generally fall into two types. The first involves simple, short, trialized tasks—such as 2AFC—where animals perform structured actions within brief time windows. The second involves complex, long, and non-stereotyped behaviors that are difficult to trialize, such as animals navigating a large maze. Neuroscientists are increasingly interested in the second type, as it better reflects naturalistic behavior. Advances in recording techniques have made such experiments more feasible and common.
-
The Rosenberg labyrinth with water port dataset was clearly designed as a second type experiment, capturing hours of naturalistic mouse behavior in a large maze. However, by pre-clustering and downsampling the raw data, DIRL effectively constrain it to a first type experiment setting, where behavior follows a specific path: home → water → home. All trajs only visit the same 5–7 states, rendering most of the 127-state maze irrelevant. As a result, their ‘water’ and ‘home’ reward maps reflect a narrow, stereotyped and trialized task rather than the original freely exploring behavior. Zhu et al. use the same processed data from DIRL and reach the same constrained conclusion. Although theoretically the model of Zhu et al. is not limited to trialized behavior as DIRL, we did test Zhu et al. (I-1) in our paper Fig. 3E and found that lacking history-dependency, their model cannot find meaningful hidden mode segments from the raw naturalistic behavior data.
We visualize the difference between the original labyrinth trajs used by SWIRL and the constrained, stereotyped trajs used by DIRL: https://anonymous.4open.science/r/SWIRL_rebuttal-C46B/swirl_dirl_trajs.pdf.
Note: DIRL trajs in above figure have unrealistic paths through walls, indicating a possible mismatch between their reported states and actual maze locations. Since their data & code release did not include the clustering and sampling pipeline (they just provided the preprocessed trajs) and they never visualized the trajs in their paper (nor did Zhu et al.), it's unclear what exactly went wrong. However, it's obvious that their trajs are short, highly stereotyped, and trialized, a stark contrast to our long, naturalistic behavior data.
-
In contrast, SWIRL directly uses the raw labyrinth dataset, capturing truly naturalistic, freely moving behavior over hours. This allows us to model unconstrained mouse labyrinth decision-making with a generative model and IRL for the first time. With our learned model parameters, one can even simulate realistic mouse behavior with switching rewards—embodying the principle: “What I cannot build, I do not understand.” Our model is a key step toward systematically understanding naturalistic behavior.
-
In conclusion, our modeling approach represents a clear advancement beyond prior work. Our application of IRL to the original Rosenberg labyrinth dataset is entirely novel. While the reviewer noted that Zhu et al. did not publish at a major conference, we argue that our contribution is more substantial than Ashwood et al, which was accepted at NeurIPS.
We thank again for the reviewer’s efforts and humbly request the reviewer to consider raising the score as we have clarified our novel contribution to the study of brain in naturalistic behavior.
This paper presents an EM-based IRL algorithm SWIRL (SWItching IRL), for learning time-varying reward functions to model animal behavior. The paper extends IRL by incorporating time-varying, history-dependent reward functions.
A key contribution of this work is that it incorporates capturing the shifting motivations and history-dependent decision-making observed in animals by modelling long behavioural sequences as transitions between short-term decision-making processes, each governed by a unique reward function.
They do this by considering history dependency at both the decision level (transitions between decision-making processes depend on previous decisions and environmental context) and the action level (actions depend on the history of states within a decision-making process). Paper is well written, coherent, and technically sound.
给作者的问题
See above.
论据与证据
The claims are well supported through multiple experiments, and various baselines.
The claims on originality and novelty is well supported ie. first IRL model to integrate both decision-level and action-level history dependency. Claims on principled method with empirical validation is also very well supported i.e. a mathematical formulation of SWIRL, including detailed explanations of how history dependency is incorporated at different levels, and a clear demonstration of improvements over baseline methods. In addition the paper is well written and presented. Claim on the relevance and importance of the work lie in the modelling of animal decision-making, which has the potential to further our understanding of intelligent behavior.
方法与评估标准
To test the efficacy of the proposed method, SWIRL is empirically tested on simulated data and real-world animal behaviour datasets.
Multiple baselines have been compared such as latest concurrent work on Locally Consistent IRL (Nguyen et al., 2015), ARHMM (Wiltschko et al., 2015), rARHMM (Linderman et al., 2016) I-1, I-2, S-1, S-2. The evaluation is competitive, and even though the approach compares similarly, it does consistently well and even favorably when compared to Nguyen et al., 2015 which is very positive.
The paper further provides connections between SWIRL and autoregressive dynamics models, arguing that SWIRL offers a more generalized and principled approach to characterizing animal behaviour. Results presented demonstrated that it outperforms existing models lacking history dependency, both quantitatively and qualitatively.
理论论述
One of the major concerns of the method proposed is difficulty in scaling to larger state space? Would these results hold there? How might the authors propose to tackle this? Would it be possible to outline how the method generalizes to larger state space?
It seems that longer L in SWIRL does require a considerable amount of computing resources, how does the method work in this case? Have you considered environments where longer L was required? Was the method still competent compared to baselines? It would be useful to add discussion around this. Additionally, if you can point me to experiments that would be useful as well that consider longer L.
实验设计与分析
Experiments are clear and results are reported with rigor. Complexity and Scalability of the method is unclear given the grid world experiments are not to scale with the dimensions considered.
It is not clear to use this method, how might one go about choosing the number of hidden modes? Was this a hyperparameter? Is there a way to meta learn this variable?
补充材料
Yes I skimmed the complexity analysis.
与现有文献的关系
The work is well positioned in the literature. Multiple baselines have been compared such as latest concurrent work on Locally Consistent IRL (Nguyen et al., 2015), ARHMM (Wiltschko et al., 2015), rARHMM (Linderman et al., 2016) I-1, I-2, S-1, S-2. The evaluation is competitive, and even though the approach compares similarly, it does consistently well and even favorably when compared to Nguyen et al., 2015 which is very positive.
The paper further provides connections between SWIRL and autoregressive dynamics models, arguing that SWIRL offers a more generalized and principled approach to characterizing animal behaviour. Results presented demonstrated that it outperforms existing models lacking history dependency, both quantitatively and qualitatively.
遗漏的重要参考文献
No
其他优缺点
Since a major contribution here is considering the Action-level history, for experiments that only consider (S-2) even though it is better than only current length of 1, it seems its too short a length to showcase the utility of history. Is this a result of the environments? How is this length determined or recommended by the method to be considered?
其他意见或建议
See above.
We sincerely thank the reviewer 9ThW for their detailed and thoughtful comments on our paper. We are pleased that the reviewer recognized the novelty of our work, the technical soundness, its potential impact on intelligent behavior research, and the overall paper presentation. Below, we address each of the questions and concerns:
1. Scalability and Generalization to Larger State Space
1.1 Our SWIRL framework is compatible with larger state-action space. As discussed in Sec. 3.4, SWIRL inference procedure can be done in a scalable way (model-free IRL) instead of the model-based approach shown in the main body of the paper. It is worth noting that IRL in large and/or continuous state-action spaces is a very challenging problem. Even state-of-the-art standard IRL methods, when assuming a single static reward function, still cannot accurately recover the ground-truth reward. Despite limitations in recovering the true reward function, IRL methods can still produce policies that approximate the expert’s behavior. As a result, recent IRL literature [1][2] in such settings usually focuses on policy recovery performance rather than reward function.
1.2 In Appendix D.1, we present a detailed discussion of a scalable model-free variant of SWIRL and evaluate its performance on the same gridworld experiment as described in Section 4.1 of the main text. Despite being model-free, this scalable variant of SWIRL is still capable of achieving reasonable reward recovery. We expect that in environments with large state-action space, SWIRL will perform worse in reward recovery but still maintain reasonable policy recovery.
1.3 SWIRL has the potential to be further extended to continuous state-action space. We can change the DQN-style inverse-Q-learning approach in Appendix D.1 to SAC-style ([1] already shows SAC is compatible with inverse-Q-learning), building a SWIRL variant compatible with continuous state-action space. Further extension to RNN/transformer structure could improve SWIRL’s performance on data with longer history-dependency.
2. Selection of number of hidden modes
Yes, it is a hyperparameter. We discussed the selection of the number of hidden modes in Appendix B.4.1. and Appendix B.4.3. It can be meta-learned based on the trend of test LL. In general, we recommend selecting the number of hidden modes at the point where the test log-likelihood curve plateaus.
3. Selection of L for labyrinth experiment
We discussed longer L experiment results (S-3 & S-4) in Appendix B.4.2. Generally, L should be selected based on the trend of the test LL curve as well as the recovered hidden mode segments and reward maps. In this case since the hidden mode segments and reward maps remain similar with L >= 2, we choose to show S-2 as the main experiment.
4. Consideration of longer L
4.1 For model-based SWIRL, the feasibility of longer L depends on the environment size. In the labyrinth environment (127 states, 4 actions), model-based SWIRL can perform reasonably fast (within 3 hours) on a L40S GPU with L up to 5. For a smaller environment such as the one in spontaneous behavior experiment (9 states, 9 actions), it is easy to test long L (e.g. L=10) with our model-based SWIRL.
4.2 For environments with longer L, we can use the scalable model-free SWIRL proposed in Appendix D.1, but at the cost of inaccurate reward recovery. On the labyrinth dataset, we find it is easy to test longer L (e.g. L=20) with the model-free SWIRL. However, mice behavior in this dataset does not show that long action-level history dependency and such long L actually leads to lower test LL.
We thank again for the reviewer’s effort and humbly request that the reviewer consider raising their score if the above reply adequately addresses their concerns.
Reference: [1] Garg et al. "Iq-learn: Inverse soft-q learning for imitation." Advances in Neural Information Processing Systems, 34 (2021). [2] Zeng et al. “Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees”. Advances in Neural Information Processing Systems, 35 (2022).
This work proposes SWIRL, an inverse reinforcement learning model with applications to the analysis of animal behavior. While previous work has modeled a single unobserved reward function underlying such behaviors, SWIRL posits both a latent state (e.g., hunger, thirst) with transitions and actions dependent on the previous sequence of states. (Thus, the method is not 1-Markov, as typically assumed.) The authors provide an efficient method for inference in this model in the tabular setting, with potential links to more complex problem setups, as well as a set of experiments on existing behavioral data.
Reviewers agreed that the experiments, collection of comparison models, and quality of the presentation were good. Some reviewers questioned the novelty of the present work relative to some comparison IRL methods, though experiments showed that the addition of a latent switching state improved the interpretability and accuracy of model fits to data.
Overall, a solid paper that contributes to an active area of research in behavioral neuroscience.