Interactive and Hybrid Imitation Learning: Provably Beating Behavior Cloning
Measure cost per state, not per trajectory: Stagger (State-wise-DAgger) —and our hybrid Warm‑Stagger—beats Behavior Cloning, giving the first formal proof that state‑wise interactive IL outperforms Behavior Cloning.
摘要
评审与讨论
The paper analyzes a DAGGER-style algorithm which has access to both expert data, as well as interaction with the environment with access to a state-labeling oracle. The paper demonstrates theoretically how having access to both sources of data can lead to algorithms with better sample complexity.
优缺点分析
Strengths:
- The paper furthers our understanding of DAgger-style algorithms which are fundamental to modern data pipelines for robotics.
- The presentation is excellent, with a clear and intuitive proof sketch for the core result. The main Theorem 6 and proof sketch provide a clear framework for how to think about the tradeoffs in interaction algorithms.
Weakness:
-
The paper makes use of Hellinger/TV-style distances throughout and uses a very strong indicator-based loss. It is unclear if these results transfer to a continuous-control setting, which is the principal setting of the experiments and presumably downstream applications. In particular bounds for continuous settings by necessity require much stronger assumptions on the expert (c.f. "The Pitfalls of Imitation Learning when the Action Space is Continous", Simchowitz et. al.). I believe there is considerable value in studying the tabular MDP case, but that the general claims need to be narrowed to better reflect that this only applies to discrete MDPs.
-
The original DAgger paper considers probabilistic switching between the expert and the learned policy on data collection rollouts, whereas WARM-STAGGER/STAGGER appear to roll out the learned policy directly after collecting the initial data. Properly tuned probabilistic switching between and should prevent the "warm start issues" mentioned in Proof Sketch of Theorem 6 (see the "beta" parameter in the original DAgger Ross et. al. paper). Slow mixing DAgger should explore E with higher probability, and therefore I believe would have similar properties to the WARM-STAGGER algorithm presented in this work.
问题
-
The experiments run are all in continuous state and time, while the loss and theory uses Hellinger distances and indicator functions. Considering that continuous imitation learning can be exponentially harder than its tabular MDP equivalent ("The Pitfalls of Imitation Learning when Actions are Continuous," Simchowitz et al), would it make more sense to consider discrete/language style tasks instead of continuous control tasks?
-
The paper uses "hybrid imitation learning" to describe the problem setting, yet similar algorithms like "vanilla" DAGGER and DART are usually described as "online" or "interactive" imitation learning algorithms and have the same premises as the hybrid setting described in the paper--access to an expert oracle and further rollouts. It would be helpful for the authors to clarify what they view as "offline" vs "hybrid" vs "online" imitation learning. Perhaps "interactive" and "non-interactive" would be better terminology to avoid overloading with terminology from Reinforcement Learning?
局限性
Yes
最终评判理由
I think this paper is a fine addition to the literature. It is limited to the discrete-action setting as it requires a TV-style loss, although there are tasks (such as language) where this is applicable. As the authors note in the paper the Theorem 4 shows performance matching at least the baselines is somewhat weak. However, the specific construction underlying Theorem 6, on which their method provably performs better than the baselines, is very clean and presents a nice picture for how to think about interactive methods.
I overall retain my original score of a 5--as a theoretical work this is very nicely presented and thorough, although not groundbreaking (they analyze a longstanding, existing method) and the theory certainly has limitations.
格式问题
No formatting concerns.
We thank reviewer 9Cf7 for the thoughtful and detailed feedback. We appreciate your recognition of our theoretical framework, the clarity of Theorem 6, and the relevance of DAgger-style analysis to modern robotics. Below, we address your specific concerns.
For W1 and Q1: Applicability of Theoretical Results to Continuous Control
We thank the reviewer for this insightful observation. We agree that our theoretical analysis is specific to discrete-action MDPs and does not directly extend to continuous control settings such as MuJoCo. Our intention was to use MuJoCo primarily to illustrate the practical viability of our state-wise annotation model and algorithms, rather than to validate our theoretical bounds. We agree that continuous imitation learning presents unique challenges such as exponential-in-horizon compounding errors, and that DAgger-style theoretical analysis in this setting remains largely open.
We will revise relevant statements to clarify this distinction:
"Empirical results on the synthetic MDP support our theoretical findings, while MuJoCo experiments demonstrate the practical viability and competitive performance of our state-wise annotation model and algorithms on continuous control tasks."
That said, we are currently conducting additional experiments on discrete-action environments (e.g., Atari) and language model distillation tasks, which more directly align with our theoretical framework. Results from these ongoing efforts will be included in the final or extended version of the paper.
For W2: Probabilistic Switching vs. Warm-Start
For clarity of exposition, we analyze DAgger without stochastic mixing with the expert (as also done in [1]), since setting yields the best policy suboptimality guarantee in the original DAgger paper. We agree that stochastic mixing with the expert is a useful practical heuristic. Meanwhile, Warm-Stagger can also be viewed as a form of stochastic mixing with for the offline trajectories and for interactive rounds. While properly tuned probabilistic switching may also mitigate warm-start issues in our MDP example, we adopt this hybrid IL formulation to enable a clean analysis. Extending our framework to incorporate stochastic mixing is an interesting direction for future work.
For Q2: Clarification of Terminology – "Offline", "Interactive" , "Hybrid"
We appreciate the reviewer’s suggestion. In this paper, we deliberately use the terms "offline", "interactive", and "hybrid", to distinguish between three supervision regimes. The term online only appears in the context of online learning as induced by our theoretical reduction, and we deliberately avoid using online imitation learning to prevent confusion.
Thank You!
We hope the above clarifications address your concerns, and we sincerely thank you for your thoughtful and supportive feedback.
[1] D. J. Foster, A. Block, and D. Misra. “Is behavior cloning all you need? Understanding horizon in imitation learning.” NeurIPS 2024.
Thank you for the thoughtful responses to my questions. I don't personally see a need to distinguish hybrid imitation learning as a new, distinct setting but that is more a stylistic choice.
I think analysis for learning algorithms in discrete spaces, although limited in its scope, is certainly interesting in its own right and agree that greater emphasis on discrete-action environments and language tasks where this analysis would improve the paper. Overall, Theorem 6 and the associated construction is very nicely presented and a nice contribution!
We're glad to hear that you found Theorem 6 and its construction valuable. The hybrid setting offers a quantitative framework for jointly measuring offline and interactive annotation costs in practical applications. Emphasizing discrete-action environments and language tasks would further highlight the strengths of our analysis, and this direction will be incorporated in future revisions.
This paper proposes new interactive algorithms for imitation learning. Motivated by several recent theoretical findings surrounding behavior cloning, the authors look into whether low-cost interaction can benefit behavior cloning, including in the case of function approximation. Specifically, the authors focus on how to minimize the number of expert queries, such that expert performance is achieved or some theoretical gap can be proven to occur with high probability.
Specifically, the proposed algorithm, STAgger, executes a mixture policy at each step (effectively can query a different policy at each state it seems along the trajectory), obtaining a single sample and using it to update the set of likelihoods over a finite policy class, of which the expert is assumed to be in.
Experiments across a few MuJoCo tasks show that STAgger outperforms behavior cloning in terms of sample efficiency.
优缺点分析
The paper is pretty theoretically sound -- the underlying theory is classical no-regret theory from the original DAgger paper for the most part, which is well-known. The experiments also show good results of the proposed method, which is pretty neat for imitation learning. As for the method, it is a pretty straightforward extension to DAgger in the case of having a finite set of policies -- you effectively want to reduce your space down through queries on the mixture distribution, and update the mixture to bias towards the (expert) labels given at each query.
There are no theoretical weaknesses in the paper (e.g. nothing is really wrong from the theory side), but from the perspective of significance, I'm not sure how much this has from a practical standpoint. There weren't many details given about the experimental setup (my big concern are posed in the Questions section), and it seems that with something like this (if the set of policies given to you is reasonably small), a simple baseline could just be to find the policy in the (known) class that does the best (e.g. some type of test-set logic here) on the expert dataset (although I may be misunderstanding this), in which no interactions are required. It seems that interactive IL is great for theory, but not great from a practical perspective, and I'm not sure how much theoretical contribution this brings for now.
问题
There are a few things I would like to clear up personally -- how many members of the policy class do you experiment with when you run these MuJoCo tasks? The task seems like you have a set of deterministic or Markovian policies, and you're effectively learning some sort of router/mixture of them that should in principle converge to a Dirac over the true expert. The appendix didn't give much in terms of the experimental paradigm, so I was wondering how many policies this allows for.
To scale this to resemble something close to what we may have in practice, how does this scale to a (finite) set, but which has a large number of policies? This is in effect something that is more realistic, where we would want to choose the expert from this set as quickly and with as little interaction with the oracle as possible. Maybe an ablation study can be done for that.
局限性
The authors do mention some limitations of their work, including expert realizability (strong assumption) and lack of modeling of the environment query cost. It is clear (from an IL practitioner's perspective) that one can only be as good as an expert on any IL task at optimum, so design choices around the expert and the data collection process before IL takes place should be carefully considered from a societal standpoint. Maybe these can be put together as an addition to the last section of the paper (e.g. a "Limitations" section).
最终评判理由
After the authors clarified some misconceptions I had about the paper, as well as reading what other reviewers had discussed, I will raise my score to 4.
格式问题
None for me.
We are glad that reviewer yovM found our work “theoretically sound” and appreciated that the “experiments also show good results.” Below we address specific concerns:
For W: Practical Significance and Baseline Design
We appreciate the thoughtful question of the reviewer. Our work is motivated by the need to understand and formally analyze warm-start and non–full-trajectory annotation strategies in interactive imitation learning, despite their widespread use in practice (e.g., [3,4]). As noted by Reviewer 9Cf7, DAgger-style algorithms are fundamental to modern data pipelines for robotics (e.g., [6,7]).
As noted in Appendix G, we train a single parametric MLP using log-loss with empirical risk minimization (ERM), retraining the policy after each newly annotated state. Thus, in our experiments our method does not initialize a small set of predefined policies. We apologize for the confusion, and will clarify this more explicitly in the main experimental section.
We interpret the reviewer's suggestion of "test set logic" baseline as behavior cloning (BC), which we already benchmark. Please let us know if we misunderstood the suggestion.
For Q1 and Q2: Scalability of Policy Class and Experimental Setup
We use a single parametric MLP policy in MuJoCo experiments, trained via log-loss with ERM. There is no predefined finite policy class or mixture of policies. In our synthetic MDP experiment, where the policy class scales combinatorially (e.g., in Figure 2), we simulate the effect of a randomized mixture by assigning random actions to previously unseen states.
For Limitation: Realizability Assumption, Environment Cost Modeling, and Expert Design
We agree that we should be aware of the suboptimality of the expert in practical deployment of imitation learning; in some applications it may be preferable to combine imitation and reinforcement learning signals (e.g., [8]). We will add this discussion in a "limitation" section the final version. Modeling environment query cost is an interesting direction, which we intentionally omitted for clarity of presentation (see line 301), and aim to explore in future work.
Thank You!
If the above clarifications address your concerns, we would greatly appreciate a reconsideration of the score.
[3] M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. “HG-DAgger: Interactive imitation learning with human experts.” ICRA 2019.
[4] R. Hoque, A. Balakrishna, E. Novoseller, A. Wilcox, D. S. Brown, and K. Goldberg. “ThriftyDagger: Budget-aware novelty and risk gating for interactive imitation learning.” CoRL, 2022.
[6] L. X. Shi et al. “Yell at your robot: Improving on-the-fly from language corrections.” RSS, 2024.
[7] Y. Jiang, C. Wang, R. Zhang, J. Wu, and L. Fei-Fei. “TRANSIC: Sim-to-Real Policy Transfer by Learning from Online Correction.” CoRL, 2025.
[8] S. Ross and J. A. Bagnell. “Reinforcement and imitation learning via interactive no-regret learning.” arXiv, 2014.
Hi,
Thank you for the information! I can now understand most things in the paper. Due to new practical knowledge I've learned (thanks to you and Reviewer 9Cf7), I will raise my score to 4.
To clear up any confusion, by test-set logic, I think I had meant something like "whatever policy fits the test set best should be the one to work with" as is the case in model selection for standard supervised learning.
Hi yovM,
Thank you for your careful review and for reading through the comments from other reviewers. We're very glad to hear that the discussion has been helpful and appreciate your updated evaluation.
In our experiments, we adopt a similar notion of “test-set logic”, where the next-round policy is selected as any policy that fits the accumulated data. We will clarify this in the final version.
This paper explores theoretical underpinnings of when interactive imitation learning (i.e. variants of DAgger) can provably outperform vanilla behavior cloning (BC), and when a hybrid approach can provably beat both pure DAgger and BC, given the same budget of "expert annotations". To this end, the paper considers an amortized version of DAgger denoted STAgger, which instead of querying expert labels for each state along an on-policy rollout, simply draws a single sample from the on-policy averaged state visitation distribution. A guarantee for STAgger is established, showing that the performance upper bound can be smaller than that of BC as long as the expert is sufficiently recoverable and the relative cost of state-wise oracle queries is not too large. Given this result, a natural extension denoted Warm-STAgger is considered for hybrid settings, where an offline dataset is available in addition to online expert queries. Warm-STAgger sets the initial rollout policy is to an empirical risk minimizer over the offline dataset and performs STAgger on top. This algorithm is demonstrated to satisfy a bound no worse than either pure offline BC or pure online STAgger. Some lower bound constructions and numerical experiments are provided to demonstrate the benefits of online interaction.
优缺点分析
Overall, this paper introduces an interesting dimension to imitation learning. The paper is generally well-written. Thus, I am leaning toward acceptance. My main concerns with the paper are regarding the actual efficacy of the proposed algorithm, and the degree to which the theorems convincingly establish the uniform benefit of hybrid imitation learning.
Strengths
The focus on reducing the number of online expert oracle queries is a perennially relevant one, and this paper achieves an interesting result that a reduced form of DAgger may actually perform on par or better to offline BC and DAgger, as indexed by the budget of expert queries. This has practical relevance when on-policy expert queries may be expensive compared to collecting expert policy rollouts.
Weaknesses
In terms of theoretical soundness, Theorems 2 (BC), 3 (STAgger), 4 (Warm-STAgger) are technically comparing worst-case performance upper bounds. To be clear, the authors do not claim otherwise, but this technically renders claims of one method being generically better than another difficult to conclude. Theorem 6 partially closes the gap with instance constructions. However, it remains unclear when Warm-STAgger generically instance-wise dominates the performance of (ST,D)Agger or BC.
It is somewhat precarious to attribute the experimental results in continuous control as support for the theoretical bounds in this paper. In particular, it is known that imitation learning in control systems suffers unique problems that otherwise do not manifest in discrete-action settings [1, 2]. Notably, imitation learning in control systems can suffer from exponential-in-horizon compounding errors. This does not conflict with, e.g. the loglossBC bound, because such bounds assume policy estimation in losses such as indicator or log-loss, which are not feasible in continuous action spaces. I am not aware of analysis of DAgger in these settings, though I imagine for expert annotation along a exponentially unstable policy rollout likely yields exponentially low "coverage" around on-expert trajectories.
Minor comments
- Some inconsistent use of versus , e.g. line 108
- Line 210: the equality should be an inequality
- Should the loss used for fitting the BC policy technically be the log-loss to instantiate Theorem 2? The set of empirical risk minimizers is the same for deterministic realizable experts, but it does matter with respect to which loss generalization is proven in order to yield the horizon-independent bound in Theorem 2.
[1] Tu et al. "On the Sample Complexity of Stability Constrained Imitation Learning"
[2] Simchowitz et al. "The Pitfalls of Imitation Learning when Actions are Continuous"
问题
Please refer to Weaknesses above. My main questions can be summarized as:
When can sparse corrections like in Warm-STAgger typically outperform both BC and a BC warm-started DAgger (as is typically done in practice)? In other words, in what regimes is Warm-STAgger actually expected to outperform both BC and (Warm-)DAgger, rather than "falling back" to the behavior of one or the other? When DAgger also gets a warm-start, when is expert annotation along the full-trajectory provably "wasteful" compared to state-wise annotation?
Clarifying these points will alleviate some of my hesitation about interpreting the results.
局限性
Yes
最终评判理由
I maintain my generally positive evaluation, and lean acceptance (4) on the paper. I still maintain concerns about: 1. general performance upper bounds instead of instance-wise guarantees, and 2. when sparse annotations can beat both BC and (Warm) DAgger rather than fall back to the better-performing bound of the two. These are probably non-trivial extensions of the work the authors have already presented, and therefore they do not significantly impact my positive evaluation.
格式问题
None.
We thank reviewer rsrZ for the thoughtful and constructive review. We appreciate your recognition of our theoretical contributions and your positive evaluation. Below we address specific concerns:
For W1: Worst-Case Bounds and Instance-Wise Comparisons
We acknowledge the reviewer’s valid observation that Theorems 2–4 provide worst-case upper bounds, which do not imply generic instance-wise superiority. We are glad that Theorem 6 was seen as a meaningful step toward closing this gap, and we appreciate the reviewer’s recognition of our efforts to go beyond worst-case analysis.
For W2: Applicability of Continuous Control Experiments to Support Theoretical Bounds
Indeed, our theoretical guarantees apply only to discrete-action MDPs. The MuJoCo experiments are intended to empirically evaluate the practical effectiveness of the proposed algorithms, not to validate the theory. We will revise related statements to: "Empirical results on the synthetic MDP support our theoretical findings, while MuJoCo experiments demonstrate the practical viability and competitive performance of our methods on continuous control tasks." to accurately reflect this distinction.
We agree that continuous imitation learning presents unique challenges such as exponential-in-horizon compounding errors, and that DAgger-style theoretical analysis in this setting remains largely open, which is an interesting direction. We will explicitly acknowledge this in the experiment section.
Our analysis is based on log-loss, which reduces to the indicator loss under deterministic experts in discrete-action settings. While indicator losses are indeed incompatible with continuous action spaces, log-loss remains applicable. Our implementation employs log-loss for continuous control tasks (see Appendix G, Line 1454), consistent with the Walker2d-v4 experiment setting with logarithmic loss in Foster et al. (2024) [1]
For Minor Comments
-
Inconsistent use of ln vs. log. We will use log consistently in the final version.
-
Line 210: equality vs. inequality. Though the equality is technically correct under , we will revise it to for consistency in presentation.
-
Loss used in Theorem 2. We will clarify that Theorem 2 establishes a new horizon-independent generalization bound under log-loss, distinct from standard 0-1 loss analyses in BC.
For Q: When Does Warm-Stagger Provably Outperform BC and Warm-DAgger?
We appreciate the reviewer’s thoughtful question. To clarify, our work is motivated by the need to understand and formally analyze warm-start and non–full-trajectory annotation strategies in interactive imitation learning, despite their widespread use in practice (e.g., [3,4]).
To the best of our knowledge, we are the first to formally analyze hybrid imitation learning that combines warm-start with interactive annotation. Our analysis framework is also transferable and can be extended to analyze other DAgger-style algorithms initialized from offline data.
We are not exactly sure which version of the Warm-DAgger algorithm the reviewer is referring to; thus we present and discuss three possible interpretations below:
(1) Naively changing Warm-Stagger (Algorithm 2) to query full trajectory-wise annotation at step 5.
This variant warm-starts with offline data and, in each interaction round, executes an on-policy rollout and retroactively labels the entire trajectory. Let be the number of such trajectories.
Following similar analysis, the suboptimality is bounded by
which matches the Warm-Stagger bound but replaces with , incurring an extra factor of when measuring the cost using the number of state-wise annotations.
(2) Adding a warm-start procedure to the LogLossDAgger in [1] (see their Proposition 2.2 and Appendix C.2).
Their guarantee under a deterministic expert is:
Translating into our notation, with a fixed policy class of size and on-policy full-trajectory queries, this becomes:
Following similar analysis with a warm-start initialization, the suboptimality of warm-start LogLossDAgger is:
which also incurs an extra factor of in state-wise annotation cost.
(3) Adding a warm-start procedure to our Algorithm 3, Tragger.
By our Theorem 26 (which achieves a better guarantee than [1] and does not require recoverability), under a deterministic expert, Tragger satisfies:
With a warm-start initialization using offline trajectories, the suboptimality becomes:
This is, to our knowledge, the best known guarantee for warm-start DAgger under log-loss with offline initialization and trajectory-wise annotations. However, since , this implies , and therefore Warm-Stagger remains more sample-efficient under state-wise annotation.
In all, under our analysis, given a fixed target policy suboptimality Warm-Stagger achieves better state-wise annotation cost than all three warm-start DAgger variants we consider. Therefore, if the considered variants align with the reviewer’s intended formulation, our analysis shows that sparse corrections are provably more efficient.
A final remark to intuitively understand when full-trajectory expert annotation becomes wasteful compared to sparse corrections: consider MDPs with self-absorbing states (as in the lower bounds of [1,5]). In such settings, annotating all identical states in a trajectory provides no more information than annotating a single state.
Thank You!
We would be happy to include empirical comparisons of Warm-start DAgger variants (specifically, variants (1) and (3)) on MuJoCo benchmarks in the appendix, to further support our theoretical conclusions. If the above response addresses your concerns, we would sincerely appreciate a reconsideration of the score.
[1] D. J. Foster, A. Block, and D. Misra. “Is behavior cloning all you need? Understanding horizon in imitation learning.” NeurIPS 2024.
[3] M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. “HG-DAgger: Interactive imitation learning with human experts.” ICRA 2019.
[4] R. Hoque, A. Balakrishna, E. Novoseller, A. Wilcox, D. S. Brown, and K. Goldberg. “ThriftyDagger: Budget-aware novelty and risk gating for interactive imitation learning.” CoRL, 2022.
[5] N. Rajaraman, Y. Han, L. Yang, J. Liu, J. Jiao, and K. Ramchandran. “On the value of interaction and function approximation in imitation learning.” NeurIPS 2021.
I thank the authors for their clarifications. I maintain my generally positive evaluation, and lean acceptance on the paper. I still maintain concerns about: 1. general performance upper bounds instead of instance-wise guarantees, and 2. when sparse annotations can beat both BC and (Warm) DAgger rather than fall back to the better-performing bound of the two. These are probably non-trivial extensions of the work the authors have already presented, and therefore they do not significantly impact my positive evaluation.
As for a small point of clarification regarding logloss: the authors mention that log-loss is applicable to continuous control. This is true only if the policy class is stochastic, which is reflected by the authors using a Gaussian-parameterized policy in the experiments. However, from a learning-theoretic point of view, this can be problematic, as the expert policy (especially in many settings of continuous control) may be deterministic (consider the classical Linear-Quadratic Regulator deterministic linear policy), which: 1. puts realizability into jeopardy, 2. driving to incorporate deterministic continuous policies makes log-loss generalization bounds increasingly vacuous, recovering the same issue indicator losses suffer on continuous state-action spaces. This is largely a theoretical issue, but is worth mentioning.
Thank you for the thoughtful comment and continued positive evaluation. We will discuss and clarify the limitations of log-loss in continuous control settings in the final version.
This paper investigates the benefits of state-based annotations for imitation learning. A new imitation algorithm with state-based annotation, STAGGER, is developed and settings with better cost-sample efficiency tradeoff than offline imitation (behavioral cloning) are analyzed. A hybrid method that initializes from offline imitation, called WARM-STAGGER, is also developed to outperform both purely offline and the base STAGGER algorithms. An MDP is constructed to demonstrate this. Experiments in MuJoCo also demonstrate the benefits of the approach.
优缺点分析
- This paper makes an important contribution to imitation learning by establishing the theoretical benefits of interactive imitation learning.
- The paper is nicely organized and well written in motivating, defining, and supporting its main claims.
- The description of related work makes the originality and significance of the paper's contributions clear.
- Despite being heavily based in sample complexity theory, the developed algorithms are practical.
- The experiments effectively demonstrate the benefits of the developed approach.
Overall, this is a solid paper with both theoretical and experimental contributions.
问题
Q1. Are the theoretical bounds meaningful for the MuJoCo environment? In other words, can R and \mu be provided/estimated for each environment (and the resulting ratios discussed)?
Q2. How reliant do you anticipate the performance of your algorithms to be on the assumptions of the expert policy being realizable and deterministic?
局限性
yes
最终评判理由
This paper makes both theoretical and experimental contributions to the imitation learning literature, so I recommend acceptance.
格式问题
None
We thank reviewer p7ue for the kind and supportive review. We are glad you found the work a solid contribution both theoretically and experimentally.
For Q1: On the Applicability of Theoretical Bounds to MuJoCo
We agree that our theoretical analysis is limited to discrete MDPs and does not directly extend to continuous control. Our intention was to use MuJoCo experiments to illustrate practical viability of our state-wise annotation model and algorithms rather than to validate theoretical bounds.
We will revise related statements to: "Empirical results on the synthetic MDP support our theoretical findings, while MuJoCo experiments demonstrate the practical viability and competitive performance of our state-wise annotation model and algorithms on continuous control tasks." to accurately reflect this distinction.
That said, we are currently conducting additional experiments on discrete-action environments (e.g., Atari) and language model distillation tasks, which more directly align with our theoretical framework. Results from these ongoing efforts will be included in the final or extended version of the paper.
In MuJoCo, the exact values of and may be hard to estimate due to the infinite size of the action space. The value of can be estimated by the cumulative reward of the expert policy over multiple episodes, while can be estimated by replacing a single expert action during a rollout and measuring the maximum performance degradation compared to the original expert. We will include these empirical estimates in the final version under the experimental details section.
For Q2: On Assumptions of Expert Being Realizable and Deterministic
Theoretically, realizable and deterministic expert assumptions enable our favorable -style policy suboptimality guarantees.
Under stochastic experts, we haven't worked out the guarantees of interactive imitation learning yet; we expect that the suboptimality guarantee includes a variance term that vanishes at a rate of due to the expert’s randomness (analogous to [1]).
Under nonrealizable experts, we expect the suboptimality guarantee of output policy to have two parts:
(1) approximation error measuring how well the policy class approximates the expert;
(2) an optimization error term that typically degrades to in the agnostic setting (see [2]). Extending the theory to the nonrealizable and stochastic expert settings is an interesting direction for future work.
Empirically, we evaluate our algorithm with realizable stochastic expert. Though the nonrealizable setting is beyond the scope of this work, we expect that some variant of our algorithm can still give reasonable performance, provided the policy class is expressive enough (so that the problem is not exactly realizable but still meaningful). For example, [2] observed that with nonrealizable stochastic experts, DAgger variants outperform BC, and exhibit learning curves similar to our Figure 1.
Thank You!
We appreciate the thoughtful questions and hope our clarifications address the concerns raised.
[1] D. J. Foster, A. Block, and D. Misra. “Is behavior cloning all you need? Understanding horizon in imitation learning.” NeurIPS 2024.
[2] Y. Li and C. Zhang. “Agnostic Interactive Imitation Learning: New Theory and Practical Algorithms.” ICML 2024.
Thank you for your response. I plan to keep my rating and look forward to seeing your additional experimental results in your final version.
Thank you for your kind review. We plan to include the additional experimental details in the final version.
The paper studies imitation learning in the realizable, discrete-action case. It compares non-interactive IL (such as behavior cloning) to interactive IL (such as DAgger). The paper considers a cost model where we can ask the expert for annotations either for a single state, or for an entire trajectory (at a factor-C discount from the cost if we separately queried each state). So the learner can use four kinds of data: offline or online collection and state or trajectory queries.
Earlier work pointed out that, if we’re restricted to whole-trajectory queries and the realizable case, there’s no benefit to allowing interaction in the worst case. (This contrasts sharply with the non-realizable case, in which there can be a strong benefit to interaction: we need it to see how to recover from the inevitable errors in matching a non-realizable expert. It would be really great if the authors could emphasize more strongly this dependence on realizability, e.g., in the paragraph that starts at line 38.)
The paper asks whether removing the whole-trajectory restriction can improve data efficiency for interactive IL, while leaving the other choices above intact. It answers affirmatively: it presents new algorithms (Stagger and Warm-Stagger) that have better bounds on the amount of data needed. It also presents experiments (both in settings that match the theory and in settings “inspired by” the theory) that confirm that the new algorithms can be useful.
The reviewers agree that the paper is clear and that the analysis is sound. The question is well posed, and the answer is interesting. The reviewer/author discussion was helpful, the authors were responsive, and hopefully the discussion will improve the final version.