DiLQR: Differentiable Iterative Linear Quadratic Regulator
摘要
评审与讨论
This paper proposes an efficient O(1) iterative Linear Quadratic Regulator (iLQR) based-method to analytically compute the trajectory gradients for optimal control problem with constrained first-order difference equations. The forward differentiation method outlines impressive speedups compared to standard automatic differentiation tools. The method is evaluated on tasks from inverted pendulum to vision-based cartpoles, consistently displaying outstanding scaling and learning performance over neural network policies.
优点
The paper has a careful presentation and illustration of existing work, and how its approach differs from the rest. It carefully distinshuisiung between the various notions of forward and backward passes it introduces. The results are interesting, especially on computational performance and imitation learning, as they indicate near constant scaling with the number of iLQR iteration and remarkably low losses.
缺点
The paper suffers from a number of weaknesses, listed below using the same references as in the main paper:
- The contribution is somewhat limited. As you point out in section 4.7, the most important related work (Amos et al. 2018) differs from the current because it assumes independent fixed points. You combine this with the idea of analytically deriving derivatives for a fixed-point wrt to parameters . This has been done very well before, notably in (Bai et al. 2019) as you point out. I think better contextualization of your contribution and acknowledgements of these two pieces of work is important.
- Since the main contribution of this work is theoretical with the aim of outperforming auto-differentiation, I believe the comparison shouldn't be limited to Pytorch's implementation. Other frameworks like JAX could provide invaluable insights.
- The visual control task is interesting, but it lacks substantial results, analysis, and discussion. The only results available are qualitative. More generally, the work is not compared to more established methods on advanced tasks, and it suggests potential for SOTA performance. The work mentions other works that only consider toy examples, but those have various strengths that the current doesn't have, particularly in their exposition and strong theoretical guarantees (e.g. Xu et al. 2024a, Amos et al. 2018)
- The presentation suffers from inconsistent notations, and the limitations of the work are absent.
Minor issues
- Typo in line 288
- fixed-point typo in line 346
问题
- I am confused about section 4.4. Is the referenced there sill referring to the loss function from line 208 ? Also, is the binary loss approach described in that section approximate ? If so, then it might be worth clarifying that the gradients mentioned in the main contribution are equally approximate (line 094).
- I'm not sure about the first paragraph in section 4.6. In PyTorch's
torch.autograd.jacobiandocumentation, I believe the flagcreate_graphwhich would in turn raiseretain_graphshould allow gradient information to flow through time steps. I couldn't figure that out from the code you attached. Can you please clarify whether this feature is what you addressed, or not ? - In line 312, you refer to from Eq. 9, but that term doesn't appear in Eq. 9. Did you mean Eq. 13 instead ? (It makes it hard to understand Eq. 14 and 15.)
We appreciate the reviewer for the serious and careful feedback . For this discussion, let’s use [BKK] as an abbreviation for “Bai et al. (2019)”.
[BKK] implemented our idea (analytically deriving derivatives for a fixed-point) before
In [BKK], fixed-point theory is used to handle the gradients in various kinds of DNN (transformers, RNNs, etc.), which are used for language modeling. By contrast, we apply fixed-point theory to iLQR, a nonlinear controller. This context is quite different from that of DNN. Different problems call for different technical contributions. The novelty of our method comes from: 1. observing that the structure of iLQR fits the fixed-point methodology, and 2. developing the details that make the idea work for iLQR (sections 4.2-4.6). We do not claim to be the inventors of implicit differentiation or fixed-point theory.
Other frameworks like JAX could provide invaluable insights.
We agree JAX has special properties in comparison with Pytorch. But our contribution identifies an approach that saves calculation steps that any autodiff approach would have to do. Our innovation is theoretical, and has nothing to do with what programming language and library we are using. Since our contribution is primarily conceptual, we think it’s reasonable to demonstrate proof-of-concept on just one widely-endorsed platform. Using Pytorch suffices to demonstrate the difference in work just described.
Visual experiment lacks discussion and analysis.
Our visual experiments aim to show visual control with our modular could predict a series of images (with high accuracy) instead of a single image. We will provide some numerical analysis shortly.
"the work is not compared to more established methods on advanced tasks, and it suggests potential for SOTA performance"
We are unclear about the definition of "SOTA" and "performance." The two cited papers also seem to lack explicit definitions. Our paper focuses on proof of concept, showing that our theoretical framework works effectively. In the future, we may combine more powerful vision models such as large vision-language foundation models combined with MPC for more complex tasks. We also have a discussion in the common comment.
Our method lacks exposition and strong theoretical guarantees compared with our references.
We are unsure about the specific aspects referred to as "exposition" and "strong theoretical guarantee". We have added a theoretical section in appendix A.4 to give more support of our method. The reviewer could provide further clarification or examples.
Limitations absent
Shortly speaking, our method assumes the iLQR can find a fixed-point, and also the requirement of first-order and second-order derivatives of the dynamics. These assumptions may restrict the applications of our methods. We will revise our submission to include them.
Inconsistent notations
We have made efforts to maintain consistent notations and ensure that each notation is properly explained. But, as mentioned in the common comments, we have updated the notations to make them more readable.
Questions from the reviewer
Q1.1 In section 4.4, is the L referenced there still referring to the loss function from line 208?
The L in section 4.4 can refer to the loss in line 208, however, that is not how we use it in section 4.4. Here, we treat H as L, and we rely on derivatives provided in (Amos 2018) (e.g. ) to calculate .
Q1.2 Is the binary loss approach described in that section approximate?
No, it is not approximated, as we enumerate all the elements in H to create multiple binary losses to obtain .
Q2. If the first paragraph in section 4.6 addressed and implemented in our code
Yes, it is. Please refer to line 703 in lqr_step_explicit.py in our code, where true_dynamics.grad_input performs the operation described in Section 4.6. The function is defined in the model file. In line 711 of lqr_step_explicit.py, we return dtheta, which indicates that we compute the analytical gradient of and pass the customized gradient to , rather than allowing PyTorch to traverse the graph using create_graph=True, as noted by the reviewer in the mpc_explicit.py file. We enabled create_graph only for testing auto-differentiation.
Q3.Does refer to term in Eq. 13?
Yes, it refers to the in equation (13). We refer to equation (9) because we want to trace back to the root equation. We have revised this point to make the section more readable.
Typos
We thank the reviewer for pointing out the typos. We have corrected them accordingly.
Quickly commenting in here with some of my thoughts:
Relationships to DEQs [BQQ]
The similarity is that a DEQ fixed-point and control problem can both be implicitly differentiated. I agree with the authors' response that there are not any specific insights from DEQs that could be applied in this setting.
Comparisons to JAX
I also agree with the authors' response that switching to JAX alone at the framework level would not lead to any interesting insights. However, the jaxopt package does have implicit differentiation support here that will use autodiff for implicit derivatives.
It seems possible to simply put the iLQR fixed point into this function. It should result in an autodiff way of obtaining the implicit derivatives of iLQR (computed via conjugate gradient)
"strong" theoretical guarantees
Again I agree with the authors that the paper is not lacking in theoretical guarantees in ways that the related papers overcome.
We have added a theoretical section in appendix A.4 to give more support of our method.
I read this, this description of it by the authors is slightly misleading. There are no new theorems or proofs in it. The section simply notates iLQR and implicit derivatives, and applies this to a simple example.
We sincerely thank the reviewer for his/her valuable feedback and the time and effort dedicated to evaluating our work. We’ve decided to withdraw the paper at this stage to further improve it before resubmitting. This will save the Area Chair and reviewers’ time.
The paper studies derivatives through an iLQR-based controller, which are given by implicit differentiation in Prop 1. The experimental setting in section 5 compare this method's runtime to the unrolled derivatives (Fig 2), and then go onto imitation/inverse control settings on the pendulum and cartpole (Fig 3), and finally to a pixel-based control setting.
优点
Understanding controller derivatives is an important topic, and the paper investigates efficient ways of doing this.
缺点
While the overall direction of the paper is okay, it is not ready for publication. My biggest concern is that there are at least two crucial pieces of related work that need to be discussed and compared against: Differentiable Optimal Control via Differential Dynamic Programming and Leveraging Proximal Optimization for Differentiating Optimal Control Solvers. This abstract of the submitted paper claims "the iterative Linear Quadratic Regulator (iLQR) [...] still lacks differentiable capabilities." However, both of these papers focus on this setting. Section A of Differentiable Optimal Control via Differential Dynamic Programming states that they use "any method (including either iLQR or DDP) for solving optimal control problems" while using the second-order terms from DDP in only the implicit derivative computation. I believe it's possible the derivatives between the submitted paper and this one are very similar, if not identical, and a discussion and direct comparison here is necessary. Furthermore, Leveraging Proximal Optimization for Differentiating Optimal Control Solvers can also solve the OC problem with iLQR and then proposes a proximal optimization way of computing the implicit derivatives.
I have a few other minor concerns with the paper: 1) Figures 1 and 2 compare autodiff through the unrolled LQR iterates to the proposed method. This is slightly misleading as it ignores the related work and alternative approaches that are not based on unrolling. 2) The experimental results on imitating the pendulum and cartpole are relatively small-scale. I would find it significantly more convincing to use the larger experimental settings from these other differentiable control papers.
问题
I do not have any further specific questions. I am very open to discussing the weaknesses above
We thank the reviewer for pointing out two articles that we did not cite. For the current discussion, let’s refer to these works as
[DM+] Differentiable Optimal Control via Differential Dynamic Programming;
[BPC] Leveraging Proximal Optimization for Differentiating Optimal Control Solvers.
We agree that they are relevant, but we do not agree that the derivatives of [DM+] and ours are similar. We believe that our innovations differ from [DM+, BPC]. They are not in conflict with ours, but rather complementary.
Differences between our method and [DM+,BPC]
-
In our notation, the linearized model involves coefficients D and d. [DM+] essentially details a more accurate calculation of just one term (equation (19) in [DM+], namely, the one we denote ) among the several terms on the right-hand side of our Equation (13). We currently use the results from (Amos et al.2018) for this term. Similarly, [BPC] provides a more accurate calculation of the same term, , under constraints. Both cited results omit the other terms we show on the right side of Equation (13) as well as the fixed-point finding process in Equation (9). Thus our fixed-point perspective leads to materially different calculations from these works, and our experiments confirm that the difference is indeed an improvement.
-
Why does the fixed-point method make a difference? We have added a comparison between the fixed-point and non-fixed-point methods in the appendix A.4. Please refer to section 4.7 and the additional appendix A.4 in the revised PDF for details.
-
In fact, it could be useful to combine the refinements of [DM+,BPC] with the structure suggested in our manuscript to generate a new method with all the advantages of both. However, [DM+,BPC] didn’t release their code, so we cannot complete this project during the current discussion period.
“Section A of [DM+] states that they use any method (including either iLQR or DDP) for solving optimal control problems."
As noted in the introduction of our paper, we focus on the exact gradient of iLQR. This includes aligning the forward solution with the backward pass gradient. Such alignment can be important, especially when the forward pass finds a sub-optimal path but the backward pass obtains the gradient of the path assuming the path is optimal.
[BPC] can also solve the OC problem with iLQR
[BPC] briefly suggests a way to extend their result to iLQR; however, it remains a non-fixed-point method that simply uses the chain rule to connect each derivative (same as Amos (2018)). This has been addressed in Points 1 and 2 in the reply above.
Simple example
As mentioned in our discussion section, 'Many prior works, such as Amos et al. (2018), Watter et al. (2015), Xu et al. (2024a), and Jin et al. (2020), also rely on such toy examples to demonstrate foundational concepts.' The [BPC] paper referenced by the reviewer similarly uses these exact examples. We have provided further discussion on this point in the common comments.
Thank you for the response! I believe further connecting and comparing to these will significantly improve the paper. I do not think the new version of the paper adequately does this, as there are still no experimental comparisons. The methods seem competing rather than complementary, and not having their source code is not a sufficient reason for not comparing. And, additional weaknesses from my the original review remain unresolved:
- Figures 1 and 2 compare autodiff through the unrolled LQR iterates to the proposed method. This is slightly misleading as it ignores the related work and alternative approaches that are not based on unrolling. 2) The experimental results on imitating the pendulum and cartpole are relatively small-scale. I would find it significantly more convincing to use the larger experimental settings from these other differentiable control papers.
We sincerely thank the reviewer for his/her valuable feedback and the time and effort dedicated to evaluating our work. We’ve decided to withdraw the paper at this stage to further improve it before resubmitting. This will save the Area Chair and reviewers’ time.
This paper proposes a method for calculating the analytical gradient of iLQR by leveraging implicit differentiation at the fixed point to enhance the learning and optimization of control algorithms using iterative Linear Quadratic Regulators (iLQR). The method also introduces a forward approach that reuses computations from each time step to accelerate the next step, along with parallelization techniques to further improve efficiency. The proposed approach significantly reduces computational costs and enables accurate and scalable gradient calculations compared to conventional automatic differentiation methods. Additionally, the integration of iLQR as a module within neural networks demonstrated the feasibility of end-to-end learning for high-dimensional tasks involving visual inputs.
优点
- The paper is well-written overall, with the proposed method's theoretical foundation clearly explained. Additionally, the positioning and differences from existing research are clearly demonstrated.
- The proposed method significantly reduces computational costs compared to conventional methods using automatic differentiation, achieving improved scalability. To the best of my knowledge, this approach is novel and highly useful, as it is applicable to large-scale control problems and tasks with long horizons. Furthermore, the integration of iLQR as a module within neural networks for end-to-end learning has been demonstrated, suggesting potential applications to more complex tasks.
缺点
- The proposed method improves gradient accuracy through analytical implicit differentiation at the fixed point and further enhances speed by introducing a forward method for differentiation concerning nonlinear dynamics parameters of linear dynamics, as well as implementing parallelization. However, the results only evaluate the final method that combines these improvements, making it unclear how much each specific enhancement contributed to the overall speed improvement. The authors should clarify through experiments how much each component contributed to the speed and performance outcomes.
- While the paper is well-written overall, the explanation of the proposed method could be further improved. For instance, due to the parallel presentation of Sections 4.1 to 4.6, understanding the complete picture was challenging, and it was difficult to grasp how each element fit into the overall method. Including an overview at the beginning of Section 4 or presenting the entire computation as an algorithm would make the method easier to understand. Additionally, while reading the paper, I found it confusing that both (meaning ) and (meaning ) were represented by the same symbol . Using different notations or adding a note for clarification would help avoid confusion.
问题
- To clarify the proposed method's effectiveness, I would like you to improve the points I have raised in the above weaknesses.
We are grateful for the reviewers' feedback and the effort the reviewer has put into improving our work.
Relations between each component
Section 4.1 provides an overview of the pipeline for differentiable control. Section 4.2 focuses on our main method, specifically for in section 4.1. Section 4.3 addresses the computation of F_XF_UG_XG_U as referenced in Section 4.2. Section 4.4 covers \frac{\partial H}{\partial D} (where or ) as used in Section 4.3. Section 4.5 deals with \frac{\partial D}{\partial X} from Section 4.3, while Section 4.6 explains \frac{\partial D}{\partial \theta} from Section 4.3. We can incorporate this explanation into either a pseudo-code block or an additional subsection for improved clarity.
Contribution of each component
Section 4.4 demonstrates at least a 2x acceleration when the data size is substantial(>1500 data pairs). Section 4.5 highlights significant memory cost benefits. Section 4.6 is a critical component specifically developed for our overall method. Without this module, the fixed-point implementation would not achieve the expected time advantage. For example, with 1,500 input data pairs, the computation time for gradients (line 704 in lqr_step_explicit.py) is reduced from 5 seconds to 0.02 seconds. Additional numerical illustrations will be provided shortly. The contributions of these components are independent and can be evaluated separately without the need for an ablation study.
Notation issue
Please see the common comments for more information. we have updated the operator to .
We sincerely thank the reviewer for his/her valuable feedback and the time and effort dedicated to evaluating our work. We’ve decided to withdraw the paper at this stage to further improve it before resubmitting. This will save the Area Chair and reviewers’ time.
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.