/10

Poster3 位审稿人

最低2最高4标准差0.8

ICML 2025

Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

Yuhui Wang,Qingyuan Wu,Dylan R. Ashley,Francesco Faccio,Weida Li,Chao Huang,Jürgen Schmidhuber

提交: 2025-01-23更新: 2025-07-24

TL;DR

A novel 5000-layer Dynamic Transition Value Iteration Network performs well in extremely long-term large-scale planning tasks

摘要

The Value Iteration Network (VIN) is an end-to-end differentiable neural network architecture for planning. It exhibits strong generalization to unseen domains by incorporating a differentiable planning module that operates on a latent Markov Decision Process (MDP). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a $100\times 100$ maze---a task that typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module's depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introduce an "adaptive highway loss" that constructs skip connections to improve gradient flow. We evaluate our method on 2D/3D maze navigation environments, continuous control, and the real-world Lunar rover navigation task. We find that our new method, named Dynamic Transition VIN (DT-VIN), scales to 5000 layers and solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in complex environments.

关键词

Value Iteration NetworksLong-term PlanningReinforcement LearningDeep Neural Network

评审与讨论

审稿意见

评分: 42025-02-24

In this paper, the authors investigated to extend Value Iteration Network (VIN) to address longer-term and larger-scaled planning tasks. It is not feasible just by applying VIN, and they found the reasons from invariant transition in the network and inefficient loss design for long-term planning tasks. To address them, the authors proposed a new module, Dynamic Transition kernel which transits under the consideration of observation knowledge, and adaptive highway loss to give the learning signals to the each layer of network when the planning horizon is over the oracle length. It not just leads the network can find the shorten path but also improves the information flow. They called it as Dynamic Transition VINs (DT-VINs) and they showed this planning modules can be scaled up to 5000 layers empirically. They empirically evaluated their model on large-scaled discrete/continuous action conditioned maze environments and Rover Navigation environment. In the evaluations, their model outperformed baselines. They also studied ablations when different kernels and losses are applied and the performance change through different number of layers applications.

update after rebuttal

The authors properly addressed our concerns, we keep our score.

给作者的问题

For continuous action maze environments, you applied the pre-trained VIN model for planning. Then, in the aspect of the planning, can we think the task is similar to discrete action spaced maze? May the difference is the controller I guess.

论据与证据

This paper claims that by augmenting the standard VIN architecture with the dynamic transition kernel and an adaptive highway loss, one can overcome the limitations in long-term planning present in conventional VINs. DT-VINs achieve much higher success and optimal rates in tasks where traditional methods fail, and the ablation studies clearly show the contribution of each component. While the improvements are well supported, a deeper discussion on the trade-offs between performance gains and increased computational complexity would further strengthen the evidence.

方法与评估标准

The methodology is well motivated. The use of a dynamic transition kernel addresses the representational bottleneck in traditional VINs, and the adaptive highway loss provides an elegant solution to train extremely deep networks. The evaluation criteria—primarily success rate and optimality rate across varied maze sizes—are appropriate for assessing long-term planning performance. The extensive experiments on multiple benchmarks and controlled ablation studies make a strong case for the proposed modifications.

理论论述

The paper references theoretical work linking network depth with value estimation accuracy (e.g., via Theorem 1.12 from prior studies) but does not provide new proofs.

实验设计与分析

The experimental design is reasonable and their experiments are comprehensive. The authors evaluated DT-VINs across a variety of settings—from small-scale discrete action spaced mazes to complex continuous control and real-world rover navigation tasks. The soundness of the experimental analyses is supported by thorough ablation studies, comparisons with multiple baselines, and assessments under different noise conditions.

补充材料

We checked their appendix, especially for additional experimental detail parts and additional ablation study results.

与现有文献的关系

In the perspective of the targeting tasks, long-term planning, it can be related to the recent works applying the Diffusion model to the long-term planning tasks [1,2]. Their methods approach to generate the long-term planning through the powerful generative modeling with the strengths of the diffusion model, holistically generating data while DT-VIN can make it possible through very deep VIN model learning.

遗漏的重要参考文献

The essential references are properly addressed in this paper.

其他优缺点

Strengths:

A good motivation and well-designed methodology: One of the weaknesses of VIN for long-term planning is well addressed with properly designed methodology.
Extensive empirical results: They empirically evaluated their methodology with several environments and studied diverse ablations.

Weaknesses:

Too compressed discussion for their methodology: Their discussion for proposed methodology is relatively short, and it was hard to understand at first glance. More detailed figures and explanations are given, it would be better. For instance, why the dynamic transition kernel is required is shown through figure that shows the strength of it by comparing the region with walls and empty areas are encoded invariantly and observation conditionally.
Computational overhead discussion: Deeper network requires more computational overhead, but it is not discussed in their paper.

其他意见或建议

We do not have any other comments.

作者回复

2025-04-01

We appreciate the insightful and helpful comments from Reviewer vY56. Please find our responses to your concerns and questions below.

Suggestion 1: The tasks are related to diffusion models for long-term planning.

Answer:

Thank you for noting the connection to diffusion-based planning. As references [1,2] were missing, we reviewed related works [r3, r4]. Though differing in approach—trajectory generation vs. step-wise planning—both address long-horizon dependencies in complex tasks. We will revise the manuscript to reflect this connection and consider future integration.

Additionally, if the reviewer can provide [1,2], we will be happy to cite and discuss them in our final version.

[r3] Janner M. et al. Planning with diffusion for flexible behavior synthesis. ICML 2022.

[r4] Mishra U.A. et al. Generative skill chaining: Long-horizon skill planning with diffusion models. CORL 2023.

Weakness 1: More discussion on methodology.

Answer:

We agree that the methodology section could be clearer and will revise it for the final version. In particular, we agree with your proposal to lean more on illustrative figures in the paper. Currently, Figure 6(e) visualizes the learned dynamic latent transition kernel of DT-VIN across several states. We plan to integrate this into Figure 2 and include a side-by-side comparison with an invariant transition kernel to better highlight the advantages of the dynamic approach.

Weakness 2: More discussion on trade-offs between performance gains and the additional computational costs introduced by using deeper networks.

Answer:

We already partially address this in Appendix G, where we found that:

At the same depth, DT-VIN achieves better performance with a similar or lower GPU memory and training time (Appendix G.1, Tables 14–15).
DT-VIN’s depth scales almost linearly with planning steps, ensuring manageable complexity as tasks grow harder (Appendix G.2).

To more specifically address your concern, we have produced Tables R1 and R2 (shown below), which explicitly analyze the trade-off between computational cost and performance across different network depths $N$ . While deeper networks naturally incur higher computational costs, DT-VIN demonstrates substantial performance gains as depth increases. Specifically, it achieves +81.23% and +99.77% success rate improvements at $N=300$ and $N=600$ over $N=100$ , respectively. In contrast, VIN, GPPN, and Highway VIN show little to no improvement despite incurring similar or even higher computational costs.

We will incorporate these new tables into the revised manuscript to clarify the performance–efficiency trade-offs of DT-VIN.

Table R1: GPU memory (GB) and GPU hours (h) during training, on $35\times 35$ maze tasks for varying network depths $N$ .

GPU memory (GB):

	N=100	N=300	N=600
VIN	1.3	2.7	4.2
GPPN	42.1	135.2	182
Highway VIN	15.2	31.5	41.3
DT-VIN (ours)	18.8	35.8	53.3

GPU hours (h):

	N=100	N=300	N=600
VIN	2.8	5.2	8.4
GPPN	4.2	9.1	12.6
Highway VIN	4.9	9.8	14.3
DT-VIN (ours)	4.8	8.7	12.1

Table R2: Success rates of different models at varying depths on $35\times 35$ mazes with shortest paths in [200, 300]. See Appendix Fig. 9 for plots.

	N=100	N=300	N=600
VIN	$0.0_{\pm 0.0}$	$0.0_{\pm 0.0}$	$0.0_{\pm 0.0}$
GPPN	$0.0_{\pm 0.0}$	$0.0_{\pm 0.0}$	$0.0_{\pm 0.0}$
Highway VIN	$0.0_{\pm 0.0}$	$34.23_{\pm 11.27}$	$54.41_{\pm 10.2}$
DT-VIN (ours)	$0.0_{\pm 0.0}$	$\mathbf{81.23_{\pm 1.34}}$	$\mathbf{99.77_{\pm 0.23}}$

Q1: Whether planning in continuous action mazes is essentially the same as in discrete ones. May the main difference is the controller.

Answer:

You're generally right—the high-level planning in continuous control tasks is conceptually similar to that in discrete mazes (all planning in reinforcement learning is essentially about producing a sequence of actions—discrete or continuous—that will lead to a desired state or sequence of states).

However, DT-VIN is not simply a stack of a high-level planner and a low-level controller. Instead, it is trained end-to-end to map observations directly to actions, learning high-level planning implicitly from expert demonstrations without requiring high-level planning supervision. This simplifies training and improves robustness.

For example, in D4RL tasks, we supervise using expert control actions (e.g., from a 2-DOF ball or a 8-DOF quadruped robot). While D4RL uses the same $9\times 12$ maze layout for all training samples (varying only the start and goal positions), we evaluate on much larger, unseen mazes ( $35\times35$ , $100\times100$ ), making the task significantly more challenging.

审稿意见

评分: 32025-03-08

Value Iteration Networks (VIN) struggle to scale with large scale planning problems, typically in problems involving a higher number of steps to reach to goal. The paper provides two observations that explain the poor performance (1) low representation capacity (2) lack of depth in VINs - due to vanishing gradient issue for higher depths. The authors propose to use dynamic transition kernels to improve representational capacity (solve (1)) and propose ‘adaptive highway loss’ to solve issues with vanishing gradients allowing higher depths. Their method, named DT-VIN, shows state-of-the-art performance with respect to other VIN frameworks, especially for large scale planning problems.

Update after Rebuttal

Thanks for a detailed response. That is, I still think the paper lacks clarity in proposed method section (aligns closely with reviewer vY56). I do not agree with authors' final comment on "minor" writing issues. Although I understand minor writing issues are fine - however if they obstruct understanding of the paper (and thus critical review of the contributions) then in my opinion those issues are not "minor". Given my concern I will be retaining my original score as of now.

给作者的问题

How is suboptimality measured when success rate is less than 100%?
Why does highway VIN get a better optimality rate with increasing shortest path lengths?
line 260-261 - we examine depths of N = 600, 5000 what do you mean? Do you mean in range [600, 5000]?
Why is pretraining required in section 4.2?

论据与证据

Yes.

方法与评估标准

Yes, to my knowledge.

理论论述

N/A

实验设计与分析

The experiments sound solid for most part. See questions section for questions.

补充材料

Didn't review in detail - Appendix provides experimental details. Skimmed through the details - which made sense.

与现有文献的关系

They advance a widely used framework for large scale planning problems. I think the contribution is sufficient.

遗漏的重要参考文献

N/A

其他优缺点

Strengths

The paper provides rigorous experiments and strong ablation studies providing insightful conclusions.
The results are impressive on large scale planning problems.
The paper is well structured

Weakness

Section 3.1 is not clear to me. Specifically, justification for 'state’-dynamic and 'observation’ dynamic kernels is not clear to me. What is the difference between them in this context? How does a kernel of |A| \times F \times F make it `state’/’observation’ dependent? In this case, should we use |A| \times F \times F parameters for each state?

其他意见或建议

Minor

Text in images in too small (e.g., Fig. 1 Fig. 2)
Line 244 right ('Additional’ experiments). Due to space constraints we `report’ (or evaluate and report) Additional is also not spelled correctly.

作者回复

2025-04-01

We sincerely thank reviewer TBA2 for the valuable comments.

Concern 1: Clarification on state/observation-dynamic in Sec 3.1.

Answer:

To clarify: the observation refers to the maze image input, which is mapped by the model to a latent MDP. Each latent state corresponds to a specific node within this latent MDP.

We distinguish two types of dynamics:

Latent state-dynamic: As the reviewer correctly noted, in this setup each latent state $(i,j)$ has its own transition kernel $\overline{\mathsf{T}}_{i,j} \in \mathbb{R}^{|\overline{\mathcal{A}}| \times F \times F}$ .
Observation-dynamic: The transition kernel is generated from the input observation $x$ via a learned mapping function: $\overline{\mathsf{T}} = f^{\overline{\mathsf{T}}}(x)$ .

These two notions are orthogonal, defining independent aspects of variation in the transition kernel. They give rise to four configurations: latent state-dynamic/invariant and observation-dynamic/invariant (see Table R1). Our ablation study (Sec. 4.4) shows that both components are essential—removing either significantly degrades DT-VIN performance.

We will clarify this distinction in the final version.

Table R1: Four configurations of latent transition kernels based on whether they are dynamic with respect to latent state and/or observation.

Type of latent tranisiton kernels	latent state-dynamic?	observation-dynamic?	represenation of latent tranisiton kernels
fully invariant	$\times$	$\times$	parameter $\overline{\mathsf{T}} \in \mathbb{R}^{\lvert \overline{\mathcal{A}}\rvert\times F \times F}$
latent state-dynamic only	$\checkmark$	$\times$	parameter $\overline{\mathsf{T}} \in \mathbb{R}^{M \times M \times \lvert \overline{\mathcal{A}}\rvert \times F \times F}$
observation-dynamic only	$\times$	$\checkmark$	output of the mapping function, $\overline{\mathsf{T}} = f^{\overline{\mathsf{T}} }(x) \in \mathbb{R}^{ \lvert \overline{ \mathcal{A} } \rvert \times F \times F}$
fully dynamic	$\checkmark$	$\checkmark$	output of the mapping function, $\overline{\mathsf{T}} = f^{\overline{\mathsf{T}} }(x) \in \mathbb{R}^{ M \times M \times \lvert \overline{ \mathcal{A} } \rvert \times F \times F}$

Q1: How is suboptimality measured when the success rate is less than 100%?

Answer:

As defined in the paper (Line 220), suboptimality is measured relative to the best solution across all models, including the expert. This is applicable in the 2D Maze Navigation task, where the expert solution can be computed using Dijkstra’s algorithm with access to an underlying binary representation of the maze. For tasks without expert access (e.g., continuous control), only the success rate is reported.

Q2: Why does highway VIN get a better optimality rate with increasing shortest path lengths?

Answer:

Thank you for the thoughtful question. One possible explanation is that longer planning tasks offer a more structured search space, allowing Highway VIN to better propagate informative signals over multi-steps. In contrast, shorter mazes may contain more local optima or distractors, leading to suboptimal choices despite a shorter horizon. As Highway VIN is not our focus, we leave deeper analysis to future work.

Q3: line 260-261 - we examine depths of $N = 600, 5000$ what do you mean? Do you mean in range $[600, 5000]$ ?

Answer:

Thank you for pointing this out. We meant that we examine depths at two specific values, $N = 600$ and $N = 5000$ , rather than a continuous range. We will clarify this in the final version.

Q4: Why is pretraining required in Sec 4.2?

Answer:

Pretraining is necessary because the training dataset for the continuous control task includes only a single $9\times12$ maze layout, differing solely in start and goal positions. However, during evaluation, the model encounters significantly larger ( $35\times35$ and $100\times100$ ) and previously unseen maze layouts, substantially increasing task complexity and requiring strong generalization abilities.

To address this challenge, we leverage pretraining on the diverse and easily accessible 2D maze navigation dataset. This approach helps the model acquire generalizable planning skills, resulting in strong performance with success rates of 98% on the Point Maze and 93% on the Ant Maze tasks (refer to Table 1 in Sec. 4.2 for further details).

While it is feasibly possible for it to solve the task without pretraining, the sample complexity would be high—a property shared with VIN (and VIN variants) here.

Suggestion 1: Minor comments on figure readability and typos.

Answer:

Thank you for bringing this to our attention. We will correct this for the camera-ready version.

审稿人评论

2025-04-04

Thanks for a detailed response. I am not convinced about concern 1. That is, I still think the paper lacks clarity in proposed method section (aligns closely with reviewer vY56).

Regarding Q1. I think there was some misunderstanding regarding my question. I will try to rephrase. How do you aggregate suboptimality if an algorithm is unable to solve a problem? Do you consider the cost as infinity (or a high number) in that case? Typically, I see (and agree with) suboptimality is calculated over a set of instances that were solved by ALL algorithms that are being compared.

Given my concern I will be retaining my original score as of now.

作者评论

2025-04-07

Thank you for your thoughtful and detailed follow-up. Please find our responses below.

Concern 1-1 (based on Concern 1): Ongoing concern regarding the clarity of Sec. 3.1.

Answer:

We greatly appreciate your careful attention to detail and constructive feedback. We understand that your remaining concern primarily relates to the clarity of the dynamic latent transition kernel, and we sincerely hope our previous response has adequately addressed this issue.

Nevertheless, our original submission explicitly defined both the latent state-dynamic transitions (Lines 126–136, Right) and the observation-dynamic transitions (Lines 153–157, Right). We acknowledge that explicitly highlighting their differences and clearly specifying parameter shapes could further help resolve any remaining ambiguity.

Given your positive evaluation—highlighting the "rigorous experiments and strong ablation studies" and the "impressive results on large-scale planning problems"—we believe you would also agree that addressing this concern would involve only minor rephrasing of a few sentences rather than substantial modifications. The ambiguity pertains solely to clarity of presentation, rather than to any intrinsic methodological flaw or fundamental issue.

We hope these clarifications help demonstrate that the core contributions remain solid and that the concern can be resolved through light adjustments to the presentation.

Q1-1 (based on Q1): Ongoing question about the suboptimality.

Answer:

We apologize for the misunderstanding.
As many long-horizon planning tasks are highly challenging—for example, over 60% of tasks on the $35 \times 35$ maze involve at least one algorithm that fails (see Fig. 3a)—suboptimality is not well-defined for failed attempts. To address this, our paper adopts the optimality rate, which provides a more consistent and fair measure of both solution quality and robustness across all solvable instances. As defined in Lines 223–225, the optimality rate (OR) refers to “the rate at which the algorithm provides a relatively optimal path.” OR is computed over instances that are at least solvable by the expert policy. As noted in Appendix C.1, Line 582: “Each maze features a goal position, with all reachable positions selected as potential starting points.”

However, to further address your concern, we additionally report suboptimality as a complementary metric, defined as:

\text{suboptimality} = \frac{\text{cost of solution}}{\text{cost of optimal solution}} - 1

We consider two evaluation settings:

Suboptimality on instances solved by all algorithms:
Following standard practice, we compute suboptimality only over the subset of instances successfully solved by all evaluated algorithms.
Suboptimality with penalty for failures:
For instances where an algorithm fails to solve the task, we assign a penalty cost (7× the optimal cost) when computing suboptimality.

As shown in the table below, DT-VIN consistently achieves the lowest suboptimality under both evaluation protocols, further demonstrating its effectiveness and robustness.

Table R2: Suboptimality comparison under two evaluation settings.

	Suboptimality (instances solved by all algorithms)	Suboptimality (with penalty for failures)
VIN	$\mathbf{0.73_{\pm 0.27}}$	$\mathbf{4.23_{\pm 1.76}}$
GPPN	$\mathbf{0.41_{\pm 0.19}}$	$\mathbf{3.73_{\pm 1.22}}$
Highway VIN	$\mathbf{0.33_{\pm 0.12}}$	$\mathbf{3.13_{\pm 1.13}}$
DT-VIN (ours)	$\mathbf{0.02_{\pm 0.01}}$	$\mathbf{0.07_{\pm 0.03}}$

We hope these clarifications address your concerns. Thank you again for your valuable and expert feedback.

审稿意见

评分: 22025-03-23

This paper tackles the problem of extending value iteration networks (VINs) to handle very long-horizon planning. To achieve this, the authors propose two main modifications:

A dynamic transition kernel that relaxes the standard weight sharing of convolution layers (i.e., removing strict translation equivariance), and
An adaptive highway loss designed to enable the training of much deeper planning modules.

Experimental results are presented primarily on 2D path planning tasks, demonstrating the network’s ability to scale to thousands of planning steps. Overall, the paper is very empirical and focuses on a niche domain related to path planning.

给作者的问题

Dynamic Kernel Analysis:

Could you provide further analysis or ablation studies on how the dynamic transition kernel (i.e., relaxing weight sharing) affects the network’s translation equivariance and overall performance?
Generalization Beyond 2D Planning:

Have you considered applying your approach to planning problems outside of 2D path planning? What challenges do you foresee, and how might your method generalize to more realistic or complex environments?
Broader Implications:

Can you elaborate on what new capabilities or applications might be enabled by scaling VINs to such long horizons? For instance, could this be integrated with perception modules to tackle more challenging planning tasks?

论据与证据

Claims:
- The proposed modifications enable VINs to plan over very long horizons (up to thousands of steps) by increasing network capacity and facilitating deep credit assignment.
- Relaxing the invariant kernel (i.e., weight sharing) boosts performance on these tasks.
Evidence:
- Empirical results support improved planning in extended 2D path planning scenarios.
- However, the paper does not convincingly analyze the impact of removing translation equivariance (i.e., the dynamic kernel is simply a relaxation of the weight-sharing property) and lacks detailed insights or ablations on this point.

方法与评估标准

Methods:
- The paper proposes modifications to the standard VIN architecture by replacing invariant (shared-weight) convolutions with dynamic kernels and enforcing an adaptive highway loss to help train very deep networks.
- While the adaptive loss is an interesting idea, the paper provides limited discussion on its theoretical justification.
Evaluation:
- Experiments are conducted on standard 2D path planning benchmarks, which reflect a narrow application domain, although showing extended horizon capability. I would suggest to follow the mentioned work below for more tasks, e.g., for visual navigation with perception input using additional network heads for perception networks.
- The evaluation criteria remain focused on path planning success rates; broader or more realistic planning tasks would strengthen the work.

Scaling up and Stabilizing Differentiable Planning with Implicit Differentiation. ICLR 2023.

理论论述

paper’s contributions are mainly empirical, and no detailed theoretical proofs were provided regarding the benefits of the dynamic kernel or the adaptive loss.

实验设计与分析

Design:
- The experimental setup largely mirrors that of the original VIN paper (from eight years ago) but with extended planning horizons.
Analysis:
- While the experiments demonstrate that the approach scales to very long horizons, they remain confined to a niche 2D planning domain.
- The paper would benefit from further analysis or ablation studies that clarify how the removal of weight sharing affects network performance and equivariance properties.

补充材料

Yes. Mainly additional results and setup.

与现有文献的关系

N/A

遗漏的重要参考文献

This work also studies scaling up the training of value iteration networks, which is potentially related. It also has some more interesting domains to study.

Scaling up and Stabilizing Differentiable Planning with Implicit Differentiation. ICLR 2023.

其他优缺点

Strengths:
- Tackles an interesting and challenging problem of extending VINs to very long-horizon planning.
- Introduces conceptually appealing modifications (dynamic kernel and adaptive highway loss) that are intuitively motivated.
Weaknesses:
- The approach is limited to a niche domain (2D path planning) and may not generalize well to more realistic problems.
- The relaxation of the invariant (weight-sharing) property is not thoroughly analyzed; the paper lacks convincing insights on how this change impacts translation equivariance.
- Experimental validation remains confined to benchmarks from older literature, rather than exploring potential applications unlocked by long-horizon planning capabilities (e.g., integration with latent perception modules).

其他意见或建议

Consider broadening the experimental evaluation to include more realistic planning scenarios.
Provide additional ablation studies or analysis that clarifies the impact of using a dynamic (non-shared) kernel on equivariance properties.
Elaborate further on the adaptive highway loss to help the reader understand its benefits and limitations.
Discuss potential applications of long-horizon planning beyond pure path planning to motivate the broader relevance of the approach.

作者回复

2025-04-01

We sincerely appreciate Reviewer HD2P's valuable time and constructive comments.

To improve clarity and conciseness, we have reorganized the reviewers’ comments by grouping similar points.

Q1: Experiments are limited to 2D path planning. Evaluating more realistic tasks, integrating visual perception, and discussing broader applications.

Answer:

We thank the reviewer for their thoughtful suggestions and for pointing out [r1]; we will cite and discuss it in the final version. However, we believe there may be a misunderstanding—our submission already includes tasks aligned with [r1] and extends them in several ways:

3D Visual Maze Navigation (Sec. 4.1, Additional Experiments, Appendix C.3): Our setup is identical to that of [r1], with both following established protocols from prior work [r2]. As the reviewer noted, this involves “visual navigation with perception input using additional network heads.” Agents must infer maze layouts from noisy, ambiguous first-person visual views, making the task especially challenging.
Continuous Control Tasks (Sec. 4.2): These involve long-horizon planning with low-level torque control to actuate movement of the agent. While [r1] uses 2-DOF manipulation tasks, we evaluate on a 2-DOF ball and an 8-DOF quadruped robot—sharing the same fundamental task but introducing greater control complexity due to higher dimensionality.
Rover Navigation (Sec. 4.3): This more realistic task, inspired by the original VIN paper but absent in [r1], involves planning over noisy, incomplete aerial terrain images—adding complexity beyond synthetic 2D gridworlds.
Scale: Our benchmarks feature much larger mazes (up to 100×100 vs. up to 49×49 in [r1]), posing harder long-horizon challenges.

These tasks—spanning visual navigation, control, and realistic planning—are already described throughout the original submission (Abstract, Introduction, Experiments). Nonetheless, we will revise the final version and cite [r1] to more clearly highlight their generality and realism.

Extending the VIN architecture to broader applications is a compelling direction for future work—previously barred by the limited planning capacity of VIN-based architectures. However, this work aims to focus on the first step in this of demonstrating the performance gains of DT-VIN over related methods, which we view as a necessary precursor to deployment in more applied domains. Indeed, while we are looking into more applied domains currently, such an investigation is of the scale that would necessitate a separate paper.

[r1] Zhao et al., Scaling up and Stabilizing Differentiable Planning with Implicit Differentiation, ICLR 2023

[r2] Lee et al., Gated Path Planning Networks, ICML 2018

Q2: Concern over the impact of relaxed dynamic kernel weight sharing on translation equivariance and performance.

Answer:

Thank you for the thoughtful and constructive feedback.

While our dynamic transition kernel relaxes the weight sharing, it preserves translation equivariance—i.e., shifting the input (e.g., maze image) results in a corresponding shift in the output. This may seem counterintuitive but is straightforward upon closer inspection. This holds because the kernel $\overline{\mathsf{T}}^{\rm dyn} = f^{\overline{\mathsf{T}}}(x)$ is generated by a CNN, which is itself translation equivariant. Consequently, the downstream value iteration process retains this property (see Eq. 2).

To verify this, we conducted a small experiment comparing DT-VIN, a standard CNN, and VIN on 5,000 maze images with translations of 1 to 10 pixels. All models use stride and pooling of (1,1). We measured the L2 error between the output on a translated input and the translated output of the original. Results in Table R1 show perfect translation equivariance across all models.

Additionally, ablation studies in the original submission (Figs. 6a and 6b) compare dynamic vs. invariant kernels. Dynamic kernels consistently outperform their invariant versions, particularly in long-horizon and obstacle-rich settings—highlighting both practical advantages and robustness.

Table R1: Translation Equivariance — L2 Error

Model	Avg. L2 Error
CNN	0.0
VIN	0.0
DT-VIN (ours)	0.0

Q3: Benefits and limitations of adaptive highway loss.

Answer:

Thank you for the suggestion. The adaptive highway loss avoids the vanishing gradient problem in very deep networks by adding skip connections from intermediate layers to the final loss, guided by the planning trajectory length (computed from training data). This improves gradient flow and stabilizes training (Sec. 4.4, adaptive highway loss).

A potential limitation is computational overhead, which we mitigate by applying the loss every $J$ layers (Eq. 3). This balances efficiency and performance; e.g., $J=10$ significantly accelerating training with minimal performance loss (Appendix F.1).

We will clarify this in the revised manuscript.

最终决定Accept (poster)

2025-05-01

The paper presents a significant advancement in the field of reinforcement learning by extending the capabilities of Value Iteration Networks (VINs) to handle extremely long-term planning tasks. The authors introduce a novel architecture, Dynamic Transition VIN (DT-VIN), which incorporates a dynamic transition kernel and an adaptive highway loss to address the limitations of traditional VINs in terms of representation capacity and depth. The proposed method demonstrates impressive performance across a variety of challenging tasks, including 2D/3D maze navigation, continuous control, and applications like the Lunar rover navigation.

The introduction of a dynamic transition kernel and adaptive highway loss is a well-motivated and innovative solution to the challenges faced by traditional VINs.
The paper provides extensive empirical evidence supporting the effectiveness of DT-VIN, with experiments conducted on a diverse set of tasks and environments, and ablations supporting each component.

The paper could benefit from a more detailed discussion of the trade-offs between performance gains and the increased computational complexity associated with deeper networks. (Partially addressed in rebuttal.)

Clear paper

Overall, I recommend this submission for acceptance at ICML 2025. The paper presents significant contributions to improving VINs for long-term planning tasks through innovative architectural modifications. While there are some concerns regarding clarity and experimental breadth, the authors have adequately addressed these in their rebuttal.

Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

摘要

评审与讨论

update after rebuttal

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

Suggestion 1: The tasks are related to diffusion models for long-term planning.

Answer:

Weakness 1: More discussion on methodology.

Answer:

Weakness 2: More discussion on trade-offs between performance gains and the additional computational costs introduced by using deeper networks.

Answer:

Q1: Whether planning in continuous action mazes is essentially the same as in discrete ones. May the main difference is the controller.

Answer:

Update after Rebuttal

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

Concern 1: Clarification on state/observation-dynamic in Sec 3.1.

Answer:

Q1: How is suboptimality measured when the success rate is less than 100%?

Answer:

Q2: Why does highway VIN get a better optimality rate with increasing shortest path lengths?

Answer:

Q3: line 260-261 - we examine depths of N=600,5000N = 600, 5000N=600,5000 what do you mean? Do you mean in range [600,5000][600, 5000][600,5000]?

Answer:

Q4: Why is pretraining required in Sec 4.2?

Answer:

Suggestion 1: Minor comments on figure readability and typos.

Answer:

Concern 1-1 (based on Concern 1): Ongoing concern regarding the clarity of Sec. 3.1.

Answer:

Q1-1 (based on Q1): Ongoing question about the suboptimality.

Answer:

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

Q1: Experiments are limited to 2D path planning. Evaluating more realistic tasks, integrating visual perception, and discussing broader applications.

Answer:

Q2: Concern over the impact of relaxed dynamic kernel weight sharing on translation equivariance and performance.

Answer:

Q3: Benefits and limitations of adaptive highway loss.

Answer:

Q3: line 260-261 - we examine depths of $N = 600, 5000$ what do you mean? Do you mean in range $[600, 5000]$ ?