PaperHub
8.2
/10
Oral4 位审稿人
最低5最高5标准差0.0
5
5
5
5
4.0
置信度
创新性2.8
质量3.3
清晰度3.5
重要性3.0
NeurIPS 2025

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

While most RL methods use shallow MLPs (~2–5 layers), we show that scaling up to 1000-layers for contrastive RL (CRL) can significantly boost performance, ranging from doubling performance to 50x on a diverse suite of robotic tasks.

摘要

Scaling up self-supervised learning has driven breakthroughs in language and vision, yet comparable progress has remained elusive in reinforcement learning (RL). In this paper, we study building blocks for self-supervised RL that unlock substantial improvements in scalability, with network depth serving as a critical factor. Whereas most RL papers in recent years have relied on shallow architectures (around 2 -- 5 layers), we demonstrate that increasing the depth up to 1024 layers can significantly boost performance. Our experiments are conducted in an unsupervised goal-conditioned setting, where no demonstrations or rewards are provided, so an agent must explore (from scratch) and learn how to maximize the likelihood of reaching commanded goals. Evaluated on simulated locomotion and manipulation tasks, our approach increases performance on the self-supervised contrastive RL algorithm by $2\times$ -- $50\times$, outperforming other goal-conditioned baselines. Increasing the model depth not only increases success rates but also qualitatively changes the behaviors learned.
关键词
Reinforcement LearningSelf-Supervised LearningContrastive RLGoal-conditioned RLScaling

评审与讨论

审稿意见
5

This paper explores the potential of deep network architectures in self-supervised reinforcement learning (RL) to enhance scalability. Unlike most recent RL studies that use shallow networks (2-5 layers), the authors show that increasing network depth to up to 1024 layers can significantly improve performance. The experiments are conducted in an unsupervised, goal-conditioned setting without demonstrations or rewards, requiring the agent to explore and learn to reach commanded goals. Tested on simulated locomotion and manipulation tasks, the approach achieves 2x-50x performance improvements over other goal-conditioned baselines, demonstrating that greater depth not only boosts success rates but also qualitatively alters the learned behaviors.

优缺点分析

Strengths

  1. The idea is interesting, and the writing is clear and easy to understand.
  2. The authors conducted numerous experiments to demonstrate that increasing the depth of networks is a key factor in improving the performance of CRL.
  3. The authors carried out extensive analysis experiments to explore various issues, including some that the reviewer might be concerned about. For example, whether increasing the depth of networks can enhance the performance of other baseline algorithms. Although the answer is no, combining scaling depth or network size with other algorithms may be one of the important future directions for RL.

Weaknesses

  1. There is an error in the citation of the appendix at line 173, which shows as "??".
  2. The paper does not mention the training and inference costs of the proposed algorithm. The training and inference costs associated with a network depth of 1000 layers may be relatively high.
  3. The innovativeness of the article has certain deficiencies. The core technologies involved (such as residual connections, normalization, activation functions, etc.) are derived from existing research. The main novelty of the article lies in the impact of the increase in model size on performance, rather than innovation in methodology.
  4. The contributions of the article are mainly experimental, with a lack of theoretical discussion on the critical parts of the experiments. For example, why does the performance undergo a sudden change when the depth reaches a certain critical point? This, to some extent, limits the article's insight.

问题

  1. How do the training costs of the proposed algorithm compare with those of the baselines? Can anthors show it in a Table?
  2. Can the authors elaborate empirically or conceptually on why performance “jumps” occur at specific depths or why agents exhibit different behaviors at certain depths (as shown in Figure 3) in certain environments?

局限性

yes

最终评判理由

The authors' response has addressed my concerns, and I will maintain my score.

格式问题

There are no formatting issues in this paper.

作者回复

We thank Reviewer 4F8D for the helpful feedback, questions, and suggestions.

1. How do the training costs of the proposed algorithm compare with those of the baselines? Can authors show it in a table?

Sure! Because the baseline algorithms use standard-sized networks (less than 4 layers), our scaled approach naturally incurs higher raw wall-clock time per environment step (Table 1a). However, for practical deployment the more meaningful metric is the total wall-clock time required to attain a given level of performance. To evaluate this, we compare our method with the strongest baseline, SAC. As shown in Table 1b, our approach surpasses SAC performance in less wall-clock time in 7 out of 10 environments.

2. Can the authors elaborate empirically or conceptually on why performance “jumps” occur at specific depths or why agents exhibit different behaviors at certain depths (as shown in Figure 3) in certain environments?

To elaborate conceptually why performance “jumps” occur at specific depths, we draw parallels to the broader deep learning literature of the well-documented phenomenon of emergent abilities, where a sudden increase in performance arises at a critical threshold of model scale. Wei et al. 2022 [1] first proposed that large language models exhibit such emergent behavior, with performance transitioning sharply from near-random at a certain threshold points in model size. Although cross-entropy loss improves smoothly with scale, the model only begins to complete tasks once a critical threshold of capability is reached. In our experiments, we observe similar patterns in the Humanoid agent: as the depth increases from 4 to 8 to 16 layers, its ability to locomote improves gradually. However, only upon reaching a critical level of coordination—when it can walk stably without falling—does the agent achieve the goal and remain upright. This manifests as a sharp, discontinuous jump in performance. Wei et al. also remark that emergent abilities often involve qualitatively novel behaviors, such as chain-of-thought reasoning in LLMs, that are absent in smaller models. We observe a parallel in the Humanoid U-Maze environment: only at a very large model scale (i.e., depth 256) does the agent achieve the complex and coordinated movement necessary to successfully vault over the wall. At smaller depths, the agent simply appears to “flounder”, its movements not yet sophisticated enough to complete the maneuver. This behavior manifests as a qualitatively distinct, “emergent” ability and is reflected in the observed performance jump.

3. There is an error in the citation of the appendix at line 173, which shows as "??".

Thank you for pointing that out. This figure comparing our approach with baselines is included as Figure 10 of the Appendix. The issue was due to a typo that resulted in an undefined reference — we have now fixed this in our local version.

4. The paper does not mention the training and inference costs of the proposed algorithm. The training and inference costs associated with a network depth of 1000 layers may be relatively high.

Thanks for this suggestion. In terms of considerations on training and inference costs, we report the wall-clock time of our approach as we scale depth (Table 2a, Table 2b). A comparison of the cost of wall-clock time vs. baselines is discussed in the first paragraph of this response above, and a comparison of wall-clock time for depth vs. width scaling can be found in Table 3a and Table 3b. We likewise also report the metrics of FLOPs and memory (Table 4, Table 5).

5. For example, why does the performance undergo a sudden change when the depth reaches a certain critical point? This, to some extent, limits the article's insight.

We address this in the second paragraph of this response above.

[TABLE 1a] Wall-clock Time Comparison of Our Method vs. Baselines

MethodHumanoidAnt Big MazeAnt U4-MazeAnt U5-MazeAnt Hardest MazeArm Push EasyArm Push HardArm Binpick HardHumanoid U-MazeHumanoid Big Maze
CRL Depth 41.48 ± 0.002.12 ± 0.001.98 ± 0.059.46 ± 0.285.11 ± 0.009.97 ± 0.399.74 ± 0.4018.41 ± 0.828.72 ± 0.0012.45 ± 0.00
CRL Depth 325.92 ± 0.006.57 ± 0.016.35 ± 0.0031.49 ± 0.1313.97 ± 0.0014.94 ± 0.8814.40 ± 0.6221.91 ± 0.3826.48 ± 0.0230.61 ± 0.02
CRL Depth 6410.99 ± 0.0011.66 ± 0.0111.43 ± 0.0146.40 ± 0.0523.96 ± 0.0319.52 ± 0.5718.53 ± 0.0229.64 ± 1.3046.74 ± 0.0250.33 ± 0.02
SAC0.46 ± 0.001.55 ± 0.001.16 ± 0.005.73 ± 0.014.33 ± 0.008.32 ± 0.008.49 ± 0.0420.70 ± 0.073.04 ± 0.008.55 ± 0.01
SAC+HER0.55 ± 0.001.63 ± 0.001.26 ± 0.006.10 ± 0.004.51 ± 0.008.54 ± 0.038.60 ± 0.0120.74 ± 0.023.48 ± 0.009.27 ± 0.01
TD30.79 ± 0.011.68 ± 0.011.30 ± 0.006.18 ± 0.015.04 ± 0.008.41 ± 0.018.31 ± 0.0718.40 ± 0.285.35 ± 0.007.50 ± 1.14
GCSL0.42 ± 0.011.48 ± 0.260.70 ± 0.002.81 ± 0.122.13 ± 0.576.37 ± 0.125.22 ± 0.347.95 ± 0.903.07 ± 0.085.08 ± 0.04
GCBC0.61 ± 0.031.40 ± 0.131.11 ± 0.085.56 ± 0.544.38 ± 0.528.28 ± 0.347.36 ± 0.5016.24 ± 0.367.18 ± 0.7811.37 ± 1.93

[TABLE 1b] CRL Surpasses SAC Performance in Less Wall-clock Time in 7 of 10 Environments

AlgorithmHumanoidAnt Big MazeAnt U4-MazeAnt U5-MazeAnt Hardest MazeArm Push EasyArm Push HardArm Binpick HardHumanoid U-MazeHumanoid Big Maze
SAC27.892.769.8343.8259.8499.2509.71242.1182.6512.9
Scaled CRL (Depth 64)382.2 ± 11.30.0 ± 0.00.0 ± 0.00.0 ± 0.027.2 ± 3.8114.7 ± 9.9398.8 ± 42.1266.0 ± 14.7N/A*N/A*

*N/A denotes SAC outperforms our approach, so CRL never reaches SAC performance. These account for 2 of the 3 environments where CRL fails to surpass SAC in less wall-clock time.

[Table 2a] Wallclock time of Depth 4, 8, 16, 32, and 64 across all 10 environments

DepthHumanoidAnt Big MazeAnt U4-MazeAnt U5-MazeAnt Hardest MazeArm Push EasyArm Push HardArm Binpick HardHumanoid U-MazeHumanoid Big Maze
Depth 41.48 ± 0.002.12 ± 0.001.98 ± 0.279.46 ± 1.755.11 ± 0.009.97 ± 1.039.74 ± 1.0518.41 ± 2.168.72 ± 0.0112.45 ± 0.02
Depth 82.13 ± 0.012.77 ± 0.002.54 ± 0.0110.99 ± 0.026.39 ± 0.0011.02 ± 1.2910.55 ± 1.2017.48 ± 1.8811.29 ± 0.0115.02 ± 0.01
Depth 163.40 ± 0.014.04 ± 0.013.81 ± 0.0116.09 ± 0.018.94 ± 0.0112.20 ± 1.4311.98 ± 1.4919.47 ± 0.0516.36 ± 0.0320.34 ± 0.01
Depth 325.92 ± 0.016.57 ± 0.026.35 ± 0.0131.49 ± 0.3413.97 ± 0.0114.94 ± 1.9614.40 ± 1.6421.91 ± 1.9326.48 ± 0.0530.61 ± 0.05
Depth 6410.99 ± 0.0111.66 ± 0.0311.43 ± 0.0346.40 ± 0.1223.96 ± 0.0619.52 ± 1.9718.53 ± 0.0629.64 ± 6.1046.74 ± 0.0450.33 ± 0.05

[Table 2b] Wallclock time up to Depth 1024 in the Humanoid U-Maze environment

Depth481632641282565121024
Time (hours)3.23 ± 0.0014.19 ± 0.0036.07 ± 0.0039.83 ± 0.00617.33 ± 0.00332.67 ± 0.12473.83 ± 2.364120.88 ± 2.177134.15 ± 0.081

[TABLE 3a] Wall-clock Time Comparison of Depth vs. Width Scaling

Width 256Width 512Width 1024Width 2048Width 4096
4.52 ± 0.345.32 ± 0.015.96 ± 0.018.39 ± 0.0215.90 ± 0.64
Depth 4Depth 8Depth 16Depth 32Depth 64
4.52 ± 0.345.31 ± 0.436.55 ± 0.489.14 ± 0.6514.06 ± 0.66

[TABLE 3b] Scaling Depth Surpasses Performance of Scaling Width in Less Wall-clock time

EnvironmentWidth = 4096Depth = 64
Humanoid22.30 ± 0.085.66 ± 0.24
Ant Big Maze11.70 ± 0.034.04 ± 0.41
Arm Push Easy13.69 ± 1.922.38 ± 0.18

[TABLE 4] Comparison of Width vs. Depth Scaling: FLOPS and Memory

FLOPs comparison of width vs. depth scaling

Width256512102420484096
FLOPs (B)4.817.466.9262.41039.9
Depth48163264
FLOPs (B)4.88.917.133.566.2

[TABLE 4] Comparison of Width vs. Depth Scaling: FLOPs

Width256512102420484096
FLOPs (B)4.817.466.9262.41039.9
Depth48163264
FLOPs (B)4.88.917.133.566.2

[TABLE 5] Comparison of Width vs. Depth Scaling: Memory

Width256512102420484096
Memory (MB)24.863.4194.7673.32494.6
Depth48163264
Memory (MB)24.843.881.9158.0310.3

[1] Wei, J., et al., NeurIPS 2022, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903

评论

The authors' response has addressed my concerns, and I will maintain my score.

审稿意见
5

The paper investigates scaling the depth of the networks used in self-supervised reinforcement learning for goal-reaching tasks. The work shows that scaling depth can significantly boost the performance of Contrastive RL in the JaxGCRL suite of online RL tasks. The authors also extensively analyze their design choices and provide qualitative analyses of the behavior and the value functions learned.

优缺点分析

Strengths

  • Strong empirical analysis: the empirical analysis of the approach is extensive and I found it particularly useful. It clearly shows how scaling depth can significantly improve performance for Contrastive RL. However, it also provides analyses about: scaling width vs depth, the usefulness of residual layers, scaling actor vs critic network, etc.
  • New SOTA on JaxGCRL: the results obtained with scaling depth and data are impressive. Also, improving the state-of-the-art in a benchmark is useful to further push the boundaries of research on that benchmark and in the field.

Weaknesses

  • Limited generality of the claims : my main concern with this work is that the idea of scaling depth seem to apply to a very specific algorithm, i.e. contrastive RL, and in very specific settings, i.e. online RL. I appreciate the authors' attempts in testing their method in offline RL settings and using CMD-1 (in Appendix). However, the generality of their claims remain so far limited.

问题

  • How is training time affected as you scale depth?

Also, I leave here some questions/idea for the authors to test outside of their contrastive RL+ online RL settings:

  • Given that simply scaling depth in the actor seems beneficial, would scaling depth be useful for BC and GCBC in Offline RL settings?
  • I understand the authors provide a study on CMD-1 in the Appendix, which uses the MRN network for quasimetric learning. It would be useful to also study QRL [1], which instead uses IQE for quasimetric learning

Minor note:

  • the reference at Line 173 is undefined (probably due to the Figure being in the Appendix)

[1] Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning

局限性

The authors have adequately addressed limitations in their work.

最终评判理由

The work presents depth scaling as a successful technique to scale (self-supervised) RL. The findings seem to be mostly limited to the CRL approach (and to GCBC, as shown in some experiments posted for the rebuttal). However, the work remains an interesting case study and the authors provide an insightful analysis, which I think will be useful to the community.

格式问题

No paper formatting concerns.

作者回复

We thank Reviewer kTrp for the helpful feedback, questions, and suggestions.

1. How is training time affected as you scale depth?

We report the wall-clock metrics of our approach as we scale depth in Table 1. We also conduct a comparison to baselines: since the baselines use standard sized networks, naturally our scaled approach incurs higher raw wall-clock time per environment step (Table 2a). For practical deployment, however, the more meaningful metric is often wall-clock time to attain a given level of performance. In Table 2b, we compare against the strongest baseline, SAC, and show that in 7 out of 10 environments our approach surpasses SAC performance in less wall-clock time.

2. Limited generality of the claims : my main concern with this work is that the idea of scaling depth seem to apply to a very specific algorithm, i.e. contrastive RL, and in very specific settings, i.e. online RL.

To address this concern, we report a new experimental result: we find that our approach successfully scales depth on a different algorithm (GCBC) in a different setting (offline RL). In Table 3a, we find that GCBC successfully scales on the antmaze-medium-stitch task in OGBench [1] when using our architecture of layer normalization, residual connections, and swish activations. In Table 3b, we show that using our architecture is critical to enabling scaling in GCBC, as scaling the standard architecture degrades performance. These results suggest that our approach to depth scaling may be adapted to other algorithms, as well as the offline setting. Future work could further investigate what algorithms and architectural components best enable depth scaling in offline RL.

3. Given that simply scaling depth in the actor seems beneficial, would scaling depth be useful for BC and GCBC in Offline RL settings?

Thanks for this suggestion! For GCBC, we have discussed the results in the response above. We also attempt to scale depth for standard BC in OGBench [1]. However, in this case we find negative results (Table 4).

4. I understand the authors provide a study on CMD-1 in the Appendix, which uses the MRN network for quasimetric learning. It would be useful to also study QRL [1], which instead uses IQE for quasimetric learning

We attempt to scale depth for QRL in both the online and offline settings, using the original QRL codebase [2] and OGBench [1] respectively. However, in both settings, we find that while QRL is able to train stably with deep networks, deep networks don’t improve performance beyond that of the shallow networks. Table 5a and Table 5b show these results.

[Table 1a] Wallclock time of Depth 4, 8, 16, 32, and 64 across all 10 environments

DepthHumanoidAnt Big MazeAnt U4-MazeAnt U5-MazeAnt Hardest MazeArm Push EasyArm Push HardArm Binpick HardHumanoid U-MazeHumanoid Big Maze
Depth 41.48 ± 0.002.12 ± 0.001.98 ± 0.279.46 ± 1.755.11 ± 0.009.97 ± 1.039.74 ± 1.0518.41 ± 2.168.72 ± 0.0112.45 ± 0.02
Depth 82.13 ± 0.012.77 ± 0.002.54 ± 0.0110.99 ± 0.026.39 ± 0.0011.02 ± 1.2910.55 ± 1.2017.48 ± 1.8811.29 ± 0.0115.02 ± 0.01
Depth 163.40 ± 0.014.04 ± 0.013.81 ± 0.0116.09 ± 0.018.94 ± 0.0112.20 ± 1.4311.98 ± 1.4919.47 ± 0.0516.36 ± 0.0320.34 ± 0.01
Depth 325.92 ± 0.016.57 ± 0.026.35 ± 0.0131.49 ± 0.3413.97 ± 0.0114.94 ± 1.9614.40 ± 1.6421.91 ± 1.9326.48 ± 0.0530.61 ± 0.05
Depth 6410.99 ± 0.0111.66 ± 0.0311.43 ± 0.0346.40 ± 0.1223.96 ± 0.0619.52 ± 1.9718.53 ± 0.0629.64 ± 6.1046.74 ± 0.0450.33 ± 0.05

[Table 1b] Wallclock time up to Depth 1024 in the Humanoid U-Maze environment

Depth481632641282565121024
Time  (h)3.23 ± 0.0014.19 ± 0.0036.07 ± 0.0039.83 ± 0.00617.33 ± 0.00332.67 ± 0.12473.83 ± 2.364120.88 ± 2.177134.15 ± 0.081

[TABLE 2a] Wall-clock Time Comparison of Our Method vs. Baselines

MethodHumanoidAnt Big MazeAnt U4-MazeAnt U5-MazeAnt Hardest MazeArm Push EasyArm Push HardArm Binpick HardHumanoid U-MazeHumanoid Big Maze
Scaled CRL (Depth 64)10.99 ± 0.0011.66 ± 0.0111.43 ± 0.0146.40 ± 0.0523.96 ± 0.0319.52 ± 0.5718.53 ± 0.0229.64 ± 1.3046.74 ± 0.0250.33 ± 0.02
SAC0.46 ± 0.001.55 ± 0.001.16 ± 0.005.73 ± 0.014.33 ± 0.008.32 ± 0.008.49 ± 0.0420.70 ± 0.073.04 ± 0.008.55 ± 0.01
SAC+HER0.55 ± 0.001.63 ± 0.001.26 ± 0.006.10 ± 0.004.51 ± 0.008.54 ± 0.038.60 ± 0.0120.74 ± 0.023.48 ± 0.009.27 ± 0.01
TD30.79 ± 0.011.68 ± 0.011.30 ± 0.006.18 ± 0.015.04 ± 0.008.41 ± 0.018.31 ± 0.0718.40 ± 0.285.35 ± 0.007.50 ± 1.14
GCSL0.42 ± 0.011.48 ± 0.260.70 ± 0.002.81 ± 0.122.13 ± 0.576.37 ± 0.125.22 ± 0.347.95 ± 0.903.07 ± 0.085.08 ± 0.04
GCBC0.61 ± 0.031.40 ± 0.131.11 ± 0.085.56 ± 0.544.38 ± 0.528.28 ± 0.347.36 ± 0.5016.24 ± 0.367.18 ± 0.7811.37 ± 1.93

[TABLE 2b] CRL Surpasses SAC Performance in Less Wall-clock Time in 7 of 10 Environments

AlgorithmHumanoidAnt Big MazeAnt U4-MazeAnt U5-MazeAnt Hardest MazeArm Push EasyArm Push HardArm Binpick HardHumanoid U-MazeHumanoid Big Maze
SAC27.892.769.8343.8259.8499.2509.71242.1182.6512.9
Scaled CRL (Depth 64)382.2 ± 11.30.0 ± 0.00.0 ± 0.00.0 ± 0.027.2 ± 3.8114.7 ± 9.9398.8 ± 42.1266.0 ± 14.7N/A*N/A*

*N/A denotes SAC outperforms our approach, so CRL never reaches SAC performance. These account for 2 of the 3 environments where CRL fails to surpass SAC in less wall-clock time.

[TABLE 3a] Depth Scaling on GCBC (Antmaze-medium-stitch, OGBench)

Depth2481632
Performance0.474 ± 0.110.572 ± 0.040.574 ± 0.020.588 ± 0.000.634 ± 0.05

[TABLE 3b] Our Architecture is Critical: Depth Scaling on GCBC Fails with Standard Architecture

Depth2481632
Performance0.532 ± 0.0880.746 ± 0.0140.692 ± 0.0240.698 ± 0.0220.210 ± 0.210

[Table 4] Depth Scaling on BC (Antmaze-giant-navigate on expert SAC policy, OGBench)

Depth2481632
Performance0.696 ± 0.000.782 ± 0.030.348 ± 0.000.088 ± 0.020.123 ± 0.04

[Table 5a] Depth Scaling on Online QRL (FetchPush task on original QRL codebase)

Depth2481632
Performance0.160 ± 0.040.200 ± 0.000.180 ± 0.000.260 ± 0.100.250 ± 0.01

[Table 5b] Depth Scaling on Offline QRL (Pointmaze-giant-stitch, OGBench)

Depth2481632
Performance0.570 ± 0.020.495 ± 0.120.450 ± 0.000.600 ± 0.000.588 ± 0.00

[1] Park, S., et al., ICLR 2025, OGBench: Benchmarking Offline Goal-Conditioned RL, https://arxiv.org/abs/2410.20092.
[2] Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning

评论

I thank the authors for providing additional insights about the training times of their experiments and the possibility of applying the proposed technique to other approaches, i.e. GCBC and QRL. I encourage them to add these experiments to the paper (even in Appendix) as I believe they will provide useful insights to future readers too.

I will keep my positive judgement of this paper: I think it should be accepted to the conference.

评论

We sincerely thank the reviewer for their positive assessment of our work and for providing concrete recommendations for additional experiments. As suggested, we will incorporate these experiments and their insights into the camera-ready version of the paper.

审稿意见
5

The authors show that depth scaling properties arise in contrastive RL, whereas width scaling is prominent in recent vanilla RL literature [1,2]. They propose to stack neural network blocks with residual connection, normalization (LayerNorm), and Swish architecture. By increasing the number of blocks, authors found that goal reaching capabilities emerge after a certain threshold, shown both by increased performance measure (figure 1) and novel behavior (figure 3). In their analysis, they find that depth scaling improves contrastive representations, enhances exploration, and allows trajectory stitching.

[1] SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning., Lee et al.

[2] Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners., Nauman et al.

优缺点分析

Strengths

  • Strong performance and clear depth-scaling tendency to support their claims.
  • In-depth analysis on the architecture and policy as depth scales.
  • Verification of contrastive RL algorithm compared to baseline methods.

Weaknesses

  • The experiments are limited to one type of architecture. Whether the same trend would occur in other well-designed architectures in RL [1,2,3] is unsure.
  • On a similar note, no sufficient analysis was made on the design choices of the architecture, other than the residual connection.
  • The connection between depth and exploration is not well explained. The results presented in figure 13 could also be due to improved learning capability of the collector.
  • Due to the model's sequential nature, a depth-scaled model is likely to take more time to generate output than a width-scaled model with same number of parameters. The claim that depth scaling is more compute efficient than width scaling (line 225:229) should thus we accompanied by a comparison on wall-clock time.
  • The fact that their architecture does not scale in offline GCRL weakens their point.

Overall, I think the paper provides interesting and useful knowledge to the community, although there are several issues that needs to be refined.

[1] SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning., Lee et al.

[2] Hyperspherical Normalization for Scalable Deep Reinforcement Learning., Lee et al.

[3] Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control., Nauman et al.

问题

Please see weaknesses above.

局限性

Yes

最终评判理由

The concerns I have raised have been addressed with additional experiment; I have raised my score from 4 to 5 accordingly. I believe this paper nicely reveals the scaling potential of GCRL compared to standard RL settings.

格式问题

None.

作者回复

We thank Reviewer 5GqS for the helpful feedback, questions, and suggestions.

1. The experiments are limited to one type of architecture. Whether the same trend would occur in other well-designed architectures in RL [1,2,3] is unsure.

To address this concern, we have run a new experiment: we find that the same trend occurs when we apply our approach to another well-designed architecture for scaling RL, Simba-v2 [1]. The key innovation in Simba-v2 is replacing layer norm with hyperspherical normalization, which projects network weights onto the unit-norm hypersphere after each gradient update. Table 1 shows that indeed the same depth scaling trends holds when we add the hyperspherical normalization to our architecture, and, in fact, actually improves the sample efficiency of depth scaling. This suggests that our method is able to integrate new architectural innovations in the emerging RL scaling literature.

2. On a similar note, no sufficient analysis was made on the design choices of the architecture, other than the residual connection.

As suggested by the reviewer, we report new ablation experiments for the other design choices of architecture: namely, layer norm and swish activation. We show that both of these design choices are critical for effective scaling:

  • In Table 2a, we show that removing layer normalization significantly hampers scalability.
  • Additionally, Table 2b shows that scaling with ReLU activations performs significantly worse than Swish activations.

These results, along with Figure 5 of the manuscript, show that all of our architectural components—residual connections, layer norm, and swish activations—are jointly essential to unlocking the full performance of depth scaling.

3. The connection between depth and exploration is not well explained. The results presented in figure 13 could also be due to improved learning capability of the collector.

To clarify, by exploration, we specifically mean the breadth or coverage of states visited. Shallow networks, due to their limited learning capacity, tend to explore only states near their initializations. While deep networks, due to their greater learning capacity, learn effective long-horizon behaviors, resulting empirically in much broader state coverage. Figure 13’s shallow collector experiments illustrate that improved learning capacity alone is insufficient when data diversity or coverage is poor. Conversely, Figure 13’s deep collector experiments show that good data coverage alone, without adequate learning capacity, also falls short. Thus, rather than viewing learning capacity and exploration as diametrically opposed, they mutually reinforce each other: stronger learning capacity drives more extensive exploration, and rich data coverage is essential to fully realize the potential of stronger learning capacity. Both aspects jointly contribute to improved performance.

4. Due to the model's sequential nature, a depth-scaled model is likely to take more time to generate output than a width-scaled model with same number of parameters. The claim that depth scaling is more compute efficient than width scaling (line 225:229) should thus we accompanied by a comparison on wall-clock time.

In Table 3a, we report the average wall-clock time for scaled width and depth, and find that width and depth empirically scale quite similarly in terms of wall-clock time. For practical deployment, the relevant metric is often wall-clock time to reach a given level of performance. Table 3b shows that scaling depth surpasses the performance of scaling width in less wall-clock time in all three environments.

5. The fact that their architecture does not scale in offline GCRL weakens their point.

Although we found negative results for scaling CRL in the offline setting, we now report new experiments demonstrating that our approach scales effectively to another widely used offline algorithm, GCBC. In Table 4a, we find that GCBC successfully scales with depth on the antmaze-medium-stitch task in OGBench [2] when using our architecture of layer normalization, residual connections, and swish activations. In Table 4b, we show that using our architecture is critical to enabling this scaling in GCBC, as scaling the standard architecture degrades performance. These results suggest that our approach may be adapted to scaling the offline RL setting. Future work could further investigate what algorithms and architectural components best enable depth scaling in offline RL.

[TABLE 1] Hyperspherical Normalization (Simba-v2) Improves Sample Efficiency of Depth Scaling

Steps to reach ≥ 200 success

DepthWithWithout
4
165064
324254

Steps to reach ≥ 400 success

DepthWithWithout
4
166275
324864

Steps to reach ≥ 600 success

DepthWithWithout
4
1677
326777

[TABLE 2a] Layer Norm Ablation

DepthWith Layer NormWithout Layer Norm
413.43 ± 4.406.58 ± 0.88
16350.17 ± 44.1413.01 ± 5.23
32628.15 ± 41.0012.34 ± 3.55
64672.56 ± 40.0112.94 ± 2.66

[TABLE 2b] Swish vs. ReLU Ablation

DepthSwishReLU
413.43 ± 4.4015.09 ± 1.15
16350.17 ± 44.1438.76 ± 6.99
32628.15 ± 41.0046.62 ± 6.81
64672.56 ± 40.01185.25 ± 107.40

[TABLE 3a] Wall-clock Time Comparison of Depth vs. Width Scaling

Width 256Width 512Width 1024Width 2048Width 4096
4.52 ± 0.345.32 ± 0.015.96 ± 0.018.39 ± 0.0215.90 ± 0.64
Depth 4Depth 8Depth 16Depth 32Depth 64
4.52 ± 0.345.31 ± 0.436.55 ± 0.489.14 ± 0.6514.06 ± 0.66

[TABLE 3b] Scaling Depth Surpasses Performance of Scaling Width in Less Wall-clock time

EnvironmentWidth = 4096Depth = 64
Humanoid22.30 ± 0.085.66 ± 0.24
Ant Big Maze11.70 ± 0.034.04 ± 0.41
Arm Push Easy13.69 ± 1.922.38 ± 0.18

[TABLE 4a] Successful Depth Scaling on GCBC (Antmaze-medium-stitch, OGBench)

Depth2481632
Performance0.474 ± 0.110.572 ± 0.040.574 ± 0.020.588 ± 0.000.634 ± 0.05

[TABLE 4b] Our Architecture is Critical: Depth Scaling on GCBC Fails with Standard Architecture

Depth2481632
Performance0.532 ± 0.0880.746 ± 0.0140.692 ± 0.0240.698 ± 0.0220.210 ± 0.210

[1] Lee, H., et al., ICML 2025. Hyperspherical Normalization for Scalable Deep Reinforcement Learning. http://arxiv.org/abs/2502.15280.
[2] Park, S., et al., ICLR 2025, OGBench: Benchmarking Offline Goal-Conditioned RL, https://arxiv.org/abs/2410.20092

评论

Thank you for the response. I find the new results really interesting and helpful. I will update my score accordingly after the discussion.

That said, I am a bit confused about the results on A4. The first sentence and Table 3a claim that width and depth have similar scaling tendencies, but Table 3b seems to tell otherwise. Why are the numbers so different between 3a and 3b? What e nvironments were used for 3a?

评论

We appreciate the reviewer finds our results interesting and helpful! Below, we address the reviewer's question by clarifying Table 3a and Table 3b.

The first sentence and Table 3a claim that width and depth have similar scaling tendencies, but Table 3b seems to tell otherwise. Why are the numbers so different between 3a and 3b?

To clarify, the first sentence and Table 3a indicate that scaling depth and scaling width have similar effects on raw wall‑clock time, i.e. doubling depth increases the wall-clock time similarly as doubling width. This observation is purely about runtime and does not consider model performance.

In practice, what we really care about is comparing the wall-clock time to reach a target level of performance, which reflects how efficient is each scaling strategy. This is what Table 3b is comparing. Table 3b shows that scaling depth is significantly more efficient than scaling width: it achieves the same performance in less wall-clock time.

What environments were used for 3a?

We use the same three environments: Humanoid, Ant Big Maze, Arm Push Easy. Table 3a reports the average wall‑clock time across these environments.

审稿意见
5

This paper investigates scaling network depth in self-supervised reinforcement learning, specifically using the Contrastive RL (CRL) algorithm for goal-conditioned tasks. The authors demonstrate that increasing network depth up to 1024 layers can yield substantial performance improvements in unsupervised goal-reaching tasks. Their approach combines three key components: (1) self-supervised contrastive RL that learns without demonstrations or rewards, (2) GPU-accelerated training frameworks for increased data throughput, and (3) very deep residual networks with architectural stabilization techniques. The experiments are conducted on locomotion, navigation, and manipulation tasks from the JaxGCRL benchmark, showing 2-50× performance improvements over shallow networks typically used in RL (2-5 layers). The paper provides extensive analysis of why depth scaling works, including improved exploration, better representational capacity, and emergent qualitatively different behaviors at critical depth thresholds.

优缺点分析

Strengths:

  1. The paper provides a clearly-written comprehensive study showing that depth scaling (up to 1024 layers) can dramatically improve RL performance, contrary to conventional wisdom.
  2. The paper shows strong empirical evidence, by demonstrating consistent improvements across diverse tasks with proper statistical reporting (error bars across multiple seeds), across 10 diverse environments spanning locomotion, navigation, and manipulation.
  3. I appreciate the thorough analysis through visualization of Q-functions, learned representations, and emergent behaviors (wall vaulting, creative navigation).
  4. The approach provides effective adaptation of residual connections, layer normalization, and Swish activation for very deep RL networks.
  5. Good set of ablations are provided: width vs. depth, actor vs. critic scaling, batch size effects, and generalization capabilities.

Weaknesses:

  1. It seems that the scaling benefits appear specific to CRL. The traditional TD methods (SAC, TD3) show no improvement or even show degradation with depth.
  2. The preliminary experiments show no benefits in offline settings, limiting applicability.
  3. The experiments are specific to unsupervised RL setting. It is unclear if the benefits may transfer to reward-rich (dense) or imitation learning settings.
  4. The paper would benefit from detailed analysis of computational overhead compared to baselines.
  5. As acknowledged by the authors, the approach requires substantial computational resources that may limit reproducibility.

问题

  1. Can you provide any intutive explanation to why the scaling benefits appear specific to CRL?
  2. Could you provide detailed computational cost analysis comparing depth vs. width scaling? What are the FLOPS, memory, and wall-clock time trade-offs? How does this inform practical deployment decisions?
  3. In Figure 1, the performance of these approaches increases with depth, and then seems to decrease when depth is increased further. What is the reason for such behavior? Is this pattern consistent with further depth experients? If yes, do you have any thoughts on finding the critical depth for particular tasks?
  4. In line 173, the figure that compares proposed approach with SAC, SAC+HER, TD3+HER, and CRL seems to be missing?

局限性

Please refer to the weaknesses section.

最终评判理由

The authors' response has addressed all my concerns. The authors properly answered all the raised questions, and also provided the required resource information in the form of four tables, which satisfied my initial concerns. Therefore, I will my high judgement score for the paper.

格式问题

None.

作者回复

We thank Reviewer gFHA for the helpful feedback, questions, and suggestions.

Question 1: Can you provide any intuitive explanation to why the scaling benefits appear specific to CRL?

Sure! The key intuitive reason for scaling benefits specific to CRL lies in how it redefines the core learning challenge of RL. Traditional RL faces several well-known scalability challenges: regression to noisy and recursively bootstrapped Q-value targets, gradient instability, plasticity loss, amplifying overestimation biases, etc. In contrast, CRL fundamentally shifts the “heavy lifting” of the RL learning problem from recursively estimating fluctuating Q-values to addressing a simple classification problem: is the sampled future state likely to be along the same trajectory or along a different trajectory as my current state? This transforms the unstable RL problem into a supervised learning classification task based on cross-entropy loss—a setup known to scale exceptionally well in deep learning (and recently also in RL, see Farebrother et al., 2024 [4]). Once CRL scalably learns good contrastive representations through this supervised objective, the remaining learning burden on the actor becomes straightforward: at any given state, simply greedily choose the action that yields the representation most similar to the goal state.

Question 2: Could you provide detailed computational cost analysis comparing depth vs. width scaling? What are the FLOPS, memory, and wall-clock time trade-offs? How does this inform practical deployment decisions?

In Table 1a, we show that width and depth empirically scale similarly in terms of wall-clock time. For practical deployment, the relevant metric is wall-clock time to reach a given performance. Table 1b shows that scaling depth surpasses the performance of scaling width in less wall-clock time in all three environments. In terms of FLOPs and memory, width scales quadratically while depth scales linearly, making depth more efficient in these metrics as well. Thus, for practical deployment we recommend scaling depth in almost all cases.

Question 3: In Figure 1, the performance of these approaches increases with depth, and then seems to decrease when depth is increased further. What is the reason for such behavior? Do you have any thoughts on finding the critical depth for particular tasks?

Even with techniques like residual connections and layer norm, training very deep networks is known to be tricky—exploding gradients, representation staleness, plasticity loss, etc. This is likely why many RL scaling efforts show results on width scaling but not depth. In our experiments, for tasks where degradation does occur, performance has already scaled to near-optimal. Thus, we hypothesize that additional depth yields diminishing returns while sometimes increasing the risk of instability. In these cases, to answer the question on identifying critical depth:

  • Start with 32 layers: Across all 10 environments, we consistently observe stable and improving performance up to depth 32.
  • Scale deeper if needed: For more complex tasks, especially those involving locomotion or high-dimensional control (e.g., Humanoid environments), we observe stable scaling up to 1000 layers.
  • Be conservative with manipulation tasks: These tasks tend to be more sensitive to over-scaling and performance degradation at higher depths.

Question 4: In line 173, the figure seems to be missing?

This Figure is in the Appendix (Figure 10). Thanks for catching the undefined reference.

Weakness 1: It seems that the scaling benefits appear specific to CRL. The traditional TD methods (SAC, TD3) show no improvement or even show degradation with depth.

This is a known result from prior work that traditional TD methods do not scale well with depth [5, 2, 3]. As such, rather than viewing this as a limitation, we see it as a point of strength/novelty of our work: our results demonstrate that the benefits of scaling depth can indeed be unlocked in RL. Previous efforts to scale RL have largely focused on network width [5,2], with depth scaling either limited (e.g., up to 8 residual blocks in [3]) or harmful [5, 2]. Our approach unlocks the ability to scale depth, yielding performance improvements greater than width alone.

Weakness 2: The preliminary experiments show no benefits in offline settings, limiting applicability.

Although we found negative results for scaling CRL in the offline setting, we now report new experiments demonstrating that our approach scales effectively to another widely used offline algorithm, GCBC. In Table 2a, we find that GCBC successfully scales with depth on the antmaze-medium-stitch task in OGBench [1] when using our architecture of layer normalization, residual connections, and swish activations. In Table 2b, we show that using our architecture is critical to enabling scaling in GCBC, as scaling the standard architecture degrades performance. These results suggest that our approach may be adapted to scaling the offline RL setting. Future work could further investigate what algorithms and architectural components best enable depth scaling in offline RL.

Weakness 3: The experiments are specific to unsupervised RL setting. It is unclear if the benefits may transfer to reward-rich (dense) or imitation learning settings.

CRL is an unsupervised RL algorithm; as such, we focus on this setting as the primary scope of our paper. One potentially interesting direction for future research: although CRL itself is unsupervised, its self-supervised contrastive representations can be learned on any set of collected trajectories, including those of a reward-based RL algorithm. Prior work has explored integrating such representations via, for example, auxiliary losses or planning [6]. Our analysis of visualized Q-functions and learned representations give confidence that increasingly sophisticated representations can be learned through depth scaling. We encourage future work to investigate whether these scaling benefits can thus be integrated into reward-based RL.

Weakness 4: The paper would benefit from detailed analysis of computational overhead compared to baselines.

Because the baselines use standard-size networks, our scaled approach naturally incurs higher wall-clock time per environment step at a raw computation level (Table 3a). However, as with the width-vs-depth analysis, the more meaningful metric is the total wall-clock time required to reach a given performance level. To evaluate this, we compare our method with the strongest baseline, SAC. As shown in Table 3b, our approach surpasses SAC performance in less wall-clock time in 7 out of 10 environments.

Weakness 5: The approach requires substantial computational resources that may limit reproducibility.

In fact, all of our experiments, including those with 1000 layer networks, can be run on a single 80GB A100!

[TABLE 1a] Wall-clock Time Comparison of Depth vs. Width Scaling

Width 256Width 512Width 1024Width 2048Width 4096
4.52 ± 0.345.32 ± 0.015.96 ± 0.018.39 ± 0.0215.90 ± 0.64
Depth 4Depth 8Depth 16Depth 32Depth 64
4.52 ± 0.345.31 ± 0.436.55 ± 0.489.14 ± 0.6514.06 ± 0.66

[TABLE 1b] Scaling Depth Surpasses Performance of Scaling Width in Less Wall-clock time

EnvironmentWidth = 4096Depth = 64
Humanoid22.30 ± 0.085.66 ± 0.24
Ant Big Maze11.70 ± 0.034.04 ± 0.41
Arm Push Easy13.69 ± 1.922.38 ± 0.18

[TABLE 2a] Successful Depth Scaling on GCBC (Antmaze-medium-stitch, OGBench)

Depth2481632
Performance0.474 ± 0.110.572 ± 0.040.574 ± 0.020.588 ± 0.000.634 ± 0.05

[TABLE 2b] Our Architecture is Critical: Depth Scaling on GCBC Fails with Standard Architecture

Depth2481632
Performance0.532 ± 0.090.746 ± 0.010.692 ± 0.020.698 ± 0.020.210 ± 0.21

[TABLE 3a] Wall-clock Time Comparison of Our Method vs. Baselines

MethodHumanoidAnt Big MazeAnt U4-MazeAnt U5-MazeAnt Hardest MazeArm Push EasyArm Push HardArm Binpick HardHumanoid U-MazeHumanoid Big Maze
Scaled CRL (Depth 64)10.9911.6611.4346.4023.9619.5218.5329.6446.7450.33
SAC0.461.551.165.734.338.328.4920.703.048.55
SAC+HER0.551.631.266.104.518.548.6020.743.489.27
TD30.791.681.306.185.048.418.3118.405.357.50
GCSL0.421.480.702.812.136.375.227.953.075.08
GCBC0.611.401.115.564.388.287.3616.247.1811.37

[TABLE 3b] CRL Surpasses SAC Performance in Less Wall-clock Time in 7 of 10 Environments

AlgorithmHumanoidAnt Big MazeAnt U4-MazeAnt U5-MazeAnt Hardest MazeArm Push EasyArm Push HardArm Binpick HardHumanoid U-MazeHumanoid Big Maze
SAC27.892.769.8343.8259.8499.2509.71242.1182.6512.9
Scaled CRL (Depth 64)382.20.00.00.027.2114.7398.8266.0N/A*N/A*

[1] Park, S., et al., ICLR 2025, OGBench: Benchmarking Offline Goal-Conditioned RL, https://arxiv.org/abs/2410.20092
[2] Lee, H., et al., ICLR 2025, SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning. http://arxiv.org/abs/2410.09754
[3] Lee, H., et al., ICML 2025. Hyperspherical Normalization for Scalable Deep Reinforcement Learning. http://arxiv.org/abs/2502.15280
[4] Farebrother, J., et al., ICML 2024, Stop Regressing: Training Value Functions via Classification for Scalable Deep RL. http://arxiv.org/abs/2403.03950
[5] Nauman, M., et al., NeurIPS 2024. Bigger, Regularized, Optimistic: Scaling for compute and sample-efficient continuous control. http://arxiv.org/abs/2405.16158
[6] Laskin, M., et al. ICML 2020, CURL: Contrastive Unsupervised Representations for Reinforcement Learning. https://proceedings.mlr.press/v119/laskin20a.html

评论

Dear Reviewer,

We have worked hard to incorporate the review feedback by running new experiments and revising the paper. Do the revisions and discussions above address your questions? We would greatly appreciate your engagement.

Thanks!

The Authors

评论

I appreciate the clarifications and the detailed computation analysis and scaling based analysis. The authors' response has addressed all my concerns, and I will maintain my already high judgement score for the paper.

最终决定

This paper received positive and enthusiastic reviews from all reviewers. The AC agrees with and shares this enthusiasm.

This paper presents a reinforcement learning paradigm to effectively train (very) deep neural networks with self-supervised reinforcement learning. Using this framework, it then presents an analysis demonstrating that, like in other areas of self-supervised learning, self-supervised RL too can scale effectively with network depth and more and more sophisticated capabilities emerge as depth increases.

The paradigm presented in this paper is relatively straightforward and makes use of three things. First, it uses a simple self-supervised RL algorithm, contrastive RL (CRL), second it makes use of GPU-accelerated RL simulators to collect a large amount of data, and third it makes use of modern network designs. This simplicity and the included code will make it easier for the community to build upon this work.

The results and analysis are both equally impressive. On the JaxGCRL benchmarks the performance of contrastive RL improves by 2x to 50x and their scaled CRL outperforms all other methods (by up to an order of magnitude) in nearly all environments. Their analysis is extensive and well done. While there is more than can be summarized here, the AC would like to draw attention to:

  1. their demonstration of new behaviors being discovered at greater depth,
  2. the analysis showing the importance of batch size scaling for deeper networks in CRL,
  3. and the analysis showing that deeper networks enable the policy to stitch together its training experience of short tasks into handling long tasks as evaluation time.

Overall, this paper presents a paradigm that others can easily build on, impressive results, and analysis that will improve the communities understanding of scaling self-supervised RL. Given this, an Oral presentation is appropriate.