PaperHub
7.8
/10
Poster3 位审稿人
最低3最高5标准差0.8
3
4
5
TL;DR

Robust and naturalistic driving emerges from self-play in simulation at unprecedented scale

摘要

关键词
Reinforcement LearningAutonomySimulationDrivingSelf-play

评审与讨论

审稿意见
3

The authors introduce a simulator which is capable of efficiently simulating joint traffic scenarios at scale. Using this simulator, they train a population of driving policies using self-play reinforcement learning. When evaluated on CARLA, NuPlan and Waymax, the authors report state of the art performance despite never training on these datasets.

Update After Rebuttal

During the rebuttal, the authors addressed my concerns regarding fine tuning of the reward and committed to including additional references. I believe that this work warrants acceptance. However, I believe that the manner in which the authors report their performance on Waymax is misleading. I strongly urge them to improve the clarity as discussed in the my review to improve their work further.

给作者的问题

My primary grievances with this paper surround reporting of evaluation hyperparameters and results on Waymo data. I will gladly raise my score if my questions are answered in these areas

  1. What reward conditional signal was used for each of the reported benchmarks. How much tuning (if any) was done to select the reward hyper parameters for each benchmark?
  2. How does the performance of SoTA WOSAC policies compare against your policy on the Waymax benchmark? I would specifically like to see the performance of SMART which is available open source here.
  3. Have any other prior art used the baseline implementations presented in the Waymax paper as a benchmark? I was unable to find any other work which uses these results as a benchmark.
  4. The performance of Gigaflow seems highly dependent on the details of your simulator. Will you open-source this simulator to allow others to reproduce your results?

论据与证据

Although technically correct, I have some concerns regarding the framing of the author's claims of emergent state of the art behaviour. The authors use three driving benchmarks to support their claim. Of these, NuPlan is a strong result, as it is widely used in this field. CARLA is less commonly used. I have no issues with either of these evaluations. My major concern is with the author's Waymo based evaluations. In the literature, the Waymo Open Motion Dataset is an extremely popular and competitive benchmark. The authors compare against the baseline methods reported in the original 2023 Waymax simulator paper [1] and prominently report their improvements (ie Figure 1 right). To my knowledge, I have never seen this paper's metrics used as a benchmark for evaluating policy performance. As I understand it, the official and substantially more widely used evaluation for the Waymo Open Motion Dataset is the Waymo Sim Agents Challenge. On this domain, the authors do not achieve state of the art performance, despite only comparing their methods to 2023 leaderboard and not the newer 2024 leaderboard where state of the art has further improved.

[1] "Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research" Gulino et al 2023

方法与评估标准

In addition to my concerns about the SoTA claim, I have a few concerns about the evaluation criteria

  1. For the Waymax benchmark, the authors propose a Score metric which is not present in the original Waymax paper as their summary metric. Although this metric captures realism regarding collisions, off-road driving, and distance driven it does not incorporate the displacement error with respect to the logged policy. Notably, Gigaflow performs worse than all policies except DQN on this metric.

  2. The self-play policy is conditioned on CrewardC_{reward} which the authors argue has a significant impact on the behaviour of the agent. However, CrewardC_{reward} is not reported for any of their benchmarks. It is unclear how much hyperparameter tuning of CrewardC_{reward} was needed to achieve state of the art performance on each dataset.

  3. As noted in the Claims and Evidence section, WOSAC should be evaluated on the newer 2024 leaderboard.

理论论述

N/A

实验设计与分析

See Methods and Evaluation Criteria

补充材料

I read through the appendix. However, I was not able to find the referenced infraction video supplementary material.

与现有文献的关系

This paper provides a counterpoint to the large body of literature related to learning traffic behavioral models. Specifically, the majority of prior art is based off various imitation learning techniques, leveraging large scale datasets of recorded traffic behaviour. The quality and realism of the policy reported in this paper opens up new possibilities for developing traffic policies through RL.

遗漏的重要参考文献

The authors fail to include several more recent SoTA policy models as they do not compare to the 2024 WOSAC leaderboard. For example

  • SMART: Scalable Multi-agent Real-time Motion Generation via Next-token Prediction (Wu et al. 2024)*
  • BehaviorGPT: Smart Agent Simulation for Autonomous Driving with Next-Patch Prediction (Zhou et al. 2024)

This paper references two papers on fine-tuning behavioural cloning polcies with RL. I believe this subject area is highly related and warrants further discussion. For example:

  • Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving (Peng et al. 2024)
  • Learning to Drive via Asymmetric Self-Play (Zhang et al 2024)
  • Scaling Is All You Need: Autonomous Driving with JAX-Accelerated Reinforcement Learning (Harmel et al 2023)
  • Smart 7M results from Table 1 of Wu et al. corresponds to the InteractionFormer results from table A8 of your paper

其他优缺点

In my opinion, this paper is very well done in almost every aspect. It shows evidence that large scale reinforcement learning is a promising alternative to the dominant paradigm of supervised learning for traffic modelling. The analysis of the policy performance is well done and the details provided in the appendix are thorough. Frankly, had the authors simply reported their reasonable WOSAC performance instead of claiming Waymo SoTA through their "Waymax benchmark" I would wholeheartedly recommend acceptance. Their SoTA claim in this area undermines an otherwise excellent paper.

Strengths

  • The performance reported by this paper is remarkable given the lack of supervised training on human data.
  • The qualitative analysis of the resulting policies is robust
  • The paper is clear and well written.
  • The supplementary provides many of the details required to reproduce the paper

Weaknesses

  • While obviously a strong behavioural policy, the state of the art claims on Waymo data are unconvincing.
  • The authors report training time compute but do not report hyperparameter tuning procedures or compute budget
  • Although the authors spend a large portion of the paper describing the details of their simulator, they do not offer to open source it, making reproduction of their results challenging.

其他意见或建议

No other comments

作者回复

We thank the reviewer for their thoughtful review, and list a detailed reply for all questions raised.

To contextualize this rebuttal we would like to clarify one of the main differences between Agent and Environment Simulation (WOSAC) and learning a driving policy (this work). The goal of WOSAC (and motion forecasting more generally) is to learn how human traffic generally behaves. If a vehicle collides or goes off-road (in the real world, or in a log due to sensor noise) a good agent and environment simulator will simulate this event. The WOSAC benchmark contains 3% collision events and 12% off-road events. A competent driving policy should never collide, and rarely go off-road.

For example, the best Agent and Environment Simulations (SMART etc) should reproduce the logged trajectory almost perfectly to maximize their score. A good policy will not because certain noisy trajectories lead to collisions and off-road events.

It was never our aim to claim to be state-of-the-art on WOSAC, nor should a competent driving policy be expected to. We show that in interactive driving simulations (CARLA, nuPlan, Waymax) our policy is a competent driver which gets to its goals safely, robustly, and efficiently. However, it does not necessarily achieve this in a human-like manner, nor does it imitate the sense noise of the waymo perception stack. This leads to a lower WOSAC score. The only mention of WOSAC in the main paper is on line 429. We primarily mention it as it is interesting, but not critical to the claims of the paper, that we achieve a non-trivial WOSAC score.

Additionally references

Thank you. We will add them in the final version.

For the Waymax benchmark, the authors propose a Score metric which is not present in the original Waymax paper as their summary metric...

It is unfortunately quite easy to trade individual metrics off in the Waymax benchmark. Yes BC and Wayformer perform better in Log ADE (following the demonstrated trajectory). They clearly should as this was their training objective. However, they fall short across all other simulation objectives with a collision rate of 4+%.

As noted in the Claims and Evidence section, WOSAC should be evaluated on the newer 2024 leaderboard.

We are working on this, and hopefully will have numbers for the final version.

It is unclear how much hyperparameter tuning of C_reward was needed to achieve state of the art performance on each dataset. / How much tuning (if any) was done to select the reward hyper parameters for each benchmark?

All benchmarks use the same reward hyperparameters (the mean of the randomized range). We did not tune reward parameters as we considered this overfitting to the target benchmark (and thus would invalidate our zero-shot generalization claim). There is a single difference between all benchmarks, which is the radius of the target region. Carla requires more precise navigation (the agent needs to follow the given route points closely.) than Waymax and nuPlan and uses a tighter target radius. We will add these values to the appendix in the final version.

How does the performance of SoTA WOSAC policies compare against your policy on the Waymax benchmark? I would specifically like to see the performance of SMART which is available open source here.

SMART is certainly performing better at Agent and Environment Simulation (WOSAC). The aim of the WOSAC results was to show the policy learns human-like driving, and not to claim state-of-the-art on this benchmark. As such, we sharply outperform SMART on collision scores, our primary metric of interest, but underperform on distance to logged trajectories, which our paper did not target and is not always relevant for a self-driving car.

Have any other prior art used the baseline implementations presented in the Waymax paper as a benchmark?

As developed, the tools developed in the waymax paper are primarily intended for the development of human-like simulation agents. However, as the data is interesting and can be used to measure robustness, other works have started to use it in similar manners to those used here [1, 2, 3].

[1] Cornelisse, D., Pandya, A., Joseph, K., Suárez, J., & Vinitsky, E. (2025). Building reliable sim driving agents by scaling self-play. arXiv preprint arXiv:2502.14706.

[2] Xiao, L., Liu, J. J., Ye, X., Yang, W., & Wang, J. (2024). EasyChauffeur: A Baseline Advancing Simplicity and Efficiency on Waymax. arXiv preprint arXiv:2408.16375.

[3] Charraut, V., Tournaire, T., Doulazmi, W., & Buhet, T. (2025). V-Max: Making RL practical for Autonomous Driving. arXiv preprint arXiv:2503.08388.

Will you open-source this simulator to allow others to reproduce your results?

We would very much like to open-source this work, but cannot commit due to approval requirements beyond our control.

审稿意见
4

This paper presents a batched driving simulator called GIGAFLOW that enables large-scale self-play, i.e., randomly initializing scenarios and learning robust driving behaviors with pre-defined RL rewards. The GIGAFLOW world creates worlds based on eight maps, spawns agents on random locations with randomly perturbed accelerations, sizes, goals, and other necessary parameters. It defines rewards w.r.t. safety, progress, traffic rules, and comforts. It trains a unified policy for all agents with conditions of various driving behaviors as input. The policy is trained with PPO and a variant of prioritized experience replay to filter unimportant samples. Through learning on GIGAFLOW only, the policy achieves state-of-the-art performance on multiple driving benchmarks, including CARLA, nuPlan, Waymax, and Waymo Sim Agent.

update after rebuttal

The authors clarified most of my concerns and confusion on technical details during the rebuttal. This paper presents a promising direction of self-play to realize robust motion planning. The novelty, contribution, and final results all demonstrate its high quality. I did not give a higher score mainly because self-play has been adopted in other domains, which limits the broader impact on the ML community. Its reliance on accurate maps and perception results may also limit its real-world application, but I believe the industry can acquire insights from the paper and address the bottlenecks for large-scale deployment soon.

给作者的问题

  1. What is the meaning of "long-form evaluation" in Sec. F.3? Besides, what is the relationship of long-form evaluation with the decision-making frequency in Figure A4? I hope the authors can clarify this section in the revision.
  2. In multiple places, the authors state that they use only a minimalistic reward function (Line 43) or do not design delicate reward terms (Line 149). Meanwhile, I think the rewards listed in Sec. B.3 and Table A2 have already included the most adaptable rewards in RL for driving. Could the authors clarify which other rewards can potentially be used for more delicate designs?
  3. The description in Lines 258-268 is very interesting. But I am curious how the policy autonomously adapts to various scoring criteria if they are contradictory.
  4. Would using a single policy to train all agents, with various conditions and randomization, result in unstable training, especially during the beginning stages?
  5. How are the routing information and goals applied for short-term planning benchmarks like nuPlan? Will there be label leakage or additional information used when comparing with other methods?
  6. Sec. E.4 adopts WOSAC 2023 results. How about the public 2024 WOSAC leaderboard?

论据与证据

The paper's claims are solid overall. The method does not involve human data and uses self-play with RL only, in alignment with the claims. The experimental results show that it realizes state-of-the-art performance on benchmarks and outperforms specific algorithms.

方法与评估标准

The method, self-play, is simple and sound. To make it effective, the authors propose several improvements, including hash-based map encoding for acceleration, the simple importance sampling strategy, careful randomization and filtering designs, etc. These methods make sense for the problem and are validated in the ablation study in the supplementary material.

理论论述

Not applicable. This paper does not include theoretical claims.

实验设计与分析

The authors incorporate multiple benchmarks, including open-loop and closed-loop evaluations. I think the experimental results and analysis are valid and sound, with concrete descriptions on how experimental settings are adapted in the supplementary material.

补充材料

I have read all the parts in the supplementary materials. The authors provide great details about implementations on the simulator, randomization, models, rewards, and results on several benchmarks, etc. The simulator designs are important to grasp how they accelerate batched training, and the randomization parts are crucial to assess the technical novelty of this work.

与现有文献的关系

The methodology in this paper is mainly self-play, which has been explored in some existing applications such as StarCraft gaming and Go Chess.

遗漏的重要参考文献

I think one closely related prior work is GPUDrive [1]. It also supports large-scale simulation or RL training with 1 million FPS, which is comparable to the speed of GIGAFLOW (1.2M as shown in Table A1). Though GPUDrive mainly targets fast training agents on WOMD, it can probably support self-play as well. Additionally, GPUDrive supports sensor simulation. I suggest the authors include a discussion of the comparisons with GPUDrive in the revision, which may be important to evaluate the contribution of GIGAFLOW in the aspect of batched simulation.

[1] Kazemkhani, S., Pandya, A., Cornelisse, D., Shacklett, B. and Vinitsky, E., 2024. GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS. arXiv preprint arXiv:2408.01584.

其他优缺点

Strength:

  1. The map hashing part involves great engineering effort, which is crucial to designing an accelerated simulator for batched RL training.
  2. Conditioning various agent types to the policy and using a unified policy for training is interesting.
  3. Great visualizations. Very neat and informative.
  4. Results are strong, with a single policy adapted to various benchmarks.

Weakness:

  1. One potential weakness of this method is the reliance on maps. This work put great efforts into encoding maps to accelerate information acquisition during training and calculating geometric constraints. However, the vectorized representation requires accurate high-definition (HD) maps to achieve robust planning, which are not available for most areas. To enlarge the functional regions, the current driving industry is focusing on map-free or map-light methods, using navigation maps or (inaccurate) crowdsourced maps only. Therefore, this issue may limit the broader application for the real-world driving industry.
  2. The paper adopts the Frenet coordinates, which potentially encourage the vehicles driving on the center of lanes. This is reasonable in most cases and is prevailing in current motion benchmarks. However, corner cases in real life sometimes do not satisfy this assumption. For example, it requires the vehicle to deviate to the opposite lane to bypass an obstacle in the front. The method adopts a discrete set of actions (12 in total) as well to simplify the problem. I understand that the authors have fully analyzed the failure cases in current benchmarks, yet I am wondering if it is possible to come up with more challenging scenarios to discuss the drawbacks of the proposed method. Besides, another potential benchmark in academia could be Bench2Drive [2]. This weakness also relates to W1 in terms of maps.
  3. I hope the authors could provide further clarifications on Figure A3. The results seem to indicate that randomization mainly works for CARLA experiments, while the impact for nuPlan and Waymax is relatively minor. I think the ablations in the appendix, Sec. F, are more important to get more insights from the proposed method, compared to the visualizations in Figure 4.
  4. The RL algorithm is trained in GIGAFLOW with billions of miles, but a large portion of data is filtered. Could the authors provide a more detailed number for the ultimate data used and compare it with specific data in various benchmarks? As those dataset-specific algorithms assume all the public dataset is useful, this comparison would be more direct and intuitive to see how much valuable data is needed.

[2] Jia, X., Yang, Z., Li, Q., Zhang, Z. and Yan, J., 2024. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. arXiv preprint arXiv:2406.03877.

其他意见或建议

  1. Typos: Line 29/401 (e.g.,), Line 85 (Dashed Lines -> The trophy and the thumbtack?), Line761 (intersect), Line979 (GPUs), additional blank lines before Eq. 1 and Eq. 5.
  2. There are multiple sentences stating that videos are in the supplementary material in benchmark evaluations (Sec. E, Lines 1088/1051), but they are not provided in fact.
  3. It would be helpful if the work could be open-sourced to benefit the community and the industry.
作者回复

We thank the reviewer for their thoughtful review, and list a detailed reply for all questions raised.

GPUDrive

We are aware of the work, and have been in close communication with the authors of GPUDrive from the start of their project. We will cross-cite each other's work for the camera-ready paper.

... requires accurate high-definition (HD) maps to achieve robust planning, which are not available for most areas

True, training requires HD-maps. Unclear how big the reliance on HD maps is during deployment. Our policy generalizes across several different vector map formats (Carla, nuPlan, Waymo). It does not directly see the vector representation, but rather an intermediate representation. A performant sense stack should be able to construct this representation.

The paper adopts the Frenet coordinates, which potentially encourage the vehicles driving on the center of lanes

This is a misunderstanding. The Frenet frame provides a policy with a rotation invariant representation of its local world. However it does not limit its driving ability. The policy choses to go into the opposing lane if required (see Figure 4). After all, PPO optimized the policy on such a huge amount of experience that it seems inconceivable that biases like lane-centering would survive only because of the input representation.

... Sec. F.3? ... Figure A4?

Our training environment is augmented with erratic drivers and severe dynamics noise. Sec F.3 evaluates robustness under ordinary conditions. In this regime, our agents achieve over 3 million kms of driving between incidents of any kind. This corresponds to approximately 99.99999997% of transitions in the simulator being collision-free. To our best knowledge, this is one of the first examples in the literature where such level of robustness is demonstrated for a stochastic policy trained with RL.

The policy drives better with a higher decision making frequency (Fig A4) as it has to commit to "shorter" actions (lower reaction time).

Will clarify in final paper.

... Could the authors clarify which other rewards can potentially be used for more delicate designs?

Our reward encodes road rules, but not behavior or expected interactions between agents. We do not use:

  • Penalties shaping expected proximity to other agents
  • Expected speeds around other agents
  • Forward progress

These terms trade short term performance with generalizability.

... Lines 258-268 ... how the policy autonomously adapts to various scoring criteria if they are contradictory.

Our eval is zero-shot. We use a single reward condition for all benchmarks, there is no autonomous adaptation.

Would using a single policy ... result in unstable training, ...?

It does not. Ultimately, the major reward the agent receives is to reach a goal. The randomized components of the reward are secondary. This is also reflected in the behaviors learned. The agent first learns to move towards the goal and collect close-by goals, only after that will it pay attention to red lights, driving direction etc. The training curves are stable. We will add them to the appendix.

Routing information / potential label leakage?

We do not use the expert trajectory in any way (no leakage). For nuPlan, we use a point in the final lane segment as the goal (no routing information, our policy learned to route locally during training). For waymax, we use the final location of the log ego vehicle projected forward 20m along the lane to ensure that the agent continues driving if it arrived early. In neither setup, our agent sees more information that baselines.

2024 WOSAC leaderboard?

Working on it, unfortunately wont finish in time for this rebuttal.

Clarifications on Figure A3 and randomization

The metrics for nuPlan and Waymax are very compressed. Both benchmarks have a certain number of scenarios that are unsolvable and thus a non-zero optimal policy. Carla has fewer scenarios that are completely unsolvable, and its metric degrades less gracefully. Reward randomization does help for all methods (nuPlan 93.08 -> 93.17; Waymax 98.79 -> 98.92).

Detailed number of data used.

Over 99% of trajectories are used during training and need to be simulated to completion to calculate rewards and advantages for each timestep. Figure 1 is an accurate comparison of sizes. Gigaflow is about five orders of magnitude larger than prior public attempts.

We filter 80% of time-steps (see section 2.2). This filtering happens after simulating rollouts. Even accounting for this filtering, RL training sees at least four orders of magnitude more data than a single pass over any publicly available data would provide.

Typos

Thank you. Will be fixed.

Supplementary videos

Sorry, we forgot to upload them: https://drive.google.com/drive/folders/1hlDrPDQIu56-iq039fwOa18g8aUGSKYl?usp=sharing

Open-sourcing

We would very much like to open-source this work, but cannot commit due to approval requirements beyond our control.

审稿人评论

Thanks for the clarification.

... HD maps

Agree. This effect is unclear for the method currently, and thus the limitation holds and is worth future work, together with the end-to-end simulation mentioned by Reviewer SjGz.

... Figure A4

I think the authors could add a one-sentence summary and discussion like the one in the rebuttal in the paper. Currently the figure is not described with a conclusion.

... how the policy autonomously adapts to various scoring criteria if they are contradictory.

Yes, I know that the policy is evaluated zero-shot. I do not mean the policy is adapted or fine-tuned. I am just curious why the zero-shot policy can achieve robust performance on both benchmarks if the two benchmarks have contradictory scoring criteria. I am wondering if the authors can provide a deeper analysis on this point. I think this may relate to the discussion of Reviewer EHkB as well. If the same scenarios are evaluated with two contradictory metrics, the policy should not yield satisfactory performance under both criteria. I assume that though different benchmarks favor or highlight different metrics, the overall criteria are the collision and running time, which are not contradictory. In this sense, the word "contradictory" may not be very accurate.

作者评论

... Figure A4

Happy to add this comment.

... how the policy autonomously adapts to various scoring criteria if they are contradictory.

The main factors to trade off between the benchmarks are:

  • Comfort (nuPlan, Waymax)
  • Traffic violations (CARLA)
  • Forward progress (CARLA, nuPlan)

Many scoring criteria are shared between all benchmarks:

  • Collisions
  • Off-road events
  • Goal reaching

We have trained some policies with a subset of rewards that perform better on specific benchmarks, but fail to generalize to all. For nuPlan, a policy without stop-lines generally performs better. Red light violations are not properly penalized in nuPlan and running lights (without colliding) will yield much better route progress. For CARLA, removing the comfort penalty yields better results. NPC vehicles in CARLA can exhibit braking beyond 2g, a policy that can react to this without regards to a drivers comfort generally avoids a few more collisions.

Our strategy has been to include all rewards that pertain to good driving during training and not to exploit each individual simulator. Fortunately this was sufficient to outperform state-of-the-art methods with a single policy.

We are happy to add this discussion to the final version of the paper.

审稿意见
5

This paper presents GIGAFLOW, a batched simulator that supports large-scale simulation to train robust driving policies via self-play. Through learning from a massive scale of self-play, the learned policy demonstrates superior performance compared to the prior state-of-the-art on multiple challenging benchmarks, without seeing any human demonstration in training.

给作者的问题

N/A.

论据与证据

Yes, the claims made in this submission are supported by clear and convincing evidence.

方法与评估标准

Strengths of methods and evaluation:

  • Large-scale reinforcement learning for driving. The key component of the proposed method is to apply large-scale reinforcement learning for learning robust driving policy. Although reinforcement learning has been adopted for driving in prior works, it is the first time attempt to conduct reinforcement learning on such a large scale with over 42 years of experience.
  • Strong generalization of learned policy on multiple benchmarks. Due to the learning on massive experience with a wide span of scenarios, the learned policy demonstrates strong generalization to unseen environments in a zero-shot manner. Additionally, it sets new records for several challenging benchmarks like CARLA and nuplan without special adaptations or additional learning. These results provide encouraging signals for future studies on applying reinforcement learning to broader areas including driving and robotics.

Issue of the method:

  • Lack of study on the model design. This work mainly focuses on the simulation and experiment aspects with a lack of detailed investigation for model designs. For instance, according to L145-150 (right), this method uses a single policy for different types of agents with conditioning CconditioningC_{conditioning} as an indicator for the agent type. However, the intuition and motivation to use one policy for all agents is under-discussed in this work. Furthermore, the ablation study of comparing shared policy and separate policies for different types of agents is not reported in this work. Given the distinction between different agents, such investigations could further improve the soundness of this work.

理论论述

No issues with the theoretical aspect of this work.

实验设计与分析

Strengths of experiments:

  • Extensive experimental results on multiple benchmarks. This work presents extensive experimental results to demonstrate the superiority of the proposed method, such as Fig1 and Table (A5, A6, A7) in supp. Providing cross-environment results would empirically require a lot of effort, and thus should be encouraged and valued.

  • Abundant qualitative analyses. Qualitative visualizations in Figures 3 and 4 enhance the readability of this work and clearly illustrate the policy behavior during and after training.

补充材料

Yes, I viewed all supplementary material.

与现有文献的关系

The key contributions of the paper relate to the areas of autonomous driving, traffic simulation, large-scale training, and applicable reinforcement learning.

遗漏的重要参考文献

N/A.

其他优缺点

Other Strengths:

  • This paper is well-written, well-organized, and easy to follow.
  • This paper focuses on a long-standing yet challenging problem, which is to establish a robust driving agent without extensive human demonstrations. This paper shows encouraging results in that research direction.

Limitation yet NOT weaknesses of this work:

  • Although this work shows promising results of learning robust policy via self-play, it is notable that all policies presented in this paper operate on highly abstracted vector space without sensor inputs, which is different from real-world driving agents that reason over their online sensor data. Therefore, there is still a significant gap in applying this method to establish real-world driving policy. The gap lies in the difficulty of high-dimensional sensor simulation which requires much more computation compared to low-dimensional vector-space simulation in this paper. However, we should be aware that sensor simulation refers to another branch of work and is not the focus of this paper, thus it should not be viewed as a weakness of this work.

Also see strengths and weaknesses in the above sections.

其他意见或建议

N/A.

作者回复

We thank the reviewer for their thoughtful review, and list a detailed reply for all questions raised.

The intuition and motivation to use one policy for all agents is under-discussed in this work. Furthermore, the ablation study of comparing shared policy and separate policies for different types of agents is not reported in this work.

N separate policies are technically hard to implement efficiently at scale for two reasons: Computational cost grows, as there is a significant amount of routing required from agent states to potentially non-uniformly batched inference and training steps / sequences Separate policies likely train slower: Each separate policy sees 1/N amount of experience. Prior evidence suggests that embedding a population in a single policy can accelerate learning [1].

The amount of policies required to cover our continuous reward parametrization can potentially be quite large.

[1] Liu, Siqi, et al. "Neural population learning beyond symmetric zero-sum games." arXiv preprint arXiv:2401.05133 (2024).

Although this work shows promising results of learning robust policy via self-play, it is notable that all policies presented in this paper operate on highly abstracted vector space without sensor inputs, which is different from real-world driving agents that reason over their online sensor data.

The reviewer is right that scaling this work to sensor simulation or real-world deployment will require additional work, and presents an exciting avenue of future work. One thing to note is that we are the first work to show that a single policy can generalize well across several different vector maps (Carla, nuPlan, Waymo). The main reason for this is that our policy does not directly see the vector representation, but rather lane boundaries etc. These inputs can likely be inferred by a sense stack, although getting a good noise profile will be part of future work.

审稿人评论

Thanks for clarifying the reason for using a unified policy for all types of agents. I'd like to keep my score as strong accept regarding the large-scale RL training, strong performance, and extensive experiments on multiple benchmarks of the proposed method.

最终决定

This paper presents a reinforcement learning approach to train driving policies, without human demonstration. The paper uses gigaflow, to run massive numbers of simulations from three benchmark driving environments. Although the paper used a metric that they define, the evaluation seems fair given it’s a millage measure in dense traffic. It is a good contribution to the icml community.