Thank you for engaging in the discussion. We would like to provide the following response respectively regarding your remaining concerns.

The benefits of Fine-grained Trajectory Tokenization are primarily twofold compared with previous methods. The interactive behavior can involve different traffic instances with different spatiotemporal relationship (e.g. a lane change attempt with multiple surrounding vehicles). The fine-grained trajectory tokens enable flexible feature fusion with these instances, in collaboration with the downstream Interaction-enhanced Spatiotemporal Fusion mechanism where different tokens of trajectory segments may interact with different environmental features, and thereby enhance the overall flexibility of trajectory generation. On the other hand, to maintain the kinematic consistency and smoothness of the trajectory, we introduce the overlap between neighboring trajectory segments. A straightforward illustration of these designs can be seen in Fig. 3 and the experiment provided in the rebuttal above.
Flow Matching can be considered as a general form of Diffusion [1], with more versatile choice of noise schedules. Specifically, we choose Flow Matching with Optimal Transport path [2], which adopts a straightforward linear interpolation between noise samples and data samples: , where This enables faster convergence during training with less intersection of interpolation paths [3], as is indicated by the performance comparison between Flow Planner and its Diffusion adaptation with the same training steps in the experiment above. Additionally, the linear interpolation results in a constant velocity along the path, enabling simple ODE simulation with fewer steps [2] and thus faster inference, which makes Flow Matching a widely adopted paradigm in scenarios where high inference frequency is necessary [4][5]. In our case, we use simple Euler method with only 4 steps simulation, compared with previous Diffusion-based methods [6] (20 steps).
As discussed in the rebuttal, "by learning both unconditioned and scene-conditioned behaviors, CFG captures interaction patterns—offering a novel solution for generating interactive behaviors". An illustration of CFG mechanism can be seen in Fig. 1. More theoretical analyse can be found in [7][8].

[1] Lipman, Yaron, et al. "Flow matching for generative modeling." arXiv preprint arXiv:2210.02747 (2022).

[2] Liu, Xingchao, Chengyue Gong, and Qiang Liu. "Flow straight and fast: Learning to generate and transfer data with rectified flow." arXiv preprint arXiv:2209.03003 (2022).

[3] Lipman, Yaron, et al. "Flow matching guide and code." arXiv preprint arXiv:2412.06264 (2024).

[4] Black, Kevin, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, Ury Zhilinsky. ": A Vision-Language-Action Flow Model for General Robot Control." arXiv:2410.24164, 2024.

[5] Zheng, Zangwei, et al. "Open-sora: Democratizing efficient video production for all." arXiv preprint arXiv:2412.20404 (2024).

[6] Zheng, Yinan, et al. "Diffusion-Based Planning for Autonomous Driving with Flexible Guidance." The Thirteenth International Conference on Learning Representations.

[7] Zheng, Qinqing, et al. "Guided flows for generative modeling and decision making." arXiv preprint arXiv:2311.13443 (2023).

[8] Ho, Jonathan, and Tim Salimans. "Classifier-free diffusion guidance." arXiv preprint arXiv:2207.12598 (2022).