TSTTC: A Large-Scale Dataset for Time-to-Contact Estimation in Driving Scenarios
In this work, we present a large-scale object oriented TTC dataset in the driving scene for promoting the TTC estimation by a monocular camera.
摘要
评审与讨论
The manuscript introduces TSTTC, a large-scale dataset for Time-to-Contact (TTC) estimation in driving scenarios.
TTC is a key metric in ADAS for assessing collision risks, which is essential for subsystems like Adaptive Cruise Control (ACC) and Automated Emergency Braking (AEB). The authors argue that there is a scarcity of real-world, large-scale TTC datasets, which has historically limited the effectiveness of deep learning-based TTC estimation methods.
To address this gap, the authors have constructed a dataset comprising 206K sequences from real-world driving scenes (highway and urban), supplemented by 1K sequences generated using NeRF for scenarios with small TTC values. Each sequence contains six consecutive frames captured at 10 Hz, annotated with 2D and 3D bounding boxes and ground-truth TTC values for vehicles.
Additionally, the manuscript proposes two baseline TTC estimation methods: Pixel MSE and Deep Scale. Both methods rely on calculating the scale ratio between consecutive frames to estimate TTC, with the latter using deep learning to enhance accuracy.
The authors evaluate these methods on the TSTTC dataset, reporting some good improvements in terms of Motion-in-Depth (MiD) and Relative TTC Error (RTE). The results demonstrate that Deep Scale outperforms traditional depth estimation and other baselines.
The work highlights the benefits of their dataset for future TTC estimation research and emphasizes the importance of real-world datasets in improving the robustness of TTC predictions in autonomous driving systems.
优点
(+) The TSTTC dataset is a valuable contribution to the community, filling a notable gap by providing a large-scale, real-world dataset specifically tailored for TTC estimation. The dataset’s focus on both urban and highway driving scenarios, combined with a wide depth range (up to 400 meters), makes it highly relevant for autonomous driving applications.
(+) The paper presents a solution to the problem of data scarcity in small TTC cases by augmenting the dataset with NeRF-generated sequences. This approach is proposed to address the imbalance in real-world data.
(+) The proposed TTC estimation methods (Pixel MSE and Deep Scale) are well-executed. The results on various metrics such as MiD and RTE demonstrate the efficacy of the methods in practical driving scenarios, providing useful baselines for future research.
缺点
(-) The addition of NeRF-generated sequences is an interesting idea, but the exact impact of these synthetic data on the model’s generalization capabilities is not sufficiently explored. More detailed analysis of how much NeRF data improves performance, especially on smaller TTC values, would be helpful.
(-) The manuscript lacks thorough ablation studies for the proposed Pixel MSE and Deep Scale methods. It would be beneficial to break down the impact of different components (e.g., center shift, different scale bins) and analyze their contributions to the final performance.
(-) The manuscript does not address the computational complexity or real-time performance of the proposed TTC estimation methods. In real-world applications like autonomous driving, it is crucial to assess the trade-off between accuracy and computational cost, especially for resource-constrained systems.
(-) The dataset is primarily focused on highway and urban driving scenes, which limits its applicability in more complex scenarios like pedestrian-rich environments or non-vehicle objects. Including data for other road users, such as pedestrians and cyclists, would broaden the scope of the dataset.
Justification of Rating
While I do not have extensive experience in evaluating dataset papers, I can appreciate the significance of introducing the TSTTC dataset for TTC estimation in autonomous driving scenarios.
The dataset fills a notable gap by providing large-scale, real-world data for both urban and highway driving, which will undoubtedly be useful for researchers working on Time-to-Contact estimation tasks. The inclusion of NeRF-generated sequences for small TTC cases is an innovative approach to addressing the imbalance in real-world data, although a more detailed analysis of the impact of this synthetic data is needed.
The proposed TTC estimation methods, Pixel MSE, and Deep Scale, provide useful baselines for further research, but the manuscript could be improved by discussing their computational complexity and real-time performance, which are crucial factors for practical deployment in ADAS systems.
Additionally, the paper lacks thorough ablation studies and analysis of the individual contributions of different components in the methods.
Given these factors, I would leave more room for other reviewers to determine the overall contribution and novelty of this work.
问题
- Q1: Could the authors provide a more detailed analysis of the impact of NeRF-generated data on the model’s generalization capabilities, especially for smaller TTC values? How much do these synthetic sequences contribute to performance improvements?
- Q2: The manuscript lacks ablation studies. Could the authors provide more details on the individual contributions of the components in the proposed methods, such as the center shift or scale bins? This would help clarify the relative importance of each element in the final performance.
- Q3: The paper does not discuss the computational complexity or real-time performance of the proposed methods. How do Pixel MSE and Deep Scale perform in terms of runtime, and are they suitable for real-time applications in ADAS?
- Q4: The dataset primarily focuses on vehicles in highway and urban scenarios. Do the authors have plans to extend the dataset to more complex environments, such as pedestrian-rich or mixed-traffic scenarios? If so, how would the proposed methods perform in these cases?
Since the authors didn't attempt to provide a rebuttal, I have to keep the current rating for this submission.
This paper proposes a large-scale monocular Time-to-Collision (TTC) dataset for driving scenarios, including both 2D and 3D NeRF (Neural Radiance Field) image data. Additionally, it introduces two simple yet effective TTC estimation algorithms to validate the effectiveness of the proposed approach. Future work will focus on expanding the types of scenarios in the dataset or incorporating more safety-critical situations. Overall, this paper offers new insights into TTC estimation methods in the field of autonomous driving.
优点
- The author provides a very detailed explanation of the dataset processing.
- This work builds a large-scale TTC dataset and provides a simple yet effective TTC estimation algorithm as baselines for the community.
缺点
- The layout caption of Figure 2 on page 5 requires refinement.
- It is recommended that the author provide a flowchart to illustrate further details on the continuous monocular image generation of NeRF images.
- Lighting conditions (such as nighttime or low-light environments) or weather (such as rain or snow) affect the quality of NeRF rendering? The author could include some statistics regarding these factors in the dataset, as these issues are highly relevant in autonomous driving scenarios.
- In real driving scenarios, the motion trajectories of objects can be highly complex, which may lead to cumulative errors in speed and depth estimation, ultimately affecting the accuracy of TTC estimation. It is recommended that the author, after the derivation of the formulas, further discuss how to reduce the impact of these errors on the final results, for example, by using filtering or other smoothing techniques.
- In the future, it is necessary to increase the diversity of scene types in the dataset.
问题
- Please provide more details for the NeRF rendering concerning different weather and light conditions.
- The trajectory feature is hard to maintain in complex driving scenes, and the solutions are preferred for reducing the impact of the cumulative errors in trajectories for final results.
伦理问题详情
none
Because there is no rebuttal file, I keep the original rating.
This work proposes a Time-to-Contact dataset and two baselines to estimate time-to-contact. Time-to-Contact (TTC), the time for an object to collide with the observer's plane, is an important metric in autonomous driving. Vision-based, especially RGB-based TTC estimation method is needed for cost-efficiency.
优点
-
A Large dataset.
-
Promising performance.
缺点
-
It is unclear about the quality of the ground-truth labels. The authors should conduct experiments to show the quality of labels.
-
Albeit the dataset is large, what does the dataset bring? Can other methods benefit from using the dataset? In other words, a cross-dataset experiment is needed.
-
The introduction of the Nerf dataset is somewhat unclear. Why should it be used? How realistic is the dataset?
-
The authors use the MiD metric as the main metric rather than the RTE metric by stating that "Due to the instability of TTC at larger value". This makes the reviewer wonder about the validity of the TTC metric and the correctness of the proposed approaches.
问题
-
It is unclear why object-level TTC is used rather than pixel-level TTC. It seems that pixel-level TTC is more informative than object-level TTC.
-
The objective of the proposed baseline is to estimate the ratio of objects. Can we use an object detection algorithm to perform such a task? Why should we use the proposed baselines? I have found that there is an object detection method (SOT) in the experiment. Is the SOT method retrained or finetuned on the proposed dataset?
-
What is the LIDAR model used in Table 2?
Minors
-
Line 064, what does "class 8" mean?
-
The authors state that "due to page limitation", the overall RTE metic is reported. The page limitation of ICLR is 10 pages, so please present more results.
-
Line 238: How do you obtain the velocity of the vehicle? Through using which data by what means? Please be specific. Is the velocity data provided by radar accurate?
-
Line 238, How exactly do you fit the velocity by the depth?
I will keep my rating as the authors do not respond to the comments.
This paper introduces a time-to-contact dataset for the safety requirements of the redundancy system in ADAS. In addition to being collected from real-world scenarios, the dataset also introduces NeRF scenes to further extend the number of safety critic scenarios. Other than the dataset, this paper also models the TTC estimation problem as estimating the scale ratio of 2D bounding boxes, thus making the TTC problem solvable using image pairs solely, and introduces Pixel MSE and Deep Scale accordingly. Experiments demonstrate the effectiveness,
优点
- The time-to-contact estimation task makes sense to me. Though some existing tasks like depth estimation and velocity estimation, or motion prediction can be used to solve this task as well, using TTC system as a redundancy system makes sense to me.
缺点
I have concerns regarding to the following 3 aspects:
- Dataset.
-
Object range. The paper highlights their dataset contributions in Table 1 and bolds the object range tag. However, from the histogram in Figure 3, the objects farther than 200m are relatively few. I wish to see a histogram comparing to existing datasets, like Argoverse 2 [1] and Cityscapes3D [2] on a relative number of distant 3D objects.
-
From L186, the paper mentions all real-world data are collected from commercial trucks. I am wondering what the object size of the trucks is. Whether the dataset is available for ADAS system on commercial cars, instead of large trucks.
-
NeRF scenes. From Figure 8, I can see a visual domain gap from the NeRF reconstructed scenes and the real-world scenes. Thus, I am wondering the effectiveness of introducing Nerf scenes as well. Whether it can bring a better training performance, or whether it makes the testing harder? If it makes the testing scenarios harder, whether this difficulty comes from the challenges in domain gap?
[1] Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting
[2] Cityscapes 3D: Dataset and Benchmark for 9 DoF Vehicle Detection
- Method.
- Method details. In L287, this papers mentions the Eq. (4) is derived from Eq. (1) and Eq. (3). However, it is not obvious to me how it achieves. I suggest the author to add more details on the combination.
- Experiments.
-
Redundancy design. The paper highlights TTC estimation as a redundancy design for ADAS system. Therefore, I am wondering, whether the TTC estimation helps the planning or the ADAS system. I suggest some experimental results demonstrating TTC estimation helps planning or other components in autonomous driving, instead solely experiments on TTC estimation. This is to show TTC estimation really helps.
-
Some additional experiments. As the TTC estimation can either be achieved by scale difference estimation and depth-velocity estimation as well, (also, can be achieved by motion prediction as well?), I suggest the authors to compare their methods with other alternative methods.
-
Distance breakdowns are suggested. TTC estimation errors with different object distance ranges.
问题
See weakness.
伦理问题详情
N/A
This paper's main contribution is a dataset for evaluating time-to-contact that assesses collision risk in ADAS (driver assistance) systems. The proposed dataset provides monocular videos and consists of manually selected (200) sequences sampled from thousands of hours of driving footage. The proposed approach utilizes NeRFs to render additional sequences/scenarios to train the model and increase data diversity. The paper also provides two reasonable baselines evaluated on the newly constructed dataset.
However, reviewers also point out that (i) the constructed dataset lacks diversity (and focuses on highways and urban driving scenes, lacking pedestrian-populated environments); (ii) the addition of NeRF-generated sequences is not well justified (as W2A7 points out, the impact of synthetic data on model's generalization capability is not evaluated); (iii) were not convinced with the quality of provided labels, and, (iv) question whether methods, trained on the proposed dataset, can generalize across datasets.
As the authors did not respond, reviewers unanimously recommended not accepting this paper (with final ratings 4, 5, 5, 5). AC agrees with this assessment.
审稿人讨论附加意见
There was no rebuttal and no discussion.
Reject