/10

Poster4 位审稿人

最低3最高3标准差0.0

ICML 2025

S2-Track: A Simple yet Strong Approach for End-to-End 3D Multi-Object Tracking

Tao Tang,Lijun Zhou,Pengkun Hao,Zihang He,Kalok Ho,Shuo Gu,Zhihui Hao,Haiyang Sun,Kun Zhan,Peng Jia,XianPeng Lang,Xiaodan Liang

OpenReview PDF

提交: 2025-01-10更新: 2025-07-24

TL;DR

A simple yet strong approach for end-to-end 3D multi-object tracking

摘要

关键词

3D Multi-Object TrackingS2-Track

评审与讨论

审稿意见

评分: 32025-02-27

This paper proposes a novel end-to-end 3D multi-object tracking method, aimed at addressing complex scenarios in autonomous driving perception, such as occlusions and small object tracking. The authors decompose the existing end-to-end 3D MOT framework into three core components: query initialization, query propagation, and query matching, and introduce corresponding improvements for each part. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on the several datasets. Although the optimizations proposed by the authors in the three parts are combinations of existing techniques, they still effectively enhance the performance of the end-to-end multi-object tracking paradigm, which sounds quite impressive.

Update after rebuttal

Thank you for the authors' response, which has addressed most of my concerns. Although the authors have adopted some existing techniques, the performance improvement in their end-to-end tracking algorithm remains commendable in my assessment. Particularly noteworthy is its successful application in real-world factory settings. Additionally, I hope the code could be made open-access to advance research in end-to-end MOT field. I keep my score.

给作者的问题

I noticed that the authors mentioned the inference speed of their proposed method is only 7.5 FPS on an NVIDIA A100. If deployed on an in-vehicle GPU, would it meet the requirements for real-time inference? If not, how far is this end-to-end multi-object tracking approach from practical application?

论据与证据

Yes, the claims made in the submission are well-supported by clear and convincing evidence.

方法与评估标准

Yes, the proposed methods and evaluation criteria are well-suited for the problem of 3D multi-object tracking in autonomous driving.

理论论述

Yes, i have checked the theoretical Claims of the proposed method.

实验设计与分析

Yes, i have checked the experimental designs and analyses.

补充材料

Yes, i have reviewed the experimental supplementary material, including Additional Details and Additional Results and Additional Visualizations.

与现有文献的关系

The paper builds on recent advances in end-to-end query-based trackers and leverages ideas from 2D-to-3D perception and probabilistic modeling to address limitations in 3D MOT. Therefore, i believe one of the core contributions of this paper is that the performance of its proposed end-to-end multi-object tracking method surpasses both end-to-end and non-end-to-end methods. I think end-to-end approaches represent the future trend for both autonomous driving and multi-object tracking, and this paper validates the potential of end-to-end methods.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

1.The proposed 2D-Prompted Query Initialization and Uncertainty-aware Probabilistic Decoder represent innovative improvements to existing end-to-end multi-object tracking methods. The integration of 2D information with 3D localization for Query Initialization sounds intresting.

2.The Hierarchical Query Denoising strategy is a novel contribution that addresses noise issues in training, providing a new solution for enhancing the robustness of end-to-end frameworks. Although the core idea originates from DN-DETR.

Achieving state-of-the-art (SOTA) performance is highly significant for end-to-end multi-object tracking.

Weaknesses:

1.I believe the authors should discuss the computational complexity (not only inference speed) of the proposed method in comparison to previous approaches, in order to more clearly highlight the advantages and disadvantages of their method.

2.The authors should specify whether other methods use full resolution or reduced resolution, in order to make the comparison more fair.

其他意见或建议

See strengths And weaknesses

伦理审查问题

N/A

作者回复

2025-03-29

Thanks for your time and insightful feedback. We especially appreciate your recognition of our well-designed modules with innovative improvements, and found our impressive performance validates the potential of end-to-end methods. We responded in detail below and will add them to the revision.

Q1: Discuss the computational complexity, in order to more clearly highlight the advantages and disadvantages of their method.

Thanks for your valuable suggestion! We provide the additional computational complexity of the proposed method. S2-Track only adds about 7.0% parameters, 2.4% FLOPs over PF-Track, while this trade-off results in a 12.3% improvement in AMOTA. Further improvements in tracking efficiency are a promising direction for future research. To highlight the disadvantages of our method, we have added a "Limitations and Future Work" section in the revision, hoping to inspire further exploration in this field.

Method	FLOPs	Parameters
PF-Track	534G	91.8 M
S2-Track	547G	98.3 M

Q2: Specify whether other methods use full resolution or reduced resolution

Thank you for your valuable suggestion! The numbers in our tables are directly taken from the corresponding methods' papers or the official nuScenes leaderboard, and we report the highest values for each method. Based on your suggestion, we have added a column to the tables in the revision to include the corresponding resolutions, allowing readers to make a more comprehensive assessment. Below, we provide some of the resolution settings.

S2-Track-F	PF-Track-F	S2-Track-S	PF-Track-S	ADA-Track	Sparse4Dv3	DQTack
1600 × 640	1600 × 640	800 x 320	800 x 320	1600 × 900	1408 × 512	1408 × 512

Q3: If deployed on an in-vehicle GPU, would it meet the requirements for real-time inference? If not, how far is this end-to-end multi-object tracking approach from practical application

Thank you for your comment! We have successfully implemented our S2-Track on the real-world autonomous vehicle equipped with the NVIDIA Drive AGX Orin platform. By incorporating various engineering optimizations, such as TensorRT quantization, we have achieved real-time performance exceeding 20 Hz on the real-world vehicle. Additionally, we have provided a video showcasing the real-world results on an anonymous GitHub Page: https://anonymous-github-8ab1cv.github.io/s2-track/. However, to achieve large-scale production and deployment for high-level autonomous driving (e.g., L3/L4), further improvements in efficiency or platform (e.g., NVIDIA Drive AGX Thor) will be required.

Thanks again for your thoughtful feedback. We believe that your suggestions, along with our revision, have greatly enhanced the persuasiveness and completeness of our work. We hope our rebuttal can address your concerns.

审稿意见

评分: 32025-03-10

This paper presents a new method called S2-Track for 3D multiple object tracking (MOT), an essential component for the perception of autonomous driving systems. Existing methods adopt end-to-end query-based trackers to simultaneously detect and track objects, but they fail to track objects in complex scenarios like occlusions and the small size of target object. To address these issues, the authors first summarize current end-to-end 3D MOT framework by decomposing it into three parts and propose corresponding improvements to every part. Experiments on the nuScenes dataset show that the method achieves 0.663 AMOTA on test split, surpassing the previous best end-to-end solution by 8.9%. Overall, S2-Track decomposes current end-to-end 3D MOT framework into three parts and proposes corresponding improvements to every part, which improves AMOTA by 8.9% on test split of the nuScenes dataset.

给作者的问题

How do you come up with these improvements? Are there some connections among these ideas?
In Hierarchical Query Denoising section, there are few explanations. Can you explain the core ideas there more clearly?

论据与证据

As mentioned in the Queries initialization part, each query consists of a feature vector and a 3D location. I find that it is not explained clear enough about how to initialize object queries with just 3D location in the 2D-Prompted Query Initialization section. Is the feature vector initialized randomly?
I find Hierarchical Query Denoising section hard to follow. Without referring to the DN-DETR paper, it may be difficult to understand. Maybe more explanation would help and let the readers understand the core ideas.

方法与评估标准

I think the proposed three improvements are indeed useful for the 3D multiple object tracking task. But I think all these improvements are just other papers’ ideas and the authors just simply combine these ideas together. I can’t figure out the connections between these improvements and think they are just separate tricks to improve the performance.

理论论述

I think most technical concepts are explained with appropriate detail and context except the Hierarchical Query Denoising section which I find it hard to understand.

实验设计与分析

This paper follows the experimental setup from previous works. I believe the thorough ablation studies revealing the specific contributions of each proposed improvement and the overall superior performance compared to the recent state-of-the-art methods demonstrates its efficiency.

补充材料

I have read all the supplementary material the authors provided.

与现有文献的关系

Prior work has shown that using depth information can significantly improve the localization of objects in 3D space. The proposed 2D-prompted query initialization leverages this by using predicted 2D object locations and depth information to guide the object detection process more effectively, thereby addressing the challenge of correctly initializing queries, which is a known limitation of previous transformer-based object detection methods. By using a probabilistic decoder that models and captures uncertainty, the approach is aligned with the idea that complex environments and real-world data require models that can quantify and deal with uncertainty in predictions. This helps improve robustness and performance, especially in tasks where the true object locations or classes are not easily discernible due to noise or occlusions.

遗漏的重要参考文献

There are no essential references that appear to be missing from the paper.

其他优缺点

The paper presents three efficient improvements over existing methods but I think these proposed methods have already been broadly implemented in other computer vision fields.

其他意见或建议

None

作者回复

2025-03-29

Thanks for your time and insightful feedback. We especially appreciate your recognition of our three useful improvements and superior tracking performance. We responded in detail below and will add them to the revision.

Q1: How to initialize object queries with 3D location.

Thanks for your comment! We ulitze the 3D location to initialize queries by following steps: 1) normalize input coordinates to the [0, 2π] range; 2) generate frequency bands using exponential temperature scaling; 3) compute sine/cosine components for each dimension (X,Y,Z); 4) concatenate the encoded dimensions; 5) project the concatenated features through two linear layers with ReLU activation. We will include these implementations in the revision.

Q2: More explanations of core ideas in HQD.

Thanks for your suggestion! In complex 3D MOT scenarios, challenges such as occlusions and varying object sizes can hinder the learning and convergence of query-based methods. The slow convergence and suboptimal results from the instability of bipartite graph matching. To this end, we perturb GT bounding boxes with noises into the decoder and train the model to reconstruct the original boxes, which effectively reduces graph matching difficulty and leads to faster convergence. Moreover, we define hierarchical challenging levels for the perturbed queries to enhance the model’s ability to handle diverse driving scenarios. We have included these explanations in the revision.

Q3: How do you come up with these improvements? Are there connections among these ideas?

Thanks for your comment! As mentioned in the Introduction L71–101, with the goal of enhancing existing end-to-end tracker in complex driving environments, we first decompose current query-based framework into three constituent parts: query initialization, propagation, and matching (Fig. 1(b)). Then we propose corresponding improvements for each part: PQI, UPD, and HQD. These modules are connected by their shared foundation—the query-based framework, with all improvements targeting challenges posed by complex environments. As Reviewer WzGu acknowledged, "The three modules (PQI, UPD, HQD) address the tasks in different stages of the query tracking lifecycle."

Q4: all these improvements are just other papers’ ideas

Thanks for your comments! We, with respect, did not agree. Current end-to-end trackers are still in the early stages and struggle with complex driving scenarios. In response, S2Track comprehensively improves existing framework. For PQI module, we leverage predicted 2D locations and depth information to enhance the queries initialization. While previous works in detection have explored the use of depth information, none have leveraged it for query initialization. For UPD module, the uncertainty issue has never been explored in the 3D MOT, let alone propose an Uncertainty-aware Probabilistic Decoder for tracking. For HQD module, although it draws inspiration from DN-DETR, we have improved it by introducing Hierarchical Query Denoising. As demonstrated in the ablations (Tab 5), our improvements outperform the original DN-DETR. None of these proposed modules are just other papers’ ideas. Moreover, S2Track, is not a simple combination of these ideas; it delivers impressive tracking results, showcasing the potential of end-to-end framework, which are acknowledged by all other reviewers:

Reviewer WzGu: "The design of UPD is novel... The framework is both simple and strong, .... It also achieves outperforming performance with the refined transformer query mechanisms."
Reviewer WzGu: "... incorporating three novel modules, ... brought by the newly-designed modules."
Reviewer 4TdZ: " (PQI&UPD) represent innovative improvements, (PQI) sounds intresting, (HQD) is a novel contribution .. providing a new solution, (framework) the core contributions...validates the potential of end-to-end methods."

Finally, we try to understand the reviewer's perspective. However, without S2Track decomposing the current framework into constituent parts, and proposing targeted improvements, it would be difficult for the community to grasp current limitations clearly. If these problems were so easily solvable in tracking, how could S2Track achieve such a significant improvement (+8.9% AMOTA) over previous SOTA? As Reviewer 4TdZ stated: "Effectively enhances the performance of the end-to-end multi-object tracking paradigm, which sounds quite impressive." We sincerely appreciate that the reviewer can recognize the substantial efforts behind our simple and strong framework.

Thanks again for your thoughtful feedback and time. We believe that your suggestions, along with our revision, have greatly enhanced the persuasiveness and completeness of our work. We hope our rebuttal can address your concerns.

审稿人评论

2025-04-08

We appreciate the reviewer’s suggestion and have carefully considered it.

作者评论

2025-04-08

Dear Reviewer,

Thank you for your response. However, we found your reply slightly unclear. If the message was not posted in the wrong chat box, we speculate that you intended to express agreement with the other reviewers'suggestions and to acknowledge our efforts in developing a simple yet strong end-to-end 3D MOT framework. We are glad that our rebuttal may have addressed your concerns, and we sincerely appreciate that you can potentially consider updating your score.

If you have any further questions or require additional clarification, we would be happy to provide more information.

Thank you once again!

Sincerely,

Authors

审稿意见

评分: 32025-03-13

The paper aim to improve the existing end-to-end 3D multi-object tracking framework. Specifically, the authors propose 2D-prompted query initialization, uncertainty-aware probabilistic decoder, and hierarchical query denoising. Experimental results on nuScenes benchmark show the effectiveness of the proposed framework.

update after rebuttal

Please see the rebuttal comment below.

给作者的问题

In the analysis of uncertainty (line 323-325-left column), the authors do not provide an analysis on the observation that other modules also effectively reduce the uncertainty. This weakens the motivation of the designed UPD module, as other modules could also achieve that. Do the authors have insights of why it is the case?
In section 4.4.1, the authors conduct ablation study on upper bound and lower bound thresholds. Does it require searching on parameter pairs? It is very time-consuming. Also, these parameters may need to be re-search for different data distributions.
The major concerns of the reviewer are listed in Other Strengths And Weaknesses section. Although the limitations might not be easily addressed during rebuttal, the reviewer would appreciate any explanation or interpretation on these limitations.

The reviewer may adjust the final rating after rebuttal based on the clarifications from the authors.

论据与证据

Yes.

方法与评估标准

The proposed methods and evaluations are reasonable.

理论论述

No theoretical claims and proofs involved.

实验设计与分析

The proposed method is only evaluated on nuScenes dataset, which may not be enough to demonstrate the effectiveness of the proposed method on the MOT task.

补充材料

The reviewer reviews all the supplementary material, including the video.

与现有文献的关系

The paper further improve the existing multi-object tracking framework by incorporating three novel modules.

遗漏的重要参考文献

No.

其他优缺点

Strength:

The paper is well-written, with clear structure and illustrations.

Weakness:

The proposed method is only evaluated on one dataset. Not enough to show the robustness in the multi-object tracking task.
The reviewer acknowledge the improvement brought by the newly-designed modules, but the general framework is still based on existing methods. The insights to this task and the community is limited.

其他意见或建议

The qualitative results in Figure 4(a) is hard to see. The reader could not easily tell the ground truth. Maybe show the ground truth as separate image, or a pure image without any annotation.

作者回复

2025-03-29

Thanks for your time and insightful feedback. We especially appreciate your recognition of our effective framework with newly-designed modules and well-written paper. We responded in detail below and will add them to the revision.

Q1: The proposed method is only evaluated on nuScenes dataset

Thanks for your suggestion! First, since the nuScenes dataset provides comprehensive detection and tracking tasks, along with evaluation leaderboards, most previous detection and tracking methods (e.g., DQTrack [ICCV23], PF-Track [CVPR23], and ADA-Track [CVPR24]) have been only evaluated on nuScenes. Therefore, following them, we also perform a fair comparison on this dataset. Second, we agree with the reviewer’s point that methods evaluated on only one dataset have not demonstrated robustness across multiple datasets. To address this concern, we present additional evaluations on an in-house autonomous driving dataset, which is collected from real-world scenarios. The results show that our method effectively tracks objects in challenging environments, demonstrating the generalization and robustness of our S2-Track. Additionally, we have provided a video showcasing the real-world results on an anonymous GitHub Page: https://anonymous-github-8ab1cv.github.io/s2-track/.

Method	MOTA	MOTP	RECALL
PF-Track	0.549	0.476	54.8%
S2-Track	0.712	0.334	77.3%

Q2: The general framework is still based on existing methods. The insights to this task and the community is limited.

Thank you for your comments! However, we respectfully disagree with the "limited insights". While S2-Track employs the query-based end-to-end framework as previous works, as mentioned in the Introduction, current end-to-end trackers are still in the early stages of development and are unable to effectively handle the various complex driving scenarios and achieve satisfactory tracking results. Therefore, S2-Track comprehensively enhances the existing end-to-end 3D MOT framework, delivering impressive robust and accurate tracking results, and demonstrating the potential of end-to-end frameworks.

As Reviewer 4TdZ acknowledged, "The paper builds on recent advances in end-to-end query-based trackers and ... to address limitations in 3D MOT. Therefore, I believe one of the core contributions of this paper is that the performance of its proposed end-to-end multi-object tracking method surpasses both end-to-end and non-end-to-end methods. I think end-to-end approaches represent the future trend for both autonomous driving and multi-object tracking, and this paper validates the potential of end-to-end methods."

Q3: Analysis of other modules also effectively reduce the uncertainty.

Thanks for your comment! Our PQI module leverages learned certain priors, i.e., 2D object location and depth information, to enhance the initialization of queries, thus effectively reducing the uncertainty in query initialization and resulting in more accurate object localization and tracking. The HQD strategy introduces different levels of noise to the queries and then applies a denoising process, allowing the model to encounter varying magnitudes of noise (i.e., uncertainty) during training. This effectively helps the model reduce uncertainty during query matching, leading to more stable and accurate tracking performance. Although the motivation of these two modules is not uncertainty, they both help the model reduce uncertainty during query initialization and matching. Moreover, they are incorporated together with the UPD module, which aims to reduce uncertainty during query propagation. We will incorporate this discussion into the revision.

Q4: The parameters of HQD require searching.

Thanks for your comment! The HQD module indeed has two thresholds that need to be set. As shown in Tab 5, the impact of these two thresholds on the results remains relatively stable within a certain optimal range, meaning that we do not need to perform an extremely fine-grained search for their values. Simply identifying the approximate optimal range is sufficient. Furthermore, on our in-house autonomous driving dataset, we used the same values as those in nuScenes and also achieved satisfactory results. While a more detailed search might yield further improvements, the current settings already provide stable and satisfactory performance gains in most cases.

Q5: Add the GT in Fig 4.

Thanks for your suggestion! We have carefully included the ground truth as separate images in Fig4 in the revision, making it easier for readers to understand.

审稿人评论

2025-04-03

Thanks for the detailed rebuttal. The authors have addressed most of my concerns. I preserve my point of limited novelty, but acknowledge the improvement from newly-designs modules. I have also read reviews from other reviewers and author's rebuttal. I would like to change my rating to weak accept.

作者评论

2025-04-03

Dear Reviewer,

Thank you for your thoughtful feedback and for updating your scores. We appreciate that you acknowledge the improvement from our newly-designed modules, and we hope our S2-Track will inspire future research in this field. Thank you!

Sincerely,

Authors

审稿意见

评分: 32025-03-16

This paper proposes an end-to-end stronger yet simple 3D multi-object tracking framework named S2-Track, which decomposes the tracking pipeline into three core modules: query initialization, propagation, and matching. Experiments show the effectiveness of each module in complex scenarios, including 2D-Prompted Query Initialization (PQI), Uncertainty-aware Probabilistic Decoder (UPD), and Hierarchical Query Denoising (HQD). The proposed framework is simple yet strong. It achieves excellent tracking performance when dealing with occlusions and small objects.

update after rebuttal

The authors have addressed most of my concerns, I decide to keep the positive rating.

给作者的问题

Please refer to the above.

论据与证据

Yes.

方法与评估标准

Yes. It shows that the proposed method works excellent in small object tracking.

理论论述

No theoretical claim in this paper.

实验设计与分析

Yes. For the performance comparison: On nuScenes dataset, S2-Track achieves state-of-the-art performance with an AMOTA of 66.3%, outperforming previous methods by 8.9%.

补充材料

Yes. It provides a tracking video demo.

I have a question here. It seems that even in the initial frame of the video. One very nearing lady is missed by the baseline PF-Track. I wonder if it is due to the difference in the detection part, rather than tracking. I mean the authors are supposed to use the same detection baseline to convince that the tracking part of the proposed method is better.

与现有文献的关系

NA.

遗漏的重要参考文献

One of the 3D object detection methods that also follows a query-based paradigm and exploits depth net is missed: 3DPPE, published in ICCV 2023. The authors are supposed to discuss it.

其他优缺点

Strengths:

The three modules (PQI, UPD, HQD) address the tasks in different stages of the query tracking lifecycle. These designs improve the tracking performance in complex scenarios. Meanwhile, the design of UPD is novel, as it integrates uncertainty perception through a probabilistic attention mechanism. This allows the model to maintain stable predictions even in challenging scenarios such as occlusion, small targets, and distant objects.
The framework is both simple and strong, avoiding complex designs such as multi-stage tracking pipelines. It also achieves outperforming performance with the refined transformer query mechanisms.
Extensive ablation studies and visualization results show the effectiveness of each module. The results (66.3% AMOTA) on the nuScenes test set achieve SOTA tracking performance in query-based methods.

Weaknesses:

The paper lacks quantitative analysis of performance degradation in extreme scenarios (e.g., heavy occlusion, low-light nighttime conditions). The paper also lacks the results of category-aware AMOTA (e.g., pedestrian vs. vehicle) on the nuScenes validation set.
Is the assumption of a Gaussian distribution reasonable, or are there other distributions that might be more suitable for UPD?

其他意见或建议

I suppose that the proposed method can be also generalized to some query-based multi-modal 3D object detection methods. And a few of them also report the tracking results. The authors are suggested to add an experiment to show its generalizability.

作者回复

2025-03-29

Thanks for your time and insightful feedback. We especially appreciate your recognition of our simple and strong framework with effective modules and SOTA tracking performance. We responded in detail below and will add them to the revision.

Q1: More analysis of other extreme scenarios

Thanks for your suggestion! In Table 4, we have already analyzed different occlusion situations, i.e., different visibilities. Here, we provide additional experiments of different weather and lighting conditions, and the results show that S2-Track is robust under different lighting and weather conditions, significantly boosting the performance under challenging rainy and nighttime scenes. The metric is AMOTA.

Method	Day	Night	Sunny	Rainy
PF-Track	41.3	12.6	41.8	36.1
S2-Track	46.5(+5.2)	19.7(+7.1)	46.7(+4.9)	42.6(+6.5)

Q2: Results of category-aware AMOTA (e.g., pedestrian vs. vehicle)

Thanks for your valuable comment! We provide detailed category-aware AMOTA results on both the val and test set for better comparisons, as the previous methods or leaderboard reports comprehensive test set results. The results show that S2-Track achieves larger improvements in more challenging categories, e.g., pedestrian.

AMOTA	car	pedestrian	bicycle	bus	motorcycle	trailer	truck
Val-PF-Track-CVPR23	57.9	41.5	-	-	-	-	40.3
Val-S2-Track	62.0(+4.1)	47.0(+5.5)	38.6	55.3	40.6	32.5	44.6
Test-PF-Track-CVPR23	62.2	45.1	32.2	40.8	44.8	38.0	40.5
Test-ADA-Track-CVPR24	66.4	53.4	33.4	38.2	48.4	43.7	35.9
Test-S2-Track	77.4 (+15.2/11.0)	70.1 (+25.0/16.7)	57.6	65.8	67.5	64.3	61.0

Q3: Other distributions for UPD

Thanks for your suggestion! We conducted additional experiments to explore different distributions. The results show that other distributions did not achieve satisfactory performance, which may be attributed to the natural statistical properties following the central limit theorem, i.e., many natural phenomena (e.g., lighting variations, sensor noise) arise from the superposition of numerous small and independent effects, leading to normal distribution.

Distribution	Gaussian	Uniform	Exponential
AMOTA	45.8	22.3	4.7

Q4: Generalized to query-based 3D object detection.

Thanks for your suggestion! We have already presented our detection results on the nuScenes test and val sets in Table 9 and 10. As a framework designed for tracking, our model also achieves leading detection performance (62.7% mAP and 68.0% NDS on the test set), which clearly demonstrates our strong generalizability. While integrating our powerful modules into existing SOTA detection methods could lead to further improvements, due to computational resource constraints and the scope of this tracking paper, we leave this exploration to future work.

Q5: Missed reference: 3DPPE

Thanks for your comment! While 3DPPE [1] also involves depth priors in a query-based framework, it differs from S2-Track in several aspects. First, 3DPPE focuses on 3D object detection, whereas we tackle 3D MOT. Second, 3DPPE introduces 3D point positional encoding, while our PQI is designed for query initialization. Moreover, we also retain randomly initialized queries to explore missing objects. We will add this discussion into the revision.

[1] 3DPPE: 3D Point Positional Encoding for Transformer-based Multi-Camera 3D Object Detection

Q6: One very nearing lady is missed by the baseline PF-Track. I wonder if it is due to the difference in the detection part.

Thanks for your comment! In Table 7, S2-Track achieves comparable results with both detection heads, PETR (the default head of PF-Track) and DETR3D (the default head of S2-Track), indicating that the missed lady in the demo is not caused by the detection component. In fact, if you carefully review the video, you will notice that the nearby lady is detected in the first frame. However, due to interference from other vehicles and pedestrians, her bounding box is lost in subsequent frames. This is precisely where our carefully designed modules enhance performance, enabling stable and robust tracking in challenging scenarios.

最终决定Accept (poster)

2025-05-01

The paper addresses the challenges faced by current end-to-end 3D MOT frameworks in complex scenarios such as occlusions and small target sizes. The authors decompose the framework into three core components—query initialization, propagation, and matching—and introduce targeted improvements for each. Their solution, S2-Track, leverages 2D information for query initialization, employs probabilistic decoders for propagation, and uses hierarchical denoising for matching. This approach not only enhances performance but also maintains simplicity. The results demonstrate significant improvement on the nuScenes benchmark. The authors have effectively addressed the reviewers' concerns, leading all reviewers to agree on a "weakly accept" rating for the paper. ACs concur with this positive assessment and recommend acceptance of the paper.