Efficient Motion Prompt Learning for Robust Visual Tracking
摘要
评审与讨论
The paper presents a lightweight Motion Prompt Tracking module that integrates motion cues into vision-based trackers. Using a motion encoder with spatial, point, and temporal encodings and a Transformer-based fusion decoder with adaptive weighting, it improves robustness against occlusion and distractors. The method employs prompt learning, freezing the visual model while fine-tuning only the added modules.
update after rebuttal
Thank the author for the rebuttal. I will keep my original rating which was already positive.
给作者的问题
N/A
论据与证据
Main Claims:
- Motion prompts can effectively complement visual features to improve tracking robustness.
- A non-linear temporal positional encoding is theoretically justified and empirically optimal.
- The proposed module improves performance on multiple tracking benchmarks with minimal additional resources. Evidence:
- Extensive quantitative results across benchmarks (VOT2018/2020/2022, LaSOT, LaSOTEXT, TNL2K, TrackingNet) demonstrate significant performance improvements.
- Ablation studies highlight the contributions of each component (positional encodings, data augmentation strategies, trajectory length).
- Visualizations and qualitative comparisons illustrate improved robustness over baseline trackers.
方法与评估标准
Methods:
- Motion Encoder: Encodes historical trajectories using spatial, point, and non-linear temporal positional encodings.
- Fusion Decoder: A two-layer Transformer network that fuses motion and visual features.
- Adaptive Weight Mechanism: Dynamically balances the contribution of visual and motion cues based on an auxiliary prediction.
- Training Strategy: Prompt learning with frozen visual backbones, reducing training time and memory usage.
Evaluation Criteria:
- Performance metrics such as Expected Average Overlap (EAO), AUC, precision, and robustness.
- Comparison against state-of-the-art trackers and baseline models on standard benchmarks.
- Analysis of training efficiency, convergence, and ablation of different components.
理论论述
- The paper derives a non-linear temporal positional encoding using the Nyquist frequency principle.
- It demonstrates mathematically that setting optimally controls the frequency range of the temporal encoding.
- Experimental results confirm that this setting achieves the best performance compared to other values.
实验设计与分析
Design:
- Integration of the MPT module into various baseline trackers (OSTrack, SeqTrack, ARTrack).
- Evaluation on multiple challenging benchmarks to test robustness and accuracy.
Analyses:
- Ablation studies that individually assess the impact of each positional encoding, data augmentation strategies (CutMix, sparse sampling), and trajectory lengths.
- In-depth performance analysis comparing hard versus easy scenarios to validate robustness improvements. Visualizations of attention maps and training dynamics to further substantiate the method's benefits.
补充材料
- Detailed proofs of the non-linear temporal positional encoding.
- Extended implementation details, including architecture settings and training parameters.
- Additional ablation studies and in-depth performance analyses (e.g., effects of trajectory length, data augmentation).
- Extra visualizations and qualitative comparisons, including video demonstrations.
与现有文献的关系
- The work builds on established vision-based tracking methods and recent advancements in prompt learning and Transformer architectures.
- It positions itself relative to methods like ARTrack, ODTrack, and other sequential training approaches by emphasizing efficiency and robustness.
- The integration of motion cues into visual tracking is contextualized in traditional motion estimation techniques and modern deep learning frameworks.
遗漏的重要参考文献
N/A
其他优缺点
Weaknesses:
- The focus on robustness may come at a slight cost to tracking accuracy compared to some sequential training methods.
- The approach might be sensitive to the quality of motion inputs, which could be further explored.
- Limited discussion on potential challenges when extending the method to all moving subjects, which is multi-object tracking scenarios. Does it cause any conflict?
其他意见或建议
- A deeper analysis of failure cases and conditions under which the motion cues might mislead the tracker would be beneficial.
We thank the reviewer for the positive feedback, and address the concerns below.
-
Sensitivity to the trajectory quality: The proposed adaptive weight mechanism in our method can mitigate the impact of noisy trajectories to some extent. As shown in Figure 3 (Page 8 in submitted paper), the predicted weights align with the accuracy of motion predictions which implicitly reflect the quality of input trajectories. |Success Rate|[0.0,0.1)|[0.1,0.2)|[0.2,0.3)|[0.3,0.4)|[0.4,0.5)|[0.5,0.6)|[0.6,0.7)|[0.7,0.8)|[0.8,0.9)|[0.9,1.0)| |-|-|-|-|-|-|-|-|-|-|-| |OSTrack_MPT|1%|21%|26%|32%|42%|63%|82%|94%|99%|100%|
To further analyze the sensitivity of our MPT to varying noisy trajectories, we evaluate the success rate (IoU>0.5) under different quality trajectory inputs on LaSOTEXT, as shown in the table above. Here, [a,b) represents the range where the average IoU of the trajectory is no less than a and less than b. The following conclusions can be drawn:
- The success rate increases significantly with the improvement of trajectory quality, especially when the trajectory IoU is in the range of [0.6, 1.0).
- Our method shows a certain level of robustness to noisy trajectory inputs, obtaining a success rate of 21% even when the trajectory IoU is in the range of [0.1, 0.2).
Thanks for your valuable advice, we will add the above analysis to our paper.
-
Deep analysis of success/failure cases: To deeply explore the success/failure cases of our MPT, we evaluate the proportion (Rate) of these two conditions on LaSOTEXT, along with the corresponding average trajectory IoU (Avg_IoU_traj) and average IoU of the previous tracking result (Avg_IoU_pre).
Here, we define a success case when the predicted IoU of our MPT exceeds that of the visual baseline by more than 0.3. Conversely, when the predicted IoU of our MPT is lower than that of the visual baseline by more than 0.3, we define it as a failure case. As shown in the table below, success cases of our MPT usually occur when the historical trajectory IoU is around 0.72, while failure cases usually happen when the historical trajectory IoU is around 0.18. Additionally, our method is more likely to succeed when the tracking result of the previous frame is better, whereas poorer previous tracking results are more likely to lead to failure. This is mainly because the tracking result of the previous frame affects the current search area and the completeness of the target.
Rate Avg_IoU_traj Avg_IoU_pre Success 2% 0.72 0.76 Failure 1% 0.18 0.1 Moreover, we separately evaluate the distribution of trajectory IoU and previous frame IoU in both success and failure cases, shown in the following two tables. In success cases, the input trajectory IoU primarily falls within a relatively broader range of [0.6, 1.0), especially in [0.8, 0.9). In failure cases, the input trajectory IoU mainly falls within the range of [0.0, 0.2), particularly in [0.0, 0.1). This indicates that addressing misleading tracking issues caused by extremely poor-quality trajectories remains a challenge for our method. This is a limitation of our method that requires future research.
Distribution_succ [0.0,0.1) [0.1,0.2) [0.2,0.3) [0.3,0.4) [0.4,0.5) [0.5,0.6) [0.6,0.7) [0.7,0.8) [0.8,0.9) [0.9,1.0) IoU_traj 1% 2% 3% 4% 6% 6% 10% 20% 34% 15% IoU_pre 2% 0% 1% 3% 4% 6% 9% 16% 37% 23% Distribution_fail [0.0,0.1) [0.1,0.2) [0.2,0.3) [0.3,0.4) [0.4,0.5) [0.5,0.6) [0.6,0.7) [0.7,0.8) [0.8,0.9) [0.9,1.0) IoU_traj 55% 12% 8% 6% 6% 5% 4% 3% 1% 0% IoU_pre 74% 11% 4% 2% 2% 2% 1% 1% 1% 1% Thanks for your valuable advice, we will add the above analysis to our paper.
-
Extension to MOT: Different from single-object tracking (SOT), typical multi-object tracking (MOT) frameworks involve object detection and association. Therefore, integrating our MPT into an MOT framework will differ from that of SOT. However, the core idea of our method remains highly compatible with MOT problem. Specifically, historical trajectories of each object can be utilized as prompts to enhance MOT performance. On the one hand, our MPT can be integrated into the object detection process to enable motion-aware detection, thereby improving detection accuracy. On the other hand, during the association process, our MPT can help maintain the coherence of the object trajectory, achieving more accurate matching.
In addition, compared with SOT, MOT usually encounters more complex backgrounds with a higher number of distractors. As an advantage of our MPT, introducing motion prompts can potentially help address distractor challenges. However, complex scenes are more prone to generating noisy trajectories, which may also pose a risk of misleading tracking issues.
Thanks for your valuable advice, we will add the above discussion to our paper.
This paper presents a flexible and efficient motion prompt tracking method, which complements existing trackers based on motion prompts. It includes a motion encoder with three different positional encodings, a Transformer-based fusion encoder, and an adaptive weight mechanism. The tracker is evaluated on several benchmarks and achieves encouraging results.
给作者的问题
Please refer to the above suggestion.
论据与证据
Comprehensive analysis and support have been provided concerning the significance of motion patterns and the need for greater computational resources for trackers relying on sequential training. The MPT module exhibits outstanding performance on certain datasets, such as the VOT dataset; its enhancement on other datasets is less conspicuous.
方法与评估标准
The proposed MPT method makes sense as it addresses the challenges of incorporating temporal coherence in tracking. Its flexible integration and training-efficient design fit the problem well.
理论论述
The MPT module can be flexibly integrated into existing vision-based trackers and enhance various baseline trackers, which is demonstrated through a description of its internal components and their interactions. For example, within a lightweight frame level fine-tuning, the MPT achieves comparable performance to those sequential-based SOTA methods. Although the text provides an overview of the design and anticipated functions of MPT, when it comes to trackers like Artrack, which already incorporate temporal information, the enhancements that MPT makes turn out to be fairly restricted. Likewise, its contributions on datasets such as LaSOT and TrackingNet prove to be rather meager as well.
实验设计与分析
- The Efficiency comparison is insufficient. For example, the increase in FLOPS ought to be provided.
- The authors are suggested to provide the results of full-finetuning the head instead of training an adaptive weight mechanism to support effectiveness and efficiency or using LoRA.
补充材料
Yes. Training details and more analysis.
与现有文献的关系
This paper contributes to the broader literature by refining motion-prompt-based tracking through improved encoding, Transformer-based fusion, and adaptive weighting. It aligns with prior research on Transformers in tracking but introduces motion prompt mechanisms to enhance efficiency and flexibility.
遗漏的重要参考文献
No
其他优缺点
Strengths: The paper introduces a motion encoder with three different positional encodings, which appears to be an effective motion-tracking approach.
Weaknesses: 1.Include experimental results comparing with LoRAT and other recent trackers to ensure fair benchmarking. 2. Provide efficiency metrics (e.g., FLOPs, latency, memory footprint) to justify the claim of being “efficient.”
其他意见或建议
More comparisons about PEFT-based tracking methods, like LoRAT, are suggested.
We thank the reviewer for the constructive feedback, and address the concerns below.
-
More efficiency metrics: We have provided several efficiency metrics for both baselines and our methods in Table3 (Page 7 in submitted paper), including memory footprint (Params, and Mem), latency (FPS), and training time (Time_tr). Thanks for your advice, we will add the FLOPs metric to our paper to further demonstrate the efficiency, as shown in the table below. ||OSTrack |+MPT(Ours)|ARTrack|+MPT(Ours)| |-|-|-|-|-| |FLOPs (G)|29.1|33.7|40.3|41.9|
-
Impact of fine-tuning strategies: According to your advice, we supplement the ablation study in terms of fine-tuning strategies. Our method freezes all of the baseline parameters, and only fine-tunes parameters of MPT during training. Here, we also evaluate two variants. FinetuneAll fine-tunes both MPT and baseline parameters, while FinetuneHead fine-tunes parameters of MPT and the tracking head of the baseline, i.e., only freezing the backbone of the baseline model. As shown in the table below, our method slightly outperforms the other two variants, and is more efficient in terms of memory footprint and training time. The potential reason is that freezing the baseline parameters enables the model to focus on learning motion cues and the fusion mechanism. ||OSTrack-B256|+MPT(Ours)|FinetuneAll|FinetuneHead| |-|-|-|-|-| |VOT22(EAO)|0.530|0.572|0.567|0.548| |LaSOTEXT(AUC)|46.9|48.7|48.3|48.1| |Mem(GB)|35.0|6.1|37.5|7.7| |Training Time(h)|38|5|8|6|
Thanks for your valuable advice, we will add the above experimental analysis to our paper.
-
Comparison with PEFT-based tracking methods, like LoRAT: PEFT methods primarily include adapter-based and prompt learning-based approaches, both of which demonstrate great efficiency in various downstream tasks. LoRAT successfully applies LoRA, a typical adapter-based PEFT method, to the tracking task. The main contribution of LoRAT lies in designing a LoRA-friendly visual tracking model to achieve efficient training and inference. In contrast, our work aims to build a joint tracking framework that incorporates both visual and motion cues, leveraging the temporal consistency of videos and complementing visual limitations. Additional motion inputs are naturally compatible with prompt learning-based PEFT, which is why we choose prompt learning to ensure efficiency.
Furthermore, we integrate our MPT into LoRAT, as shown in the table below. Results demonstrate that our MPT improves LoRAT by an average performance of 1.5% across the two datasets, showing good complementarity of our method with such PEFT-based tracking methods.
LoRAT +MPT(Ours) VOT22(EAO) 0.541 0.558 LaSOTEXT(AUC) 49.9 51.2 Thanks for your advice, we will add the above comparison to our paper.
-
Comparison with recent trackers: Since our method is a plug-and-play module, and can be flexibly integrated into varying visual trackers to achieve performance complementarity, especially in challenging scenarios, the relative performance improvement over baseline methods can fairly demonstrate the effectiveness of our method. Thank you for your advice. We will add the above experimental comparisons with LoRAT to our paper to showcase the effectiveness of our method on recent trackers.
-
Minor performance gains for ARTrack and on some datasets: Compared with ARTrack, we provide new insights into efficient temporal tracking. In addition, the apparent performance bias is primarily due to the varying distributions of easy and hard sequences across different datasets, which is further clarified through the quantitative analysis of performance bias in Table 6 (Page 13 of the submitted paper). Our method consistently exhibits robustness advantages in challenging scenarios, demonstrating its prospects for real-world tracking applications.
Thank you for your rebuttal. Some questions were well resolved. However, could there be some errors in the FLOPs calculation? Please refer to the FLOPs comparison published by the ARTrack authors [1]. I also share the same concern as Reviewer 7U4B regarding the performance of Appearance-based prompt trackers. [1] https://github.com/MIV-XJTU/ARTrack/issues/80
-
FLOPs calculation: We greatly appreciate you pointing out the discrepancy in the FLOPs calculation. Since ARTrack does not provide official implementation for FLOPs evaluation, our previous evaluation was based on the official implementation of OSTrack, which adopted profile() function from thop library. To obtain more accurate FLOPs calculations, we carefully use FlopCountAnalysis() function from fvcore library to recompute FLOPs. The results are shown in the table below. The updated FLOPs show only a slight difference compared to that provided by ARTrack (55.2 v.s. 57.6). Nonetheless, we believe that the efficiency comparison, based on the same FLOPs calculation method, remains fair and credible. Our method only introduces a small increase in FLOPs compared to the baselines. We will include the new FLOPs results and the detailed calculation method in our paper, and will also try to contact authors of ARTrack to inquire about their FLOPs calculation for further verification. ||OSTrack|+MPT(Ours)|ARTrack|+MPT(Ours)| |-|-|-|-|-| |FLOPs (G)|31.2|35.9|55.2|56.8|
-
Performance of appearance-based prompt trackers: To the best of our knowledge, the concept of the appearance-based prompt tracking is first introduced in multi-modal trackers[1,2], which use visual prompts from other modalities to enhance the tracking performance. In contrast, for the RGB tracking task, all visual trackers naturally utilize appearance cues for tracking (i.e., visual information provided by the initial template). Beyond that, dynamically updated templates align better with the concept of appearance prompts. PromptVT[3] mentioned by Reviewer 7U4B proposes a dynamic template prompter to learn appearance changes between initial and dynamic templates, which shares a very similar appearance-based prompt idea with visual trackers employing dynamic template mechanisms.
As we replied to Reviewer 7U4B, appearance-based prompt trackers (like PromptVT and those trackers with dynamic template mechanisms) enhance appearance features using dynamic appearance information, thereby enabling the model to be robust to appearance changes during tracking. However, the tracking problem is still treated as a discrete image processing task. In contrast, our MPT leverages the temporal consistency of videos by incorporating continuous object trajectories into the visual tracking framework. Our method proves to be more effective in addressing visually challenging scenarios, such as occlusions and distractors. Therefore, these two kinds of methods are complementary: while appearance-based prompt methods focus on enhancing visual features, our MPT can integrate historical motion cues into this visual framework, achieving a joint enhancement of both visual and motion information.
Our experiments integrating our MPT into SeqTrack[4], a typical visual tracker with a dynamic template mechanism, can demonstrate the performance improvement brought by our MPT in such appearance-based prompt trackers. As shown in the table below, our MPT improves the two models of SeqTrack by an average of 1.2%AUC on LASOTEXT, 1.1%AUC on LASOT, and 0.8%EAO on VOT22. More results can be found in Table1 and Table2 (Page 6-7 in submitted paper).
SeqTrack-B256 +MPT(Ours) SeqTrack-L384 +MPT(Ours) VOT22(EAO) 0.523 0.528 0.568 0.579 LaSOTEXT(AUC) 49.8 50.8 50.4 51.7 LaSOT(AUC) 69.4 70.1 72.5 73.9
[1] Prompting for multi-modal tracking, ACMMM2022.
[2] Visual Prompt Multi-Modal Tracking, CVPR2023.
[3] PromptVT: Prompting for efficient and accurate visual tracking, IEEE TCSVT2024.
[4] SeqTrack: Sequence to sequence learning for visual object tracking, CVPR2023.
This paper proposes an Efficient Motion Prompt Learning (EMPL) framework for robust visual tracking. The key idea is to leverage historical motion trajectories as prompts that are encoded via three components: Spatial Encoding (SPE), Point Embedding, and Temporal Positional Encoding (TPE).
update after rebuttal
I have reviewed the rebuttal and found that all my concerns were thoroughly addressed with clear analysis and additional experiments. I therefore maintain my original score of 4 (Accept).
给作者的问题
- Have you considered how your motion prompt framework might be integrated with appearance-based prompt methods like PromptVT?
- Can you provide more insights or experiments on how sensitive your model is to errors in the initial detection of bounding box corners?
- While you have chosen a two-layer Transformer for the Fusion Decoder, did you perform any preliminary experiments with varying depths?
论据与证据
The paper claims that incorporating motion prompts can significantly improve tracking robustness, particularly in the presence of occlusions and distractors. Experimental evidence on several benchmarks (VOT, LaSOT, etc.) supports this claim.
方法与评估标准
The proposed method employs SPE, Point Embedding, TPE to map motion trajectories into a high-dimensional vector space aligned with visual features. The evaluation is based on established tracking benchmarks, and ablation studies are performed to validate the key design choices.
理论论述
The paper includes a theoretical rationale behind the non-linear initialization of the Temporal Positional Encoding (TPE) using Nyquist frequency principles. The provided intuition is compelling.
实验设计与分析
The experimental design is comprehensive, testing on multiple datasets and including ablation studies on individual components.
补充材料
The supplementary material was reviewed, including demo videos and source code.
与现有文献的关系
The paper builds on established ideas in positional encoding and prompt learning, extending them to the domain of motion cues for visual tracking.
遗漏的重要参考文献
Can you discuss PromptVT[A] in this paper?
[A] Zhang, Minghua, et al. "Promptvt: Prompting for efficient and accurate visual tracking." IEEE Transactions on Circuits and Systems for Video Technology (2024).
其他优缺点
Strengths:
- Innovative use of motion trajectories as prompts, which is relatively unique in the tracking literature.
- The design is efficient, both in terms of computational cost and training resources.
Weaknesses:
- The method’s reliance on accurate spatial (SPE) and point (Point Embedding) encodings may be vulnerable to errors in detection; any mistakes could potentially lead to cascading errors.
- There is no direct comparison with appearance-based prompt methods (e.g., PromptVT), which would help contextualize the benefits of using motion cues exclusively.
- The Fusion Decoder is fixed at two layers without analysis on whether a deeper or alternative architecture could further improve performance.
其他意见或建议
N/A
We thank the reviewer for the positive feedback, and address the concerns below.
-
Appearance-based prompt trackers: Taking PromptVT as an example, appearance-based prompt methods efficiently enhance appearance features using dynamic appearance information, thereby enabling the model to be robust to appearance changes during tracking. However, the tracking problem is still treated as a discrete image processing task. In contrast, our MPT leverages the temporal consistency of videos by incorporating continuous object trajectories into the visual tracking framework. Our method proves to be more effective in addressing visually challenging scenarios, such as occlusions and distractors. These two kinds of methods are complementary: while appearance-based prompt methods focus on enhancing visual features, our MPT can integrate historical motion cues into this visual framework, achieving a joint enhancement of both visual and motion information.
Thanks for your valuable advice. We will add the above comparison and related references to our paper.
-
Sensitivity to the trajectory quality: The proposed adaptive weight mechanism in our method can mitigate the impact of noisy trajectories to some extent. As shown in Figure 3 (Page 8 in submitted paper), the predicted weights align with the accuracy of motion predictions which implicitly reflect the quality of input trajectories. |Success Rate|[0.0,0.1)|[0.1,0.2)|[0.2,0.3)|[0.3,0.4)|[0.4,0.5)|[0.5,0.6)|[0.6,0.7)|[0.7,0.8)|[0.8,0.9)|[0.9,1.0)| |-|-|-|-|-|-|-|-|-|-|-| |OSTrack_MPT|1%|21%|26%|32%|42%|63%|82%|94%|99%|100%|
To further analyze the sensitivity of our MPT to varying noisy trajectories, we evaluate the success rate (IoU>0.5) under different quality trajectory inputs on LaSOTEXT, as shown in the table above. Here, [a,b) represents the range where the average IoU of the trajectory is no less than a and less than b. The following conclusions can be drawn:
- The success rate increases significantly with the improvement of trajectory quality, especially when the trajectory IoU is in the range of [0.6, 1.0).
- Our method shows a certain level of robustness to noisy trajectory inputs, obtaining a success rate of 21% even when the trajectory IoU is in the range of [0.1,0.2).
Due to the character limit, please find additional deep analysis of success/failure cases in our response to Reviewer EZ8N (2)
-
Architecture of fusion decoder: According to your advice, we supplement two sets of ablation studies in terms of the architecture of the fusion decoder, as shown in the table below. ||OSTrack-B256|+MPT(Ours)|+OnewayCA|D3|D4| |-|-|-|-|-|-| |VOT22(EAO)|0.530|0.572|0.554|0.555|0.555| |LaSOTEXT(AUC)|46.9|48.7|43.8|47.7|49.2|
-
First, our method employs bidirectional Cross-Attention (CA) for better alignment and fusion. Compared with one-way CA, our method obtains an average performance improvement of 3.4% across the two datasets, which demonstrates the effectiveness of our fusion decoder architecture.
-
Second, our method implements a two-layer fusion decoder, improving the baseline by an average performance of 3.0% across the two datasets. Deeper variants (D3/4) exhibit slightly inferior or comparable performance, with average improvements of 1.7% and 2.4%, respectively. Therefore, the two-layer structure excels in both performance and efficiency. In addition to reducing efficiency, there are two potential risks for a deeper-layer fusion decoder. First, since the number of prompt tokens is much smaller than that of visual tokens, a deeper fusion process may lead to the dilution and loss of motion prompts, thereby weakening the effectiveness of the motion information. Second, deeper-layer decoders introduce more trainable parameters, increasing the risk of overfitting.
Thanks for your valuable advice. We will add the above experimental analysis to our paper.
-
The paper proposes a lightweight, plug-and-play motion prompt tracking method to enhance existing vision-based trackers. This method integrates motion and vision cues for robust tracking through efficient prompt learning. Key components include a motion encoder with three positional encodings to encode long-term motion trajectories into visual embedding space, and a fusion decoder with an adaptive weight mechanism to dynamically fuse visual and motion features. Experiments on seven challenging tracking benchmarks show significant improvements in robustness with minimal training costs and negligible speed sacrifice.
The authors have successfully addressed most of the reviewers' concerns, resulting in an "accept," a "weak accept," and a "weak reject" rating for the paper. The main concern of the reviewer rating "weak reject" is computational load, which the authors have addressed in the rebuttal. After discussion with the Senior AC, we concur with the positive assessment and recommend acceptance of the paper. The authors should address the reviewers' comments in the final version.