General Compression Framework for Efficient Transformer Object Tracking
We propose a novel and general model compression framework for efficient transformer object tracking.
摘要
评审与讨论
The paper presents CompressTracker, a framework for general model compression designed to enhance the efficiency of transformer-based object-tracking models. This approach utilizes a unique stage partitioning strategy that divides the transformer layers of the teacher model into distinct stages, allowing the student model to simulate each corresponding stage better.
Furthermore, the authors introduced a replacement training technique, where specific stages in the student model are randomly replaced with those from the teacher model. This strategy, combined with predictive guidance and staged feature imitation, provides additional supervision to help the student model mimic the teacher model more effectively during the compression process.
Extensive experiments on vit-based trackers were conducted, which show the proposed method can lightweight the trackers while main a comparable performance.
优点
-
The author clearly presents a framework for compressing single object tracking models, effectively reducing larger models to smaller, efficient versions. Extensive experiments demonstrate the method's effectiveness.
-
One of the main benefits of the proposed framework is its structural agnosticism, meaning it can work with any transformer architecture. This adaptability allows CompressTracker to fit different student model configurations, making it suitable for various deployment environments and computational limits.
-
The paper shows through extensive experiments that CompressTracker strikes an impressive balance between inference speed and tracking accuracy. It significantly speeds up the tracking process while preserving high performance, achieving nearly 96% of the original accuracy with a 2.17× increase in speed.
缺点
-
The primary drawback of this method lies in its dependence on various distillation techniques, such as different training strategies, feature mimicking, and loss guidance. This lack of a clear, consistent framework among these techniques may undermine the generalization ability and transferability of the proposed approach, despite the author's assertions to the contrary.
-
Moreover, the overall complexity of the method raises concerns about its usability for other researchers. For instance, when applied to Mixformer V2, which has only two layers, the improvement in performance is minimal, while the processing speed remains unchanged. Such results indicate possible limitations of the method, as the intricate techniques lead to only marginal benefits.
-
The proposed techniques (stage division, progressive replacement, Replacement Training, Prediction Guidance, and Stage-wise Feature Mimicking) appear to be independent. The title "General Framework" raises my expectations significantly.
问题
See weakness. Moreover, the paper does not compare with other model compression techniques, such as knowledge distillation, model quantization, and pruning. It helps if you provide some comparisons or analysis.
Thank you for recognizing the efficiency and value of our work. We sincerely appreciate your insightful feedback and thoughtful suggestions. Your support and recognition mean a great deal to us, and we deeply value your encouragement for our research!
Q1: Inherent Consistency
Thank you for your valuable comments, our CompressTracker is a unified and generalized framework designed to accommodate any transformer tracking structure. Reviewers hzRe and ejyx acknowledged the novelty, simplicity and strong generalization of our CompressTracker. We would like to clarify how our contributions are intrinsically connected and collectively contribute to the cohesive functioning of our framework.
Motivation of stage division strategy Current dominant trackers are one-stream models that utilize sequential transformer encoder layers to iteratively refine temporal features across frames. This layer-wise refinement process naturally lends itself to viewing the model as a series of interconnected stages, where each stage plays a distinct role in feature extraction and temporal alignment. Building on this insight, we propose the stage division strategy, which segments the teacher model into distinct stages corresponding to the layers of the student model. This approach enables each stage of the student model to learn and replicate the functionality of its corresponding stage in the teacher model, fostering more effective and targeted knowledge transfer.
Motivation of replacement training Building on this foundation, we propose a replacement training methodology centered on the dynamic substitution of stages during training. This innovation is made possible by our stage division strategy, which decouples the teacher model into distinct, independent stages. In contrast, previous methods tightly couple layers, making replacement training impractical or potentially confusing due to the strong interdependence between stages in the student model. By decoupling these stages, our approach enables effective replacement training, resulting in improved accuracy and more efficient knowledge transfer.
Motivation of prediction guidance and stage-wise feature mimicking After that, to accelerate convergence, we introduce prediction guidance using the teacher's predictions as supervision. Additionally, our stage-wise feature mimicking strategy aligns the feature representations at each stage of the student model with those of the teacher, ensuring more accurate and consistent learning throughout the training process.
Inner Connection Our contributions are sequential and deeply interconnected. begin with the introduction of the stage division strategy, which facilitates replacement training. This replacement training is built upon the foundation established by the stage division strategy. Building on this, we introduce prediction guidance and the stage-wise feature mimicking strategy, further enhancing the student’s ability to learn from the teacher. Each of these contributions lays the groundwork for the next, resulting in a strong and inherent consistency throughout our approach.
Simplify Our framework is quite simple and requires minimal code modifications. We provide pseudo code in Appendix Algorithms 1, 2, and 3. Reviewers ejyx and hzRe acknowledge the simplicity. We believe that other researchers will be able to reproduce our method quickly and easily. Additionally, CompressTracker requires only a straightforward, end-to-end, efficient training process, unlike the complex, multi-stage training processes seen in previous works like MixFormerV2. This simplicity further highlights the strong transferability of our framework. We hope this clarifies that, while effective, our framework remains user-friendly and highly accessible to other researchers.
In conclusion, our contributions are inherently consistent, with each building upon the previous one. Extensive experiments verify effectiveness and generalization ability. All other reviewers acknowledged the novelty, simplicity and generalization. We hope that our statement addresses your concerns about intrinsic continuity and our extensive experiments demonstrate the strong generalization and ease of use of our work. We will modify our manuscript to clarify the inner connection of contributions. We sincerely hope you can reconsider our work. We would appreciate it very much.
Q2: MixFormerV2-S
We have clraified the simplity and generalization of our CompressTracker in the response to Q1. Indeed, MixFormerV2-s consists of 4 transformer layers. For more details, please refer to Lines 402-419. Following MixFormerV2-S, we use MixFormerV2-B as teacher and compress it into a student model with 4 layers. Our CompressTracker-M-S shares the same structure and MLP layer channel dimensions as MixFormerV2-S, and outperforms MixFormerV2-S by about 1.4% AUC on LaSOT, as shown in Table 2. As highlighted in MixFormerV2-S, a reduced feature dimension can lead to decreased accuracy. The feature dimension of MixFormerV2-S and CompressTracker-M-S is lower than teacher model, but CompressTracker-M-S still delivers better performance than MixFormerV2-S. Furthermore, since CompressTracker-M-S shares the same structure as MixFormerV2-S, the inference speed remains identical. Additionally, MixFormerV2-S requires a complex, multi-stage training process, taking 120 hours, while our CompressTracker only requires a simple, end-to-end training procedure.
We will revise the paper to provide further clarification on the simplicity and generalization of our framework and will include additional comparisons with MixFormerV2-S in the final version to further emphasize the effectiveness and simplicity of our CompressTracker.
Q3: Generalization
Thanks to our stage division strategy, our framework exhibits strong generalization capabilities, providing flexibility in designing the student model and supporting any transformer architecture. This level of flexibility is unique to our approach and cannot be achieved by previous methods, which lack the stage division strategy.
We conduct extensive experiments to validate the effectiveness and generalization ability of our framework (Tables 1, 2, 3). Besides, we have also added additional experiments, and the results can be summaried as follows:
| # | Model | LaSOT | LaSOT_ext | TNL2K | TrackingNet | UAV123 | FPS |
|---|---|---|---|---|---|---|---|
| Model | Generalization | ||||||
| 1 | CompressTracker-4 | 66.1 (96%) | 45.7 (96%) | 53.6 (99%) | 82.1 (99%) | 67.4 (99%) | 228 (2.17×) |
| 2 | CompressTracker-4-ODTrack | 70.5 (96%) | 50.9 (97%) | 60.4 (99%) | 82.8 (97%) | 69.2 (98%) | 86.5 (1.74×) |
| 3 | CompressTracker-4-SeqTrack | 68.1 (95%) | 47.9 (96%) | 54.5 (99%) | 83.1 (98%) | 68.4 (98%) | 62.1 (1.36×) |
| Stage | Scabality | ||||||
| 4 | CompressTracker-2 | 60.4 (87%) | 40.4 (85%) | 48.5 (89%) | 78.2 (94%) | 62.5 (92%) | 346 (3.30) |
| 5 | CompressTracker-3 | 64.9 (94%) | 44.6 (94%) | 52.6 (97%) | 81.6 (98%) | 65.4 (96%) | 267 (2.54×) |
| 6 | CompressTracker-4 | 66.1 (96%) | 45.7 (96%) | 53.6 (99%) | 82.1 (99%) | 67.4 (99%) | 228 (2.17×) |
| 7 | CompressTracker-6 | 67.5 (98%) | 46.7 (99%) | 54.7 (101%) | 82.9 (99%) | 67.9 (99%) | 162 (1.54×) |
| 8 | CompressTracker-8 | 68.4 (99%) | 47.2 (99%) | 55.2 (102%) | 83.3 (101%) | 68.2 (99%) | 127 (1.21×) |
| Larger | Transformer | Scabality | |||||
| 9 | CompressTracker-4-L | 67.5 (96%) | 45.9 (98%) | 58.3 (98%) | 83.2 (99%) | 67.4 (99%) | 228 (2.84×) |
| Higher | Resolution | Scabality | |||||
| 10 | CompressTracker-4-384 | 67.7 (96%) | 48.1 (96%) | 54.3 (99%) | 82.7 (99%) | 68.2 (98%) | 228 (3.90×) |
| Heterogeneous | Structure | Robustness | |||||
| 11 | CompressTracker-M-S | 62.0 (88%) | 44.5 (88%) | 50.2 (87%) | 77.7 (93%) | 66.9 (96%) | 325 (1.97×) |
| 12 | CompressTracker-SMAT | 62.8 (91%) | 43.4 (92%) | 49.6 (91%) | 79.7 (96%) | 65.9 (96%) | 138 (1.31×) |
We compare the AUC scores and performance gaps relative to the original models across several benchmarks with a resolution of 256. Our experiments involve 4 teacher models (OSTrack, OSTrack-384, OSTrack-L, ODTrack, MixFormerV2, SeqTrack) and 11 student models. We train OSTrack with ViT-L as backbone by ourselves. We would like to emphasize that our CompressTracker is designed as a scalable framework, capable of adapting to variations in image resolution (e.g., #4-8, 10), teacher model size (#4-8, 9), and student model size (#4-8). Our framework demonstrates strong generalization across different teacher models (#1-3, 11). Additionally, CompressTracker exhibits structural robustness when applied to various student model architectures (#11, 12). Extensive experiments have demonstrated the strong generalization, scalability, and robustness of our CompressTracker. Our CompressTracker supports any kind of transformer structure and layer number in student model, any input resolution, and any kind of teacher model, achieving the general compresstion. We will include additional experiments on more tracking models in the revised manuscript to varify the genetalization and scabality of our CompressTracker.
Q4: Other Model Compression Techniques
We have compared CompressTracker with several model compression techniques in our paper. As shown in Table 13 and Figure 3, our CompressTracker outperforms knowledge distillation ('Distill Training' in Figure 3) and achieves a 5.5% AUC improvement over MixFormerV2-S on LaSOT, even though MixFormerV2-S utilizes pruning for speedup (Table 4).
| # | Model | LaSOT | FPS |
|---|---|---|---|
| 1 | CompressTracker-4 | 66.1 | 228 |
| 2 | Distillation (4 layers in Figure 3) | 63.8 | 228 |
| 3 | Pruning (MixFormerV2-S) | 60.6 | 325 |
We appreciate your suggestion and will add experiments comparing other model compression techniques, such as model quantization, in the final version of the manuscript.
Thank you for the time and effort you have dedicated to reviewing our work. We sincerely appreciate your valuable feedback and the recognition of the contributions of our CompressTracker. Based on your insightful suggestions, we have revised the manuscript in Appendix and updated the PDF accordingly. Specifically, we have added a summary of experiments evaluating the generalization capability of CompressTracker, as well as comparisons of its performance on CPU and with other model compression methods. Additionally, we have carefully reviewed the manuscript and corrected any typographical errors. Please refer to the latest version of the PDF for these updates.
Once again, we would like to express our gratitude for your hard work and thoughtful suggestions, and we are deeply appreciative of your continued support for our work.
Thanks for the answer. I maintain the initial rating as positive. I would let the AC make the final decision.
We are glad that our response has addressed your concerns. We are grateful for your recognition of the novelty and efficiency of our work. Thanks a lot for you support for our work.
Dear Reviewer:
Sorry to bother you. Thank you for your follow-up and for confirming that our rebuttal has addressed your concerns. We are delighted that we could resolve the issues you raised. If possible, we would be sincerely grateful if you could consider raising your score. Your recognition and support mean a great deal to us, and we deeply appreciate the effort you have put into reviewing our work.
Thank you again for your time and effort.
Best regards
This paper introduces a general model compression framework based on the teacher-student knowledge distillation method for efficient transformer object tracking, named CompressTracker, which designs a stage division strategy and a replacement training technique.
优点
The motivation of this paper is clear and it has certain innovation.
缺点
Two classical models are only used to verify the effectiveness of the proposed CompressTracker, which are not enough to demonstrate its applicability and scalability.
问题
- OSTrack and MixFormer are only used to verify the effectiveness of the proposed CompressTracker. As a general model compression framework, this is far from enough.
- This paper proposes a novel stage division strategy. To demonstrate its effectiveness, the corresponding ablation experimental results are shown in Table 8. It can be seen from Table 8 that the even dividing strategy is better than the uneven dividing strategy, but it cannot indicate that the performance of model with the stage division strategy is better than that without the stage division strategy.
- What is the difference of the ablation experiments in Table 9 and 10?
- Does CompressTracker only apply to the model with several same feature extractor modules? If the structures of different stages are different, does it work?
- The format of Table 5 is non-standard. A table should not contain other tables.
- There is spelling mistake, such as "Stucture Limitation" in Page 1.
Q3: Table 9 and 10
The two tables examine distinct aspects of our approach. Table 9 evaluates the overall effectiveness of our replacement training. For detailed descriptions of Table 9, please refer to Lines 494–504. In #1 (Random), we implement the replacement training and progressive replacement. In # 2 (Decouple-300), we apply decoupled training, sequentially training and freezing each stage for 75 epochs, followed by 30 epochs of fine-tuning. The 'Decouple-300' approach (# 2) involves a complex, multi-stage training pipeline along with supplementary fine-tuning. By comparison, our CompressTracker achieves superior performance with a straightforward, end-to-end, single-step training process, demonstrating the effectiveness of our replacement training strategy.
Table 10 investigates the impact of the progressive replacement strategy specifically. For more details, please refer to Lines 505–510. In the first row (#1), we apply the progressive replacement strategy as used in CompressTracker. In the second row (#2), we fix the sampling probability at 0.5 and train the student model for 300 epochs, followed by 30 fine-tuning epochs. Removing the progressive replacement strategy results in a 0.4% drop in AUC performance, highlighting the importance and effectiveness of this approach.
In summary, Table 9 examines the overall advantages of replacement training, while Table 10 focuses on the effects of progressive replacement. We appreciate your valuable feedback and will revise the manuscript to further clarify the distinct purposes of the two tables.
Q4: Different Feature Extractor Modules
Yes, our CompressTracker is compatible with a variety of feature extractor modules. As shown in the # 11 and 12 of the table in the response to Q1 and the following table, as well as Table 2 and 3 in our paper, our CompressTracker consistently achieves competitive performance across different setups.
| # | Model | LaSOT | LaSOT_ext | TNL2K | TrackingNet | UAV123 | FPS |
|---|---|---|---|---|---|---|---|
| 11 | CompressTracker-M-S | 62.0 (88%) | 44.5 (88%) | 50.2 (87%) | 77.7 (93%) | 66.9 (96%) | 325 (1.97×) |
| 12 | CompressTracker-SMAT | 62.8 (91%) | 43.4 (92%) | 49.6 (91%) | 79.7 (96%) | 65.9 (96%) | 138 (1.31×) |
In # 11, we compress MixFormerV2-B into a student model, CompressTracker-M-S, with 4 layers, which has a reduced hidden feature dimension in the MLP layer compared to the teacher model. The model structure and feature dimension align with those of MixFormerV2-S, but the feature extractor modules of CompressTracker-M-S and MixFormerV2-S differ from the teacher model MixFormerV2-B. As highlighted in MixFormerV2-S, a reduced feature dimension can lead to decreased accuracy. our CompressTracker-M-S still outperforms MixFormerV2-S. Besides, our CompressTracker-M-S only requires a simple and end-to-end training, while MixFormerV2-S relies on a complex multi-process training.
In # 12, we compress OSTrack into a student model CompressTracker-SMAT, which matches the number and structure of transformer layers in SMAT. SMAT replaces the vanilla attention mechanism in its transformer layers with separated attention, resulting in feature extractor modules in SMAT and CompressTracker-SMAT that differ from the teacher model OSTrack. Notably, CompressTracker-SMAT outperforms SMAT.
We have conducted experiments on two different teacher model and two different student models to verify the generalizaion of CompressTracker on different feature extractor modules. We will further clarify the generalization of our CompressTracker in the revised manuscript.
Q5: Table and Typo
Thanks for your valuable advice. We will remove Table 5 as recommended and carefully review and modify our manisctipt to enhance its clarity and presentation based on your feedback.
Thank you for acknowledging the efficiency and novelty of our work. We deeply appreciate your insightful feedback and constructive comments. Based on your suggestions, we will carefully review and revise our manuscript to further enhance its clarity and presentation. We sincerely hope you will reconsider your evaluation, and we would be truly grateful for your support.
Q1: Generalization
Thank you for your valuable feedback. In response to your comments, we have conducted additional experiments to further demonstrate the generalizability of our CompressTracker. Our CompressTracker supports any kind of transformer structure and layer number in student model, any input resolution, and any kind of teacher model. To showcase its versatility, we have added experiments on compressing the ODTrack and SeqTrack models. The results are presented below:
| # | Model | LaSOT | LaSOT_ext | TNL2K | TrackingNet | UAV123 | FPS |
|---|---|---|---|---|---|---|---|
| Model | Generalization | ||||||
| 1 | CompressTracker-4 | 66.1 (96%) | 45.7 (96%) | 53.6 (99%) | 82.1 (99%) | 67.4 (99%) | 228 (2.17×) |
| 2 | CompressTracker-4-ODTrack | 70.5 (96%) | 50.9 (97%) | 60.4 (99%) | 82.8 (97%) | 69.2 (98%) | 86.5 (1.74×) |
| 3 | CompressTracker-4-SeqTrack | 68.1 (95%) | 47.9 (96%) | 54.5 (99%) | 83.1 (98%) | 68.4 (98%) | 62.1 (1.36×) |
| Stage | Scabality | ||||||
| 4 | CompressTracker-2 | 60.4 (87%) | 40.4 (85%) | 48.5 (89%) | 78.2 (94%) | 62.5 (92%) | 346 (3.30) |
| 5 | CompressTracker-3 | 64.9 (94%) | 44.6 (94%) | 52.6 (97%) | 81.6 (98%) | 65.4 (96%) | 267 (2.54×) |
| 6 | CompressTracker-4 | 66.1 (96%) | 45.7 (96%) | 53.6 (99%) | 82.1 (99%) | 67.4 (99%) | 228 (2.17×) |
| 7 | CompressTracker-6 | 67.5 (98%) | 46.7 (99%) | 54.7 (101%) | 82.9 (99%) | 67.9 (99%) | 162 (1.54×) |
| 8 | CompressTracker-8 | 68.4 (99%) | 47.2 (99%) | 55.2 (102%) | 83.3 (101%) | 68.2 (99%) | 127 (1.21×) |
| Larger | Transformer | Scabality | |||||
| 9 | CompressTracker-4-L | 67.5 (96%) | 45.9 (98%) | 58.3 (98%) | 83.2 (99%) | 67.4 (99%) | 228 (2.84×) |
| Higher | Resolution | Scabality | |||||
| 10 | CompressTracker-4-384 | 67.7 (96%) | 48.1 (96%) | 54.3 (99%) | 82.7 (99%) | 68.2 (98%) | 228 (3.90×) |
| Heterogeneous | Structure | Robustness | |||||
| 11 | CompressTracker-M-S | 62.0 (88%) | 44.5 (88%) | 50.2 (87%) | 77.7 (93%) | 66.9 (96%) | 325 (1.97×) |
| 12 | CompressTracker-SMAT | 62.8 (91%) | 43.4 (92%) | 49.6 (91%) | 79.7 (96%) | 65.9 (96%) | 138 (1.31×) |
We compare the AUC scores and performance gaps relative to the original models across several benchmarks with a resolution of 256. Our experiments involve 4 teacher models (OSTrack, OSTrack-384, OSTrack-L, ODTrack, MixFormerV2, SeqTrack) and 11 student models. We train OSTrack with ViT-L as backbone by ourselves. We would like to emphasize that our CompressTracker is designed as a scalable framework, capable of adapting to variations in image resolution (e.g., #4-8, 10), teacher model size (#4-8, 9), and student model size (#4-8). Our framework demonstrates strong generalization across different teacher models (#1-3, 11). Additionally, CompressTracker exhibits structural robustness when applied to various student model architectures (#11, 12). Extensive experiments have demonstrated the strong generalization, scalability, and robustness of our CompressTracker. Our CompressTracker supports any kind of transformer structure and layer number in student model, any input resolution, and any kind of teacher model, achieving the general compresstion. We will include additional experiments on more tracking models in the revised manuscript to varify the genetalization and scabality of our CompressTracker.
Q2: Stage Division
Thank you for your valuable feedback. We would like to further clarify the advantages of our stage division strategy. By introducing stage division, we can achieve the decoupling of model structure. One of the benefit of stage division is enabling replacement training, which improves the accuracy. Additionally, stage division allows us to apply feature supervision at each individual stage of the model. In contrast, traditional distillation approaches typically apply feature supervision and prediction guidance only at the final layer, limiting the effectiveness of knowledge transfer. This conventional setup overlooks the valuable intermediate representations of the teacher model, which can result in suboptimal performance.
To verify the effectiveness of our stage division, we conduct the experiments to compare the model without stage division, which only applies the prediction guidance and feature supervision at the last layer, with the model utilizing our stage division (the same as #7 in Table 13). For a fair comparison and to isolate the impact of stage division, replacement training is not implemented from this analysis. As shown in the following table, model with stage division outperforms the model without stage division, which demonstrates the advantage of our stage division.
We sincerely appreciate your suggestion and will include additional experiments in the revised version of our paper to further demonstrate the benefits of stage division. Thank you once again for your insightful comments.
Thank you for the time and effort you have dedicated to reviewing our work. We sincerely appreciate your valuable feedback and the recognition of the contributions of our CompressTracker. Based on your insightful suggestions, we have revised the manuscript in Appendix and updated the PDF accordingly. Specifically, we have added a summary of experiments evaluating the generalization capability of CompressTracker, as well as comparisons of its performance on CPU and with other model compression methods. Additionally, we have carefully reviewed the manuscript and corrected any typographical errors. Please refer to the latest version of the PDF for these updates.
Once again, we would like to express our gratitude for your hard work and thoughtful suggestions, and we are deeply appreciative of your continued support for our work.
Dear Reviewer:
We hope this message finds you well. As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. We would appreciate it very much if you could reconsider our work and offer your support.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Dear Reviewer:
As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. If our responses have satisfactorily addressed your concerns, we kindly request that you consider reflecting this in your evaluation and possibly revising your score.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Dear Reviewer:
As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. If our responses have satisfactorily addressed your concerns, we kindly request that you consider reflecting this in your evaluation and possibly revising your score.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Dear Reviewer:
As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. If our responses have satisfactorily addressed your concerns, we kindly request that you consider reflecting this in your evaluation and possibly revising your score.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Dear Reviewer:
As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. If our responses have satisfactorily addressed your concerns, we kindly request that you consider reflecting this in your evaluation and possibly revising your score.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Dear Reviewer:
I hope this message finds you well. Thank you for your time and efforts in reviewing our submission. Your insights and expertise are greatly appreciated.
We submitted our rebuttal on November 17 and value your evaluation and feedback. As the discussion period is nearing its conclusion in two days, we kindly follow up for your review of our response.
Please feel free to let us know if you have any additional questions to discuss. We are more than willing to provide further clarification or engage in discussion to address any concerns.
Best regards,
The paper presents CompressTracker, a novel general model compression framework aimed at enhancing the efficiency of transformer-based object tracking models for deployment on resource-constrained devices. The framework employs a stage division strategy to segment the teacher model into distinct stages, which are then emulated by a lighter student model. CompressTracker introduces a replacement training technique, where student model stages are dynamically replaced with teacher model stages during training, enhancing the student's ability to replicate the teacher's behavior. Additionally, prediction guidance and stage-wise feature mimicking are incorporated to refine the learning process. The framework is structurally agnostic and compatible with various transformer architectures.
优点
The paper proposes an innovative approach to segmenting transformer layers into stages, allowing for more granular knowledge transfer from a teacher model to a student model.
The experiments show that CompressTracker can achieve a substantial speedup while maintaining a high level of accuracy, which is crucial for real-world applications on resource-constrained devices.
The framework's compatibility with any transformer architecture is a significant advantage, as it increases its applicability and flexibility.
缺点
There is still a performance gap between the teacher and student models, indicating that there might be room for further improvement in the compression strategy to achieve lossless compression.
While the framework simplifies the compression process, the introduction of multiple training strategies might increase the complexity of the training regimen, which could be a barrier for some users.
问题
In the article, the framework proposed is added to OSTrack, where the search area pixels are 256× 256and the template pixels are 128× 128. What is the effect when the search area pixels are 384 × 384 and the template pixels are 192 × 192? Can tracking accuracy still be maintained?
What are the limitations of the stage division strategy, and how does it affect the generalization capabilities of the student model?
According to the experimental section, it can be seen that speed and accuracy cannot be achieved simultaneously. The faster the speed, the lower the accuracy. Taking OSTrack as an example, the encoder layer of OSTrack is 12 layers. Would it be possible to achieve a similar effect by reducing the number of layers appropriately? Will the compression method proposed in the article have more advantages in accuracy and speed?
Thanks for your insightful feedback. We sincerely appreciate your valuable comments and your recognition of the novelty and effectiveness of our work. We would be grateful if you can reconsider our work and offer your support.
Q1: Performane Gap
Thank you for highlighting the performance gap between the teacher and student models. We agree that there is room for improvement, as achieving fully lossless compression remains a challenging goal. However, it is important to note that our primary objective is to significantly reduce model size and computational cost while minimizing, rather than entirely eliminating, performance loss. Our framework effectively balances compression and performance retention, and surpasses all previous methods. The performance gap between our CompressTracker and the teacher is quite slight, e.g., CompressTracker-4 exhibits only a 0.7% AUC reduction on TNL2K, while even our CompressTracker-8 outperforms OSTrack by 0.9% AUC (Table 1). Besides, our CompressTracker-4 surpasses state-of-the-art models (HiT-Base) by at least 2.1% AUC on TrackingNet under the same inference speed. Our results demonstrate that our method achieves substantial compression with minimal degradation, highlighting the effectiveness of our CompressTracker framework.
We would like to clarify that in model compression, a certain level of performance trade-off is generally considered acceptable. The primary goal in this field is to balance efficiency with performance retention. Given this, the small gap observed between the teacher and student models reflects the inherent trade-off typical in most compression frameworks, rather than the weakness of our approach. We are also actively exploring more advanced techniques to further reduce this gap in future work.
We believe our work offers a novel perspective on this issue for the tracking field, and the trade-off should not be viewed as a weakness but as an important area of focus for tracking systems. We would appreciate it very much if you can rethink our work and support us.
Q4: Limitation of Stage Division
Thank you for your insightful question regarding the limitations of the stage division strategy and its potential impact on the generalization capabilities of the student model. In our framework, we divide the teacher model into stages that correspond to the number of layers in the student model. This stage division approach is highly flexible and can support various forms of layer partitioning. As shown in Table 8 of our paper, we evaluated different stage division configurations, and the results demonstrate that variations in stage division have minimal impact on both the model’s performance and its generalization capabilities. This flexibility ensures that our approach remains robust and adaptable, allowing users to customize the stage divisions as needed without compromising the student model’s ability to generalize effectively. Thanks for your advice, and we will clarify the impact of different stage division in the later manuscript.
Q5: Performance and Layers
While it is true that reducing the number of layers in a model can lead to faster speeds, this often results in a significant drop in accuracy, as you have pointed out. However, our proposed compression method takes a more comprehensive approach to balance both speed and accuracy. Rather than simply reducing the number of layers, our method leverages stage division and replacement training to effectively distill knowledge. This allows us to preserve the accuracy of the teacher model while reducing computational overhead. Our CompressTracker achieves the best balance between accuracy and speed, and outperforms pervious works on several benchmarks.
As shown in Figure 3, we compare our CompressTracker with directly decreasing the layer number ('Navie Training'), and our CompressTracker achieves the consistent performance improvement at the same inference speed. These results demonstrate that our approach is more efficient than simply reducing the number of layers in terms of preserving accuracy.
Q3: Resolution
Yes! Our CompressTracker demonstrates strong scalability and supports any input resolution. We have conducted experiments in which we compressed an OSTrack model with a 384 resolution into a student model with four layers and a 256 input resolution. The results, shown below, indicate that our CompressTracker-4-384 still maintains high accuracy and inference speed, highlighting the scalability of our framework. Thanks for your advice, and we will add more experiments on the different input resolution in the revised manuscript.
| # | Model | LaSOT | LaSOT_ext | TNL2K | TrackingNet | UAV123 | FPS |
|---|---|---|---|---|---|---|---|
| 1 | CompressTracker-4 | 66.1 (96%) | 45.7 (96%) | 53.6 (99%) | 82.1 (99%) | 67.4 (99%) | 228 (2.17×) |
| 2 | CompressTracker-4-384 | 67.7 (96%) | 48.1 (96%) | 54.3 (99%) | 82.7 (99%) | 68.2 (98%) | 228 (3.90×) |
Q2: Training Complexity
Thank you for your valuable feedback. In fact, our framework requires only minimal modifications to standard training code. To aid in the implementation, we have provided pseudo code in the Appendix. We believe other researchers can reproduce our method quickly and easily.
Besides, our CompressTracker requires only a simple, end-to-end, efficient, one-step training process, unlike the complex, multi-step training procedures employed in previous works such as MixFormerV2. As shown in Table 12, our CompressTracker-4 only requires about 20 hours on 8 3090 GPUs, while MixFormerV2-S takes abour 120 hours on 8 RTX8000 GPUs (about 80 hours on 8 3090 GPUs). We hope this clarifies that our framework, while effective, remains user-friendly and highly accessible for other users.
Thank you for the time and effort you have dedicated to reviewing our work. We sincerely appreciate your valuable feedback and the recognition of the contributions of our CompressTracker. Based on your insightful suggestions, we have revised the manuscript in Appendix and updated the PDF accordingly. Specifically, we have added a summary of experiments evaluating the generalization capability of CompressTracker, as well as comparisons of its performance on CPU and with other model compression methods. Additionally, we have carefully reviewed the manuscript and corrected any typographical errors. Please refer to the latest version of the PDF for these updates.
Once again, we would like to express our gratitude for your hard work and thoughtful suggestions, and we are deeply appreciative of your continued support for our work.
Dear Reviewer:
We hope this message finds you well. As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. We would appreciate it very much if you could reconsider our work and offer your support.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Dear Reviewer:
As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. If our responses have satisfactorily addressed your concerns, we kindly request that you consider reflecting this in your evaluation and possibly revising your score.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Dear Reviewer:
As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. If our responses have satisfactorily addressed your concerns, we kindly request that you consider reflecting this in your evaluation and possibly revising your score.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Dear Reviewer:
As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. If our responses have satisfactorily addressed your concerns, we kindly request that you consider reflecting this in your evaluation and possibly revising your score.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Dear Reviewer:
As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. If our responses have satisfactorily addressed your concerns, we kindly request that you consider reflecting this in your evaluation and possibly revising your score.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Dear Reviewer:
I hope this message finds you well. Thank you for your time and efforts in reviewing our submission. Your insights and expertise are greatly appreciated.
We submitted our rebuttal on November 17 and value your evaluation and feedback. As the discussion period is nearing its conclusion in two days, we kindly follow up for your review of our response.
Please feel free to let us know if you have any additional questions to discuss. We are more than willing to provide further clarification or engage in discussion to address any concerns.
Best regards,
In this paper, the authors proposed a general model compression framework for efficient Transformer object tracking, named CompressTracker. The method adopts a novel stage partitioning strategy to divide the Transformer layers of the teacher model into different stages, enabling the student model to more effectively simulate each corresponding teacher stage. The authors also designed a unique replacement training technique, which involves randomly replacing specific stages in the student model with specific stages in the teacher model. Replacement training enhances the student model's ability to replicate the behavior of the teacher model. To further force the student model to simulate the teacher model, we combine predictive guidance and staged feature imitation to provide additional supervision during the compression process of the teacher model. The authors conducted a series of experiments to verify the effectiveness and generality of CompressTracker.
优点
The author has clear ideas and the article is easy to understand. He proposes a general compression framework for single object tracking. This method can efficiently compress large object tracking models into small models. The author has conducted a large number of experiments to prove the effectiveness of this method.
缺点
- The innovation is slightly insufficient. The author’s innovation focuses on replacement training, prediction guidance, and feature mimicking. The latter two are common methods of distillation and are not enough to be the innovation of this article. Therefore, the innovation of this article is more focused on the replacement training strategy, but the author’s explanation of the intuitive reasons why this strategy is useful is poor.
- The author claims that the method is general compression framework, but the paper only experiments on OSTrack and MixFormerV2. However, there are many trackers based on transforemr, such as SimTrack, ODtrack, LoraT, etc. The reviewer thinks that it cannot be called general compression framework after only verifying two trackers.
- The author only conducted experiments on GPUs with sufficient computing power. However, efficient trackers are more targeted at devices with insufficient computing power, such as CPUs and edge devices. The author did not conduct experiments on such devices to verify the term efficient.
问题
The application scenarios of efficient tracking models are mostly devices with insufficient examples. The author should provide the speed of the model on the CPU or edge device to verify the word "efficient", rather than just testing the speed on the GPU. For other issues, see weakness.
We sincerely appreciate your recognition of the effectiveness and contributions of our work, along with your thoughtful feedback and suggestions. We are also pleased that other reviewers, including Reviewer hzRe, Reviewer AUCs, and Reviewer 3q3t, have recognized the novelty and effectiveness of our approach. We would be truly grateful if you could reconsider our work and offer your support!
Q1: Novelty
Reviewer hzRe, and Reviewer AUCs have both acknowledged the novelty and innovation of our work, and we deeply appreciate their recognition. Pelase allow us to restate the key aspects of our novelty and contributions to further clarify the impact of our approach. We would be extremely grateful if you could reconsider the novelty of our work in this context.
We have summarized our novelty as follows:
- We introduce CompressTracker, a novel and general model compression framework designed to enable efficient transformer-based object tracking.
- We propose a stage division strategy, allowing for fine-grained imitation of the teacher model at the stage level, which enhances both the precision and efficiency of knowledge transfer.
- We introduce replacement training to enhance the student model's ability to effectively replicate the teacher model's behavior.
- We further integrate prediction guidance and feature mimicking to accelerate and refine the learning process, ensuring more efficient knowledge transfer.
- CompressTracker overcomes traditional structural limitations, adapting flexibly to various transformer architectures for student models. It outperforms existing methods, significantly accelerating OSTrack by 2.17× while maintaining approximately 96% of the original accuracy (66.1% AUC on LaSOT).
Fristly, we introduce the first innovation, CompressTracker framework, a novel and general model compression framework for efficient transformer-based object tracking. Our CompressTracker support any kind of transformer structure and layer number in student model, any input resolution, and any kind of teacher model, while maintaining an end-to-end efficient training process. Extensive experiments in Table 1, 2, 3, and 4 validate the robust generalization ability, scalability, and robustness of CompressTracker.
Secondly, we introduce our second innovation: stage division strategy, addressing limitations in previous works that either restrict student architectures or result in suboptimal knowledge transfer. By partitioning the teacher model into stages, this approach allows students to learn both raw feature matching and refined strategies developed by the teacher. We are the first to implement the stage division, which offers several key advantages: (1) Structure Flexibility. Stage division supports any transformer architecture and any layer count in the student model. (2) Enhanced Mimicking. By applying distillation supervision at each stage, we ensure more effective knowledge transfer. These advantages are lacked in previous works, and ablation studies in Table 8 and 13 prove the effectiveness of our stage division.
Thirdly, we introduce our third innovation: replacement training, which integrates teacher and student models for collaborative training. The reason for the effectiveness of replacement training can be summarized as follows: (1) Enhanced Knowledge Transfer. Dynamic stage replacements enable direct, stage-level interactions, allowing the student to better replicate the teacher’s strategies (2) Structural Flexibility. Replacement training works without strict architectural alignment, enabling independent refinement of each stage without a rigid dependency on an identical layer structure. (3) Adaptive Learning. Random stage substitutions expose the student to diverse teacher capabilities, fostering a deeper and broader understanding of the teacher’s knowledge. Our replacement training has several advantage: (1) Simplified Training Process. The single, end-to-end process reduces training time and avoids risks of suboptimal performance associated with complex methods. (2) Better Accuracy. The replacement training brings 0.9% AUC improvement of LaSOT (Table 9, 10, 13 and Figure 3), which enhances both efficiency and accuracy.
Then, to accelerate convergence, we introduce the forth innovation: prediction guidance and stage-wise feature mimicking, aligning the student’s features with the teacher’s at each stage for consistent learning. Experiments in Table 13 and Figure 3 (1.5% AUC increase) prove the effectiveness of the two techniques. To emphasize that improvements stem from the CompressTracker framework itself, we implemented these techniques in their simplest forms. This approach highlights that the framework drives performance gains, while leaving room for further enhancements with more advanced techniques. This underscores the framework’s novelty and general applicability, demonstrating its effectiveness without relying on complex auxiliary components.
Q2: Generalization
Thank you for your valuable feedback. In response to your comments, we have conducted additional experiments to further demonstrate the generalizability of our CompressTracker. Our CompressTracker supports any kind of transformer structure and layer number in student model, any input resolution, and any kind of teacher model. To showcase its versatility, we have added experiments on compressing the ODTrack and SeqTrack models. The results are presented below:
| # | Model | LaSOT | LaSOT_ext | TNL2K | TrackingNet | UAV123 | FPS |
|---|---|---|---|---|---|---|---|
| Model | Generalization | ||||||
| 1 | CompressTracker-4 | 66.1 (96%) | 45.7 (96%) | 53.6 (99%) | 82.1 (99%) | 67.4 (99%) | 228 (2.17×) |
| 2 | CompressTracker-4-ODTrack | 70.5 (96%) | 50.9 (97%) | 60.4 (99%) | 82.8 (97%) | 69.2 (98%) | 86.5 (1.74×) |
| 3 | CompressTracker-4-SeqTrack | 68.1 (95%) | 47.9 (96%) | 54.5 (99%) | 83.1 (98%) | 68.4 (98%) | 62.1 (1.36×) |
| Stage | Scabality | ||||||
| 4 | CompressTracker-2 | 60.4 (87%) | 40.4 (85%) | 48.5 (89%) | 78.2 (94%) | 62.5 (92%) | 346 (3.30) |
| 5 | CompressTracker-3 | 64.9 (94%) | 44.6 (94%) | 52.6 (97%) | 81.6 (98%) | 65.4 (96%) | 267 (2.54×) |
| 6 | CompressTracker-4 | 66.1 (96%) | 45.7 (96%) | 53.6 (99%) | 82.1 (99%) | 67.4 (99%) | 228 (2.17×) |
| 7 | CompressTracker-6 | 67.5 (98%) | 46.7 (99%) | 54.7 (101%) | 82.9 (99%) | 67.9 (99%) | 162 (1.54×) |
| 8 | CompressTracker-8 | 68.4 (99%) | 47.2 (99%) | 55.2 (102%) | 83.3 (101%) | 68.2 (99%) | 127 (1.21×) |
| Larger | Transformer | Scabality | |||||
| 9 | CompressTracker-4-L | 67.5 (96%) | 45.9 (98%) | 58.3 (98%) | 83.2 (99%) | 67.4 (99%) | 228 (2.84×) |
| Higher | Resolution | Scabality | |||||
| 10 | CompressTracker-4-384 | 67.7 (96%) | 48.1 (96%) | 54.3 (99%) | 82.7 (99%) | 68.2 (98%) | 228 (3.90×) |
| Heterogeneous | Structure | Robustness | |||||
| 11 | CompressTracker-M-S | 62.0 (88%) | 44.5 (88%) | 50.2 (87%) | 77.7 (93%) | 66.9 (96%) | 325 (1.97×) |
| 12 | CompressTracker-SMAT | 62.8 (91%) | 43.4 (92%) | 49.6 (91%) | 79.7 (96%) | 65.9 (96%) | 138 (1.31×) |
We compare the AUC scores and performance gaps relative to the original models across several benchmarks with a resolution of 256. Our experiments involve 4 teacher models (OSTrack, OSTrack-384, OSTrack-L, ODTrack, MixFormerV2, SeqTrack) and 11 student models. We train OSTrack with ViT-L as backbone by ourselves. We would like to emphasize that our CompressTracker is designed as a scalable framework, capable of adapting to variations in image resolution (e.g., #4-8, 10), teacher model size (#4-8, 9), and student model size (#4-8). Our framework demonstrates strong generalization across different teacher models (#1-3, 11). Additionally, CompressTracker exhibits structural robustness when applied to various student model architectures (#11, 12). Extensive experiments have demonstrated the strong generalization, scalability, and robustness of our CompressTracker. Our CompressTracker supports any kind of transformer structure and layer number in student model, any input resolution, and any kind of teacher model, achieving the general compresstion. We will include additional experiments on more tracking models in the revised manuscript to varify the genetalization and scabality of our CompressTracker.
Then, to accelerate convergence, we introduce the forth innovation: prediction guidance and stage-wise feature mimicking, aligning the student’s features with the teacher’s at each stage for consistent learning. Experiments in Table 13 and Figure 3 (1.5% AUC increase) prove the effectiveness of the two techniques. To emphasize that improvements stem from the CompressTracker framework itself, we implemented these techniques in their simplest forms. This approach highlights that the framework drives performance gains, while leaving room for further enhancements with more advanced techniques. This underscores the framework’s novelty and general applicability, demonstrating its effectiveness without relying on complex auxiliary components.
Our contributions are both sequential and interdependent. The stage division strategy forms the basis for replacement training, which is further enhanced by prediction guidance and stage-wise feature mimicking. Together, these components ensure a cohesive and consistent consistency within our framework.
In a nut shell, our innovations can be summarized in four folder: (1) The CompressTracker framework, (2) Stage Division, (3) Replacement Training. (4) Prediction Guidance and Feature Mimicking. These innovations are cohesively linked, ensuring a consistent and effective overall approach. We will take your valuable feedback into account by clarifying the specific innovations in our work and providing a more detailed explanation of the effectiveness of replacement training in the revised manuscript. We hope this can address your concerns and strengthens the overall contribution of our work. Thanks for your valuable comment, and we will appreciate it if you can rethink the novelty of our work.
Q3: CPU Speed
Thanks for your valuable advices, and we evaluate the infernece speed of models an Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz. These experiments demonstrate that our framework maintains high efficiency even on resource-constrained devices. Our CompressTracker achieves an optimal balance between accuracy and speed. Results are presented in the table below.
| Model | AUC on LaSOT | FPS(CPU) |
|---|---|---|
| CompressTracker-2 | 60.4 | 29 |
| CompressTracker-3 | 64.9 | 22 |
| CompressTracker-4 | 66.1 | 18 |
| CompressTracker-6 | 67.5 | 13 |
| E.T.Track | 59.1 | 42 |
| FEAR-XS | 53.5 | 26 |
| Model | AUC on LaSOT | FPS(CPU) |
|---|---|---|
| CompressTracker-M-S | 62.0 | 30 |
| MixFormerV2-S | 60.6 | 30 |
| Model | AUC on LaSOT | FPS(CPU) |
|---|---|---|
| CompressTracker-SMAT | 62.8 | 31 |
| SMAT | 61.7 | 33 |
We just propose a novel model compression framework, rather than a specific model. To showcase the effectiveness of our framework, we applied it to compress several tracking models. Due to the framework's strong generalization capabilities, other researchers can select suitable student models based on their hardware and apply our framework accordingly. We will provide a more detailed explanation of this aspect in the revised version.
Thank you for the time and effort you have dedicated to reviewing our work. We sincerely appreciate your valuable feedback and the recognition of the contributions of our CompressTracker. Based on your insightful suggestions, we have revised the manuscript in Appendix and updated the PDF accordingly. Specifically, we have added a summary of experiments evaluating the generalization capability of CompressTracker, as well as comparisons of its performance on CPU and with other model compression methods. Additionally, we have carefully reviewed the manuscript and corrected any typographical errors. Please refer to the latest version of the PDF for these updates.
Once again, we would like to express our gratitude for your hard work and thoughtful suggestions, and we are deeply appreciative of your continued support for our work.
Dear Reviewer:
We hope this message finds you well. As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. We would appreciate it very much if you could reconsider our work and offer your support.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Dear Reviewer:
As the discussion period is progressing, we would greatly value the opportunity to engage with you regarding your feedback on our submission. Your insights and suggestions are immensely important to us, and we are eager to address any remaining questions or concerns you may have.
We are committed to providing timely and detailed responses to ensure that all aspects of our work are clarified. If our responses have satisfactorily addressed your concerns, we kindly request that you consider reflecting this in your evaluation and possibly revising your score.
Thank you again for your time and effort, and we look forward to discussing with you.
Best regards
Thanks for the authors' rebuttal. My rating has been raised to 6.
Thanks for your support. Your suggestions have helped us improve the manuscript.
We sincerely appreciate the thorough review provided by all the reviewers. The valuable feedback from the reviewers has significantly contributed to enhancing the quality of our manuscript. We extend our gratitude to Reviewer hzRe and Reviewer AUCs for acknowledging the novelty of our work. Their positive recognition of the innovation in our research is greatly appreciated. All reviewers affirm the effectiveness of our work. Furthermore, we kindly request Reviewer ejyx, Reviewer hzRe and Reviewer AUCs to reconsider our work after reviewing our response. Your reconsideration will be highly valued.
Based on the comments from the reviewers, I have summarized the strengths of our paper as follows:
- Clear motivation and certain innovation. (ejyx, hzRe, AUCs)
- Compatibility with any transformer architecture to fit various deployment environments and computational limits. (hzRe, 3q3t)
- High efficiency and effectiveness, and impressive balance between inference speed and tracking accuracy. (ejyx, hzRe, 3q3t)
- Extensive experiments to verify the effectiveness of our CompressTracker. (ejyx, 3q3t)
We have summarized our novelty as follows:
- We introduce CompressTracker, a novel and general model compression framework for efficient transformer-based object tracking.
- We propose a stage division strategy, allowing for fine-grained imitation of teacher at the stage level, which enhances both the precision and efficiency of knowledge transfer.
- We introduce replacement training to enhance the student model's ability to effectively replicate the teacher model's behaviour.
- We further integrate prediction guidance and feature mimicking to accelerate the learning process.
- CompressTracker overcomes traditional structural limitations, adapting flexibly to various transformer architectures for student. It outperforms existing methods, accelerating OSTrack by 2.17× while maintaining approximately 96% of the original accuracy (66.1% AUC on LaSOT).
In response to the reviewers' general concerns regarding the generalization of our CompressTracker, we provide a unified response. We have conducted additional experiments and summarize the experiments in our paper with the results shown below.
| # | Model | LaSOT | LaSOT_ext | TNL2K | TrackingNet | UAV123 | FPS |
|---|---|---|---|---|---|---|---|
| Model | Generalization | ||||||
| 1 | CompressTracker-4 | 66.1 (96%) | 45.7 (96%) | 53.6 (99%) | 82.1 (99%) | 67.4 (99%) | 228 (2.17×) |
| 2 | CompressTracker-4-ODTrack | 70.5 (96%) | 50.9 (97%) | 60.4 (99%) | 82.8 (97%) | 69.2 (98%) | 86.5 (1.74×) |
| 3 | CompressTracker-4-SeqTrack | 68.1 (95%) | 47.9 (96%) | 54.5 (99%) | 83.1 (98%) | 68.4 (98%) | 62.1 (1.36×) |
| Stage | Scabality | ||||||
| 4 | CompressTracker-2 | 60.4 (87%) | 40.4 (85%) | 48.5 (89%) | 78.2 (94%) | 62.5 (92%) | 346 (3.30) |
| 5 | CompressTracker-3 | 64.9 (94%) | 44.6 (94%) | 52.6 (97%) | 81.6 (98%) | 65.4 (96%) | 267 (2.54×) |
| 6 | CompressTracker-4 | 66.1 (96%) | 45.7 (96%) | 53.6 (99%) | 82.1 (99%) | 67.4 (99%) | 228 (2.17×) |
| 7 | CompressTracker-6 | 67.5 (98%) | 46.7 (99%) | 54.7 (101%) | 82.9 (99%) | 67.9 (99%) | 162 (1.54×) |
| 8 | CompressTracker-8 | 68.4 (99%) | 47.2 (99%) | 55.2 (102%) | 83.3 (101%) | 68.2 (99%) | 127 (1.21×) |
| Larger | Transformer | Scabality | |||||
| 9 | CompressTracker-4-L | 67.5 (96%) | 45.9 (98%) | 58.3 (98%) | 83.2 (99%) | 67.4 (99%) | 228 (2.84×) |
| Higher | Resolution | Scabality | |||||
| 10 | CompressTracker-4-384 | 67.7 (96%) | 48.1 (96%) | 54.3 (99%) | 82.7 (99%) | 68.2 (98%) | 228 (3.90×) |
| Heterogeneous | Structure | Robustness | |||||
| 11 | CompressTracker-M-S | 62.0 (88%) | 44.5 (88%) | 50.2 (87%) | 77.7 (93%) | 66.9 (96%) | 325 (1.97×) |
| 12 | CompressTracker-SMAT | 62.8 (91%) | 43.4 (92%) | 49.6 (91%) | 79.7 (96%) | 65.9 (96%) | 138 (1.31×) |
Our experiments involve 4 teacher models (OSTrack, OSTrack-384, OSTrack-L, ODTrack, MixFormerV2, SeqTrack) and 11 student models. We would like to emphasize that our CompressTracker is designed as a scalable framework, capable of adapting to variations in image resolution (e.g., #4-8, 10), teacher model size (#4-8, 9), and student model size (#4-8), different teacher models (#1-3, 11), and various student model architectures (#11, 12). Extensive experiments have demonstrated the strong generalization, scalability, and robustness of our CompressTracker. Our CompressTracker supports any kind of transformer structure and layer number in student model, any input resolution, and any kind of teacher model, achieving the general compresstion.
We believe that the innovative contributions of our work significantly enhance its value in visual object tracking. We have addressed each reviewer's comments in detail and will incorporate reviewers' insightful suggestions, adding essential experiments. We kindly request reviewers to reconsider our research with these aspects in mind and extend their support.
This paper presents a model compression framework aimed at enhancing the efficiency of transformer-based object tracking. It received mixed reviews, with scores of 5, 5, 6, and 6, resulting in an average score of 5.5.
Reviewers acknowledged the paper’s strengths, particularly its innovative approach to compressing transformer-based tracking models to improve efficiency, as highlighted by reviewers 3q3t and ejyx.
However, several weaknesses were noted, including limited technical contributions (pointed out by Reviewers ejyx and hzRe), a performance gap (identified by Reviewer hzRe), and insufficient experimental validation (mentioned by Reviewers AUCs and ejyx).
The Area Chair (AC) also assessed the paper and noted that the three technical contributions—stage division, replacement training, and prediction guidance/feature mimicking—are not closely related or unified. This makes the claim of a "general compression framework" somewhat overclaimed. The results show that while the proposed method achieves approximately a 2x speed increase, it results in a performance drop of 3% to 8%. Furthermore, the effectiveness of the method is inconsistent across different backbone variants and scales, indicating that more experiments are needed to validate its generalization ability.
After discussions among the authors, reviewers, and the AC, some concerns were addressed. Nevertheless, issues concerning technical novelty remain unresolved. The AC discussed this paper with the Senior AC in depth. They concluded that they could not recommend the acceptance of this paper and encouraged the authors to take the reviewers' comments into account to improve the paper for the next venue.
审稿人讨论附加意见
After the rebuttal and discussion period, some concerns about insufficient experimental validation and performance gaps were addressed. However, issues concerning technical novelty/contributions remain unresolved.
Reject