TAPTRv2: Attention-based Position Update Improves Tracking Any Point
摘要
评审与讨论
This paper proposes TAPTRv2, an improved version of TAPTR, which addresses the Tracking Any Point (TAP) task. TAPTRv2 introduces a novel attention-based position update (APU) operation that leverages key-aware deformable attention to refine point query positions. This operation removes the need for cost-volume computation, which was previously used in TAPTR but had the drawback of contaminating point query content features.
优点
-
The introduction of the APU operation is a novel improvement over TAPTR. By directly using attention weights to update point query positions, TAPTRv2 avoids the contamination of content features and leads to more accurate tracking results.
-
This work simplifies the overall pipeline and makes it more conceptually straightforward. This will help scale up the training process and accelerate the development of more practical TAP algorithms.
-
The paper conducts extensive experiments on multiple datasets and provides ablation studies to analyze the effectiveness of each key design.
缺点
-
Due to the main contribution of simplifying the algorithm structure, speed comparison should be very important, but the author overlooked this point
-
Enhance experiments by including more common datasets (such as PointOdyssey) to assess generalization, and incorporate visual comparisons for long video sequences to demonstrate stability and accuracy of TAPTRv2.
-
Figure 4 does not show the result of Cotracker, which is inconsistent with the line 295. And it is preferable for Figure 4 to provide more key points to demonstrate the advantages of proposed method.
问题
See Weakness.
局限性
My main concern is the lack of speed comparison for the proposed TAPTRv2. It is necessary to provide detailed efficiency comparisons for each module, such as the Effective Utilization of Cost-Volume and the APU operation.
Summary of review ratings.
| Reviewer | yaYP | wowB | zD8r | zGVx |
|---|---|---|---|---|
| Rating | Weak Accept | Weak Accept | Borderline Accept | Borderline Reject |
| Confidence | 4 | 4 | 4 | 3 |
Rebuttal - zD8r
We thank the reviewer for the recognition of the novelty, effectiveness, extensive experiments, and the potential of TAPTRv2 in scaling up the training process and accelerating the development of algorithms.
Q1. Main concerns-1: Speed comparison.
A1 We thank the reviewer for this constructive suggestion. We apologize for overlooking such an important point. Indeed, the comparison should be an important part of our work, and we will add this comparison to our camera-ready version. We follow the main-stream works in the object detection field to utilize the FPS, GFLOPS, and the number of parameters to show the comparison of speed, computational efficiency, and resource requirements. As shown in the following table, TAPTRv2 exhibits a faster speed and lower resource requirements compared to TAPTR. More importantly, it's a common case that in the downstream tasks, we need to track all pixels in a region rather than just a few scattered points. In this case, the number of points to be tracked will reach to tens of thousands. However, since the computation of the cost-volume and also the cost-volume aggregation operation in TAPTR increases sharply with the number of tracking points, with the number of tracking points increased, the advantage of TAPTRv2 will become more and more pronounced. As shown in the following second table, when the number of tracking points reaches 5000 (which is only 1.9% of the pixels in a 512x512 image), the advantage of TAPTRv2 in speed and resource consumption becomes much more significant (about 24% faster and 20% fewer computational resource requirements).
| 800 Points | FPS | GFLOPS | #Params |
|---|---|---|---|
| TAPTR | 65.9 | 147.2 | 39.2M |
| TAPTRv2 | 69.1 | 143.4 | 38.2M |
| 5000 Points | FPS | GFLOPS | #Params |
|---|---|---|---|
| TAPTR | 11.8 | 426.8 | 39.2M |
| TAPTRv2 | 14.6 | 354.2 | 38.2M |
Q2. Main concerns-2: The efficiency of APU operation.
A2.1 In fact, in our implementation of the APU, it is only an additional weighted summation operation (as highlighted by the red flows in Fig. 3). Therefore, its impact on efficiency is neglectable.
A2.2 By contrast, the computation of the cost volume and its aggregation in decoder layers have a significant impact on efficiency, especially when the number of tracking points increases, as we have discussed in A1.
Q3. Assess the generalization of the improvements.
A3.1 We thank the reviewer for the constructive suggestion. Indeed, including more datasets will help us assess the generalization ability of TAPTRv2 over TAPTR. Here we further include another challenging dataset with long videos (up to 1000+ frames, about 1 minute long) developed by Deepmind, called RoboTAP, into consideration. The comparisons between TAPTR and TAPTRv2 on the RoboTAP dataset are shown below. The performance advantage of TAPTRv2 reflects its general improvement. We will further include more datasets in our camera-ready version.
| AJ | OA | ||
|---|---|---|---|
| TAPTR | 55.6 | 71.2 | 84.6 |
| TAPTRv2 | 57.8 | 71.3 | 85.7 |
A3.2 As suggested by reviewer-wowB, we conduct a statistical significance analysis to verify whether the improvements of TAPTRv2 over TAPTR are significant. We utilize the Friedman test, and the resulting p-value is 0.042. If we take the results in RoboTAP into consideration, the resulting p-value further decreases to 0.024. These statistical results indicate that the improvement of TAPTRv2 over TAPTR is statistically significant, showing its generalization capability.
Q4. Inconsistent names in Fig. 4.
A4 We thank the reviewer's kind reminder. The inconsistency is actually our typo, we will fix this typo in our camera-ready version.
Thanks for the response, especially for providing the speed comparison experiment. I believe there are no obvious flaws in this paper now. I keep my initial score unchanged.
The paper proposes an improved version of TAPTR, a DETR-based approach for point-based tracking in videos. TAPTR-v2 removes the dependency of TAPTR on the cost-volume, using local window features to define the keys and values to be used within the attention blocks of the DETR's transformer decoder. This simplification offers better computational complexity and improves performance.
优点
The paper is well written and motivated, with a deep analysis on the pros and cons of TAPTR and how it can be improved. The analysis in l. 163 regarding the Cost-volume with an ablation study is of great interest.
Based on that, I liked the idea of building an attention block that directly uses the features as keys and values rather than exploring the standard cross-attention method w.r.t. the cost-volume.
The intuition behind the APU block is clear and well presented, with Figure 3 giving a reproducible description of the proposed unit.
缺点
I am a bit confused about the part regarding the feature preparation, in particular with Eqn (1). How is this initialized in the first frame and how is this updated after each frame?
It is my understanding that the content and feature parts are those corresponding to the target points from the previous frame (i.e. t-1), although I am not sure I understood that correctly. If this is the case, I am not sure what happens if only one point is being tracked. Similarly, I wonder if the authors have considered adding as queries all former embeddings for a target point within a feasible window. For example, if a single point x is being tracked, then f_e and l_e would be {f_1, f_2, ..., f_t-1} and so forth for l_e. In such case, how would this combine for a multi-point tracking approach? Please clarify in the rebuttal.
I am a bit concerned about the results in Table 2. While it is to acknowledge the simplification that TAPTRv2 offers w.r.t. TAPTR, the results are pretty much on par with those of the latter. Such marginal differences call for a study on the statistical significance of the results. While I still believe the method deserves its merit, I would like the authors to further explore why the results do not improve over TAPTR.
I miss a computational complexity analysis between TAPTRv2 and TAPTR, as well as the contribution to the complexity of the different alternatives to the attention blocks in both. Where is the gain coming from? What's the computational complexity/performance tradeoff?
I would also like to see a real example of the offsets considered by the method for the APU unit. Some visual examples of which points are selected as neighbors in the example in Fig.4 would be very interesting. Also, how many neighboring points are considered? I understand that this is 4, but I could not find that in the paper (I wonder if I missed that part), and I wonder and whether the authors considered studying a broader set of offsets, or a local grid within the neighboring pixels.
Some proof-reading is needed: l. 152: dot-production --> product l. 250: Table 3 --> Table 2?
问题
All my questions are included above. Overall I find the paper interesting and the questions above would help me set a final rating for the paper.
局限性
N/A
Dear Reviewer wowB, The anonymized video link for the offsets’ visualization has been permitted by the Area Chair. As required by NeurIPS 2024, the link is provided in a separate comment block titled “Anonymized Video Link.” We apologize for the inconvenience.
Limited by the space of the above rebuttal block, we put some other information here. We apologize for the inconvenience.
Summary of review ratings.
| Reviewer | yaYP | wowB | zD8r | zGVx |
|---|---|---|---|---|
| Rating | Weak Accept | Weak Accept | Borderline Accept | Borderline Reject |
| Confidence | 4 | 4 | 4 | 3 |
Some statements that need the reviewer's further clarity.
Some of the reviewer’s statements are not very clear, so we have not responded to them currently. If the reviewer wants to discuss these points in the discussion phase, we'd appreciate if the review could further clarify the following questions.
- What does the 'different alternatives to the attention blocks' mean in the third-to-last paragraph?
- What does the 'gain' mean in the third-to-last paragraph?
- I mean the attention using the cost volumes vs the proposed method
- I mean what's the gain in the computational complexity
We thank the reviewer's kind suggestions and questions. We also thank the reviewer for recognizing our writing, motivation, deep analysis, and appreciation of our design in APU block.
Q1. The initialization of point queries.
A1 We appreciate the reviewer’s thorough review and pointing out the unclear parts. We will provide a more detailed description of the point-query initialization process from a clearer aspect here. If the reviewer still has questions, we can further discuss this in the discussion stage.
For simplicity, without losing generality, we assume that all points are tracked starting from the first frame. Thus, the positions of these points in the first frame are given; otherwise, we wouldn’t know which targets to track. For the i-th point, to obtain a feature that describes it, we perform bilinear sampling on the feature map of the first frame at the location of that point, resulting in . Subsequently, if we want to track this point over the next N-1 frames, and will be copied N-1 times and distributed to the following N-1 frames to initialize the content and positional part of the point-queries for detecting this point in the following N-1 frames respectively (as we no longer need to track the target point in the first frame since it is already known).
Meanwhile, let’s continue with the i-th point as an example. The initial content and positional parts of the point-queries that are responsible for detecting the i-th point are not static and will be updated. For instance, once we have completed the detection of the i-th point in the first window (frames 0-7), the content and positional part of the point-query that is responsible for detecting the i-th point in the 7-th frame and will be used to update the initial content and positional part of the point-queries that are responsible for detecting the i-th point in subsequent frames .
Since the tracking of each point is relatively independent, the points interact only within the self-attention block in decoder. Therefore, the number of points does not affect the initialization process.
Q2. Analysis of the improvement.
A2.1 At first, we have to thank the reviewer for this constructive suggestion. We conducte a statistical significance analysis to verify whether our improvements are statistically significant. We applied the Friedman test to the results in Table. 2, and the resulting p-value is 0.042. Since the p-value is less than 0.05, our experimental results are statistically significant. At the same time, if we take RoboTAP dataset, which is proposed by DeepMind and commonly used in many TAP methods, into consideration as suggested by reviewer-zD8r, the p-value decreases to 0.024, further verifying the significance. We put the comparison of the main metric (AJ) between TAPTRv2 and TAPTR on the four datasets in the table below for your convenience in reviewing.
| DAVIS-Strided | DAVIS-First | Kinetics | RoboTAP | |
|---|---|---|---|---|
| Avg. vid. len. | 33.3 | 66.6 | 246.6 | 271.9 |
| TAPTR | 66.4 | 63.0 | 49.0 | 55.6 |
| TAPTRv2 | 66.4 | 63.5 | 49.7 | 57.8 |
| Improvement | 0.0 | 0.5 | 0.7 | 2.2 |
A2.2 As for the marginal improvement on DAVIS-Strided, we believe it may be due to the shorter video lengths compared to other datasets as shown above (for more detail, please refer to Sec.4.1). This makes it relatively simpler than other datasets, so the issue of point-query content contamination in TAPTR is less pronounced, leading to the marginal improvement of TAPTRv2 over TAPTR. However, as shown in the above table, with the length of video increased, the improvement becomes more and more significant, verifying TAPTRv2's superiority.
Q3. Computational complexity.
A3 We thank the reviewer for this constructive suggestion. Since this is a common issue raised by all the reviewers, due to space limitations here, please refer to our responses in zGVx-A3, yaYP-A1, and zD8r-A1 for more detail. We apologize for the inconvenience.
Q4. Visualization of the offsets.
A4 We thank the reviewer for this constructive suggestion, which helps us have a more intuitive understanding of the of TAPTRv2's function. The corresponding anonymized video link is provided in a separate comment for AC as required by NeurIPS'24, and the illustration of the video is provided in Fig. 1 of the attached pdf in our global rebuttal.
Q5. Choice of sampling points.
A5 We conduct a group of comparative experiments to show the effect of different kinds of sampling points in APU shown below, where 'Normal' indicates the choice of sampling points in our main paper, which contains 4 neighboring points in each scale of the feature maps. The 'local grid' yields the worst results, which is expected, because the fixed sampling grid prevents the model from flexibly adjusting its receptive field to accommodate varying target motion amplitudes. After that, it can be observed that increasing the number of sampling points ('Broader Set') also does not lead to an improvement. We suspect this might be due to our model’s capacity being insufficient to handle such a large amount of information.
| AJ | OA | ||
|---|---|---|---|
| Local grid | 61.9 | 74.9 | 90.4 |
| Broader Set () | 63.1 | 76.0 | 90.8 |
| Normal | 63.5 | 75.9 | 91.4 |
Q6. Proof-reading.
We thank the reviewer for the thorough review, we will correct these typos in our camera-ready version.
I thank the authors for their effort in the rebuttal, which helps me consolidate my rating.
While the response is satisfying, I wanted to clarify an aspect in my review that might have been misunderstood considering it hasn't been replied (it might have been myself who misunderstood the answer).
In A1 my question was if the self-attention was considering the N frames of the same point, or the multiple-points at a time. My understanding is that it is the former, but in such case I wonder how does the method work at inference for multiple target points. Is inference carried for every target point?
Thank you for your time.
Discussion
Due to space limitation, our description of decoder in the main paper is relatively abbreviated. We apologize for this and will provide a more detailed description in the camera-ready version. As shown in Fig. 3, the decoder incorporates both 'self-attention' as well as 'temporal-attention' mechanisms. The 'self-attention' considers multiple points that belong to the same timestamp, while the 'temporal-attention' considers the N frames of the same point. This decomposition of 'temporal' and 'multiple-point' dimensions not only clarifies feature processing but also helps reduce computational cost and memory usage.
We have to thank the reviewer's thorough review and valuable suggestions again. At the same time, we would like to express our gratitude for the reviewer’s recognition of our efforts.
The paper introduces TAPTRv2, an enhancement of the TAPTR framework, which is akin to a DETR-based point tracking approach. It critically examines the reliance on cost-volume in traditional Tracking Any Point (TAP) challenges and questions its necessity, particularly how it affects the query's content feature across different domains. Leveraging insights into the role of cost-volume, the authors propose an Attention-based Position Update (APU) mechanism to supplant cost-volume. This innovation purportedly alleviates the computational burden associated with cost-volume calculations without compromising the state-of-the-art tracking performance.
优点
The paper establishes a straightforward yet potent baseline that significantly broadens the horizon for advancements in the field of point tracking.
The research methodology and rationale are persuasive. By decomposing various structural components, the paper effectively isolates the function of cost-volume and offers a viable alternative, leading to enhanced performance.
缺点
The paper successfully reduces the reliance on cost-volume but does not thoroughly elucidate the tangible benefits of this elimination. Specifically, it would be beneficial to understand the impact on key performance metrics such as Frames Per Second (FPS), Memory Access Cost (MAC), Floating Point Operations (FLOPs), and the number of Parameters. Furthermore, the introduction of an additional Multi-Layer Perceptron (MLP) warrants an analysis of its effect on inference speed.
In Equation 5, the paper presents the Disentangler operation, which is intended to manipulate the attention map. However, for maintaining distribution consistency, it might be more appropriate to normalize the attention weights by dividing by the square root of their dimensionality before applying the MLP. The current approach could potentially disrupt the inherent normalization of the attention mechanism.
问题
Please refer to the weakness section.
局限性
yes.
Summary of review ratings.
| Reviewer | yaYP | wowB | zD8r | zGVx |
|---|---|---|---|---|
| Rating | Weak Accept | Weak Accept | Borderline Accept | Borderline Reject |
| Confidence | 4 | 4 | 4 | 3 |
Rebuttal - yaYP
We thank the reviewer for recognizing the simplicity and effectiveness of TAPTRv2 and its contribution to broadening the point tracking field's horizon for advancements. We would also like to thank the reviewer for recognizing the rationalization of our research methodology.
Q1. Elucidate the tangible benefits of the elimination of cost-volume.
We thank the reviewer for this constructive suggestion. We follow the main-stream works in the object detection field to utilize the FPS, GFLOPS, and the number of parameters to show the comparison of speed, computational efficiency, and resource requirements between TAPTR and TAPTRv2 to verify the benefits of removing the cost-volume. As shown in the following table, without the additional overhead of computing the cost volume and aggregating the cost volume in the decoder layers, TAPTRv2 exhibits a faster speed and lower resource requirements compared to TAPTR. More importantly, it's a common case that in the downstream tasks, we need to track all pixels in a region rather than just a few scattered points. In this case, the number of points to be tracked will reach to tens of thousands. However, since the computation of the cost-volume and also the cost-volume aggregation operation in TAPTR increases sharply with the number of tracking points, with the number of tracking points increased, the advantage of TAPTRv2 will become more and more pronounced. As shown in the following second table, when the number of tracking points reaches 5000 (which is only 1.9% of the pixels in a 512x512 image), the advantage of TAPTRv2 in speed and resource consumption becomes much more significant (about 24% faster and 20% fewer computational resource requirements).
| 800 Points | FPS | GFLOPS | #Params |
|---|---|---|---|
| TAPTR | 65.9 | 147.2 | 39.2M |
| TAPTRv2 | 69.1 | 143.4 | 38.2M |
| 5000 Points | FPS | GFLOPS | #Params |
|---|---|---|---|
| TAPTR | 11.8 | 426.8 | 39.2M |
| TAPTRv2 | 14.6 | 354.2 | 38.2M |
Q2. Analysis of the extra MLP's effect on inference speed.
A2 We thank the reviewer for raising this good question. In fact, the extra MLP is small, so its effect on inference is almost negligible theoretically. We conduct comparative experiments to further verify this, the results are shown in the following table.
| FPS | GFLOPs | #Params | |
|---|---|---|---|
| With Extra MLP | 69.1 | 143.4 | 38.2M |
| Without Extra MLP | 69.1 | 143.4 | 38.2M |
Q3. Disentangler operation may disrupt the inherent normalization of the attention mechanism.
A3 We appreciate the reviewer’s thorough review. Yes, indeed, in our implementation we divide the attention weights by the square root of the dimensionality before sending them to the to keep the inherent normalization. The Eq. 5 should be:
.
We will correct this typo in our camera-ready version.
Dear reviewer-yaYP,
Thanks for your recognition and constructive suggestions again. Considering that the deadline is approaching, we would like to kindly remind you to check out our responses and raise the questions that you may still have. This will help us to provide more detailed explanations.
Thanks for your time and effort.
The paper presents TAPTRv2, an improved Transformer-based approach for the Tracking Any Point (TAP) task. Building on TAPTR, which utilizes designs from DEtection TRansformer (DETR), TAPTRv2 addresses a critical issue related to the reliance on cost-volume. This reliance was found to contaminate the point query's content feature, impacting visibility prediction and cost-volume computation. TAPTRv2 introduces an attention-based position update (APU) operation and key-aware deformable attention to improve query position prediction. By eliminating the need for cost-volume aggregation, TAPTRv2 not only simplifies the tracking pipeline but also achieves state-of-the-art performance on several challenging datasets, demonstrating its effectiveness and superiority over previous methods.
优点
- The paper identifies a practical challenge in tracking any point and proposes a novel and effective solution with the attention-based position update (APU) operation.
- The paper is well-organized, with a clear description of the problem, the proposed solution, and the experimental setup.
- The inclusion of both in-domain and out-of-domain datasets strengthens the validity of the claims.
缺点
- In Sec. 3.2, the authors suggest that the detrimental effect is due to the different distributions needed for content and position updates but do not provide a rigorous theoretical justification or detailed analysis to support this claim.
- Lack of comparative experiments with alternative methods to achieve weight separation, such as other attention mechanisms or different initialization schemes.
- A more detailed analysis of the computational efficiency and resource requirements of TAPTRv2 compared to other methods would be valuable.
问题
Please refer to the weaknesses section for details. The motivations raised by the authors seems reasonable, but their approach to the problem needs more explanation and justifications for their design choices are needed.
局限性
Authors provide a limitations section at the end of their paper.
Summary of review ratings.
| Reviewer | yaYP | wowB | zD8r | zGVx |
|---|---|---|---|---|
| Rating | Weak Accept | Weak Accept | Borderline Accept | Borderline Reject |
| Confidence | 4 | 4 | 4 | 3 |
Rebuttal - zGVx
We thank the reviewer for recognizing the clear description of the issue to solve, the strong validation of our claim, and the novelty and effectiveness of our proposed APU. We will respond to your questions in the following, hoping to address your concerns. If our response still does not address your concern, please bring it up in the discussion section, and we will reply as soon as possible.
Q1.Theoretical justification to support the detrimental effect brought by different distribution.
We thank the reviewer for this constructive suggestion, we will add the following analysis to our camera-ready version.
A1.1 We measured the distribution of the attention weights for content and position update, as visualized in Fig. 2 of the attached pdf file in our global rebuttal, the distributions of these two groups of attention weights show a significant difference, indicating that the attention weights required by content and position update are different. We will add this visualization to our supplementary material and refer to this analysis in the main paper in our camera-ready version.
A1.2 At the same time, as shown in our ablation studies in Table 3, if we do not disentangle the weights, the performance will suffer a significant drop, indicating the detrimental effect of not disentabling the weights.
Q2. More comparative experiments with alternative methods for weight separation.
A2.1 Inspired by our analysis in Q1, instead of using an MLP to obtain the separated attention weights, by reducing the "temperature" in the softmax calculation, we obtain a group of attention weights with smoother distribution. We conduct experiments with temperature = 0.2 / 0.5. The results are shown in the following table. The results show that disentangling attention weights using different temperatures indeed helps improve the performance. At the same time, consistent with the conclusion drawn in Q1, a smoother distribution (temperature=0.2) leads to better results. However, it still lags significantly behind disentangling the weights through an MLP.
A2.2 As suggested by the reviewer, we further disentangle attention weight through an attention mechanism. The results are better than the last method, but still poorer than the MLP one proposed in our paper.
| Weight Seperation | AJ | OA | |
|---|---|---|---|
| Temperature-0.5 | 59.6 | 73.3 | 89.6 |
| Temperature-0.2 | 59.9 | 74.9 | 89.2 |
| Attention | 61.2 | 74.9 | 89.8 |
| MLP | 63.5 | 75.9 | 91.4 |
Q3. Computational efficiency and resource requirements.
A3 We thank the reviewer for this constructive suggestion. Indeed, this comparison should be a necessary part of our work, and we will add this comparison to our camera-ready version. We follow the main-stream works in the object detection field to utilize the FPS, GFLOPS, and the number of parameters to show the comparison of speed, computational efficiency, and resource requirements. As shown in the following table, TAPTRv2 exhibits a faster speed and lower resource requirements compared to TAPTR. More importantly, it's a common case that in the downstream tasks, we need to track all pixels in a region (e.g. tracking a text written on the back of a horse) rather than just a few scattered points. In this case, the number of points to be tracked will reach to tens of thousands. However, since the computation of the cost-volume and also the cost-volume aggregation operation in TAPTR increases sharply with the number of tracking points, with the number of tracking points increased, the advantage of TAPTRv2 will become more and more pronounced. As shown in the following second table, when the number of tracking points reaches 5000 (which is only 1.9% of the pixels in a 512x512 image), the advantage of TAPTRv2 in speed and resource consumption becomes much more significant (about 24% faster and 20% fewer computational resource requirements).
| 800 Points | FPS | GFLOPS | #Params |
|---|---|---|---|
| TAPTR | 65.9 | 147.2 | 39.2M |
| TAPTRv2 | 69.1 | 143.4 | 38.2M |
| 5000 Points | FPS | GFLOPS | #Params |
|---|---|---|---|
| TAPTR | 11.8 | 426.8 | 39.2M |
| TAPTRv2 | 14.6 | 354.2 | 38.2M |
Thank you for your response, which has addressed some of my concerns. However, I still have some questions. Regarding the comparison of computational cost and the number of parameters, you have only compared TAPTR. In your original Table 2, you compared several different methods. I believe that you should compare at least 3 to 4 models in terms of computational cost and parameters to determine whether the performance advantage is achieved at the expense of increased computational cost.
Dear reviewer-zGVx,
Considering that the deadline is approaching, we would like to kindly remind the reviewer to raise the questions that the reviewer may still have. This will help us to provide more detailed explanations and clarify our contributions effectively.
Thanks for your time and effort.
Thank you for the author's response. This has addressed most of my concerns. I will increase my score to borderline accept.
Discussion
Q4. Whether the performance advantage is achieved at the expense of increased computational cost.
We apologize for misunderstanding your question (Q3) and conflating it with other reviewers’ questions.
A4.1 Firstly, we have to thank the reviewer for the constructive suggestion. Indeed, comparing with more methods from the aspects of efficiency and speed can better reflect the advantage of TAPTRv2. Limited by the deadline of the discussion stage, as shown in the below tables, we compare the performance, computational cost, speed, and the number of parameters with two most widely recognized works in terms of performance and speed: CoTracker (from Meta, using its open-sourced implementation with about 2.2K stars) and PIPs (from CMU, using its open-sourced implementation with about 600 stars). The results show that, although these two methods have fewer parameters than ours, their computational cost is about three times larger than ours, resulting in much slower speeds. These substantial computational costs are due to their redundant designs in their multi-layer refinement process as we have discussed in Sec. 1. For example, they need to recalculate the correlation map between each tracking point and every image feature at the beginning of each layer in multi-layer refinement (6 layers in total).
A4.2 Affected by such redundant designs, their computational costs increase more rapidly with the number of points to be tracked. As shown in the second table, when the number of tracking points reaches 5000, CoTracker encounters an Out-Of-Memory error (OOM). Although PIPs does not experience the OOM error, its computational cost is about five times larger.
A4.3 Although these methods requires much more computational cost than TAPTRv2, TAPTRv2 still obtains the best performance. Especially, even when CoTracker is tested in its 'Single' mode, which deliberately tracks each single point at a time and will brings much more computational cost, CoTracker's performance is still inferior to ours. We believe these results can prove that the performance advantage is not achieved at the expense of increased computational cost.
(Since other methods are implemented with JAX and have not open-sourced their evaluation code (TAP-Net and TAPIR) or implemented with the on-the-shelf RAFT with numerous numpy computations on CPU (MFT), we need to spend more time measuring their computational cost. However, we believe that the comparison with CoTracker and PIPs is a solid support to our conclusion.)
4.4 At the same time, we hope that our insight in the analysis of cost-volume, the novel APU derived from the analysis, and the computational efficiency and performance superiority of TAPTRv2 over TAPTR can be considered as nontrivial contributions.
| 800 Points | Performance | FPS | GFLOPS | #Params |
|---|---|---|---|---|
| PIPs | 42.0 | 32.8 | 359.0 | 27.3M |
| CoTracker | 60.7 | 38.2 | 308.6 | 23.0M |
| CoTracker-Single | 62.2 | 2.1 | 873.2 | 23.0M |
| TAPTR | 63.0 | 65.9 | 147.2 | 39.2M |
| TAPTRv2 | 63.5 | 69.1 | 143.4 | 38.2M |
| 5000 Points | Performance | FPS | GFLOPS | #Params |
|---|---|---|---|---|
| PIPs | 42.0 | 6.6 | 1777.2 | 27.3M |
| CoTracker | 60.7 | OOM | OOM | 23.0M |
| CoTracker-Single | 62.2 | OOM | OOM | 23.0M |
| TAPTR | 63.0 | 11.8 | 426.8 | 39.2M |
| TAPTRv2 | 63.5 | 14.6 | 354.2 | 38.2M |
Dear AC, considering that reviewer-wowB requires the visualizations of intermediate results from the model, we provide an anonymized video link to show the required visualizations. The anonymized link will be provided in another comment block that is visible only to you. After obtaining your permission we will make it visible to all reviewers.
We are sincerely grateful to the reviewers for dedicating their time and effort to review our work thoroughly. The constructive suggestions and thoughtful concerns raised by the reviewers are very helpful in improving our camera-ready version of the paper. We will respond to each reviewer’s comments in their respective blocks.
To assist the reviewers and the Area Chair, here we summarize the main recognition of our work provided by each reviewer:
- The clarity of our motivation and insight as well as the deep and rational analysis -- from reviewer zGVx & yaYP & wowB;
- The novelty -- from reviewer zGVx & yaYP & wowB & zD8r, and effectiveness -- from reviewer zGVx & yaYP & zD8r;
- Our efforts for the simplicity and efficiency of TAPTRv2, which will help scale up the training process, accelerate advancement, and broaden the horizon of the point tracking field -- from reviewer yaYP & zD8r;
- Extensive experiments on multiple datasets -- from reviewer zGVx & zD8r and through ablation studies -- from reviewer zD8r.
| Reviewer | yaYP | wowB | zD8r | zGVx |
|---|---|---|---|---|
| Rating | Weak Accept | Weak Accept | Borderline Accept | Borderline Reject |
| Confidence | 4 | 4 | 4 | 3 |
The paper presents TAPTRv2, an enhancement of the TAPTR model. TAPTRv2 improves upon its predecessor by addressing the reliance on cost-volume through the use of attention-based position updates. All reviewers acknowledge the significance of the problem and the practicality of the solution. Concerns were raised about the theoretical justification and computational costs. The authors addressed several of these issues, particularly the concerns about computational costs, in their rebuttal. Final ratings ranged from Borderline Accept to Weak Accept, with an average score of 5.5. Following further discussions with the reviewers after the rebuttal, the Associate Chair (AC) has observed a consensus leaning towards acceptance and recommends that the paper be accepted.