PaperHub
6.0
/10
withdrawn5 位审稿人
最低5最高8标准差1.1
5
8
6
6
5
4.4
置信度
正确性3.0
贡献度2.8
表达2.8
ICLR 2025

Emerging Tracking from Video Diffusion

OpenReviewPDF
提交: 2024-09-13更新: 2025-01-23
TL;DR

Our work achieves state-of-the-art performance in pixel-level object tracking by incorporating video diffusion model representations.

摘要

关键词
Pixel-level object trackingTemporal correspondenceDiffusion models

评审与讨论

审稿意见
5

The paper introduces a simple method of extracting diffusion features from small clips to perform the object tracking task. It enables tracking similar objects, overcoming limitations of per-frame processing proposed in previous works. The performances are quite improved on segmentation-, pixel- and pose-level tracking.

优点

  • The technical statements are quite clear and compelling.
  • The PCA analysis seems interesting.
  • The performances are quite improved on three input levels.

缺点

  • The progression from per-frame processing to batch processing seems natural, with no leap leading the methodology forward.
  • The methodology of using generative models in simple understanding tasks was quite interesting when first DIFT was proposed. However, simply making incremental improvements to these models for marginal performance gains may not fully realize their power.
  • Data statistics are missing, how is the number of samples/frames/objects? Why is the dataset claimed as a contribution while reusing from Youtube-VOS?

问题

  • I could not get from Fig.5 whether the feature visualization is a heatmap or something else. If it is a heatmap ranging from 0 to 1, how can one object be distinguished from another?
评论

Thank you for your helpful suggestions in strengthening the paper! We are glad that the reviewer finds our work provides quite clear and compelling technical statements with interesting PCA analysis, and also achieves quite improved tracking performance. Our work demonstrates, for the first time, that temporal motions from video diffusion models can solve perception challenges and significantly improve perception performance.

We have revised the paper according to your valuable suggestions and address your concerns below. Please let us know if you'd like any further information.

1.Significance and novelty of our work

Thank you for your question. The major novelty of the paper is to explore the temporal features from video diffusion models in perception tasks. Our work solves perception challenges like tracking similar-looking objects in label propagation, achieving the state-of-the-art performance across various scenarios.

Other reviewer finds that tracking similar-looking objects in label propagation is a very fundamental task in this field. Our work is the first to highlight and solve this task while prior studies fail. Other reviewer also finds that our work provides new insights into understanding diffusion models and latent representation from another perspective, and can benefit a board of readers in tracking and diffusion groups. Our work is important for the community and worth publishing.

2 Advantages of our work over DIFT [1]

Thank you for your question. We clarify and highlight the advantages of our work over DIFT as follows, which is also discussed in Appendix A.1 of the revised paper.

  • We solve a task that tracks similar-looking objects which DIFT cannot solve. Other reviewer finds that tracking similar-looking objects in label propagation is a very fundamental task in the field. Since DIFT learns only appearance features, it fails to track similar-looking objects.
  • Our work uses temporal motions learned from video diffusion models in tracking, providing new insights into how motion-based tracking emerges. Our experiments and analysis show that temporal layers in video diffusion models enable motion-aware features necessary for tracking similar-looking objects, which are absent in DIFT.
  • Improved tracking accuracy across various scenarios. Our work significantly outperforms DIFT in tracking performance in various videos, such as those with severe object deformation, achieving 1.9% higher accuracy on DAVIS and 5.3% on YouTube-Similar.

[1] Tang et al. Emergent correspondence from image diffusion. NeurIPS 2023.

3.1 Data statistics of evaluation datasets

Thank you for pointing this out. We summarize the dataset statistics in the following table and include more details in Appendix B.3 of the revised paper.

DatasetVideosFramesObjects
DAVIS30202359
Youtube-Similar2883969
Kubric-Similar1422428

3.2 Discussions on our contribution that introduces a challenging task for label propagation

Sorry for the confusion. We do not claim the YouTube-Similar dataset as a contribution. Our contribution lies in being the first to highlight and solve the task that tracks similar-looking objects in video label propagation, while prior studies fail.

4 Discussions on the feature visualization

Sorry for the confusion. Pixels are encoded as representation vectors by our diffusion models. To track, we use the full high-dimension vectors. To visualize, we use PCA for dimension reduction. Roughly, similar colors in Figure 5 indicate similar pixel representations. For more details, please refer to Appendix B.4 of the revised paper.

评论

Dear Reviewer,

We sincerely appreciate the time and effort you have dedicated to improving this paper. As we approach the end of the discussion period, we hope our responses have sufficiently addressed the concerns you raised. If there are any questions, please let us know.

Thank you very much.

审稿意见
8

This paper introduces a method from video diffusion, TED, to extract motion representations for object-level tracking tasks. By leveraging diffusion representation and existing label propagation methods, the proposed method shows a very robust identification in similar appearance matching. The experiments demonstrate its effectiveness in tracking tasks and achieving SOTA results.

优点

  • The paper is well written. It is easy to follow the key idea of the paper.
  • the experiments are comprehensive and the results are quite good. Especially, the demos, ie. fig 5, on identical appearance object matching.
  • the findings along with understanding are interesting. how to deal with the similar appearance of objects in label propagation is very fundamental in the field.
  • helping understand diffusion models and latent representation from another perspective. It can benefit a board of readers in tracking and diffusion groups.

缺点

  • Several improvements can be achieved. It would be better if more metrics were provided to understand the proposed method, i.e. the time-consuming. the details of the proposed framework can be provided in supplementary for readers to replicate the method.

问题

  • What about the time cost of the method? Considering the diffusion model often takes time to infer, how much time it will cost to track, for example, 100 frames?
  • And if the video sequence is very long, would the results decline?
  • as the method is built on standard Unet diffusion models, would some inference speed-up methods in the diffusion model help TED faster?
评论

Thank you for your positive review and helpful suggestions! We are glad that the reviewer finds our work investigates a very fundamental task that tracks similar-looking objects in label propagation, with comprehensive experiments and significantly improved results. As you mentioned, our work provides new understandings of diffusion models and latent representation, and we believe our work will benefit a board of readers in tracking and diffusion groups.

We have revised the paper according to your valuable suggestions and address your concerns below. Please let us know if you'd like any further information.

1 Implementation details

Thank you for your suggestions. We have revised the paper to include the implementation details of the proposed method in Section 3.4, Appendix B.1 and pseudocode in Algorithm 1.

2 Computation cost analysis

Thank you for pointing this out. We compare computation cost of our method with DIFT [1] in the following table. As you suggested, we track a 100-frame video, reporting average time per frame and maximum GPU memory. Following DIFT, we introduce two TED versions for real applications: efficient and best. The best version, used as the default model in the paper, includes optional techniques for higher accuracy like obtaining averaged frame representation using a batch of noise as in DIFT, while the efficient version excludes these techniques.

Our TED (efficient) and TED (best) take 121 and 224 seconds to track a 100-frame video, respectively. Our TED (efficient) outperforms DIFT (best) by 1.5% in accuracy with similar speed and slightly higher memory use, while TED (best) achieves higher accuracy at greater computation cost. In real applications, users can choose the version based on their requirements on accuracy and efficiency. For more details and discussions, please refer to Appendix C.1 in the revised paper.

Our work demonstrates, for the first time, that motions learned from video diffusion models can solve perception challenges and achieve state-of-the-art results. As you mentioned, our work offers new understandings, benefiting both tracking and diffusion groups. We believe our method can be further accelerated with future research on diffusion model acceleration as well as advances in computing and resources.

ModelVersionOptional TechniquesAccuracyTime (s)FPSGPU Memory (GB)
DIFTEfficientNo74.70.731.375.53
DIFTBestYes75.71.360.749.25
TED(ours)EfficientNo77.21.210.8211.65
TED(ours)BestYes77.62.240.4715.20

[1] Tang et al. Emergent correspondence from image diffusion. NeurIPS 2023.

3 Discussions on the long video setting

Thank you for pointing this out. Yes, our tracking performance declines if the video is very long. In label propagation task, longer videos are harder to track due to error accumulation. Since our work follows label propagation protocols of prior work that predicts the label of each frame sequentially, it also faces performance declines in long videos. We believe improving tracking in very long videos is an important topic and leave it for future work.

4 Discussions on model speed-ups

Thank you for your suggestions. Yes, inference speed-up methods could accelerate our work, such as FlashAttention[2] which optimizes GPU memory usage in attention blocks. Our work has the potential of further speed-up with future research on diffusion acceleration.

[2] Dao et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS 2022.

评论

Thanks for the feedback and my concerns have been addressed. Thus, I will keep my original score.

评论

Thank you for reviewing our rebuttal. We deeply value your insightful feedback, continued support, and appreciation for this paper!

审稿意见
6

The authors leverage latent representations from video diffusion models to capture the temporal information for pixel-level tracking. Without additional training, the proposed method improves tracking performance in various video scenarios, even enabling tracking of similar-looking objects where previous methods struggle. The authors also introduce a new benchmark, Youtube-Similar, to evaluate the complex scenario of tracking multiple similar-looking objects in real-world videos. The proposed method, TED, is evaluated on several benchmarks and achieves better tracking results.

优点

  1. TED extracts the temporal information of tracking targets from the video diffusion models to assist in tracking.
  2. TED introduce a new benchmark, Youtube-Similar, to evaluate the complex scenario of tracking multiple similar-looking objects in real-world videos.
  3. TED achieves good tracking results, especially on the Youtube-Similar benchmark, to show the better tracking ability to tackle the complex scenario of multiple similar-looking objects.

缺点

  1. The tracking results, which only train on ImageNet, are suggested to be provided for fare comparison.
  2. The tracking speed and computation cost are suggested to be provided.
  3. The authors are suggested to discuss the benefits of the temporal information extracted from diffusion compared with other models like transformers.

问题

Please see the weakness.

评论

2 Computation cost analysis

Thank you for pointing this out. We compare computation cost of our method with DIFT [1] in the following table. We track a 100-frame video as suggested by Reviewer k917, reporting average time per frame and maximum GPU memory. Following DIFT, we introduce two TED versions for real applications: efficient and best. The best version, used as the default model in the paper, includes optional techniques for higher accuracy like obtaining averaged frame representation using a batch of noise as in DIFT, while the efficient version excludes these techniques.

Our TED (efficient) outperforms DIFT (best) by 1.5% in accuracy with similar speed and slightly higher memory use, while TED (best) achieves higher accuracy at greater computation cost. In real applications, users can choose the version based on their requirements on accuracy and efficiency. For more details and discussions, please refer to Appendix C.1 in the revised paper.

Our work demonstrates, for the first time, that motions learned from video diffusion models can solve perception challenges and achieve state-of-the-art results. We believe our method can be further accelerated with future research on diffusion model acceleration as well as advances in computing and resources.

ModelVersionOptional TechniquesAccuracyTime (s)FPSGPU Memory (GB)
DIFTEfficientNo74.70.731.375.53
DIFTBestYes75.71.360.749.25
TED(ours)EfficientNo77.21.210.8211.65
TED(ours)BestYes77.62.240.4715.20
评论

Thank you for your positive review and helpful suggestions! We are glad that the reviewer finds our work with excellent soundness, presentation and contribution. As you mentioned, our work proposes to introduce temporal information from video diffusion models to tracking, and proposes a new challenging benchmark that tracks similar-looking objects. Our method achieves good results, especially in complex scenarios of multiple similar-looking objects. Our work demonstrates, for the first time, that temporal motions from video diffusion models can solve perception challenges and significantly improve perception performance.

We have revised the paper according to your valuable feedback and address your concerns below. Please let us know if you'd like any further information.

1 Comparison using the same training dataset

Thank you for your suggestions. Since our video diffusion models cannot be trained only on image datasets like ImageNet, we train image diffusion model from DIFT[1] on the same training dataset as our video diffusion model for comparison. We show the results in the following table, suggesting that the improvement of our tracking performance comes from temporal motions in video diffusion models which DIFT lacks, instead of training data.

The table shows that without temporal modeling, training on additional video data fails to track similar-looking objects, indicated by a low J&Fm of 43.8% on Kubric-Similar. Web-Vid[2] has lower individual image quality[3], such as motion blur and watermarks. Fine-tuning DIFT on Web-Vid even reduces performance. In contrast, our TED achieves significant improvements using video diffusion models and effectively distinguishes similar-looking objects, demonstrating the importance of learning temporal motions from video diffusion models for tracking.

VersionModelTrain DatasetKubric-SimilarYoutube-SimilarDAVIS
J&FmJmFmJ&FmJmFmJ&FmJmFm
OriginalDIFTImageNet52.650.354.860.759.861.775.772.778.6
Finetune on Web-VidDIFTImageNet, Web-Vid43.840.147.458.958.359.472.970.175.1
OursTEDImageNet, Web-Vid90.486.994.066.065.167.077.674.480.8

[1] Tang et al. Emergent correspondence from image diffusion. NeurIPS 2023.

[2] Bain et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. ICCV 2021.

[3] Guo et al. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. ICLR 2024.

评论

3 Advantages of our work in learning temporal features

Thank you for your suggestions. We have cited and compared our method with prior methods [4-13] that learn temporal features using models like transformers. We clarify and highlight the advantages of our work as follows. We have also included the discussions in Appendix A.2 of the revised paper.

  • Better representations obtained by solving a harder generative task. Previous methods are trained on easier tasks that always have shortcuts. For example, mismatched patches in [5,10] or objects in [8,11] with similar appearances can also yield low training loss. In contrast, our video diffusion models are trained to fully reconstruct every pixel from noisy inputs, enabling better representation learning.

  • Advanced temporal attention vs. simple pairwise correlation. During training, prior methods learn temporal features by simple correlations between spatial features across frames [4,5,6,7,9,12,13], which fail to distinguish similar-looking objects. In contrast, our video diffusion model uses temporal attention layers to integrate multiple frames, enabling advanced reasoning in complex scenarios like the deer with changing positions in paper's Figure 1.

  • Significantly improved tracking accuracy. Quantitative results and visualization show that our method significantly improves the tracking performance compared to prior studies[4-13], by more than 3.5% on DAVIS and 6.4% on YouTube-Similar.

[4] Vondrick et al. Tracking emerges by colorizing videos. ECCV 2018.

[5] Wang et al. Learning correspondence from the cycle-consistency of time. CVPR 2019.

[6] Li, et al. Joint-task self-supervised learning for temporal correspondence. NeurIPS 2019.

[7] Lai et al. Self-supervised learning for video correspondence flow. BMVS 2019

[8] Gordon et al. Watching the world go by: Representation learning from unlabeled videos. Arxiv 2020.

[9] Lai et al. Mast: A memory-augmented self-supervised tracker. CVPR 2020.

[10] Jabri et al. Space-time correspondence as a contrastive random walk. NeurIPS 2020.

[11] Xu et al. Rethinking self-supervised correspondence learning: A video frame-level similarity perspective. ICCV 2021.

[12] Qian et al. Semantics meets temporal correspondence: Self-supervised object-centric learning in videos. ICCV 2023.

[13] Li et al. Spatial-then-temporal self-supervised learning for video correspondence. ICCV 2023.

审稿意见
6

The paper investigates using features from a generative video diffusion model for the task of label propagation in videos. The findings appear to point to using a combination of the image diffusion model and video diffusion model to yield good results.

优点

(1) The reported results show improvement over prior works

(2) The motivating toy example is a great way to show the lack of temporal information that the paper is investigating.

缺点

Presentation

The paper is lacking in key details, specifically:

  1. What is the method? The closest paper that comes to the actual description of how the label is propagated is probably on L312, "recurrently predict labels for subsequent frames". Appendix A.1 is not much more helpful in explaining how tracking is performed; however, it suggests that the actual method is DIFT with just appended video features.
  2. What are the evaluation datasets? The paper mentions DAVIS, Kubric-Similar and Youtube-Similar; however, these datasets are not detailed, especially Kubric-Similar and Youtube-Similar, which seem to be new. Please include details of these datasets and how they were made/curated, including examples. For Davis, please indicate clearly which year and which "task" flavour (semi-supervised, probably) is being used.

Contributions

  1. Some of the provided answers do not seem to be well supported. E.g. on L453 (Fig. 7 caption), it is stated that given the lack of appearance at a high noise level, Rv must capture temporal cues useful for tracking. However, temporal cues must be derived from appearance, which is obfuscated with noise. Similarly, it does not explain why it performs worse when there is no noise. Temporal ques are still available. It is also not clear how these experiments were performed, given the randomness involved; how many trials does each point in the graph represent? What are the variance bounds?
  2. It is also not clear what are the results in Fig. 9? Does each point correspond to the video or average over a dataset? Why does this positive correlation imply that training new video diffusion models will improve tracking? Does this hold for both SDV and I2V models?
  3. The methods reported in Table 2 for supervised are rather old and might leave a false impression of the difference in the performance of supervised vs unsupervised. Looking at the DAVIS 2020 challenge [1] (Davis 2017 is a subset) results, it is clear that supervised performance is in 80+ region.
  4. Given that the contribution of the work is to append video diffusion features to the image of DIFT. There is arguably lack of significant learnings or exploration presented, as only 2 video diffusion models are tested. For example, why is I2V better than SDV? Would time-tuned self-supervised features work instead [2]? Why only stop at 2 models? Are there better combinations for image features, as shown in [3]? Why are video features so much worse in isolation?
  5. The performance constraints/cost are not mentioned, but they are usually key considerations in tracking settings, as systems need to be real-time due to safety critical (e.g. self-driving) or UX (e.g. film editing) concerns. What are the performance characteristics, e.g. FPS, GPU memory, to run this method?

[1] DAVIS 2020 challenge. https://davischallenge.org/challenge2020/leaderboards.html [2] Salehi et al. Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations, ICCV 2023 [3] Zhang et al. A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence, NeurIPS 23

问题

Please see weaknesses section for all questions. It is critical to consider and address the issues with the presentation, and better explain the experimental details.

评论

Thank you for your helpful suggestions in strengthening the paper! We are glad the reviewer finds our paper improves tracking performance, and proposes a great motivating toy example showing the lack of temporal information in prior studies. Our work demonstrates, for the first time, that temporal motions from video diffusion models can solve perception challenges and significantly improve perception performance.

We have revised the paper according to your valuable suggestions and address your concerns below. Please let us know if you'd like any further information.

1.1 Implementation details on label propagation

Thank you for pointing this out. We briefly introduce the tracking process as follows. For more details, please refer to Section 3.4, Appendix B.1 and the pseudocode in Algorithm 1 of the revised paper.

Following prior studies, video label propagation consists of two stages: representation extraction and label prediction.

  • We first obtain frame representation RfR_f using video and image diffusion models.
  • For each pixel in the current frame, the label is predicted by aggregating labels of its similar pixels from previous frames, where their pixel similarity is computed using RfR_f representation.

1.2 Advantages of our work over DIFT[1]

Thank you for your question. We clarify and highlight the advantages of our work over DIFT[1] as follows.

  • We solve a task that tracks similar-looking objects which DIFT cannot solve. Other reviewer finds that tracking similar-looking objects in label propagation is a very fundamental task in the field. Since DIFT learns only appearance features, it fails to track similar-looking objects.
  • Our work uses temporal motions learned from video diffusion models in tracking, providing new insights into how motion-based tracking emerges. Our experiments and analysis show that temporal layers in video diffusion models enable motion-aware features necessary for tracking similar-looking objects, which are absent in DIFT.
  • Improved tracking accuracy across various scenarios. Our work significantly outperforms DIFT in tracking performance in various videos, such as those with severe object deformation, achieving 1.9% higher accuracy on DAVIS and 5.3% on YouTube-Similar.

[1] Tang et al. Emergent correspondence from image diffusion. NeurIPS 2023.

2 Details of evaluation datasets

Thank you for pointing this out. As you mentioned, we use DAVIS, YouTube-Similar, and Kubric-Similar for evaluation, with the latter two being newly proposed. For DAVIS, we follow prior studies by evaluating on the 2017 subset, and the task is semi-supervised. For more details and dataset examples, please refer to Section 4.1 and Appendix B.3 of the revised paper.

3.1 Discussions on video diffusion representations at different noise levels

Thank you for bringing this up. A major reason that the finding from our paper is surprising, is that our feature is obtained from diffusion models that learn to reconstruct the clean videos from noisy input. Diffusion models solve different tasks at different noise levels during training [2]. When the video input is corrupted at high noise levels, video diffusion models are trained to solve the hard task that learning coarse-grained signals in the video, such as motion (like the change of object positions among frames). Therefore, its representation encodes rich motion information that enables tracking similar-looking objects. When the video input is less noisy, the diffusion model is trained to denoise appearance details, where motion features may not be so prioritized, leading to performance decrease at low noise levels. We have revised the paper from line 452 and 483.

[2] Choi et al. Perception Prioritized Training of Diffusion Models. CVPR 2022.

3.2 Stable tracking results

Thank you for your question. Following DIFT [1], we use a batch of random noise to obtain an averaged representation for each video, reducing randomness in diffusion models. We conducted the tracking experiments five times and observed a small variance, e.g., 0.1% on the DAVIS dataset. We have revised the paper to include these implementation details and discussions in Appendix B.1.

4 Discussions on video synthesis and tracking results

Thank you for your suggestions. In Figure 9, each point represents a video, showing the correlation between video synthesis quality and tracking results for the I2VGen-XL model. We conducted an experiment you suggested, and find that this correlation does not hold cross-architectures, such as SVD. Therefore, we have removed the results of the original Figure 9 in the revised paper.

评论

7 Computation cost analysis

Thank you for pointing this out. We compare computation cost of our method with DIFT [1] in the following table. We track a 100-frame video as suggested by Reviewer k917, reporting average time per frame and maximum GPU memory. Following DIFT, we introduce two TED versions for real applications: efficient and best. The best version, used as the default model in the paper, includes optional techniques for higher accuracy like obtaining averaged frame representation using a batch of noise as in DIFT, while the efficient version excludes these techniques.

Our TED (efficient) outperforms DIFT (best) by 1.5% in accuracy with similar speed and slightly higher memory use, while TED (best) achieves higher accuracy at greater computation cost. In real applications, users can choose the version based on their requirements on accuracy and efficiency. For more details and discussions, please refer to Appendix C.1 in the revised paper.

Our work demonstrates, for the first time, that motions learned from video diffusion models can solve perception challenges and achieve state-of-the-art results. We believe our method can be further accelerated with future research on diffusion model acceleration as well as advances in computing and resources. Our work is important to the community and worth publishing.

ModelVersionOptional TechniquesAccuracyTime (s)FPSGPU Memory (GB)
DIFTEfficientNo74.70.731.375.53
DIFTBestYes75.71.360.749.25
TED(ours)EfficientNo77.21.210.8211.65
TED(ours)BestYes77.62.240.4715.20
评论

I thank the Authors for their rebuttal. My questions have been suitably answered, and I support the changes made to the paper. In light of this, I have increased my rating.

评论

5 Inclusion of new supervised baselines

Thank you for your suggestions. We have reviewed the DAVIS 2020 challenge you suggested and added the new supervised baselines to Table 2 in the revised paper. Our work achieves state-of-the-art performance in the self-supervised field, significantly bridging the gap with supervised methods.

6.1 Difference between I2VGen-XL and SVD model

Thank you for your question. We empirically find I2VGen-XL outperforms SVD in tracking tasks. Their architecture difference is that I2VGen-XL is a text-to-video model while SVD generates videos conditioned on a single image.

6.2 Time-Tuning features [3] do not work in tracking similar-looking objects

Thank you for suggesting this paper. We conducted an experiment that uses Time-Tuning features [3] for video label propagation task. We find that Time-Tuning features [3] fail to distinguish similar-looking objects in our work, as shown in Figure 10 in Appendix C.3 of the revised paper. This failure is because Time-Tuning is trained to learn semantic features for semantic segmentation task, as shown in Figure 3 of original paper[3], which lacks object motions needed in tracking similar-looking objects. We have cited and discussed Time-Tuning[3] in Appendix C.3 of the revised paper.

[3] Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations, ICCV 2023

6.3 Choice of model number

Thank you for your question. Our method is general and compatible with all video diffusion architectures. We choose I2VGen-XL and SVD for tracking because they are two popular video diffusion models and cover text-to-video and image-to-video diffusion types.

6.4 Choice of image features to complement our method

Thank you for your suggestions. We follow [4] and conduct an experiment that introduces additional DINOv2 features to our TED method. We show the tracking results in the following table. The results show that additional DINOv2 features do not further improve the tracking performance for our TED. We have cited and discussed the work [4] in Appendix C.4 of the revised paper.

ModelFeaturesKubric-SimilarYoutube-SimilarDAVIS
J&FmJmFmJ&FmJmFmJ&FmJmFm
TED (With DINOv2 features)ADM, I2VGen-XL, DINOv290.086.693.565.965.066.777.374.280.5
TED (Ours)ADM, I2VGen-XL90.486.994.066.065.167.077.674.480.8

[4] Zhang et al. A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence, NeurIPS 23

6.5 Video diffusion model in isolation outperforms DIFT in tracking similar-looking objects

Thank you for your question. Video diffusion features alone outperform the state-of-the-art DIFT in tracking similar-looking objects, such as on Youtube-Similar and Kubric-Similar. On DAVIS dataset, visual appearance is already a significant clue for tacking in many cases. Therefore, appearance features from DIFT achieve high accuracy. Our TED combines appearance features from image diffusion model and motions from video diffusion model. Our method enjoys the best of these two models and achieves state-of-the-art performance.

评论

Thank you for reviewing our rebuttal and for increasing your score! We truly appreciate your thoughtful feedback and your kind support of the changes we made to the paper.

审稿意见
5

Uses video diffusion models to extract exact motion representations for strong appearances for visual tracking.

优点

The paper presents an interesting idea in leveraging visual diffusion for strong appearance correspondence and how it can be used for visual tracking with multiple objects.

缺点

I am not exactly sure how this proposed paper differs from the DIFT method? They appear to be the same except for perhaps different training data. I understand that you are leveraging the latent representations from visual diffusion however no where in the paper do you explicitly give the visual tracking algorithm in terms of a flow chart or pseudo code, this would be helpful. The ablation study refers to t which is the tilmestep in diffusion however when it appears, you also use t for time, so this is confusing to the reader.

问题

How does this method proposed differ from DIFT (Tang 2023)?, this is not clear? What is the computational cost to compute the video diffusion model if we wanted to use this method for real time tracking?, or can we do tracking in real time with this step included? How and in what manner do we include windowing for the video diffusion and how does this blend in for a real time algorithm for visual tracking. Table 2, it would be good if the datasets were the same for a fair comparison. The results for your approach and DIFT appear the same and sometimes DIFT is better, comment please?
It appears that the method is dependent on the dataset used? I am curious on if the same dataset is used, how does DIFT differ from your algorithm in terms of results

评论

3 Implementation details on tracking method

Thank you for pointing this out. We briefly introduce the tracking process as follows. For more details, please refer to Section 3.4, Appendix B.1 and the pseudocode in Algorithm 1 of the revised paper.

Following prior studies, video label propagation consists of two stages: representation extraction and label prediction.

  • We first obtain frame representation RfR_f using video and image diffusion models.
  • For each pixel in the current frame, the label is predicted by aggregating labels of its similar pixels from previous frames, where their pixel similarity is computed using RfR_f representation.

4 Clearer notation for diffusion timestep

Sorry for the confusion. In the revised paper, we have replaced the diffusion timestep tt with τ\tau to enhance clarity.

5 Computation cost analysis

Thank you for pointing this out. We compare computation cost of our method with DIFT [1] in the following table. We track a 100-frame video as suggested by Reviewer k917, reporting average time per frame and maximum GPU memory. Following DIFT, we introduce two TED versions for real applications: efficient and best. The best version, used as the default model in the paper, includes optional techniques for higher accuracy like obtaining averaged frame representation using a batch of noise as in DIFT, while the efficient version excludes these techniques.

Our TED (efficient) outperforms DIFT (best) by 1.5% in accuracy with similar speed and slightly higher memory use, while TED (best) achieves higher accuracy at greater computation cost. In real applications, users can choose the version based on their requirements on accuracy and efficiency. For more details and discussions, please refer to Appendix C.1 in the revised paper.

Our work demonstrates, for the first time, that motions learned from video diffusion models can solve perception challenges and achieve state-of-the-art results. We believe our method can be further accelerated with future research on diffusion model acceleration as well as advances in computing and resources. Our work is important to the community and worth publishing.

ModelVersionOptional TechniquesAccuracyTime (s)FPSGPU Memory (GB)
DIFTEfficientNo74.70.731.375.53
DIFTBestYes75.71.360.749.25
TED(ours)EfficientNo77.21.210.8211.65
TED(ours)BestYes77.62.240.4715.20

6 Discussions on the windowing process

Thank you for pointing this out. Video diffusion models take multiple frames with a maximum length d\mathcal{d} as input. For videos shorter than d\mathcal{d}, we use the full video as model input. For longer videos, we split them into multiple sequences that allow overlapping frames. For more details, please refer to Appendix B.1 in the revised paper.

Our method applies to real-world streaming case, where frames arrive sequentially. When fewer than d\mathcal{d} frames exist, new frames are merged with the existing ones to form the model input sequence for frame representation. Once d\mathcal{d} frames are reached, a sliding window of d1\mathcal{d}-1 existing frames and the new frame can form a new window as the input to video diffusion models.

7 Comment on performance difference between DIFT and ours.

Sorry for the confusion. Our full version uses RfR_f and outperforms DIFT on all datasets, see Table 2 of the paper.

评论

Thank you for your helpful suggestions in strengthening the paper! We are glad that the reviewer finds our work proposes an interesting idea that introduces temporal motions from video diffusion models to tracking. Our work demonstrates, for the first time, that temporal motions from video diffusion models can solve perception challenges and significantly improve perception performance.

We have revised the paper according to your valuable suggestions and address your concerns below. Please let us know if you'd like any further information.

1 Difference between our method and DIFT[1]

Thank you for your question. We clarify and highlight the difference between our method and DIFT as follows.

  • We solve a task that tracks similar-looking objects which DIFT cannot solve. Other reviewer finds that tracking similar-looking objects in label propagation is a very fundamental task in the field. Since DIFT learns only appearance features, it fails to track similar-looking objects.
  • Our work uses temporal motions learned from video diffusion models in tracking, providing new insights into how motion-based tracking emerges. Our experiments and analysis show that temporal layers in video diffusion models enable motion-aware features necessary for tracking similar-looking objects, which are absent in DIFT.
  • Improved tracking accuracy across various scenarios. Our work significantly outperforms DIFT in tracking performance in various videos, such as those with severe object deformation, achieving 1.9% higher accuracy on DAVIS and 5.3% on YouTube-Similar.

[1] Tang et al. Emergent correspondence from image diffusion. NeurIPS 2023.

2 Tracking accuracy improvements in our work DO NOT come from the training data

Thank you for your question. We train image diffusion model from DIFT on the same training dataset as our video diffusion model. We show the results in the following table, suggesting that the improvement of our tracking performance comes from temporal motions in video diffusion models which DIFT lacks, NOT training data.

The table shows that without temporal modeling, training on additional video data fails to track similar-looking objects, indicated by a low J&Fm of 43.8% on Kubric-Similar. Web-Vid[2] has lower individual image quality[3], such as motion blur and watermarks. Fine-tuning DIFT on Web-Vid even reduces performance. In contrast, our TED achieves significant improvements using video diffusion models and effectively distinguishes similar-looking objects, demonstrating the importance of learning temporal motions from video diffusion models for tracking.

VersionModelTrain DatasetKubric-SimilarYoutube-SimilarDAVIS
J&FmJmFmJ&FmJmFmJ&FmJmFm
OriginalDIFTImageNet52.650.354.860.759.861.775.772.778.6
Finetune on Web-VidDIFTImageNet, Web-Vid43.840.147.458.958.359.472.970.175.1
OursTEDImageNet, Web-Vid90.486.994.066.065.167.077.674.480.8

[1] Tang et al. Emergent correspondence from image diffusion. NeurIPS 2023.

[2] Bain et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. ICCV 2021.

[3] Guo et al. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. ICLR 2024.

评论

Dear Reviewer,

We sincerely appreciate the time and effort you have dedicated to improving this paper. As we approach the end of the discussion period, we hope our responses have sufficiently addressed the concerns you raised. If there are any questions, please let us know.

Thank you very much.

评论

We sincerely thank all reviewers for their valuable feedback. We address the reviewers' comments below and have revised the paper accordingly (with edits highlighted in blue).

The major novelty of the paper is to explore the temporal features from video diffusion models in perception tasks. We are glad that Reviewer k917 finds how to deal with the similar appearance of objects in label propagation is very fundamental in the field. Our work is the first to highlight and solve this task while prior studies fail. Reviewer k917 also finds that our work helps understand diffusion models and latent representation from another perspective, and can benefit a board of readers in tracking and diffusion groups. We are also grateful for the reviewers' positive comments and recognition of the following strengths:

  • Improved tracking accuracy over prior studies (Reviewer 5chB, 1z6R, k917, EKWS).
  • Our method enables tracking similar-looking objects in label propagation task (Reviewer 1z6R, k917, EKWS).
  • Our work analyzes the role of temporal information in tracking, such as the motivating toy example (Reviewer 5chB) and object matching analysis in Figure 5 (Reviewer k917, EKWS).
  • Comprehensive experiments (Reviewer k917).
  • Clear and compelling technical statements (Reviewer EKWS).
  • Excellent contribution and soundness (Reviewer 1z6R, k917).
  • Excellent presentation (Reviewer 1z6R) and well-written paper (Reviewer k917).

We believe our work can benefit a board of readers in tracking and diffusion groups as recognized by Reviewer k917, and is worth publishing. We also thank the ACs for their time and effort in evaluating our submission.

撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.