Cross-video Identity Correlating for Person Re-identification Pre-training
摘要
评审与讨论
This work explores learning the ReID model from a large number of unlabeled videos. The identity associations within the video and across the videos are mined. Extensive experiments under different backbones and downstream ReID settings are conducted.
优点
- Learning from unlabeled videos makes sense for improving ReID model.
- Extensive experiments are conducted.
- The performance is impressive.
缺点
-
The technical contribution is not significant. Associating different images is the core of the work. However, the main technique is to associate based on the feature similarity following a spirit of progressive learning. Such a pipeline is not novel and widely applied in many previous works. For example, some works mine the associations between images in different tracklets based on feature similarity.[R1, R2, R3, R4]. Although some of these works learn the model from trimmed videos, but share a similar sprit of progressively associating different tracklets/videos based on feature similarity. Besides, such a pipeline could be sensitive to the hyperparameter, i.e., , but the paper does not report the ablation study in the hyperparameter.
-
Supervised learning can be employed with the identity association. Why employing self-supervised learning in Sec. 3.3.
-
The proposed method achieves similar performance with SOLIDER in Table 1.
[R1] Tracklet Self-Supervised Learning for Unsupervised Person Re-Identification. AAAI2020. [R2] CycAs: Self-supervised Cycle Association for Learning Re-identifiable Descriptions. ECCV2020. [R3] Unsupervised Person Re-identification by Deep Learning Tracklet. ECCV2018 [R4] Progressive Unsupervised Person Re-Identification by Tracklet Association With Spatio-Temporal Regularization. TMM 2020
问题
Please refer to weakness.
局限性
The limitations as well as the safeguards for dataset and models are discussed in the paper.
Sincerely thanks for your constructive and valuable comments. The concerns are answered as follows.
(i) Analyses of technical contribution.
Thanks for your nice concern.
Person Re-ID, a fine-grained visual retrieval task, centers on learning a feature distribution that robustly captures the nuances between person images via feature similarity. As feature similarity is essential to Re-ID, our work, like these referenced, prioritizes it. Technically, however, our work has the following contributions compared to these works:
(a) By defining a noise concept, we propose a progressive multi-level denoising strategy to effectively seek the identity correlation across large-scale videos. Previous unsupervised person Re-ID methods have indeed proposed similar ideas of associating tracklets from different videos, but these methods are designed for small-scale video datasets. Their proposed methods are mostly complex and inefficient, making them ill-suited for application to video collections of an ultra-large scale. However, we have made it possible to seek identity correlations in large-scale videos by proposing a three-stage progressive denoising strategy and a sliding range & linking relation design.
(b) Meanwhile, we propose a identity-guided self-distillation loss to implement large-scale pre-training. Existing researches in the field of person Re-ID pre-training ignore the identity-invariance across different videos, leading to learning comparably poor representations for the person Re-ID task. However, we have made it possible for the model to learn cross-video identity invariance in large-scale videos through the identity correlation seeking and identity-guided self-distillation technics we propose.
(ii) Ablation study in the hyperparameter.
Theoretically, the quality of the sought identity correlations is indeed influenced to some extent by these two hyperparameters. However, we do not conduct ablation studies on these two parameters for the following two reasons.
(a) A series of previous works have shown that in large-scale pre-training, subtle differences in pre-training data quality do not have a significant impact on model performance.
(b) Conducting ablation studies on these two parameters would mean that for each set of values, we would need to start from scratch with the process of identity correlation seeking and pre-training for the entire large-scale video collection. Given that this process is extremely resource and time-consuming, such ablation studies are beyond our capacity to undertake, and they are difficult to accomplish within such a short rebuttal period.
Following your suggestion, we will attempt to explore the specific impacts of different hyperparameters in our future work. I hope the above response can address your concerns.
(iii) The reason for employing self-supervised learning.
A series of self-supervised learning works in general domains, such as MoCo, DINO, MAE, etc., conclude that self-supervised learning mitigates model bias from label bias compared to supervised learning, enabling deeper data understanding and more robust, generalizable representations. Hence, our paper prioritizes self-supervised learning with identity association over supervised approaches.
Importantly, through our analysis and some additional experiments, we can conclude the two following representative advantages of self-supervised learning over supervised learning in our work.
(a) Self-supervised learning leads to better generalization capabilities. To validate this, we conduct a comparative experiment using our proposed self-supervised learning method and a representative supervised person Re-ID method BoT [1]. We use 1,000,000 images from CION-AL as the pre-training dataset, with ResNet50 as the backbone. After pre-training, we directly test the zero-shot performance of each model on Market1501 and MSMT17. The table below demonstrates that the self-supervised pre-trained model surpasses the supervised counterpart in zero-shot performance, suggesting improved generalization from self-supervision.
| Method | Mar mAP | Mar R@1 | MSMT mAP | MSMT R@1 |
|---|---|---|---|---|
| BoT (supervised) | 41.4 | 72.1 | 14.9 | 42.1 |
| Ours (self-supervised) | 46.7 | 74.2 | 18.2 | 49.6 |
(b) When training on large-scale data, our proposed self-supervised learning method incurs significantly lower computational costs than some representative supervised learning methods. For example, in almost all supervised learning methods for person Re-ID, ID loss is used for auxiliary training. For a pre-training dataset with a large number of identity labels, the ID loss functions as an ultra-large-scale multi-classifier. The training efficiency of such a multi-classifier is theoretically more resource-intensive and less efficient. Meanwhile, some supervised methods also design complex modules and losses. Nevertheless, we achieve better representation learning by employing a more efficient training method (a single self-supervised learning loss).
In summary, for the person Re-ID pre-training, we believe that self-supervised learning is faster and superior compared to supervised learning.
(iv) Similar performance with SOLIDER.
Employing Swin-T, our CION outperforms SOLIDER by 0.6% on Market1501 and 3.7% on MSMT17 in mAP. With Swin-S, CION matches SOLIDER's performance, both edging up 0.1% on both datasets. However, our CION has a significant advantage over SOLIDER in that it is model-agnostic and can be applied to almost all model architectures, including the 32 models with 10 different architectures shown in this paper. In contrast, SOLIDER is only applicable to structures like vision transformers, and their work includes only 3 models. Meanwhile, our CION demonstrates exceptionally superior performance in unsupervised person Re-ID in Tab. 2.
Reference
[1] A Strong Baseline and Batch Normalization Neck for Deep Person Re-identification. IEEE Transactions on Multimedia (2019).
Thanks for your responses. All of my concerns are addressed with the solid experimental results.
I agree with the reviewer uzbJ. I still think the main contribution of your work is dataset preprocessing. Regarding the solid experimental results, I vote for an acceptance.
Dear Reviewer 5xRM,
We sincerely thank you for your feedback! Thank you for considering the experiments in our paper to be solid and voting in favor of acceptance. Your valuable comments, such as the comparison between self-supervised and supervised learning, have further enhanced our work. We will follow your suggestions to further polish our paper.
Meanwhile, we are very happy to have addressed your concerns. May we kindly ask if you mind considering raising you rating to increase the chances of our paper being accepted? Thanks a lot for your contribution to this community!
All the best,
Authors.
The paper introduces a Cross-video Identity-cOrrelating pre-traiNing (CION) framework for person re-identification. CION addresses the limitations of existing instance-level and single-video tracklet-level pre-training methods by leveraging identity-invariance across different videos. The framework uses a progressive multi-level denoising strategy and an identity-guided self-distillation loss to improve representation learning. Extensive experiments demonstrate that CION significantly enhances performance with fewer training samples compared to previous state-of-the-art methods. The authors also contribute a model zoo, ReIDZoo, which includes various pre-trained models.
优点
Originality: The paper introduces a novel approach to pre-training for person re-identification by explicitly modeling identity invariance across videos, which is a significant departure from existing methods.
Clarity: The paper is clearly written, with detailed explanations and effective use of visual aids.
Significance: The proposed method achieves state-of-the-art performance with fewer training samples and demonstrates model-agnostic ability, making it highly valuable for both research and practical applications.
缺点
-
Pre-training models should aim for broader generalizability. The current models tend to be human-centered, covering tasks such as re-identification (reid) and human parsing, while others are more narrowly focused on ReID alone, akin to SemReID [1] or PersonMAE [2]. The authors' experiments have concentrated primarily on two datasets, Market-1501 and MSMT17. This limited scope may not sufficiently demonstrate the models' generalizability across varied conditions. To better assess and enhance the generalizability of the models, it is suggested that the authors include results from datasets that introduce challenges such as occlusion (Occldued-duke) and clothing change (LTCC or PRCC). This expansion would provide a more comprehensive understanding of the model's performance across different real-world scenarios.
-
The proposed method relies on the quality of the initial tracking and identity correlation, which may be affected by the accuracy of the tracking algorithms used.
-
Some of the methods that need to be compared are ignored. For example, HAP [3] with ViT-Base achieves better performance in MSMT17.
References
[1] Huang, Siyuan, Yifan Zhou, Ram Prabhakar Kathirvel, Rama Chellappa, and Chun Pong Lau. "Self-Supervised Learning of Whole and Component-Based Semantic Representations for Person Re-Identification." arXiv preprint arXiv:2311.17074 (2023).
[2] Hu, Hezhen, Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Lu Yuan, Dong Chen, and Houqiang Li. "PersonMAE: Person Re-Identification Pre-Training With Masked AutoEncoders." IEEE Transactions on Multimedia (2024).
[3] Yuan, Junkun, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang et al. "HAP: Structure-aware masked image modeling for human-centric perception." Advances in Neural Information Processing Systems 36 (2023).
问题
see weakness.
局限性
-
Fine-grained Representation Learning: The current CION framework does not explicitly target fine-grained information extraction, which may limit its ability to learn highly detailed person representations necessary for distinguishing between very similar identities.
-
Noise Removal Challenges: Due to limitations in the feature extractor’s representational capacity, some complex noise may remain unresolved during the denoising process, leading to potential inaccuracies in the automatically derived identity correlations.
-
Correlation Issues with Distant Tracklets: The strategy used to conserve computational resources may struggle to establish correlations between tracklets of the same person that are widely separated across different videos, potentially leading to incomplete identity mapping.
Sincerely thanks for your constructive and valuable comments. The concerns are answered as follows.
(i) Include results from other datasets.
Thanks for your suggestion.
Our work focuses on proposing a superior pre-training framework to enable the model to achieve exceptional performance on the task of person Re-ID.
Due to the fact that most representative works in the field of person Re-ID pre-training, such as LUP [1], LUP-NL [2], PASS [3], ISR [4], PersonMAE [5], etc., primarily use the Market1501 and MSMT17 datasets to validate the effectiveness of their methods. In order to conduct a more equitable comparison with these works, we have also mainly utilized these two datasets as downstream datasets in this paper.
At the same time, there are also some works, such as SemReID [6], that have used more datasets introducing challenges to validate the models' generalizability across varied Re-ID conditions. This exploration indeed allows for a better validation of the general capabilities of the pre-trained models.
Following your suggestion, we include the results from Occluded-duke and PRCC datasets in the following table. The compared models all use ViT-S as the backbone. The results are derived from our implementation based on the official open-sourced model and code.
| Method | Occ-duke mAP | Occ-duke R@1 | PRCC mAP | PRCC R@1 |
|---|---|---|---|---|
| TranSSL | 58.1 | 67.6 | 46.6 | 47.9 |
| PASS | 59.4 | 68.9 | 47.3 | 49.2 |
| CION | 63.2 | 71.8 | 52.4 | 54.7 |
From the table, it can be observed that our CION pre-trained model outperforms the other two pre-trained models significantly on these two challenging datasets. This experimental result demonstrates that our pre-trained model is indeed capable of generalizing well to other more challenging person Re-ID datasets.
Notably, despite efforts to contact the authors for access to the LTCC dataset, no response was received within the rebuttal period. Meanwhile, the comparison results of PersonMAE [5] and SemReID [6] are not included since its model and source code are not publicly available. However, our commitment to include the results of more datasets and works remains.
(ii) Analysis of the affect of tracking algorithms.
As a common practice, similar to LUP-NL [2] and ISR [4], CION also employs a tracking algorithm to obtain the initial tracklet-level correlations. Theoretically, the more robust the tracking algorithm, the higher the quality of the initial tracklet-level correlations, and these correlations inevitably contain some noise.
However, different from most methods that directly ignore the noise in these correlations, we propose a multi-level denoising strategy to progressively improve the quality of these correlations. The results in Fig. 5 of the paper empirically demonstrate the effectiveness of our proposed strategy. Our strategy minimizes the impact of noise in the initial tracklet-level correlations as much as possible, thereby significantly improving the model's performance.
(iii) Not compared related work.
Thank you for pointing out such an outstanding work. When using ViT-Base, our CION outperforms HAP [7] on Market1501, but falls short on MSMT17. Overall, the two achieved comparable performance. Upon analysis, we believe that our CION holds the following advantages over HAP.
(a) CION does not require extensive human priors during pre-training. HAP explicitly identifies human body regions in images during pre-training by utilizing a robust pose estimation model, which can somewhat affect the efficiency of pre-training. In contrast, our CION does not require such priors.
(b) CION is model-agnostic. CION imposes minimal restrictions on model architecture, and has achieved excellent performance across a range of models, while HAP is only applicable to models like Vision Transformers and cannot be effectively utilized in other model architectures, such as CNNs.
Certainly, HAP is a very meaningful related work, and we will include the results of HAP in our final version.
(iv) Limitations
Thank you for summarizing the deeply discussed Limitations section in our appendix. In our subsequent work, we will focus on addressing fine-grained issues and improving the accuracy of identity correlations.
References
[1] Dengpan Fu, Dongdong Chen, Jianmin Bao, Hao Yang, Lu Yuan, Lei Zhang, Houqiang Li, and Dong Chen. Unsupervised pre-training for person re-identification. In CVPR, pages 4750–14759 (2021).
[2] Dengpan Fu, Dongdong Chen, Hao Yang, Jianmin Bao, Lu Yuan, Lei Zhang, Houqiang Li, Fang Wen, and Dong Chen. Large-scale pre-training for person re-identification with noisy labels. In CVPR, pages 2476–2486 (2022).
[3] Kuan Zhu, Haiyun Guo, Tianyi Yan, Yousong Zhu, Jinqiao Wang, and Ming Tang. Pass: Part-aware self-supervised pre-training for person re-identification. In ECCV, pages 198–214. Springer (2022).
[4] Zhaopeng Dou, Zhongdao Wang, Yali Li, and Shengjin Wang. "Identity-seeking self-supervised representation learning for generalizable person re-identification". In ICCV, pages 15847–15858 (2023).
[5] Hu, Hezhen, Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Lu Yuan, Dong Chen, and Houqiang Li. "PersonMAE: Person Re-Identification Pre-Training With Masked AutoEncoders." IEEE Transactions on Multimedia (2024).
[6] Huang, Siyuan, Yifan Zhou, Ram Prabhakar Kathirvel, Rama Chellappa, and Chun Pong Lau. "Self-Supervised Learning of Whole and Component-Based Semantic Representations for Person Re-Identification." arXiv preprint arXiv:2311.17074 (2023).
[7] Yuan, Junkun, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang et al. "HAP: Structure-aware masked image modeling for human-centric perception." Advances in Neural Information Processing Systems 36 (2023).
After providing more comparisons and many results from different ReID task datasets, I found this paper to be acceptable.
Dear Reviewer 3WDg,
We sincerely appreciate your feedback! Thank you for recognizing our work and considering our paper worthy of acceptance. Your suggestions have effectively further enhanced our work. We will include more comparative results of different ReID tasks in our future version.
If our response and additional experiments have adequately addressed your major concerns, may we kindly ask if you mind considering increasing your rating to enhance the likelihood of this paper's acceptance? We are grateful for your contribution to this community!
All the best,
Authors.
The authors propose an unsupervised method for mining identity relations in large-scale video data, incorporating self-supervised learning to pre-train the general models (like CNN, transformer). This approach demonstrates superior performance in person ReID tasks compared to other pre-training models.
优点
Large-scale video datasets are effective training source of ReID task. The authors propose a straightforward method for pre-labeling instance identity from video streams, taking into account the appearance of pedestrians across multiple videos and the necessary computational complexity. They constructed the CION-AL dataset based on the SYNTHPEDES dataset, which contains rich pedestrian information that effectively aids model pre-training.
Under the DINO self-supervised learning framework, the authors differentiate the data source between the teacher and student models, enhancing the student model's representation capabilities through the focus on local pedestrian information.
The authors use various CNN and Transformer networks pre-trained on ImageNet as baselines, conducting extensive experiments on supervised and unsupervised domain adaptation datasets. The results demonstrate that the pre-trained models built using CION-AL are more effective for pedestrian re-identification tasks compared to those built using ImageNet.
缺点
The primary contribution of this paper lies in proposing a method for unsupervised data preprocessing of video streams, utilizing self-supervised learning to pre-train more powerful pedestrian re-identification models. However, the data preprocessing method essentially employs an unsupervised clustering strategy, which is a straightforward pre-labeling approach. The establishment of sample relationships in this manner is often suboptimal. Therefore, it is crucial to demonstrate in Section 4.4 through ablation studies whether it is the identity correlation seeking (i.e., denoising or pre-labeling) or the large scale of the dataset that enhances the model's performance, and quantify their respective contributions.
Additionally, the proof of computational complexity for cross-video denoising in Section 3.2 is insufficient, making the significance of the Sliding Range and Linking Relation unclear.
In Section 3.3, the training process of identity-guided self-distillation primarily differs from DINO in the way local information is provided to the teacher and student data streams. So, the “identity-guided” is still reflected in data preprocessing rather than in the self-distillation process.
In Section 4.2, the MGN fine-tuning has a certain impact on model performance, but the essence of this fine-tuning lacks necessary explanation.
In summary, although the proposed CION-AL dataset, in terms of volume and denoising processing (pre-labeling), enhances general pre-training models, the lack of qualitative analysis experiments on denoising significantly diminishes the significance of the proposed method.
问题
- Is the contribution of this paper to ReID general pre-training models primarily a new large-scale dataset, or the proposal of a method to utilize identity correlation?
- How can you demonstrate that the proposed method effectively utilizes identity correlation information of pedestrians?
- When using the same dataset, How can you experimentally prove the superiority of the proposed identity correlation seeking method?
局限性
N/A
Sincerely thanks for your valuable comments. The concerns are answered as follows.
(i) Repective contribution analyses.
Thanks for your concern. In Sec. 4.4, the 2nd and 3rd experiments inherently quantify the respective contributions of the identity correlation seeking and the large scale of the dataset.
In the 2nd experiment, we investigate the performance improvements brought by the three stages in identity correlation seeking. The quantitative results are shown in Fig. 5 of the paper. For your easy reference, we summarize them below.
| Seeking Process | Mar mAP | Mar R@1 | MSMT mAP | MSMT R@1 |
|---|---|---|---|---|
| no seeking | 92.1 | 96.3 | 69.9 | 87.6 |
| stage 1 | 92.3 | 96.6 | 71.3 | 88.6 |
| stage 1+stage 2 | 92.7 | 96.9 | 72.4 | 89.1 |
| stage 1+stage 2+stage 3 | 93.3 | 97.3 | 74.3 | 89.8 |
It can be observed that each stage brings about a significant improvement over the previous stage. In particular, the complete seeking process (stage 1+2+3) achieves the best mAP scores of 93.3% and 74.3% on Market1501 and MSMT17, respectively, which is an improvement of 1.2 % and 4.4 % compared to the no-seeking group. These results quantify the contribution of the identity correlation seeking in enhancing the model's performance.
Meanwhile, in the 3rd experiment, we examine the impact of pre-training dataset scale on model performance improvement. The results in Fig. 6 of the paper reveal the following:
(a) With an increase in dataset scale, there's a significant performance gain. A 10% subset yields mAP of 91.1% on Market1501 and 67.5% on MSMT17, while a 100% set achieves 93.3% and 74.3% respectively.
(b) As dataset scale grows, the rate of performance improvement tapers off. A 40% increase from 10% to 50% improves mAP by 5.3% and 1.5%, while a 40% increase from 50% to 90% only adds 1.2% and 0.6%.
These results quantify the contribution of the large scale of the dataset in enhancing the model's performance.
(ii) Analyses of computational complexity.
Without our proposed cross-video denoising, for N tracklets, we need to calculate the distance between the current tracklet and all other tracklets. That is, from the 1st tracklet to the Nth tracklet, we need to sequentially calculate N-1, N-2, …, 2, 1 distances, resulting in N(N-1)/2 calculations totally. In theory, when N is very large, its computational complexity is O(N^2).
However, with our proposed cross-video denosing, we only need to calculate the distance between the current tracklet and other tracklets within the sliding range. Assuming the radius of the range is r, from the 1st tracklet to the Nth tracklet, we only need to calculate 2r distances for each, resulting in 2Nr calculations totally. When N is large, the computational complexity is O(N).
The above analyses demonstrate that our proposed strategy slashes computational complexity from O(N^2) to O(N), substantially cutting costs and proving especially crucial for large-scale cross-video denoising.
(iii) Analyses of identity-guided self-distillation.
Our CION operates in two stages: the first seeks identity correlations across videos, while the second conducts self-distillation using the sought identity correlations.
Indeed, with identity supervision, identity-guided self-distillation has improved DINO's training strategy for person Re-ID pre-training. The core idea is to align global and local features of images from the same person identity, guiding the model to effectively mine identity invariance and learn stronger representations for person Re-ID.
Therefore, it can be concluded that the "identity-guided" is also reflected in the self-distillation process.
(iv) Essence of MGN fine-tuning.
MGN is a feature learning strategy that integrates discriminative information at various granularities for person Re-ID. As an excellent open-source algorithm, using MGN as the fine-tuning algorithm is one of the standard paradigms for validating the performance of Re-ID pre-trained models.
Following standard practice, we pre-train the model with our proposed method before fine-tuning it with the widely-used MGN for downstream tasks. As Tab. 1 shows, our pre-training method brings much better performance to the model.
(v) Primary contribution to Re-ID pre-trained models.
It is an interesting question. We hold the view that the primary contribution of this paper to Re-ID pre-trained models is the proposal of a method utilizing identity correlation. We supplement some experiments to verify this.
| # | Method | Pre-train | Fine-tune | Mar mAP | Mar R@1 | MSMT mAP | MSMT R@1 |
|---|---|---|---|---|---|---|---|
| 1 | DINO | LUPerson(4.2M) | MGN | 91.1 | 96.4 | 66.3 | 86.1 |
| 2 | DINO | CION-AL(3.9M) | MGN | 90.9 | 96.3 | 66.7 | 85.8 |
| 3 | CION | CION-AL(3.9M) | MGN | 93.3 | 97.3 | 74.3 | 89.8 |
The comparison between 1 and 2 shows that for DINO (identity-ignored), dataset changes have negligible effect on performance. Conversely, comparing 2 and 3, CION demonstrates a substantial performance boost, suggesting that pre-training with identity correlation significantly enhances model performance for the same dataset.
(vi) Identity correlation utilization demonstration.
We enable the model to effectively utilize identity correlation information by using our proposed identity correlation seeking and identity-guided self-distillation. As demonstration, Fig. 4 in the paper visualizes the feature distributions of 15 randomly chosen persons, showing that our CION outperforms both the instance-level LUP and tracklet-level LUP-NL methods in identity discrimination, confirming its effective utilization of identity correlation information.
(vii) Experimental proof of proposed method's superiority
In the 2nd experiment of Sec. 4.4, we investigate the performance improvements brought by the identity correlation seeking method when using the same dataset. The results are also shown in the 1st table of this response. The significant performance improvement experimentally proves the superiority of the proposed method.
Thanks for your reply. according to your experiment results, the proposed method did brought a quite boost. Although, I still think the main contribution of your work is dataset preprocessing, given that this is a simple and effective way to improve performance,I would accept the article.
Dear Reviewer uzbJ,
We sincerely thank you for the response and improved rating! We are very grateful that you consider our work to be simple and effective, and our paper to be acceptable. We will follow your valuable suggestions to further enhance our paper.
All the best,
Authors.
This paper present a novel framework CION, to learn identity-invariance from cross-video person images for person re-identification pre-training. In particular, they model the identity correlation seeking process as a progressive multi-level denoising problem, with a novel noise concept defined. Also, they propose an identity-guided self-distillation loss to implement better large-scale pre-training.
优点
The identity-invarience of the same person across different videos is significantly neglected by these single-video tracklet-level methods. This paper present a cross-video identity-correlating pre-training framework, which explicitly learns identity-invariance in long-range cross-video person images.
缺点
This paper compare the fine-tuning results with two popular identity-ignored pre-training methods LUP and DINO in Tab. 4, how about with other identity-ignored pre-training methods?
问题
1.It is significantly valuable to eliminate the problem of false positives of identity tags caused by tracking errors, How does it work in this paper? 2.It is significantly valuable for the model to learn better identity representations without knowing the identity tags of the same person across different videos. How does it work in this paper?
局限性
The authors have given the limitations of the proposed model, the authors mentioned that the automatically sought identity correlation may still have a certain discrepancy in accuracy when compared to manually annotated correlation, which may introduce a certain degree of ambiguity during the training process. In their subsequent work, they will focus on addressing fine-grained issues and improving the accuracy of identity correlations.
Sincerely thanks for your constructive comments. The concerns are answered as follows.
(i) Comparison with other identity-ignored pre-training methods.
Thanks for your nice concern. Firstly, the experiment in Tab. 4 has yielded the following results:
when pre-trained on exactly the same dataset (CION-AL) and fine-tuned with exactly the same downstream algorithm (MGN) and datasets (Market1501 and MSMT17), our identity-guided pre-training method demonstrates a significant performance lead compared to other two representative identity-ignored pre-training methods.(DINO and LUP)
Meanwhile, the results in Tab. 1 also validate the superiority of our method over other identity-ignored methods. For your easy reference, we list the comparison results with these methods below. The results of these methods are all derived from official papers or repositories.
| Method | Pre-training Dataset | Fine-tuning Algorithm | Backbone | Mar mAP | Mar R@1 | MS mAP | MS R@1 |
|---|---|---|---|---|---|---|---|
| LUP | LUPerson(4.2M) | MGN | ResNet50 | 91.0 | 96.4 | 65.7 | 85.5 |
| UPReID | LUPerson(4.2M) | MGN | ResNet50 | 91.1 | 97.1 | 63.3 | 84.3 |
| Ours | CION-AL(3.9M) | MGN | ResNet50 | 92.3 | 97.1 | 70.1 | 87.8 |
| MoCov3 | LUPerson(4.2M) | TransReID | ViT-S | 82.2 | 92.1 | 47.4 | 70.3 |
| TranSSL | LUPerson(4.2M) | TransReID | ViT-S | 91.1 | 95.9 | 66.8 | 85.5 |
| PASS | LUPerson(4.2M) | TransReID | ViT-S | 92.2 | 96.3 | 69.1 | 86.5 |
| Ours | CION-AL(3.9M) | TransReID | ViT-S | 92.8 | 96.5 | 70.3 | 86.9 |
We can observe that, compared to these identity-ignored methods, using the same backbone and the same downstream fine-tuning algorithms, our identity-guided method has achieved a significantly leading performance even with fewer pre-training images (3.9M vs 4.2M).
Therefore, analyzing the two representative experimental results above, it can be concluded that the guidance of identity is crucial for the person Re-ID pre-training.
In fact, comparing with more identity-ignored pre-training methods in Tab. 4 could make the conclusion more convincing. However, due to the limitations of computational resources and the lengthy time required for pre-training (e.g. 8×A100 and 120 hours for PASS ViT-B), as well as some methods such as UPReID not publicly releasing their code, we are unable to conduct pre-training from scratch for these methods within such a short rebuttal period. However, our commitment to make comparisons with these methods remains, and we intend to include comparative results in our final version.
(ii) How to eliminate the false positives of identity tags.
It is an interesting question. In our task, the false positives of identity tags mean that images that should not belong to a certain person have been incorrectly assigned that person's identity tag. The single-tracklet denoising strategy is specifically proposed to eliminate these false positives of identity tags caused by tracking errors. The specifics of this strategy can be found in lines 152 to 162 of the paper. For your easy understanding, we have demonstrated the core process of this strategy in the following Python pseudocode:
def denoise_tracklet(initial_samples, threshold):
# initial_samples represents the initial collection of images for a certain identity.
denoised_samples = initial_samples.copy()
while True:
distances = []
for sample in denoised_samples:
other_samples = denoised_samples.copy().remove(sample)
centroid = calculate_centroid(denoised_samples)
distances.append(calculate_distance(sample,centroid))
max_deviation = max(distances)
if max_deviation > threshold:
outlier_index = distances.index(max_deviation)
denoised_samples.pop(outlier_index)
else:
break
return denoised_samples
(iii) How to learn better identity representations across videos.
In this paper, the model learns better identity representations through the following two stages without knowing the identity tags of the same person across different videos.
In the first stage, by defining a noise concept that takes into account both intra-identity consistency and inter-identity discrimination comprehensively, we aim to seek identity correlation from cross-video images by modeling it as a progressive multi-level denoising problem. Specifically, we utilize a model that has been preliminarily trained on a small manually annotated dataset as a noise discriminator, and discover the identity tags across different videos using the proposed multi-level denoising strategy.
In the second stage, we propose an identity-guided self-distillation loss to constrain the model to learn identity representations across different videos. Specifically, by leveraging the identity tags discovered in the first stage, we design a training task that aligns both the global and local features of the same person, thereby enabling the model to learn identity-invariant representations.
The specific contents of these two stages can be found in Section 3.2 and Section 3.3 of the paper. It is hoped that the above brief method overview can help you better understand the process of learning identity representations in our work.
Thanks for your responses. All of my concerns are addressed with the solid experimental results, I found this paper to be acceptable.
Dear Reviewer hfuf,
We sincerely appreciate your response! Thanks a lot for your recognizing our work and considering our paper to be acceptable. We will follow all your suggestions to further enhance our paper. We will include more pseudocode to clarify our denoising strategies.
We are more than happy to know that our responses have addressed your concerns. May we kindly ask that if you mind considering raising the score to increase the likelihood of our paper’s acceptance? We are really grateful for your contribution to this community!
All the best,
Authors.
We sincerely appreciate all reviewers' time and efforts in reviewing our paper.
We are glad to find that reviewers generally recognized our contributions including the novelty and significance of our method (Reviewer hfuf, uzbJ, 3WDg, 5xRM), the comprehensive experiments (Reviewer uzbJ, 3WDg, 5xRM) and our well-written paper (Reviewer 3WDg, 5xRM).
We will open-source all the code, models and dataset. We believe that our work will make great and timely contributions to this community.
- We have pioneered to learn identity-invariance from cross-video person images for person ReID pre-training and verified its superiority through extensive experiments. This learning paradigm offers a solid path to more powerful ReID models.
- Our model zoo, containing 32 models with 10 architectures, will enable researchers to quickly access high-quality model resources, supporting various ReID task applications and promoting the sustainable development of the technology.
- Our CION-AL dataset is currently the largest in the field with high-quality identity correlations, offering abundant data resources for more researches on person-related domains.
We have replied to each reviewer individually to address any concerns. I noticed that we can continue discussions on the OpenReview system for some time to come. If the reviewers has any questions, we would be happy to continue the discussion.
Once again we thank all reviewers and area chairs!
The paper received all accept recommendations after rebuttal. All reviewers acknowledged the good results obtained and weighed the dataset contribution a lot. While the data and the research are about person re-identification, ethics reviewers found some particular concerns and shared some useful suggestions. Authors took this seriously and provided responses and solutions. Overall, AC agrees with the reviewers and recommends acceptance of this paper for publication.