Cross-Modality Masked Pre-training for Visible-Infrared Person Re-identification
This paper proposes a cross-modality masked pre-training (CMMP) method for visible-infrared person re-identification.
摘要
评审与讨论
Existing visible-infrared person re-identification(VI-ReID) methods often overlook the disparity between pre-trained data (ImageNet) and VI-ReID data. This paper introduces a novel approach called Cross-Modal Masking Pre-Training (CMMP) for the recognition of visible and infrared humans. Notably, performance enhancements were realized through an initial pre-training stage involving masking and reconstruction, followed by fine-tuning.
优点
- This paper introduces a novel approach to the VI-ReID task, focusing on model pre-training for the first time.
- This paper has developed a lightweight pre-training method and demonstrated its effectiveness in enhancing VI-ReID.
缺点
While the focus of this paper is innovative, the experimental results indicate a weakness in its performance.
问题
- In this paper, the pre-training data and fine-tuning data are identical. If we conduct pre-training based on ImageNet and then proceed with fine-tuning, does this approach offer any performance advantages compared to the optimal method? 2.This paper mentions “When attempting to predict all three cross-modality images simultaneously, the model’s performance is notably worse, considerably lower than when predicting other targets.” Does this observation suggest that pre-training doesn't necessarily require infrared image data, and instead, we could use visible and generated images from ImageNet datasets or large ReID datasets for the pre-training process?
This paper includes AUTHOR CONTRIBUTIONS and ACKNOWLEDGMENTS sections in the main text. Therefore, this paper violates the double-blind reviewing rules.
优点
缺点
问题
伦理问题详情
This paper presents a mask-based pretraining strategy for visible-infrared person re-identification. Based on the extended images via random channel exchange, the proposed CMMP employs a mask-sharing mechanism to simultaneously mask three types of images. Subsequently, the model is encouraged to reconstruct the masked regions via a lightweight decoder.
优点
Good writing: I can easily catch the main story of this paper, and the organization of the paper is clear.
缺点
- Limited novelty: I can hardly catch the significant contribution and inspiring insights for this field. Since both the model architecture and data generation method in the paper are based on existing methods, thus, what is the core value or unique contribution of this paper?
- Lack of citations to recent literature: This paper lacks citations to recent literature, with only one paper from 2021 mentioned in the related work section.
- Outdated experimental comparisons: The experimental results lack comparisons to the latest work. Since this is a submission to ICLR 2024, it is inappropriate to rely solely on experimental comparisons with those prior to 2021.
- Inaccurate statement about SimMIM: The statement that SimMIM only encodes the unmasked parts is incorrect. SimMIM also encodes the learnable mask features.
- Lack of clarity in explanations: The paper repeatedly mentions "misalignment inherent in the VI-ReID dataset" without providing an explanation of the negative impact it causes or whether any existing work has addressed this issue. Additionally, the descriptions of pretraining and finetuning processes are not clear. It is essential to provide clear and detailed explanations to enhance the understanding of the proposed method.
- It is not much of a contribution for firstly introducing the mask reconstruction paradigm into VI-ReID. Clarifying the rationality and necessity of introducing it into the task is more important and meaningful.
- The and writing and reference format are sometime irregular, such as ‘Ye et al.Ye et al. (2021)’ in Section 2.2, ‘10 experiments’ in Section 4.1 and ‘288*144 size’ in Section 4.2.
- For the ablation study on the masking strategy, the experimental comparisons with more masking strategies should be provided, rather than only the random mask.
- There is a lack of insight into the experimental results, and the view is superficial analysis on results.
问题
- Both MAE and SimMIM, the most relevant models to this paper, adopt the ViT as the backbones. However, the proposed CMMP model uses a CNN architecture. Could the authors provide the reason for this choice? Did you try to apply CMMP on the ViT?
- Overall, the encoding strategy of CMMP seems more similar to SimMIM. Have the authors attempted the MAE-like strategy that only encodes the unmasked parts? Given that CMMP has an 80% mask rate, using the MAE-like encoding could significantly improve training efficiency.
- The paper mentions the statement multiple times -- “Due to the misalignment inherent in the VI-ReID dataset, CMMP focuses its reconstruction efforts solely on the visible image and the generated images after masking”. Could the authors explain what this misalignment refers to and why reconstruction is not performed on the infrared image?
- Could the authors provide a more detailed explanation of the finetuning process? Specifically, what are the differences between pretraining and finetuning in this paper?
- The encoder processing flow should be explained more thoroughly. Figure 2 shows three input images, while Figure 3 only depicts two input images to the encoder. This discrepancy raises concerns about inconsistency. Please clarify this discrepancy and ensure the consistency of the depiction.
- It is recommendable to include more figures in the experimental section. After all, compared to the tables, graphical presentation can provide a more intuitive and direct comparison.
All three reviewers give consistent and negative comments, and the major issues are: (1) Limited novelty. The basic idea is based on existing methods. (2) Experimental results are insufficient. Moreover, the authors did not support the rebuttal. After reading the comments, the AC cannot recommend to accept this paper and encourage the authors to take the comments into consideration for their future submission.
为何不给更高分
Please see the detailed meta reviews
为何不给更低分
Please see the detailed meta reviews
Reject