PLIP: Language-Image Pre-training for Person Representation Learning
摘要
评审与讨论
This paper presents a language-image pre-training framework and a large-scale synthetic image-text dataset for person representation learning. The proposed framework contains three elaborately designed pretext tasks: 1) Text-guided Image Colorization (TIC); 2) Image-guided Attributes Prediction (IAP); and 3) Identity-based Vision-Language Contrast (IVLC). Meanwhile, a large-scale person dataset with image-text pairs (SYNTH-PEDES) is constructed by automatically generating textual descriptions using the proposed Stylish Pedestrian Attributes-union Captioning (SPAC) method. The dataset contains over 4.7M images from 300K identities with 12M descriptions. Finally, the paper utilizes the pre-training framework to pre-train models on SYNTH-PEDES. Extensive experiments demonstrate the significant improvements brought by the pre-trained models to a range of downstream person-centric tasks on various settings. Also, the paper conducts extensive experiments to verify the superiority of the proposed pre-training framework compared with other SOTA pre-training methods.
优点
- This paper is well-written and easy to follow.
- The proposed pre-training framework considers the key characteristics of persons and effectively utilizes the correlations between images and texts. The learned representations through this framework show good generalization ability on extensive person-centric tasks.
- The SYNTH-PEDES dataset contains a large number of person images along with their corresponding text descriptions. The quality and diversity of the texts are quite good. This dataset can facilitate research on more cross-modal person-related domains.
- This paper conducts comprehensive experiments on many aspects, validates the powerful transfer and domain generalization capabilities of the pre-trained models, and evaluates the quality of the dataset.
缺点
- The paper does not conduct pre-training on commonly used ViT.
- The entire content of the paper, including the appendix, seems somewhat verbose, totaling about 31 pages. The authors could be more concise in their presentation.
问题
- As shown in Table 4, for the proposed pre-trained model PLIP, CMPM/C achieves the best results, significantly surpassing the other two downstream algorithms. However, for all other pre-trained models, the LGUR achieves the best results. Can you explain the reason for this?
- In my opinion, the TIC task utilizes the textual descriptions with color words to restore the original color information of the gray-scale images, which could lead to learning comparably fine-grained information. Nevertheless, in some specific scenarios, like encountering a white T-shirt with a blue logo and vaguely described as “a white T-shirt,” could TIC result in learning suboptimal representations?
局限性
Since the research in this paper involves persons, it inevitably contains some privacy and security issues. However, the authors have discussed these concerns in detail in the “Broad Impact” and “Limitations” sections of the appendix and have taken measures to minimize the risk of privacy breaches. Therefore, I think the authors have adequately addressed the limitations and potential negative societal impact of their work.
Thanks for your valuable comments and appreciation of our work. The concerns are answered as follows.
(i) Pre-training on commonly used ViT.
Due to the need for handling multi-scale image features in the TIC task of PLIP, we only consider the variants of vision transformers with hierarchical structure design, such as Swin Transformer, as the backbone in our PLIP. To verify the effectiveness, we perform a range of experiments with Swin Transformer as the backbone and achieve a series of results, shown in Table 1, Table 5, Table 7, Table 9, Table 18 and Table 21 of our paper. These experimental results all demonstrate the effectiveness of our pre-training method on the Swin-Transformer. Certainly, in our future work, we will explore more universal pre-training methods that impose fewer constraints on the model architecture. We hope the above response has addressed your concerns.
(ii) More concise presentation.
Thanks for pointing this out. Due to the extensive experiments conducted to validate the effectiveness of each component in the proposed framework, as well as experiments on a wide range of downstream tasks, the length of this paper is considerable. Following your suggestion, we will refine our presentation with greater conciseness in the future version.
(iii) The reason for inconsistent improvements.
Thanks for your insightful comment. Due to LGUR's high performance, other pre-trained models can achieve the best results on this method, yet our PLIP obtains the best results on CMPM/C, the reasons for which are explained below:
A series of previous works in the field of multi-modal pre-training have demonstrated that, for pre-trained multi-modal models, excellent performance can be achieved with simple fine-tuning on downstream tasks, and additional complex model structure design is unnecessary. We must note that CMPM/C is a very simple algorithm that correlates the visual and textual global representation by designing only two loss functions, without any additional complex design for model structure. On the other hand, SSAN and LGUR are designed with highly complex additional structures for model training. In fact, through large-scale person-centric multi-modal pre-training, PLIP has learned a very discriminative multi-modal shared space for text-based person Re-ID. Therefore, simply fine-tuning PLIP with CMPM/C gives better results than SSAN and LGUR.
(iv) Fine-grained image with a vague text.
Thanks for your valuable comment. In fact, due to the lack of sufficiently fine-grained texts in existing manually annotated datasets, there will inevitably be some relative vague texts in our synthetic dataset SYNTH-PEDES. We acknowledge that relatively vague texts naturally lead to the model not being good at distinguishing some very detailed features to a certain extent. However, we must note that our model has a preliminary understanding of the meaning of attributes and colors, and can associate them with related image regions. This ability to distinguish between different parts of the person body leads to discriminative person representation learning to some extent. We believe that not being good at distinguishing very detailed features will not have a significant negative impact on person representation learning. Obviously, if we want to further enhance the model's ability to perceive details, we can try manually annotating datasets with super fine-grained texts. We will continue to work towards this research direction with diligence.
Thank you for your detailed response, which addressed most of my concerns. The community always appreciates the introduction of new datasets to advance research. I also experimented with the PLIP dataset for some ReID tasks and observed impressive performance under the domain generalization setting. Moreover, I cannot entirely agree with Rk98 and RKFb that the novelty of the proposed method is not limited. The motivation and implementation details of the work are distinctive, even though they look similar to previous works. However, some abbreviations should be cleared as indicated by Rk98 and RKFb. As I said, introducing a new high-quality and diverse dataset is always welcomed, and I think this work is solid. Thus, I raise my score.
Dear Reviewer,
We sincerely appreciate your response and improved rating! Thank you for recognizing the significance and novelty of our work, especially our proposed dataset. If you (or any other reviewers) have any further questions, we would be more than happy to discuss and respond. Thanks a lot for your contribution to this community again!
All the best,
Authors.
A novel vision-language pre-training framework is proposed in this paper, termed PLIP, for person-centric downstream tasks: image-based Re-ID, text-based Re-ID, attributes recognition, human parsing and person search. To form PLIP, three pre-training tasks are designed: text-guided image colorization, image-guided attributes prediction and identity-based vision-language contrast. Also, the authors design a new caption model named SPAC to generate stylish textual descriptions for each person image. Utilizing SPAC, they construct a large-scale synthetic person dataset with image-text pairs to validate PLIP's effectiveness. A range of experiments have shown that the pre-trained models achieve state-of-the-art results on various downstream tasks, without the bells and whistles.
优点
- The proposed PLIP is well-motivated and the combination of three pretext tasks explicitly helps to learn fine-grained cross-modal associations. That is to say, the design of the pretext tasks takes into good consideration the distinguish features of this research field, and it is simple and effective.
- The paper presents a new dataset called SYNTH-PEDES, which contains whopping 4.7M person images and 12M textual descriptions, making it by far the largest person dataset with image-text pairs. This dataset will undoubtedly become an important resource in this research field.
- The authors commit to open-sourcing the code, dataset and weight. This practice of open-source will have a positive impact on this research field.
- Extensive experiments provide strong support for the paper. Notably, the ablation studies and analyses are very comprehensive, investigating various settings and dataset quality evaluations.
- The improvements of PLIP models to downstream tasks are obvious. Compared with existing pretrained models, the PLIP models achieve consistent leading performance across a range of tasks and settings. Specially, PLIP brings a significant improvement to unsupervised ReID methods. For example, on the MSMT17 dataset, it improves the mAP of PPLR and ISE by 14.7% and 11.4%, respectively.
缺点
- The proposed PLIP comprises three tasks, which seems somewhat complex. The popular CLIP, for example, only includes one contrastive learning task, whereas the method in this paper has three tasks, which should introduce greater computational overhead. I think it’s worth trying to make the pre-training method simpler.
- Since the identities in SYNTH-PEDES are obtained based on tracklets, and the textual descriptions are also synthetic, this dataset will contain some noisy data.
- There are some typo errors. For instance, the period in the ninth line should be changed to a semicolon. The authors should carefully review the full text to avoid these typo errors.
问题
- For the IAP task, why don’t you use multiple binary classification heads instead of just one multi-classification head? Please explain.
- How does the TIC task help with learning person representations? I am looking forward a more detailed explanation.
局限性
The authors have discussed the limitations and broader impact in detail. I hold the view that they have adequately addressed the limitations and potential negative societal impact of their work.
Thanks for your valuable comments and appreciation of our work. The concerns are answered as follows.
(i) Somewhat complex designs.
Considering that existing general-domain multimodal pre-training techniques do not take into account person-related characteristics, they are not well-suited for application in person representation learning. Therefore, we propose a novel language-image pre-training framework termed PLIP, with three elaborately designed three pretext tasks. These three tasks are obviously effective without introducing significant computational overhead.
(a) Effectiveness. The effectiveness of these three tasks has been demonstrated across a series of experiments in this paper, with each task significantly enhancing the performance of the learned representations. For instance, as shown in Table 18, under a fair comparison setup with ResNet50 as the backbone, our PLIP outperforms CLIP by 14.8% and 11.0% in Rank-1 accuracy on the CUHK-PEDES and Market1501 datasets, respectively.
(b) Computational overhead. Under the same experimental conditions, training the ResNet50 (or ViT-Small) for 70 epochs on the entire SYNTH-PEDES dataset using 4×Geforce 3090 GPUs takes approximately 12.4 days for CLIP, 18.1 days for BLIP, and 15.2 days for PLIP. It can be seen that our PLIP achieves significant performance improvement without introducing much additional training cost. Notably, when directly transferring to the CUHK-PEDES dataset, the rank-1 results of BLIP and PLIP are 44.9% and 52.9%, respectively, while the training time of PLIP is 2.9 days less compared to BLIP.
Certainly, exploring how to simultaneously improve the efficiency and performance of pre-training methods is a highly meaningful research topic. Following your suggestion, we will continue to advance this in our future work.
(ii) Noisy data.
Due to the impracticality of manually annotating all images, noisy data is a common issue encountered by almost all large-scale image-text datasets. To minimize the amount of noisy data in the dataset as much as possible, we employed a series of denoising strategies to ensure the quality of the dataset to the greatest extent. The specific details of the denoising strategies can be found in Section A.7 of the paper. Meanwhile, we meticulously evaluated the quality of the dataset in Section 9.2 of the paper. The evaluation results indicate that the quality of our dataset is nearly on par with that of manually annotated datasets.
(iii) Typo.
Thanks for pointing this out. Following your suggestion, we will carefully review the full text to avoid these typo errors.
(iv) About the prediction head of IAP.
Thanks for your insightful question. The IAP task requires to predict the masked words in a textual description. Considering the entire vocabulary as the sum of categories, with each word as a separate category, the IAP task is akin to, given a masked position, selecting the most fitting word from the entire vocabulary. In essence, this is a multi-class classification task, and therefore we use the common multi-class classification head instead of multiple binary classification heads to predict the words.
(v) About TIC task.
The pretext task of text-guided image colorization is first proposed in this paper and has been proven to be very effective for person representaion learning. It forces the model to understand the meaning of colors and attributes, and associates them with related visual regions, rather than simple memorization. This ability to distinguish between different parts of the person body leads to more discriminative person representation learning and guarantee the superior performance on many person-centric tasks.
Thanks for the authors' response. It addressed my concerns well. In my opinion, this paper opens up a new direction on how to utilize the language modality in learning general person representations for the community. I am confident that its novel solution and constructed large-scale dataset will provide valuable insights for subsequent research, as the authors stated in their response to Reviewer Rk98.
I originally gave this paper an "Accept". After reading the other reviews and the rebuttal, I decided to raise my rating and confidence. I believe that, considering the overall contribution of this work, it is fully deserving of acceptance by NeurIPS. It will be a pioneering work in the field of person representation learning.
We would like to thank the reviewer for their appreciation of our work, and for raising their score! If you (or any other reviewers) have any further questions, we would be more than happy to discuss and respond.
This paper introduces a Language-Image Pre-training framework, termed PLIP, designed for enhancing person representation learning. In order to better adapt to downstream person-centric tasks, three pretext tasks are designed to pay more attention to critical person-related characteristics, including Text-guided Image Colorization, Image-guided Attributes Prediction, and Identity-based Vision-language Contrast. In addition, a large-scale person dataset named SYNTH-PEDES generated by an image captioner named SPAC is employed during pre-training. And PLIP performs remarkable ability in various downstream person-centric tasks.
优点
The authors highlight a significant domain gap that exists between general pre-training technologies and person-centric tasks. To address this gap, the authors suggest a language-image pre-training framework, which represents a meaningful direction.
缺点
-
What is the core difference between the proposed pretext tasks and Lapscore [93]? The authors assert that the proposed IAP masks attribute phrases instead of color words, as done in Lapscore. While I agree with this distinction, it seems to be a minor difference based on certain operations.
-
Some of the representations are unclear and lack specificity. For instance, the abbreviations in Figure 2 are not explained either in the caption or in the main text; the method/technology represented as Baseline in Table 1 remains unclear.
-
The writing should be improved. For example, the last paragraph in Section 2.1 appears to be somewhat absurd; Fig 4 and Fig 12 are same.
-
The experimental evaluation is incomplete, and certain comparisons lack logicality and fairness. The proposed framework comprises a methodology and novel datasets, which require separate meticulous verification. Otherwise, some comparisons may be unfair due to the proposed framework being trained on a larger dataset, SYNTH-PEDES. Additionally, there are existing works [1-2] that aim to construct large-scale text-image person datasets, and their performance should be compared with the proposed SYNTH-PEDES dataset.
-
The authors propose training an image captioner specifically for annotating person images. However, it begs the question of why not directly utilize the existing state-of-the-art MLLM technologies for image annotation? Contemporary MLLMs possess exceptional modeling capabilities. It is worth mentioning and discussing the existing person image captioning methods [1-3] that have been previously proposed.
[1] 2024-CVPR-Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID
[2] 2023-ICCV-Unified pre-training with pseudo texts for text-to-image person re-identification
[3] 2023-ACMMM-Text-based Person Search without Parallel Image-Text Data
问题
The description in line 206 seems to violate the blind review rules. [such as the Cross-Modal Projection Matching (CMPM) loss [110] we adopt].
局限性
N/A
Thanks for your valuable comments. The concerns are answered as follows.
(i) Core difference between PLIP and Lapscore.
In fact, there are many differences between our PLIP and Lapscore.
(a) The different core motivation. The core motivation of Lapscore is to design some modules specifically for the single task of text-based person Re-ID. Unlike Lapscore, our PLIP is not limited to a single task; instead, it is designed to learn general person representations capable of boosting various person-related tasks.
(b) The different core designs. Lapscore designs two color reasoning modules for text-based person Re-ID. The first uses LSTM to obtain the text embedding, which is then infused into each skip connection of the UNet architecture to achieve color restoration of grayscale images. The second utilizes MobileNet to extract features from color images, and then employs these features along with the Bilinear Attention Network to predict the masked color words in the text. These complex designs are challenging to apply directly to large-scale pre-training. However, our PLIP not only deeply considers the characteristics of persons, but especially make targeted, simple and efficient representation learning designs for each task, making it naturally adaptable to large-scale pre-training.
(ii) Some unclear representations.
Thanks for pointing this out. The unexplained abbreviations in Figure 2 are an oversight. Specific explanations can be found in the second response to Reviewer Rk98. Notably, the methods as baselines in Table 1 are actually described detailedly in the table caption.
(iii) Writing.
This paragraph dicusses the challenge of aligning fine-grained text with image regions, a future research direction for us. We will simplify the expression to make it more easily understandable. Also, we will polish the writing to avoid any issues such as repetitive images.
(iv) Experimental evaluation.
We respectfully disagree with that the experimental evaluation is incomplete, and certain comparisons lack logicality and fairness. In fact, we have already conducted verification experiments for both the method and dataset.
(a) We conduct comparative experiments on the pre-training method. In Table 18, to verify the superiority of our PLIP, we have pre-trained a series of models on the entire SYNTH-PEDES dataset with different pre-training methods, and compare their transfer performance on downstream datasets. The results show that our PLIP outperforms all other pre-training methods significantly on CUHK-PEDES and Market1501. Considering that the pre-training datasets are identical, these experimental results strongly verify the effectiveness of our method.
(b) We perform extensive ablation studies to verify the effectiveness of the pre-training dataset. In Section A.9.3, through three experiments, we meticulously explore the effectiveness of various components of our dataset. A series of experimental results all indicate that pre-training on our dataset significantly enhances the learning of more generalized person representations. Considering that the pre-training methods are identitcal, these experimental results strongly verify the effectiveness of our dataset.
Additionally, thanks for pointing out these related works. Our work achieves comparable or leading performance compared to them, and we will include the comparative results in our future version.
(v) Advantages of specialized person image captioner and a discussion on existing methods.
It is an interesting question. We believe that using a specialized person image captioner to annotate person images has the following clear advantages over using existing MLLMs.
(a) The generated texts have higher quality. MLLMs are trained on extensive general-domain image-text datasets, which include a limited number of person images. The texts generated by MLLMs are often coarse, with low granularity, and are more likely to contain erroneous descriptions, which has been discussed in [4]. These low-quality texts are disadvantageous for model pre-training. However, by training on person-related image-text datasets, our image captioner is able to provide significantly more accurate descriptions of the appearance of persons. These accurate textual descriptions are a guarantee of our dataset's high quality.
(b) The speed of text generation is significantly faster. Existing MLLMs generally have a large number of model parameters and slow inference speeds, which can lead to considerable time costs when generating text for large-scale images. In contrast, our captioner is a small expert model that generates text quickly and is well-suited for generating text for large-scale images.
Additionally, there is a discussion of the image captioning methods.
[1] annotates images by having MLLMs fill in attribute words into human-designed templates. [2] uses CLIP to calculate the similarity between images and various attribute prompts to obtain person attribute information, which is then filled into a given template to produce final textual description. [3] first queries MLLMs for each attribute to obtain attribute information, and then either fills this information into a given template or further processes it with a language model to generate final textual description.
We can observe that these methods essentially leverage existing models to tag images with attribute labels. Since the utilized models, such as CLIP, have not been trained on specialized person data, there is a common issue of less accurate attribute tagging. In contrast, our captioning method is capable of generating more accurate and stylish textual descriptions. We will include a more detailed discussion in the paper.
(vi) Blind review rules.
After review, it has been found that this sentence does not violate the rules.
Reference
[4] 2024-CVPR-UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity
Dear Reviewer RKFb:
We thank you for the precious review time and valuable comments. We have provided corresponding responses, which we believe have covered your concerns.
The discussion phase has been on for several days and we have not heard any post-rebuttal responses from you yet.
We would love to convince you of the merits of our paper. We hope to further discuss with you whether or not your concerns have been addressed. Please let us know if you still have any unclear parts of our work.
All the best,
Authors.
This paper introduces a person pre-training framework PLIP, which consists of three pre-text tasks: text-guided image colorization (TIC), image-guided attributes prediction (IAP) and identity-based vision-language contrast (IVLC). Furthermore, a large-scale person dataset SYNTH-PEDES is constructed to facilitate the pretraining. Extensive experiments demonstrate the effectiveness of the proposed method.
优点
-
Good Writing: This paper exhibits an overall clear storyline. I can easily catch the motivation and the contribution of this work.
-
Adequate Experiments: The experiments are comprehensive and validate the importance of pedestrian pre-training data.
缺点
- Unfair Experimental Comparison: My primary concern lies in the fairness issue of this paper. As mentioned in line 1046, SPAC uses CUHK-PEDES and ICFG-PEDES as training datasets, while the subsequent PLIP model leverages SYNTH-PEDES, constructed using SPAC, as the training corpus. This results in information leakage from CUHK-PEDES and ICFG-PEDES into the PLIP model. Therefore, the claimed zero-shot or domain generalization experimental comparisons are not fair, and such comparisons can not effectively demonstrate the superiority of the proposed PLIP method.
- Undefined Abbreviations: The abbreviations SIC, VLC, and VAP are not explained in Figure 2, making their significance unclear.
- Lack of Comparison with Latest Works: The compared methods are outdated. For example, the most recent methods referenced in Tables 6 and 7 were proposed in 2022.
- Lack of Novelty: The three pretext tasks proposed in PLIP are not novel. For instance, TIC from LapsCore, IAP from APTM, and IVLC from TBPS-CLIP have already been introduced. Therefore, these tasks are unlikely to provide valuable new insights for subsequent research in this field.
问题
In the IAP task, the paper employs attribute phrases masking, which differs from random masking. However, attribute phrases masking may overlook the fact that certain verbs can reflect relationships between entities. Have the authors conducted the experiment to make an experimental comparison between attribute phrases masking and random masking?
局限性
Please refer to Weaknesses and Questions.
Thanks for your valuable comments. The concerns are answered as follows.
(i) Analysis of experimental comparison fairness.
We believe that it might be due to some of our wording that led to your misunderstanding that our method would result in information leakage and issues of unfair comparison. In fact, our method does not have an issue with information leakage, and a series of fair experimental comparisons have also demonstrated the superiority of our approach. The detailed explanations are as follows.
(a) We utilize the training sets of the CUHK-PEDES and ICFG-PEDES dataset to train our SPAC, ensuring that there is no information leakage from the test sets of these two datasets.
Specifically, in our method, the training sets from these two datasets are used to train our SPAC. Then, we employ SPAC to generate corresponding text descriptions for large-scale person images. Subsequently, we use these synthetic image-text pairs to train our PLIP model. Throughout this process, we do not utilize any information from the test sets of CUHK-PEDES and ICFG-PEDES. Therefore, when we validate the excellent performance of our method on the test sets of these two datasets, there is no issue of information leakage.
(b) A series of experimental results demonstrate the superiority of our method in promoting various person-related tasks on many other datasets.
For example, on the unsupervised person ReID task, using PPLR as the baseline algorithm, our PLIP pre-trained model significantly boosts the mAP by 5.1% on Market1501 and 14.7% on MSMT17. It is worth noting that the Market1501 and MSMT17 datasets are not encountered during our pre-training process, hence there is no issue of information leakage. This further demonstrates that our PLIP can learn highly generalized person representations.
In conclusion, our pre-training framework avoids information leakage, and various fair comparative experiments have also confirmed its superiority. Meanwhile, thanks for pointing out the possible misinterpretations. We will clarify our use of the combined CUHK-PEDES and ICFG-PEDES training sets at line 1046.
(ii) Undefined abbreviations.
Thank you for pointing this out. The undefined abbreviations in Figure 2 are an oversight. In fact, SIC should be TIC (Text-guided Image Colorization), VLC should be IVLC (Identity-based Vision-Language Contrast), and VAP should be IAP (Image-guided Attributes Prediction). We'll fix these and clarify them in the caption.
(iii) Comparison with latest works.
Firstly, the methods in Tables 6 and 7, though from 2022, are top pre-training models in person ReID. Secondly, to reproduce these methods for our experiments, they must have open-sourced their models and code, as we assess the impact of various pre-trained models on downstream unsupervised tasks, like in Table 6. Therefore, we compare key methods such as LUP, LUPNL, and PASS in these tables.
Indeed, in many of the experiments, we have compared our PLIP with the latest works, such as SOLIDER (CVPR 2023) in Table 1, and IRRA (CVPR 2023), APTM (ACMMM 2023), RASA (IJCAI 2023) in Table 5. Notably, compared to the SoTA expert models for these different tasks, our PLIP has achieved consistently competitive or leading performance.
(iv) Novelty.
We respectfully disagree with that the tasks of our PLIP are lack of novelty and unlikely to provide valuable new insights for subsequent research in this field.
It is worth noting that the three pretext tasks we proposed are not simply equivalent to the three works mentioned. In fact, there are significant differences between them.
(a) The core motivation significantly differs from these works. The core motivation of all these works mentioned is to design various expert models specifically for the single task of text-based person Re-ID. However, our motivation is to propose a language-image pre-training framework to advance a series of person-related tasks rather than a single task. Alternatively put, our PLIP, unlike these works, is not confined to one specific task but is proposed to learn general person representations that can boost a wide range of person-related tasks.
(b) The specific designs and implementation differ from these works. These works have comparabaly complex structures and loss function designs to maximize the performance on the single text-based person Re-ID task. These complex specialized designs are quite time-consuming and challenging to apply directly to large-scale pre-training. However, our PLIP not only deeply considers the characteristics of persons, but especially designs targeted, simple and efficient representation learning tasks, making it naturally adaptable to large-scale pre-training.
Furthermore, our PLIP opens up a new research direction on how to utilize language information to learn general person representations suitable for various person-related tasks. We believe that our work can provide valuable insights for subsequent research from many aspects. For example:
(a) How to generate more detailed fine-grained textual descriptions for images?
(b) How to design more effective multi-modal pre-training tasks for person representation learning?
(c) How can we add more modalities to pre-training? Does this improve model performance?
(d) How to optimize pre-training data: balancing quantity and quality, and maximizing implicit knowledge extraction?
These are the directions our team will pursue. I believe researchers in our field will build on our work to explore and answer these questions.
(v) Masking strategies.
In fact, in the IAP task, we randomly mask both attribute phrases and other non-attribute phrases, with attribute phrases being masked at a higher probability. We believe this approach encourages the model to focus more on extracting attribute information from person images. Meanwhile, we have conducted ablation experiments on different masking strategies in Section A.9.4 of our paper.
Dear Reviewer Rk98:
We thank you for the precious review time and valuable comments. We have provided corresponding responses, which we believe have covered your concerns.
The discussion phase has been on for several days and we have not heard any post-rebuttal responses from you yet.
We would love to convince you of the merits of our paper. We hope to further discuss with you whether or not your concerns have been addressed. Please let us know if you still have any unclear parts of our work.
All the best,
Authors.
We sincerely appreciate all reviewers' time and efforts in reviewing our paper.
We are glad to find that reviewers generally recognized our contributions including the meaningful and novel pre-training framework (Reviewers cgBx, bKKs, RKFb), the significance of our proposed dataset (Reviewers cgBx, bKKs), the comprehensive experiments (Reviewers cgBx, bKKs, Rk98), the practice of open-source (Reviewer cgBx) and our well-written paper (Reviewers cgBx, bKKs, Rk98).
We will open-source all the code, models and dataset. We believe that our work will make great and timely contributions to this community.
- We have pioneered the introduction of language modality into general person representation learning and verified its effectiveness through comprehensive experiments. This learning paradigm offers a solid path to more powerful general person-centric models.
- Our pre-trained models can directly improve existing person-related downstream methods to a much higher level without bells and whistles. These models will support various person-related task applications and promote the sustainable development of the technology.
- Our SYNTH-PEDES dataset with high-quality person image-text pairs is the largest by now and can facilitate research on more cross-modal person-centric domains.
We have replied to each reviewer individually to address any concerns. I noticed that we can continue discussions on the OpenReview system for some time to come. If the reviewers has any questions, we would be happy to continue the discussion.
Once again we thank all reviewers and area chairs!
This paper presents a language-image pre-training framework and a large-scale synthetic dataset for person representation learning, with applications in person re-identification and search tasks.
The initial reviews were mixed. Two of the reviewers recognized the innovation of the framework, the contribution of the datasets, and the strength of the empirical results. However, the other two reviewers raised concerns about potential information leakage during training and the technical novelty w.r.t. Lapscore (Wu et al., ICCV 2023). The authors provided a rebuttal addressing these concerns. Despite this, the two critical reviewers maintained their stance and did not respond further to the rebuttal. Consequently, the final ratings remained polarized.
After a thorough review of the paper, the rebuttal, and the discussion, the AC believes that the main concerns raised by reviewers Rk98 and RKFb were adequately addressed in the rebuttal. The AC did not find compelling support in their reviews or the discussion to warrant rejecting the paper. Therefore, the decision is to recommend acceptance. The authors are encouraged to incorporate the reviewers' feedback in their final version.