ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model
摘要
评审与讨论
For the real-world challenge of arbitrary single modalities or modality combinations queries in Person ReID, which existing methods hard to deal with, this paper proposes a diverse and practical benchmark: ORBench, with text, RGB, infrared, colored pencil drawing, and sketch modality. In addition, the authors also propose a new method ReID5o for this task, which contains a unified encoder and an expert routing mechanism and achieves flexible queries with arbitrary modality combinations. Extensive experiments demonstrate the advancement and practicality of the proposed benchmarks and indicate the SOTA performance of ReID5o against previous works.
优缺点分析
Strengths:
- The experiments are sufficient to demonstrate the advancement of the proposed benchmark and method.
- The proposed benchmark has great practical and academic significance, which is able to provide a high-quality multi-modal data platform for the community.
- The presentation is easy to follow.
Weaknesses:
See questions and limitations.
问题
-
Please discuss whether there is a conflict between learning the complementary information of modalities for fusion queries and fully mining the effective information for single modality queries, and if there is, how to conduct the trade off?
-
Please report the the space complexity, and the time cost of training stage and reference stage in different baselines.
-
When applying this foundation model on ReID tasks, would more parameters achieve better performance?
-
The authors removed unclear samples from constructed dataset, but didn't literally or empirically analysis the robustness of the proposed model against possible noise in the real world.
局限性
1. The possible approaches of enhancing the robustness of the model to real-world noise should be discussed in detail in A.7.
2. The computing cost is relatively large, which may limit its faster and wider application.
最终评判理由
This work contains a novel and practical task setting, a high-quality dataset that could effectively generalize on practical datasets with noise and a simple but effective method better than other counterparts. I think it provided a valuable data resource for the whole ReID research community thought it may not perfectly reproduce the readlity cases. After carefully reading comments from reviewers and corresponding rebuttals, I acknowledge the authors' efforts to revise the manuscripts and agree that this work could serve as a great start for this new area, so I will raise my score to show my support for it.
格式问题
No
We sincerely thank Reviewer FsmM for the positive and constructive review. We are grateful for your recognition of our sufficient experiments, proposed benchmark, and the presentation. Below, we respond to your thoughtful questions and suggestions in detail.
Q1: "Please discuss whether there is a conflict between learning the complementary information of modalities for fusion queries and fully mining the effective information for single modality queries, and if there is, how to conduct the trade off?"
A: Thank you for your insightful comments.
In fact, there is a potential conflict between them: On one hand, to achieve complementarity in multimodal fusion (e.g., semantic-visual complementarity between texts and sketches), the model needs to learn cross-modal aligned representations, which may lead to the "dilution" of modality-specific information (e.g., unique color textures in paintings or night features in infrared images). On the other hand, over-emphasizing effective information mining for single modalities may widen feature space differences across modalities, hindering alignment during fusion and weakening complementarity.
When designing our method, we took this conflict into account and achieved a trade-off through the below strategies:
-
Multi-modal Tokenizing Assembler. To encode data from different modalities into a shared embedding space, we designed a straightforward multi-modal tokenizing assembler. It consists of five lightweight sub-tokenizers, which are used to independently tokenize RGB, IR, CP, SK, and text data respectively. This operation can, to a certain extent, preserve the differences and uniqueness of the specific information of each modality.
-
Multi-Expert Router. We utilized a shared encoder to extract modality-shared features. Meanwhile, to extract modality-specific features, we designed this multi-expert routing mechanism. This mechanism allocates a respective LoRA-based expert to each modality, promoting the mining of independent and effective information for each modality and ensuring the discriminability of features from different modalities.
-
Feature Mixture. To achieve the fusion and interaction of complementary information between different modalities, we utilized this efficient feature mixture to fuse features from different modal combinations. The input of this mixture is the features of each modality with independent information, and the output is the multimodal fused feature.
In our training: for single-modal retrieval, we directly use independent features of each modality for alignment; for multimodal combined retrieval, we use fused features after mixture. This approach balances learning multimodal complementary information and mining effective single-modal information to some extent, as verified by ablation experiments in Table 2.
Additionally, we propose that future research in this area could focus on a "query type-aware mechanism". This would enable the model to dynamically adjust feature extraction focus based on whether the input is a single-modal or multimodal query, balancing the two more precisely. We will pursue this direction in subsequent work.
We will follow your suggestions and add this part of the discussion to the supplement in the future version.
Q2: "Please report the the space complexity, and the time cost of training stage and inference stage in different baselines."
A: Thank you for your suggestion. We report the space complexity (parameters), training time cost, and single inference forward time of each baseline in the table below.
| Method | Params (M) | Training Time (h) | Average Forward Time (ms) |
|---|---|---|---|
| CLIP | 149.6 | 4.79 | 50.37 |
| PLIP | 147.3 | 4.21 | 45.83 |
| IRRA | 189.2 | 7.02 | 50.42 |
| RDE | 152.8 | 7.78 | 50.46 |
| Meta-TF | 127.2 | 3.86 | 47.24 |
| ImageBind | 1200.8 | 18.64 | 152.2 |
| UNIReID | 149.6 | 5.42 | 50.45 |
| AIO | 166.8 | 4.92 | 47.38 |
| ReID5o | 159.2 | 5.99 | 76.13 |
The listed parameters include all module parameters of each baseline model during training, some of which may be discarded during inference. Training time is based on a unified setting of 60 epochs.
For inference time, to ensure efficient and fair comparison, a single forward inference is defined as: the model sequentially infers one sample of RGB, infrared, painted, sketch, and text, then fuses the features of the latter four into a multimodal feature (consistent with the quad-search process in the benchmark). This process was repeated 1000 times, with the average taken as the result.
As shown in the table, ReID5o not only maintains significantly leading retrieval performance but also has advantages in terms of parameter count, training time, and inference speed.
Following your suggestions, we will include comparisons of results in this regard in the supplement of future versions.
Q3: "When applying this foundation model on ReID tasks, would more parameters achieve better performance?"
A: When increasing the parameters of our model (especially the pre-trained multi-modal encoder within it), it can indeed achieve better performance on the ReID task. To verify this, we replaced the original CLIP-B/16 encoder in ReID5o with CLIP-L/14, conducted training under the same settings, and tested its performance on ORBench. The results are shown in the following table.
| Model | Params (M) | MM-1 | MM-2 | MM-3 | MM-4 |
|---|---|---|---|---|---|
| ReID5o (CLIP-B/16) | 159.2 | 58.09 | 75.26 | 82.83 | 86.35 |
| ReID5o (CLIP-L/14) | 443.8 | 62.14 | 78.32 | 85.03 | 87.64 |
Considering that the ReID task is essentially a fine-grained representation learning task, we believe that the performance improvement brought by the increase in the number of parameters is mainly attributed to the stronger and more robust representation ability of larger-scale pre-trained multimodal encoders. This phenomenon has also been demonstrated in many other papers.
Q4: "The authors removed unclear samples from constructed dataset, but didn't literally or empirically analysis the robustness of the proposed model against possible noise in the real world. The possible approaches of enhancing the robustness of the model to real-world noise should be discussed in detail in A.7."
A: Thank you for your insightful comments.
First, please allow us to elaborate on two reasons why we removed unclear samples from the existing datasets:
-
Clear, high-quality RGB samples aid in annotating accurate, feature-rich data across color paintings, sketches, and text. Since we rely on original RGB images to draw and describe individuals, unclear ones hinder the creation of high-quality, appearance-consistent data—making it hard to fully and accurately capture people's appearance features.
-
Many prior studies indicate that high-quality data enhances representation learning, and filtering unclear samples directly improves data quality. For instance, unlike CLIP, which trains on noisy samples, BLIP uses a data filtering mechanism to remove poorly matched image-text pairs—achieving better performance with fewer training samples.
We acknowledge that the aforementioned sample removal measures will make our high-quality benchmark somewhat fail to reflect real-world noises, which is discussed in our A.7 Limitation.
However, our cross-dataset generalization experiments in Table 7 validate the robustness of our proposed method, as most datasets here contain obvious noise (e.g., CUHK-PEDES has many rough, identity-mismatched text descriptions). The results show that training on our ORBench yields better generalization than other datasets, with our ReID5o method also exhibiting significant superiority. This indicates that ORBench and ReID5o complement each other to enhance robustness against real-world noise.
Naturally, improving robustness against real-world noise remains an urgent research focus. We believe the following directions are worthy of research:
-
Physics-inspired noise modeling and synthesis. The generation of real-world noises often follows physical laws, but existing data augmentation methods may deviate from the real distribution. We can construct noise models that conform to physical laws to better generate realistic noisy data. For example, by simulating the scattering effect of rain and snow on RGB images, we can enhance the model's robustness to weather-related noises.
-
Dynamic noise-aware multimodal fusion. In real-world multimodal person ReID applications, different modal data are affected by noise to varying degrees. How to dynamically allocate fusion weights based on the uncertainty of modal noise to obtain robust multimodal representations is also a research direction worth exploring.
-
Progressive denoising curriculum learning. By imitating the human learning process, we can first let the model learn basic representations from relatively clean data, and then gradually introduce noisier data. Alternatively, we can design some mechanisms to first recover a clean representation and then perform matching based on it, forcing the model to learn invariant representations in the presence of noise.
Following your suggestion, we will discuss the robustness to real-world noise detailedly in Section A.7 in our revised version.
Q5: "The computing cost is relatively large, which may limit its faster and wider application."
A: Thank you for your concern regarding the computing cost. We have taken it into account during the model design process, utilizing a lightweight multimodal tokenizing assembler, multi-expert router, feature mixture, and a shared encoder. For the results and discussions on this part, please refer to the Q2.
Last but not least, we would like to sincerely thank Reviewer FsmM again for the valuable time and constructive feedback provided during this review.
Thanks for the detailed response from the authors, my concerns are mostly addressed. But considering the questions raised by other authors about the dataset construction, I choose to keep my score.
Dear Reviewer FsmM,
We are truly grateful for your recognition of this work. We will certainly incorporate the clarifications and insights discussed into the revised manuscript to ensure better clarity and completeness.
Your support and encouragement mean a lot to us!
Best regards,
The Authors of Submission 5021
This paper focuses on person re-identification and proposes a novel task, omni multi-modal person re-identification (OM-ReID), which has been underexplored. The task enables person image search via any modality or their combinations. Based on this, the authors construct ORBench, the first high-quality five-modality dataset (RGB, infrared, color pencil, sketch, text). In addition, the authors also propose a unified model named ReID5o that can effectively achieve multi-modal combined retrieval. Extensive experiments validate the task's rationality, dataset's quality/generalisability, and method's superiority. Overall, the paper makes solid contributions.
优缺点分析
Strengths:
- This paper proposes a novel and underexplored task, omni multi-modal person ReID. Aiming to enable person search via any combination of modalities, this new task holds both research and practical application value.
- This paper presents a high-quality five-modality person ReID dataset, ORBench. With rich diversity, it is the first dataset to feature five high-quality modal representations of the same individual, effectively promoting research development in this field.
- This paper proposes a simple yet effective multi-modal joint learning framework, ReID5o. If open-sourced, this framework would be of some significance for the community to follow up on this new task.
- The authors conducted extensive ablation experiments to verify the effectiveness of each component of the ReID5o method, and performed in-depth evaluations on the high quality and cross-domain generalizability of the proposed dataset.
- The paper is fluently written, with clear elaboration of research motivations and technical details, making it easy for readers to follow. Weaknesses:
- The RGB and infrared modality data in this paper are filtered from existing datasets. If the authors could independently collect new data for these modalities, it might have greater community influence.
- To ensure data quality, the data construction process in this paper appears to require substantial manual involvement, which somewhat limits the scalability of the dataset.
- The painting styles in ORBench seem somewhat monolithic. Enhancing the diversity of painting styles might better improve generalization ability in real-world scenarios. Overall, this paper makes significant contributions, with its strengths outweighing the weaknesses.
问题
- In the process of constructing the ORBench dataset, it seems that you filtered out a large number of low-quality RGB images from existing datasets. What was your motivation for doing so? Because low-quality photos often appear in real-world ReID applications.
- In Table 1, those cross-modal methods also seem to perform well. Considering that these methods only support RGB and text, how did you train and test these methods on your dataset?
- Even though some visualized qualitative comparison examples, such as Figure 7, effectively demonstrate the high quality of ORBench compared to existing multi-modal datasets, could you conduct comparisons using quantitative metrics? This would make the findings more convincing. I would greatly appreciate it if you could address these concerns in the rebuttal.
局限性
yes
最终评判理由
Thank you for your detailed reply. I have carefully read the comments from other reviewers and the authors' responses, and most of the concerns and questions that I was interested in have been addressed. I believe the contributions of this research are solid, and both the dataset and methodology will have significant influence. I think this paper meets the acceptance criteria of NeurIPS, so I will upgrade my rating to show my support. Of course, I also hope that the authors can add some discussions to the revised version.
格式问题
none
We sincerely thank Reviewer akx3 for the positive and encouraging review. We are grateful for your recognition of our ORBench dataset, ReID5o method, extensive experimental results, and the quality of writing. Below, we respond to your thoughtful questions and suggestions in detail.
Q1: "The RGB and infrared modality data in this paper are filtered from existing datasets. If the authors could independently collect new data for these modalities, it might have greater community influence."
A: The RGB and infrared data of our ORBench are manually filtered from the existing datasets. Then, through significant efforts, we have additionally expanded three modalities: color painting, sketch, and text.
Since collecting a large amount of real-world video data containing both RGB and infrared information of the same person takes a long time and involves some privacy issues, we have decided to use the existing datasets LLCM and SYSU-MM01 for these two modalities, without affecting the core purpose of the research.
We have obtained permission for the use and expansion of these two datasets through written communication with their creators, which complies with relevant regulations and requirements.
Of course, as you mentioned, collecting new RGB and infrared data ourselves will indeed increase the diversity of data in the field. We will further improve our work in this regard in future research.
Q2: "To ensure data quality, the data construction process in this paper appears to require substantial manual involvement, which somewhat limits the scalability of the dataset."
A: Considering that the development of existing technologies, especially those for fine-grained image caption generation and artistic style painting targeting persons, is still immature, we have indeed invested significant manual involvement. This is to maximize the quality and diversity of the dataset, minimize noise contamination within it, and ensure that data across various modalities can accurately depict a person from multiple aspects. For further details on the efforts we have made in constructing the dataset, please refer to our response to the second comment from Reviewer h3Ed.
Certainly, as you have mentioned, we also believe that large-scale expansion of the dataset is equally important. It will enable models to learn more robust features and facilitate practical application deployment. We will strive to achieve this in our future research endeavors.
Q3: "The painting styles in ORBench seem somewhat monolithic. Enhancing the diversity of painting styles might better improve generalization ability in real-world scenarios."
A: Certainly, using more different AI painting tools would enhance the diversity of painting styles, which, to a certain extent, would be more conducive to generalization in real-world scenarios. However, most AI art painting tools tend to produce hallucinations and perform poorly when generating paintings based on low-resolution person RGB images. After multiple investigations, we had to select the only AI tool that meets the requirements, namely Doubao Art Painting.
To increase the diversity of paintings, we have made the following efforts: On the one hand, during the painting process, we manually required and supervised the model to paint the same person from three perspectives, thereby introducing perspective diversity; on the other hand, we generated multiple samples for the same person, with 18 color paintings and 18 sketches per person. Due to the inherent randomness in AI generation, this also brings about a certain degree of sample diversity.
Our aforementioned efforts have increased the diversity of the dataset to a certain extent, and the experiments in Table 7 have also demonstrated that our ORBench has better generalization ability compared to other datasets. We will attempt to use more AI painting tools in the future to enhance the diversity of painting styles in the dataset.
Q4: "In the process of constructing the ORBench dataset, it seems that you filtered out a large number of low-quality RGB images from existing datasets. What was your motivation for doing so? Because low-quality photos often appear in real-world ReID applications."
A: The question you raised has been addressed in our response to Reviewer FsmM's fourth question. We have included it below for your convenience.
First, please allow us to elaborate on two reasons why we removed unclear samples from the existing datasets:
-
Clear, high-quality RGB samples aid in annotating accurate, feature-rich data across color paintings, sketches, and text. Since we rely on original RGB images to draw and describe individuals, unclear ones hinder the creation of high-quality, appearance-consistent data—making it hard to fully and accurately capture people's appearance features.
-
Many prior studies indicate that high-quality data enhances representation learning, and filtering unclear samples directly improves data quality. For instance, unlike CLIP, which trains on noisy samples, BLIP uses a data filtering mechanism to remove poorly matched image-text pairs—achieving better performance with fewer training samples.
We acknowledge that the aforementioned sample removal measures will make our high-quality benchmark somewhat fail to reflect real-world noises, which is discussed in our A.7 Limitation.
However, our cross-dataset generalization experiments in Table 7 validate the robustness of our proposed method, as most datasets here contain obvious noise (e.g., CUHK-PEDES has many rough, identity-mismatched text descriptions). The results show that training on our ORBench yields better generalization than other datasets, with our ReID5o method also exhibiting significant superiority. This indicates that ORBench and ReID5o complement each other to enhance robustness against real-world noise.
Q5: "In Table 1, those cross-modal methods also seem to perform well. Considering that these methods only support RGB and text, how did you train and test these methods on your dataset?"
A: Considering that RGB, infrared, color painting, and sketch data in the dataset essentially still belong to the visual modality, and these cross-modal methods all adopt a visual-text dual-branch architecture, we have adopted the following simple and direct approach to endow them with more multimodal retrieval capabilities:
-
For RGB, infrared, painted, and sketched modalities, we directly use a visual encoder to extract features from each image sample. Most of these models use pre-trained CLIP as the encoder, and since CLIP has been pre-trained on hundreds of millions of image-text pairs, its visual encoder is not limited to the RGB modality. Therefore, this method is simple and feasible.
-
In the training phase, similarly, we take RGB as the core of alignment and align the features of other modalities with RGB features to learn a unified multimodal space mapping. In the testing phase, for unimodal retrieval, it is only necessary to calculate the cosine similarity between each query and each RGB image in the gallery; for multimodal combined retrieval, we use the sum of the similarity ranking lists of each modal query as the combined similarity ranking list.
Through the above simple and direct method, we can train and test these cross-modal methods on our ORBench.
Q6: "Even though some visualized qualitative comparison examples, such as Figure 7, effectively demonstrate the high quality of ORBench compared to existing multi-modal datasets, could you conduct comparisons using quantitative metrics?"
A: Thank you for this question. A quantitative comparison can indeed better illustrate the high quality of our dataset.
Regarding the quantitative comparison of the text modality, in fact, in Figure 4, we verified the high quality of the text modality in our dataset by calculating and comparing the text information content across various datasets. The average Shannon entropy of the text data in our dataset is 5.53, which is higher than that of previous datasets. For example, CUHK-PEDES only has a Shannon entropy of 4.13.
Concerning the quantitative experiment on painting data, in Figure 5, we verified the high quality of the painting data in our dataset through manual evaluation, and it excels in both identity consistency and perspective conformity. The proportions of samples with identity consistency and perspective conformity scoring 3 or higher are 97% and 91.7% respectively.
Considering the current lack of effective automated evaluation methods for assessing the quality of painting data, we acknowledge that it is difficult for us to conduct a direct quantitative comparison of painting quality with other painting datasets at this stage. However, it can be clearly seen from Figure 7 that the painting quality of our dataset is significantly better than that of other datasets.
We guarantee that our dataset is of high quality, and we will make it publicly available to facilitate verification by the research community.
Last but not least, we would like to sincerely thank Reviewer akx3 again for the valuable time and constructive feedback provided during this review.
Thanks for the reply. Considering that the ratings are not very balanced, I also have carefully read the comments from other reviewers and respective responses. Most of the concerns that I am interested in have been addressed. I believe the contributions of this work are solid, and both the dataset and method will have significant influence. In my opinion, this work meets the acceptance criteria of NeurIPS, so I will raise my score to show my support for it.
Dear Reviewer akx3,
We are truly grateful for your recognition of the solid contributions of our work, as well as your belief in the significant influence of our dataset and method. We will continue to polish the content with great care, incorporating all relevant feedback to enhance the quality and rigor of the work.
Your support and encouragement mean a lot to us!
Best regards,
The Authors of Submission 5021
This work addresses the largely overlooked problem of Omni Multi-modal Person Re-identification (OM-ReID), which involves retrieving a person using any combination of five modalities: RGB, infrared, sketch, color pencil, and text. To support this task, the authors present ORBench, a high-quality dataset featuring all five modalities across 1,000 identities. Additionally, they propose ReID5o, a unified framework that employs a multi-modal tokenizing assembler and a multi-expert routing mechanism to facilitate flexible modality fusion and alignment.
优缺点分析
Strengths:
1 The paper presents ORBench, a novel five-modality dataset characterized by its high diversity and quality, providing a valuable foundation for advancing research in more practical and versatile multi-modal ReID settings.
2 ReID5o employs a simple yet effective feature mixture strategy with multi-head self-attention and transformer blocks to combine inputs from arbitrary modality combinations. This design facilitates interactive information exchange, allowing the model to harness complementary cues without complex architectural modifications.
Weaknesses:
1 Most modalities in the dataset introduced by this paper are artificially generated rather than captured from real-world sources. Specifically, the Sketch and Color Pencil images are derived from the original RGB images through repeated transformations. This raises questions about the model’s ability to train effectively and generalize well, as the dataset may not fully capture the complexity of real-world multimodal perception.
2 Although the authors introduce a new dataset, ORBench, most of the baseline methods are re-implemented and fine-tuned on this dataset by the authors themselves. However, the paper lacks cross-dataset transfer evaluations on established benchmarks, making it unclear whether ReID5o's superior performance stems from genuine model effectiveness or potential biases inherent in the dataset design.
3 ReID5o incorporates several modality-specific tokenizers, a shared transformer encoder, and a multi-expert routing mechanism, resulting in increased model complexity. However, the paper lacks a thorough discussion of the trade-offs between efficiency and performance during training and inference, as well as the model's scalability under constrained hardware resources—factors that are critical for real-world deployment.
4 To construct multimodal query combinations, the authors randomly sample data from different modalities but of the same identity. This approach can result in overly consistent intra-identity pairs, potentially reducing the realism and difficulty of the evaluation. In particular, when the sketch, color pencil, and text are all derived from the same RGB image, the model may exploit superficial shortcuts instead of truly demonstrating cross-modal generalization capabilities.
问题
Please refer to Weakness.
局限性
Please refer to Weakness.
最终评判理由
Thank you for the author's response. The author has addressed some of my concerns. However, after carefully reviewing the comments of other reviewers, I agree with reviewer h3Ed regarding the dataset. I also remain skeptical about whether constructing a dataset based on artificially generated modalities, rather than raw image modalities (visible/infrared), can fully capture the complexity of real-world multimodal perception. Therefore, I will maintain my original score.
格式问题
N/A
Thanks for your valuable comments of our work.
First, we would like to explain the research motivation and contributions of our work.
In real-world practical applications, when searching for a target person, the actual witness data used as queries is usually multi-source and multi-modal. Data from different modalities typically focus on different aspects of the target person's features. In practice, the number of modalities in such witness data is often uncertain, which raises a thought-provoking question:
For queries with an uncertain number of modalities, how can we facilitate feature complementarity between any different modal data to achieve an accurate portrayal of the target person and thereby improve the accuracy of the search?
To promote in-depth research on this issue, our work makes the following contributions:
-
We propose a new task with practical value and challenges, namely omni multi-modal person re-identification (ReID). This task requires achieving effective retrieval with varying multi-modal queries and their combinations in the ReID model, which remains largely unexplored in previous research.
-
We construct the first high-quality five-modal person ReID dataset named ORBench, where each person is accurately and comprehensively characterized by data from five modalities. Considering that existing datasets usually contain only a small number of modalities and are of poor quality, we have supplemented three additional modalities of data—namely painted, sketch, and text modalities—based on existing RGB-infrared datasets through extensive manual annotation and intervention. This dataset can serve as a valuable research platform to advance the development of the field.
-
We propose a simple and efficient multi-modal person ReID framework for retrieval with any combination of modalities named ReID5o, serving as a baseline method for the task and dataset. Through a multi-expert routing and multi-modal feature fusion mechanism, this method can effectively achieve feature information complementarity between any modalities, thus obtaining more accurate retrieval results.
We believe that the task, dataset, and method, as an integrated whole, complement each other. They simulate real-world practical application scenarios and can provide a foundation and inspiration for subsequent research on multi-modal person ReID.
Now, we would like to address your specific concerns through the following detailed responses.
Q1: "Most modalities in the dataset introduced by this paper are artificially generated rather than captured from real-world sources."
A: We acknowledge that some modalities in our dataset are artificially generated, but we have taken rigorous measures to ensure all modalities closely align with real-world scenarios and capture their inherent complexity.
-
For RGB and infrared modalities: The data in ORBench are filtered from two existing public datasets, LLCM and SYSU-MM01. These datasets contain RGB and infrared images directly collected from real-world scenes, with diverse distributions in illumination, weather, and scenarios, ensuring the pedestrian appearance features are real and representative of real-world variations.
-
For the text modality: The descriptions are manually annotated by 26 human annotators following strict guidelines, ensuring rich language styles, accurate reflection of real pedestrian attributes, and consistency with real-world semantic expression.
-
For sketch and color pencil modalities: These data are AI-generated with heavy human intervention, not via repeated RGB transformations. Instead, they mimic real creation: AI produces multi-view sketches and drawings by referencing real RGB images (appearance) and detailed text (semantics). Humans refine outputs to match real sketch/pencil traits (e.g., preserving key features, simplifying non-essentials, matching textures) rather than replicating RGB content. This labor-intensive process captures real-world sketch/pencil perception complexity.
Our experiments confirm the dataset's effectiveness and generalization. As shown in Figure 6, the model trained on our dataset leverages multimodal complementarity to achieve accurate retrieval, demonstrating adaptability to real-world multimodal variations. Additionally, Table 7's dataset generalization experiments show our dataset outperforms others, indicating our dataset helps to learn robust representations transferable to real-world complexity.
Q2: "Most of the baseline methods are re-implemented and fine-tuned on this dataset by the authors themselves. The paper lacks cross-dataset transfer evaluations on established benchmarks."
A: Regarding the re-implementation and fine-tuning of baseline methods, it is worth noting that this study focuses on the novel task of person ReID across arbitrary modalities and their combinations, with ORBench being the first five-modality dataset for it. Since both are new, existing baselines lack training/evaluation under such multimodal settings and have no reusable weights. Therefore, to verify the effectiveness of ReID5o within a unified experimental framework, we re-implement existing methods and perform fine-tuning on ORBench. This is a necessary step in verifying the effectiveness of new methods in pioneering work in this field, ensuring fair and relevant comparisons.
Regarding cross-dataset transfer evaluation, we have actually conducted relevant experiments specifically in the paper (see Table 7 for details). We transferred the ReID5o model and baseline methods trained on ORBench to multiple existing cross-modal benchmarks for evaluation. The results clearly show that ReID5o still maintains significant advantages in cross-dataset scenarios, and the models trained on the ORBench dataset significantly outperform those trained on other datasets in terms of transfer performance. This result not only demonstrates the generalization ability of ReID5o but also verifies the rationality of the ORBench dataset, ruling out the impact of dataset design biases on performance.
Q3: "The paper lacks a thorough discussion of the trade-offs between efficiency and performance during training and inference, as well as the model's scalability under constrained hardware resources."
A: Regarding the trade-off between model efficiency and performance, we have actually conducted a systematic consideration in our research and verified it through experiments. The specific explanations are as follows:
In fact, in terms of the trade-off between efficiency and performance, we specifically compared the impact of different multi-expert routing designs on model performance and computational overhead in Table 3.
The experimental results show that the selected trade-off parameters not only achieve the best performance improvement but also only bring a small increase in params and FLOPs (please refer to Table 3 for specific values and experimental explanations). This result directly reflects the importance we attach to the balance between efficiency and performance in the design.
From the perspective of model architecture design, the efficiency of ReID5o is mainly reflected in three aspects:
-
The shared transformer encoder significantly reduces redundant computations in multi-modal scenarios and avoids parameter explosion caused by designing separate encoders for different modalities;
-
The lightweight tokenizer controls additional overhead while ensuring modal adaptability by simplifying the modal feature mapping process;
-
The LoRA-based multi-expert routing mechanism realizes parameter-efficient fine-tuning through low-rank matrix factorization, which significantly reduces the computational complexity of the routing module compared with traditional multi-expert models.
The above designs enable the model to maintain high performance while having good hardware adaptability, and it can still run in environments with limited computing resources.
Q4: "To construct multimodal query combinations, the authors randomly sample data from different modalities but of the same identity, resulting in overly consistent intra-identity pairs. The model may exploit superficial shortcuts instead of truly demonstrating cross-modal generalization capabilities."
A: First, our task focuses on omni multi-modal person ReID, whose core goal is accurate retrieval via any modality combinations. This task inherently requires multi-modal query combinations to maintain identity consistency—a necessity, not an artificial constraint. The aim is to enable the model to leverage complementary information across modalities, constructing more comprehensive pedestrian representations for real-world cross-modal search. This aligns with practical scenarios where users query based on diverse modal data of the same target.
Second, our experimental design rigorously verifies cross-modal generalization. We separated the dataset into 600 training and 400 test identities, ensuring independent division to evaluate performance on unseen data. Additionally, cross-dataset experiments in Table 7 explicitly demonstrate the model learns essential inter-modal associations (not dataset-specific superficial features), confirming strong cross-scenario generalization.
Finally, significant modal differences in our dataset prevent over-reliance on superficial shortcuts. Sketch and color painting modalities include multi-perspective samples with visual characteristics distinct from RGB images; text descriptions use varied styles to ensure richness and diversity. These designs force the model to learn intrinsic inter-modal connections, instead of relying on superficial similar features, thus effectively improving the model's cross-modal generalization ability.
We would like to sincerely thank Reviewer AoA7 again for the valuable time and feedback provided.
Dear Reviewer AoA7,
Thank you once again for your valuable time!
In response to your concerns, we detailed our dataset's realism and generalizability, provided explanations regarding cross-dataset transfer evaluation and the trade-off between model efficiency and performance. We also clarified the rationality of identity-consistent multi-modal query combinations in our task.
We sincerely appreciate your feedback, and hope our detailed responses will address these concerns of yours.
We remain actively engaged in the Author–Reviewer Discussion phase and would be happy to provide further clarification if needed.
Best regards,
The Authors of Submission 5021
Thank you for the author's response. The author has addressed some of my concerns. However, after carefully reviewing the comments of other reviewers, I agree with reviewer h3Ed regarding the dataset. I also remain skeptical about whether constructing a dataset based on artificially generated modalities, rather than raw image modalities (visible/infrared), can fully capture the complexity of real-world multimodal perception.
Dear Reviewer AoA7,
Thank you for your feedback. We are pleased that you have taken note of our in-depth discussions with other reviewers, particularly Reviewer h3Ed, and recognize their viewpoints as constructive. In fact, we fully agree with the core assessment by Reviewer h3Ed regarding the "idealized" nature of our dataset, and we have elaborated on this in detail and committed to making revisions in our response to them.
Taking this opportunity, we would like to provide a more focused explanation regarding the key question you raised—"whether it can fully capture the complexity of real-world multimodal perception"—in conjunction with our response to Reviewer h3Ed's comments.
Artificially Generated Modalities Are Grounded in Real-world Data
We acknowledge that modalities like sketches, color pencil drawings, and text are not "raw" sensor data (e.g., visible/infrared images). However, their generation is deeply rooted in real-world inputs and human perception, ensuring they reflect genuine complexity:
-
Sketches and color pencil modalities: As detailed earlier, these are not arbitrary AI creations. They are generated by referencing real RGB images and human-annotated text descriptions. Human annotators then refine these outputs to mimic how humans actually sketch or draw—simplifying non-essential details, preserving key features, and matching texture characteristics of real hand-drawn art. This process directly mirrors real-world scenarios where sketches/drawings are derived from observed or described subjects.
-
Text modality: These descriptions are not synthetically generated by AI. They are manually annotated by 26 human annotators following strict guidelines, based on direct observation of real pedestrian images. The annotations reflect natural language use and capture the variability of human semantic expression—exactly how people describe others in real life.
In short, these artificially generated modalities are derivatives of real-world data, designed to simulate the types of multimodal inputs that are commonly used in practical ReID scenarios (e.g., querying via witness descriptions or sketches when no surveillance footage exists).
Raw Sensor Modalities Are Foundational, But Multimodal Perception Requires More
The raw visible/infrared modalities in our dataset (from LLCM and SYSU-MM01) are critical—they anchor the dataset in real-world sensor data, capturing variations in illumination, weather, and viewpoint that are hallmarks of real surveillance environments. However, real-world multimodal perception is not limited to raw sensors. For example:
-
Law enforcement often uses text descriptions from witnesses or sketches from forensic artists to identify suspects, even when no visible/infrared footage is available.
-
Cross-modal retrieval (e.g., text-to-image, sketch-to-image) is a core practical need, as queries rarely come in the same modality as gallery data.
Our dataset includes these "non-raw" modalities precisely to address this gap: they are not replacements for raw sensor data but essential complements, enabling the study of how to align these diverse inputs with real sensor data—a challenge that real-world systems must solve.
Our Goal: To Build a Basic Research Platform Rather Than Perfectly Reproduce the Real World
We fully acknowledge that it is difficult for any dataset to fully capture the infinite complexity of the real world. Our core goal is not to perfectly reproduce all the noise and uncertainty in the real world—a goal that is extremely ambitious—but rather to provide an unprecedented, controllable, and high-quality basic research platform for studies on multimodal person ReID.
As we elaborated in our response to Reviewer h3Ed, the current field lacks a large-scale dataset that spans five modalities with highly aligned semantic information across modalities. Our work aims to fill this gap. Such an "idealized" or "clean" setup is a deliberate design choice, with the following purposes:
-
Isolate the Core Scientific Problem: Our primary goal is to enable researchers to focus on the core algorithmic challenge of multimodal feature alignment and fusion. By conducting experiments in a highly aligned "clean" environment, we can effectively eliminate complex interfering factors such as modal mismatches, data noise, and annotation errors that are prevalent in the real world. This allows the success or failure of an algorithm to be more directly attributed to its design, thereby accelerating the exploration of fundamental theories and models.
-
Provide a High-Quality Foundation: In the era of deep learning, a large-scale "clean" dataset with high data quality serves as an ideal starting point for learning robust and general feature representations. When models are pre-trained on our dataset, they can first acquire the most essential and purest correspondences between different modalities. This powerful foundation model can then be fine-tuned to adapt to noisier and more specific real-world scenarios, enabling it to address challenges in practical applications with lower costs and higher efficiency.
-
Establish a Fair Evaluation Benchmark: Prior to our work, the field lacked a unified standard capable of evaluating the performance of five key modalities simultaneously. Our dataset provides the community with a stable, reproducible, and fair arena. All new algorithms can be compared on the same controlled starting line, ensuring the fairness of evaluation. This not only facilitates the objective measurement of technological progress but also sets a clear performance baseline for future research, promoting the healthy development of the entire field.
Experimental Evidence: Our "Idealized" Dataset Can Generalize to "Noisy" Real-World Scenarios
Regarding your concern about the validity of a "dataset constructed based on artificially generated modalities," we would like to re-emphasize the cross-dataset generalization experiment presented in Table 7. This experiment was specifically designed to address the question of whether models trained on an idealized dataset can be applied to the complexity of real-world scenarios.
The experimental results demonstrate that models trained on our ORBench can be transferred to other more "noisy" datasets that are closer to real-world scenarios (such as those containing real hand-drawn sketches or vague text descriptions) and achieve competitive performance. This strongly proves that :
The dataset we constructed can help models learn generalized and robust multimodal features, which can be effectively transferred to more complex and noisy real-world scenarios.
Our Committed Actions
We deeply understand your concerns and fully agree with Reviewer h3Ed's incisive analysis regarding "forensic sketch" versus "viewed sketch." To ensure that the contributions and limitations of our work are clearly communicated to readers, we will keep our promise and, in the revised manuscript:
-
Explicitly elaborate on the "idealized" assumptions adopted in the construction of our dataset.
-
Candidly discuss that our dataset does not "fully capture" all challenges in real-world forensic scenarios, such as significant information loss or errors caused by memory biases.
-
Clearly position our work as providing a "starting point" and "foundation" for the field, rather than an "endpoint" that solves all practical problems.
We believe that, as noted by Reviewer akx3, this work, as a high-quality five-modal benchmark, contributes by providing the community with a solid starting point. We hope that by clarifying the boundaries of our dataset and demonstrating its generalization ability through experimental evidence, we can effectively address your concerns.
Thank you again for your valuable comments. These discussions will undoubtedly make our paper more rigorous and refined.
Best regards,
The Authors of Submission 5021
The paper introduces a new dataset for multi-modal person re-identification, ORBench, and a novel multi-modal reidentification framework, ReID5o.
ORBench is constructed by combining images from two existing public person re-ID datasets, SYSU-MM01 [46] and LLCM [57], and extending these with additional multi-modal annotations: pencil sketches, color sketches and text descriptions. The pencil and color sketches are created from the original images using (commercial) image-to-image AI software tools. The text descriptions have been manually created by paid workers. Additional manual curation has been applied to remove unreliable samples, resulting in 1000 people, 45k RGB, 45k text, 26k IR and 18k color/grayscale sketches. Compared to other Re-ID datasets, ORBench is unique in its inclusion of all of these modalities together, and its large amount of color and grayscale images.
ReID5o is a multi-modal transformer trained to retrieve a person from a reference set using multi-modal observations/descriptions of that person. The multi-modal inputs are first tokenized by modality-specific encoders, and then processed by a transformer. The transformer uses low-rank (LoRA) decomposition for feature extraction, where the LoRA transformation is specific to the input modality associated with each token. Finally, a Feature Mixing stage fuses information from all procided input modalities to produce a single descriptor. The paper presents experiments to show ReID5o outperforms various baseline methods on ORBench. It also shows strong dataset generalization of ORBench-trained models compared to models trained on other datasets, and includes ablation studies on ReID5o components, and a study on the strength of the different modalities.
优缺点分析
Strengths
- The proposed ORBench dataset combines more diverse modalities (rgb, ir, text, grayscale & color pencil) compared to other ReID datasets, allowing to test new multi-modal settings.
- A new multi-modal architecture is proposed, ReID5o, which combines existing insights in a sensible manner. A single ReID5o model can perform retrieval on a single modality, a few modalities, or all modalities. It outperforms multiple other multi-modal retieval methods on ORBench.
- An extensive ablation study is included to validate the components of ReID5o, and also inspect the impact of the different modalities in ORBench.
- Additional experiments show improved generalization of the ORBench-trained model compared to the model trained on other datasets; The paper also includes qualitative results to support the findings. Overall, the evaluation on the new dataset is thorough.
- In general the paper is clear and easy to follow.
Weaknesses
Methodological details are unclear
-
the proposed Feature Mixture module line 218: "For each multimodal features Z^mod, we perform a sequential combinatorial traversal."; I don't understand this line. As I understand it Z^mod here is the feature vector from a single modality (since mod \in {R,I,C,S,T}). I don't understand how a single feature can be processed combinatorically.
-
As I understand it the network is used to generate feature vectors from the input (multi-modal) inputs; I assume the setup is as follows as in a standard image retrieval task: the network turns a query into a descriptor, and compares it to descriptors of an available reference dataset. I did not find this explicitly stated in Method section 3. Instead, the section refers to "identity classification" which made me think the network was trained to classify a fixed set of person identities known at train time (which would not generalize to new identities). Only at the experimental setting was it made explicit that training identities and test identities are different, hence I believe this is not a classification task but a metric learning task. If it is indeed an retrieval task, then the used retrieval procedure should be stated explicitly (e.g. Euclidean or cosine comparison, any top-k results voting, etc.). Also, the size of the resulting feature size used for retrieval ( from Section 3.4, I believe) should be stated for reproducibility.
Unclear how laborious creating the dataset was
Papers that introduce new datasets often represent a significant engineering and data curation effort, and sharing such data represents a valuable service to the research community. In return, as reviewer we should expect less methodological novelty for such papers.
At first glance, this paper appears to introduce a new large dataset representing significant labour that allows the community to better study some real-world phenomena. But upon closer inspection, the new dataset mostly combines and extends existing data sources. The novel annotations that have been added are generated by "off the shelf" AI models.
The most laborious efforts seems to be the text descriptions. It is stated that 45k text descriptions were created from each RGB image by hired workers (line 119), but Figure 7 shows pretty similar description formats. Did the annotators use a tool to help them with this, and did this not introduce other biases and repeated phrasing? Might the workers not produce lower quality descriptions for the last 5k compared to the first 5k descriptions? More details on the annotation setup should be provided. I also wonder why AI tools were used for the drawing modalities, but why no LLMs were used to generate these text descriptions.
It might sound pedantic, but understanding the effort to build such a dataset is important to judge the significance and value of the dataset towards the community, and also how critical to review the dataset's realism and design choices.
This brings me to my next concern:
How realistic is the dataset, and have meaningful choices been made?
I am missing this higher-level motivation for adding new modalities as has been done in this work:
- What real-world setting(s) are exactly being simulated?
- Were these AI-generated annotated added simply because they appear to look good, or do these type of sketches reflect the intended real-world use cases?
- In the intended use cases, can we expect all modalities to have the same level of detail and completeness? What kind of variance within and across modalities can we expect?
Currently, it is unclear what practical use cases the dataset exactly targets, and if the new annotations represent the these use cases well, also in terms of data variance. My overall impression is that main motivation for the design choices is pragmatic: The experiments show the new dataset improves generalization to other ReID datasets.
There is some discussion on the realism in the appendix (A.7 "Limitation"), but I believe the paper should be explicit in its goals, assumptions, and simplifications when introducing the data in the paper itself.
Ethical concerns
First off, let me acknowledge that the authors address several concerns in the appendices:
- Appendix A.5 "Broader Impact Discussion" discusses risk of such surveillance technology, and encourages researchers using their data to follow the law; It is also explained the data is only released for research purposes, not commercial use.
- Appendix A.6 "Controlled Release" indicates that the authors plan to release the data in a controlled manner, requiring users to "adhere guidelines and restrictions" to access the data and code.
Nevertheless, I have several concerns that might require closer inspection by an ethics expert:
-
Privacy of people in the benchmark. The paper relies on already public benchmarks, and does not make claim about the privacy of the individuals by itsself. However, by extending and updating the benchmark, and publishing it at a highly visible venue as NeurIPS, I believe the paper should make explicit how it established that the used data does follow all guidelines. This may require explicitly stating how the used public benchmarks protected privacy, and how those policies are consistent with the NeurIPS Ethics Guidelines. If those public datasets did not make clear how such privacy concerns were addressed, then extending those datasets might not be suited for a NeurIPS publication. I find the current argument in de appendices A.5 & A.6 too weak: all responsibility to avoid privacy violations is delegated to the end users of the dataset and models complying with law. But the paper does not make any attempt itself to check for such violations.
-
Bias and discrimination This research is clearly about identifying human subjects, I did not find any discussion on potential bias in the dataset, and risks for discrimination, risk of predicting protected categories, etc. These areas (bias & fairness, discrimination, predicting protected categories in surveillance) are explicitly mentioned in the ethics guidelines, but not discussed. I expect a minimum would be to indicate what kind of population the data represents, what type of variations in people it contains. This too might require checking the information from the original public datasets, and here too I wonder if those datasets should be extended in a NeurIPS paper if their compliance with its guidelines cannot be verified. Furthermore, I am not sure if the human text annotations would enable to look for "protected categories" as the ethics guideline states. Could this research enable surveillance technology that retrieves individuals with a particular ethnicity, or some age, or a disability using a simple text query?
-
Answers in questions in the NeurIPs checklist do not align with what can be read in the paper submission
- Question 12 "Licenses of existing assets" -> authors answered "Yes", but there is no mention of the license of the datasets on which their proposed ORBench builds. There is also no (intended) license mentioned for ORBench.
- Question 13 "New assets" -> the authors answered "[N/A]". The paper does introduce new assets by extending the dataset with new annotations. None of the guidelines to Question 13 have been considered: No details on the license of the dataset, no discussion on how consent was obtained from people in the dataset, no anonymization.
- Questions 14 & 15, the authors reply "Our paper does not involve crowdsourcing nor research with human subjects", But the paper is about having descriptions of human subjects; Also, the paper stated that external manual labour has been used to generate textual descriptions (line 119) and to perform a public evaluation of the dataset quality (line 348)
I am not saying these necessarily are ethics violations, but my impression is that the answers do not acknowledge what the checklist and NeurIPs guidelines are for. I believe it would be good if an ethics expert reviews the submission.
问题
See my discussed Weaknesses for more detail:
- Technical: how are the features exactly fused in the Feature Mixture module?
- What are the use cases / motivation for adding the selected modalities?
- How laborious was creating this dataset, and what design choices were considered?
- To what extent have you actively checked if the public datasets, on which ORBench is built, comply with the NeurIPS ethical guidelines?
局限性
Some limitations are discussed in the Appendix, which is good. Still, I list sev3eral ethical concerns in the Weaknesses above, which also touch on societal issues. I would encourage the authors to discuss bias & fairness of the data, and if this research could lead to discrimination e.g. by searching for people with particular race, sex, age, etc.
最终评判理由
In the discussion in the rebuttal phase have answered some of my questions, but also confirmed some of my concerns, so I stick to the lower ratings on significance and originality: there are some limitations and key discussions missing in the original submission. The methodology is also not particularly innovative, but it does appear effective.
I can still see arguments for a reject: (1) a lot of necessary updates to the text have been identified and promised, but this could warrant a new review round to see if they are properly implemented; (2) the involved expert ethical reviewers might still find the rebuttal insufficient, in which case I believe the paper should also be rejected and resubmitted.
Nevertheless, I have updated my rating one step now, as I expect the necessary improvements will address my main concerns sufficiently.
格式问题
No concerns
Thanks for your valuable comments. Now we respond to your concerns as follows.
Methodological details are unclear
Q1: "How are the features exactly fused?"
A: Thanks for identifying an ambiguity.
To clarify, the combinatorial traversal is performed not on individual modal features but on the entire set of multimodal query features . Specifically, we generate all possible combinations from single-modal to quad-modal (i.e., , , , and combinations). These are then concatenated and processed by the Feature Mixture module.
The example in Lines 221-225 (IR+Text fusion) illustrates this process for one dual-modal combination. We will revise the text to resolve this confusion.
Q2: "Detailed explanations on task setup, retrieval procedure and feature size."
A: The task is essentially a fine-grained multimodal image retrieval task that generalizes to unseen identities (not included in the training set). This is a metric learning task rather than an identity classification task.
In training, we follow common practice by using training-set identity labels to guide the representation learning, with two losses:
-
The primary loss is the similarity distribution matching (SDM) loss, a typical metric learning loss that aligns cross-modal features by minimizing the KL divergence between cross-modal similarity distributions and normalized label-matching distributions.
-
A supplementary identity classification loss serves as an auxiliary objective to enhance feature discriminability, using identity labels for multi-class classification of each multimodal sample.
In inference, the classification layer is entirely discarded. Retrieval is performed by computing cosine similarity between the query feature and all gallery features.
For the feature size, we use CLIP-B/16 as the encoder (consistent with prior methods) with its default size of 512.
As suggested, these details will be explicitly stated in the revised version.
Unclear how laborious creating the dataset was
It took months of intensive effort to ensure quality and diversity of ORBench—no mere crude combination/extension of existing resources. Key workflow details:
-
Manual filtering of existing infrared-RGB datasets. Some datasets (notably LLCM) contain noisy samples with extremely poor imaging quality, where facial features are barely distinguishable to the naked eye—unsuitable for accurate annotation and lacking sufficient learnable features. Initial attempts at automatic filtering via resolution thresholds failed, as resolution does not strictly correlate with facial clarity. Thus, we adopted manual screening: for each identity, all images were reviewed individually, and low-quality samples were removed.
-
Manual fine-grained text annotation for RGB images. We developed strict annotation guidelines specifying requirements for objective, unbiased and explicit appearance descriptions, followed by training annotators (including trial sessions) to ensure compliance. The task was completed by 26 annotators over approximately one month. Our team then rigorously reviewed all annotations, with inappropriate descriptions returned for revision.
-
AI-assisted generation of paintings and sketches with human oversight. AI assistance was chosen for two reasons: manual drawing demands high expertise, and large-scale manual creation is time-prohibitive. After testing multiple tools, we selected Doubao's artistic painting tool. The workflow involved: manually selecting the clearest RGB images per viewpoint; adding detailed manual text descriptions; feeding these to the AI to generate color paintings; and having human reviewers inspect outputs (with substandard ones recreated). Sketches were derived via style conversion from these paintings. This hybrid approach ensured efficient production of high-quality data.
Q3: "Did annotators use tools, and did this introduce biases or repeated phrasing?"
A: Annotators strictly followed guidelines for manual annotation without tools. Specialized reviewers mitigated potential biases. While some repetition is inevitable, the 26-annotator team enhanced linguistic diversity, reducing redundancy.
Q4: "Might later descriptions be lower quality than earlier ones?"
A: Specialized reviewers manually evaluated all annotations, with substandard entries returned for revision—effectively preventing quality degradation.
Q5: "Why use AI for drawings but not LLMs for text?"
A: Text descriptions require detailed references to RGB images, which pure LLMs cannot process. MLLMs, while capable of image description, struggle with low-resolution person images and fine-grained details, failing to generate satisfactory results. Thus, we opted for manual annotation for text. For drawings, AI (with human oversight) efficiently produced high-quality outputs, making it the practical choice.
How realistic is the dataset, and have meaningful choices been made?
Q6: "What real-world setting(s) are exactly being simulated?"
A: Our five-modal dataset and task setup align with practical needs, explicitly simulating real-world scenarios. A key example is public safety: when searching for a target, investigators often use multi-modal information—such as low-resolution surveillance footage, witness descriptions, and sketches/portraits derived from these descriptions. Effectively leveraging their complementarity to characterize targets for accurate identification remains a highly practical research topic. Additionally, real-world scenarios naturally involve variable numbers of modalities (from single to multiple). Our work addresses retrieval under arbitrary modal combinations, enhancing practical relevance.
Q7: "Were these AI-generated annotations added just to look good, or do such sketches reflect intended real-world uses?"
A: These modalities were included based on practical value rather than superficial visual appeal. In real scenarios, after witnesses provide textual descriptions, more precise drawings or sketches are often created to improve search accuracy, with subsequent retrieval combining all available modalities. Our dataset faithfully reproduces such practical cases.
Q8: "In intended use cases, will all modalities have the same detail and completeness? What variance can we expect within and across them?"
A: In intended use cases, variations in detail and completeness across modalities are inherent—this is a core challenge in real-world multimodal ReID, which our dataset deliberately incorporates.
Regarding intra-modal variance: RGB/infrared data exhibit fluctuations in color, contour, and clarity due to lighting and scene conditions; Sketches/paintings show variability from AI generation randomness and multi-perspective design; Texts vary in descriptive content due to stylistic differences among 26 annotators.
Inter-modal differences appear in information type and granularity (Figure 7): visual modalities focus on spatial features (RGB with color, infrared with temperature, paintings more abstract), while texts are semantic-symbolic with lower detail granularity.
These differences reflect real-world realities and highlight the challenges our dataset presents to practical algorithms.
Ethical concerns
Q9: "Privacy of people in the benchmark"
A: We clarify that both the original benchmarks and our extended data strictly adhere to the guidlines with explicit privacy safeguards.
For SYSU-MM01, we verified its creators obtained written privacy licenses from all captured pedestrians, explicitly permitting use of their images for scientific research and academic publication.
For LLCM, its creators used MTCNN for facial blurring to anonymize personal identifiers. We confirmed this meets standard anonymization criteria, aligning with NeurIPS' privacy risk minimization requirements.
Both datasets have usage agreements with privacy protection provisions; access requires strict adherence and formal signing. Additionally, we obtained written permission from both datasets' creators to modify and extend them.
Q10: "Bias and discrimination"
A: Our data comes from public datasets complying with local laws, with measures ensuring fairness and avoiding bias or discrimination. Sourced from Sun Yat-sen and Xiamen Universities, it covers their campus communities, with balanced age and gender distribution. The datasets contain no biases or specific protected groups and comply with NeurIPS guidelines. Our human text annotations, under strict review, are objective and unbiased, unable to identify protected categories. Moreover, due to rigorous data review and compliance, our work will not retrieve individuals by ethnicity, age, or disability.
Q11: "Answers in questions in the NeurIPs checklist do not align with what can be read in the paper submission"
A: For Question 12: We communicated in writing with dataset creators, obtained permission for data use and extension, and strictly complied with their licenses. Additionally, we've specified the research license (CC BY-NC-SA 4.0) on OpenReview. We will also follow your suggestion to explicitly note the licenses in the supplement.
For Question 13: Since we did not directly submit the asset (e.g., via an anonymized URL) for this submission, we marked [N/A]. For further details, please see earlier statements.
For Question 14&15: We initially understood these two questions to refer to large-scale, direct human experiments (e.g., distributing mass questionnaires). As our research is unrelated, we marked [N/A]. We confirm strict adherence to relevant guidelines and that all workers were paid adequately. For further details, please see earlier statements.
We would like to sincerely thank Reviewer h3Ed again for the valuable time and feedback.
Dear Reviewer h3ED,
Thank you very much for your thoughtful review of our submission!
Following your helpful suggestions, we elaborated on the details of the method, the intensive efforts made in dataset construction, and the higher-level motivation. We have also thoroughly discussed relevant ethical concerns and proposed measures for privacy protection and preventing technical abuse.
Your insights were instrumental in improving the clarity and completeness of our work.
We are actively participating in the Author–Reviewer Discussion phase and would be happy to provide further clarification if needed.
Best regards,
The Authors of Submission 5021
Thank you for your extensive answers, and the update on my ethical considerations.
Rebuttal Methodological details are unclear and Unclear how laborious creating the dataset was.
Thank you for clarifying, I am satisfied by the rebuttal's answers to these questions.
Rebuttal How realistic is the dataset, and have meaningful choices been made?
Here I still see a major problem though.
Authors: " In real scenarios, after witnesses provide textual descriptions, more precise drawings or sketches are often created to improve search accuracy, with subsequent retrieval combining all available modalities."
This is what I expected, but I couldn't fully confirm this use case from the submission. This exactly brings me to the next statement in the rebuttal:
Authors: "Our dataset faithfully reproduces such practical cases."
This is what I don't agree with (and thus why I asked): If the setting is to use witness descriptions and police drawings from those witness accounts, then I expect AI generated sketches from the RGB images do not properly represent the intended practical use case.
See for example [Klare'10], where Figure 1 illustrates the difference between "viewed sketches" (those made when viewing (a picture of) a subject) versus good/bad quality "forensic sketches" (based on witness reports). Using this terminology, your stated goal as I understand is to match "forensic sketches", "real-world scenarios only involve forensic sketches" (quote from [Klare'10]). But forensic sketches often do not really look at like the corresponding people in the real photographs! Your AI-based sketches mimic much closer to human-made "viewed sketches" of [Klare'10], capturing nuanced facial properties that what an eye witness report would (I expect eye witness descriptions would also provide inaccurate/incomplete/ambiguous text descriptions, by the way).
This is also also why I asked for the variances that you intend to capture: your dataset unfortunately does not capture real world variations between forensic sketches and actual surveillance camera footage, as far as I can tell.
I still see a risk that a system developed and tested on your dataset does not necessarily translate to good performance in real-world settings with real witness sketches, which also has risks. In my view the paper submission, and unfortunately also the rebuttal, still lacks reflection on this aspect by stating that the "dataset faithfully reproduces such practical cases".
Let me clear that I can appreciate that a lot of effort went into creating the sketches, as you clarified for my other questions. I also understand that having real-world "forensic sketches" for such datasets is not really feasible. I therefore completely understand that AI-based techniques have been employed. My impression is that your approach presents an okay starting point, adding "viewed sketches" to large-scale multi-modal datasets for the research community, since currently even that is missing.
Overall, the paper should be explicit about it does and does not achieve, and should not overclaim how well it reproduces reality.
[Klare'10]: Klare, Brendan, Zhifeng Li, and Anil K. Jain. "Matching forensic sketches to mug shot photos." IEEE Transactions on Pattern Analysis and Machine Intelligence 33.3 (2010): 639-646.
Rebuttal Ethical concerns
I see there are now separate discussions with two expert Ethical Reviewers. I will adjust my evaluation of this aspect primarily on their assessment of the concerns. I will leave my thoughts on this aspect in a comment on the thread of the first Ethical Reviewer, to keep the discussion contained.
Dear Reviewer h3Ed,
We sincerely thank you for your insightful feedback and for engaging in this constructive discussion. Thanks for raising a crucial point regarding the reality of our dataset, specifically concerning the distinction between our AI-generated sketches and real-world "forensic sketches." This is a valuable observation, and we appreciate the opportunity to clarify our position and detail the revisions we will make to the manuscript.
Our dataset is indeed constructed based on idealized assumptions, making it difficult to fully reflect the phenomenon of mismatches between queries and targets in real-world scenarios.
First and foremost, we agree with the reviewer's core assessment. Our dataset was constructed based on an assumption that is indeed idealized.
Specifically, we premised our data generation on the following condition: "Given comprehensive information from multiple eyewitnesses, it is possible to form a complete and accurate textual description of the target person. Subsequently, with the witness's assistance for refinement, a precise sketch or color painting of the target can be created."
We concede that this represents a best-case scenario. As the reviewer correctly points out, drawing a parallel to the terminology in the excellent reference [Klare'10], our generated sketches more closely resemble "viewed sketches" (high-fidelity depictions) rather than "forensic sketches," which are often incomplete or inaccurate due to the fallibility of human memory.
In fact, our dataset reproduces an idealized practical case. However, we believe it still holds significant value. Its primary contribution is to provide a large-scale, high-quality, five-modality dataset where the query modalities (infrared, sketch, color painting, text) are tightly aligned with the gallery RGB images. To our knowledge, such a foundational resource for studying multi-modal feature alignment in a controlled environment was previously missing in the field.
Given the feasibility of dataset construction, such idealized assumptions are necessary. Building a dataset that fully simulates real-world application scenarios would require the participation of thousands of people.
The primary reason for adopting this idealized approach was feasibility. As the reviewer astutely understands, creating a large-scale dataset of true "forensic sketches" paired with corresponding surveillance images is a monumental undertaking. It would require the participation of thousands of individuals (as mock witnesses and artists), introducing immense logistical, financial, and ethical challenges. Such an effort, while valuable, was beyond the scope of our current resources. Our AI-driven approach, therefore, represents a pragmatic and necessary starting point to enable research in this multi-modal space.
The cross-dataset evaluation experiments in Table 7 of the paper provide evidence that our "clean" dataset can generalize well to "noisy" real-world scenarios.
A key concern raised by the reviewer is that "a system developed and tested on your dataset does not necessarily translate to good performance in real-world settings." This is a valid risk. However, we have evidence to suggest that models trained on our dataset can, to a certain extent, generalize to datasets that better reflect real-world noise and variability.
We would like to draw the reviewer's attention to Table 7 in our manuscript, which presents cross-dataset evaluation results. In these experiments, a model trained on our dataset was tested on existing text-to-image and sketch-to-image Re-ID datasets. Many of these datasets were collected manually and contain significant misalignment and noise between query and gallery items, thus better reflecting the challenges of "forensic sketches" or ambiguous text descriptions. Our model achieved competitive performance, suggesting that the feature representations learned on our "clean" dataset are robust enough to provide a strong foundation that can handle the domain shift present in more realistic, noisy data.
Dear Reviewer h3Ed,
Thank you sincerely for your continued dedication and valuable time invested in reviewing our manuscript. We greatly appreciate the thoughtful feedback you've provided throughout this process.
In response to your remaining concerns, we have conducted in-depth discussions on how our dataset simulates real-world scenarios—with particular attention to cases involving imperfect matches between queries and targets. Additionally, we have prepared a detailed and careful response addressing the ethical considerations you raised regarding potential bias and misuse
Your insights have been instrumental in helping us enhance the clarity and comprehensiveness of our work. We would include these discussions into the revised manuscript to strengthen its quality.
As we remain actively engaged in the Author–Reviewer Discussion phase, we are eager to confirm whether our responses have adequately addressed your concerns. Please do not hesitate to let us know if further clarification would be helpful.
Best regards,
The Authors of Submission 5021
Dear authors, Thank you once again for your detailed responses, and your willingness to discuss and reconsider.
I agree that the dataset still is valuable in its current form, and also that Table 7 provide evidence of the ability of the dataset to generalize to noisy real-world settings.
I believe your proposed updates to the text (which outline the intended real-world task, the design considerations, and the pros/cons of the selected approach towards achieving the intended task), would strengthen the paper by providing more context, and insight into the challenges faced by the research field. I have no further questions on this point.
Dear Reviewer h3Ed,
We would like to extend our sincerest gratitude for your thorough review and your willingness to engage in such a detailed, multi-round discussion. Your insightful comments have been instrumental in improving our work.
We are delighted to know that our responses have resolved your concerns, particularly regarding the ethical considerations and the simulation of real-world scenarios. We also value your recognition of our earlier clarifications on methodology details and dataset construction efforts.
We sincerely appreciate the trust you have placed in us. As promised, we will meticulously revise the manuscript to incorporate the proposed changes, ensuring the text is phrased with the nuance and reflection you suggested.
Thank you once again for your dedication and constructive guidance.
Best regards,
The Authors of Submission 5021
We will explicitly state in the revised manuscript that our dataset does not fully capture the variance and imperfections of real-world scenarios.
To be perfectly clear, we agree with the reviewer's conclusion. We will explicitly state in the revised manuscript that our dataset does not fully capture the variance and imperfections of real-world scenarios. Specifically, we will:
-
Revise the dataset creation section to explicitly state the idealized assumption and introduce the "viewed sketch" vs. "forensic sketch" dichotomy, citing [Klare'10]. We will clarify that the identity consistency across modalities in our dataset is intentionally high, and our query modalities are, in essence, high-quality "viewed sketches" and their equivalents, not "forensic sketches."
-
Add a dedicated paragraph in the Limitations section to discuss the points raised above, clearly delineating what our dataset does and does not achieve concerning real-world forensic applications. We will acknowledge that the challenge of matching low-quality, ambiguous, or partially inaccurate "forensic sketches" is a critical but different problem that our dataset does not directly address.
We concur with the reviewer's positive framing that our work "presents an okay starting point." This was precisely our intention. By providing the community with a large-scale, high-quality, multi-modal benchmark, we aimed to establish a foundation. We believe our dataset can catalyze research in several ways: as a powerful pre-training resource, as a testbed for developing ideal feature alignment algorithms, and as a baseline from which future research can explicitly tackle the domain gap between "viewed" and "forensic" data. We hope our work encourages future efforts to build upon this by introducing more realistic noise and variations.
Once again, we are grateful for the reviewer's detailed and constructive feedback, which will undoubtedly strengthen our paper. We hope our response and proposed revisions have adequately addressed their concerns.
Best regards,
The Authors of Submission 5021
The paper received two accept, one borderline accept and one borderline rejection ratings. Reviewers praised about the dataset contribution while criticizing the realism of the dataset. Reviewers also raised the ethic problems during review. The authors provided a rebuttal, which addressed most of those concerns (generalization on real-world data etc). AC acknowledges that the ethic problem exists in the whole person ReID community and is not specific to this paper. The dataset contribution can facilitate more research along this direction: apply more advanced deep learning framework for traditional problems. Thus AC recommends to accept the paper.
The open-source reid5o project code (https://github.com/Zplusdragon/ReID5o_ORBench) lacks the and . I attempted to do so by using the training code of IRRA[1] or RDE[2] methods, but the final results obtained are significantly different from those presented in the paper. The mAP of the single-modal method is only around 51, while in the paper it is around 58, a difference of nearly 7 percentage points. The other bimodal, trimodal and quadrimodal also differed by approximately 2 percentage points.
Please make the complete training strategy and training code publicly available. Thanks a lot !
[1] Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval [2] Noisy-Correspondence Learning for Text-to-Image Person Re-identification