Empowering Visible-Infrared Person Re-Identification with Large Foundation Models
摘要
评审与讨论
The authors aim to tackle the challenge of lacking detailed information in the infrared modality by employing foundation models. Their proposed method includes an Incremental Fine-tuning Strategy (IFS) and Modality Ensemble Retrieving (MER). These techniques enhance the representation of the infrared modality through automatically generated textual descriptions, thereby lowering the cost of text annotations and boosting the performance of cross-modality retrieval.
优点
1)The authors explore a viable solution to enhance VI-ReID performance using readily available foundation models.
2)The solution is well-conceived, and the experiments are comprehensive.
3)The paper is well-structured, featuring clear diagrams and lucidly presented ideas.
4)The appendix material provides detailed information about the methodology, and the extensive experiments effectively validate the proposed approach.
缺点
-
The main content lacks a description of the data generation process. It is recommended to replace the baseline description with details of the data generation process.
-
The task setting should be introduced in the introduction, which can adequately support the rationale for using text enhancement in cross-modality retrieval tasks, improving the quality of the paper.
-
The font size in Figure 2 and the tables needs adjustment.
-
Several writing errors in the paper need correction.
问题
The results of YYDS in Tables 3 and 4 do not align with those reported in the original paper. The authors need to provide a further explanation for this discrepancy.
局限性
Authors are encouraged to add descriptions of limitations to their papers
We appreciate the positive feedback regarding the clear architecture figure, feasibility, and soundness of our method. We also acknowledge and appreciate the constructive criticisms for improving certain aspects of our paper writing.
Q1: Explain the discrepancy of the misaligned results of YYDS in Tables 3 and 4 with those reported in the original paper.
A1: We primarily explored the performance of existing models on the proposed automatically expanded tri-modality datasets, at the mainstream image resolution of 144*288, and under the newly proposed task setting. Consequently, YYDS was tested on our data and setting, yielding dissimilar results from the original paper. Extensive experiments demonstrate that our method performs better and is more robust under the proposed task.
Q2: The main content lacks a description of the data generation process.
A2: Thanks for the suggestion, we will add an introduction about the generator fine-tuning process for text generation in the main text to ensure the coherence and readability.
Q3: The task setting should be introduced in the main text.
A3: Thanks for the suggestion. We will add a detailed explanation of the task setting to the introduction.
Q4: The font size in Figure 2 and the tables need adjustment. Several writing errors in the paper need correction.
A4: We will correct the noted writing errors and adjust the font size in Figure 2 and the tables to enhance readability.
Q5: Authors are encouraged to add descriptions of limitations to their papers.
A5: The limitations were initially detailed in Appendix D.
- The quality of generated text can indeed affect model performance, particularly when the original images' (for generation) quality or generators' capabilities are suboptimal.
- However, even on challenging LLCM and lower-resolution RegDB, with generated descriptions not completely accurate, our method still achieves improved performance. This demonstrated the robustness of our method against inaccuracies in descriptions.
- To provide more valuable insights to the community, we will also add a discussion of potential ways to improve the quality of generated text, like progressive generation strategy and images augmentation for fine-funing VI-ReID specialized description generators.
After carefully reading this rebuttal, I raise my score and am inclined to accept this paper.
- This paper investigates a feasible solution to empower the VI-ReID performance with off-the-shelf foundation models. The solution is reasonable.
- This paper is sufficiently innovative and insightful for VI-ReID.
- The experiments in this paper are sufficient and reproducible, and I look forward to the author's open source.
We deeply appreciate your positive feedback. It is gratifying to see our method for enhancing VI-ReID with foundation models recognized as both innovative and feasible. We will release our code and data to ensure that our work can be reproduced and can contribute to the VI-ReID research community.
This paper proposes a text-enhanced VI-ReID framework driven by Foundation Models (TVI-FM). VI-ReID often lags behind RGB-based ReID due to the inherent differences between modalities, particularly the absence of information in the infrared modality. This paper enriches the representation of the infrared modality by integrating automatically generated textual descriptions. Extensive experiments on three expanded cross-modal re-identification datasets demonstrate significant improvements in retrieval performance.
优点
This paper is a good attempt to use textual information from heterogeneous modalities to enhance cross-modal retrieval performance. This paper is methodologically sound, clearly presented, and able to provide the following contributions:
a). The proposed text-enhanced VI-ReID framework driven by Foundation Models (TVI-FM) enriches the representation of infrared modality with the automatically generated textual descriptions, reducing the cost of text annotations and enhancing the performance of cross-modality retrieval.
b). This paper develops an Incremental Fine-tuning Strategy (IFS) to employ LLM to augment textual descriptions and incorporate a pre-trained LVM to extract textual features, leveraging modality alignment capabilities of LVMs and feature-level filters generated by LVMs to enhance infrared modality with information fusion and modality joint learning.
c). Extensive experiments demonstrate that the proposed improves retrieval performance on three expanded cross-modality re-identification datasets, paving the way for utilizing LLMs in downstream data-demanding tasks.
缺点
a). Some key elements in the appendix should be included in the main text, such as obtaining a multimodal model capable of generating text from two visual modalities and the definition of the new task.
b). The introduction is somewhat lengthy and verbose, making the method appear repetitive. It should be simplified to refine the key ideas, avoiding repetition in the method overview.
c). There are some grammatical errors, and the tenses are inconsistent. The authors should further strengthen the correctness of their writing.
问题
According to the description in Task Settings, the method in this paper utilizes text information from heterogeneous modalities of the same individual during testing, which aligns more closely with real-world conditions. This appears to be a new test setting, and the authors should further elaborate on what constitutes "real-world conditions" and discuss its plausibility.
局限性
The structure of the paper requires adjustments, particularly in refining details within specific methods. Additionally, the insights should be clarified and made more understandable.
We are grateful for the positive recognition of the soundness, clearly presentation of our framework and also appreciate your detailed comments aimed at improving our paper writing. We believe our revisions will address your suggestions. Thank you for the valuable feedback for guiding these improvements.
Q1: How testing settings align with "real-world conditions"?
A1: Humans perceive objects as visible images, and eyewitness descriptions are based on these perceptions. These descriptions, rich in information complementary to infrared modalities, serve as auxiliary clues for retrieval. Given the variability in eyewitness descriptions of the same target, our task setting allows any description of visible images of the same identity to be used with infrared features for retrieval, mimicking the varied visual perceptions of human eyewitnesses.
Q2: Some key elements in the appendix should be included in the main text.
A2: Thanks for the suggestion, we will add details of the proposed task setting and an introduction about the generation process into the main text to ensure the coherence and readability.
Q3: The introduction is somewhat lengthy and verbose, making the method appear repetitive.
A3: We will streamline the introduction as suggested to reduce redundancy and focus more on the key ideas and innovations of our approach
Q4: There are some grammatical errors, and the tenses are inconsistent.
A4: We will make revisions to correct all grammatical errors and ensure the consistency of simple present tense throughout the document.
Q5: The structure of the paper requires adjustments, particularly in refining details within specific methods. Additionally, the insights should be clarified and made more understandable.
A5: We will adjust the structure of the paper and add detailed motivation, rationale, and insights into the confusing parts of the methodology, ensuring that the details and insights of our method are clearly communicated and easily understandable.
I am satisfied that this rebuttal adequately addresses the concerns in the review. Most of the writing issues are also well addressed based on the author's rebuttal. After considering the authors' responses and the feedback from other reviewers, I have decided to raise my evaluation and endorse the acceptance of this paper.
(1) The methods presented in the paper achieved favorable results with comprehensive experiments.
(2) The paper is comprehensible with clear motivation and ideas. The proposed method is also considered interesting by the other reviewers with sufficient innovation.
Based on the above points, I give a score of STRONG ACCEPT.
Thanks for your encouraging review. We are pleased to know that our responses have addressed your concerns. Your recognition of the competitive results and innovation of our method further motivates us. We will make our code and data available to ensure reproducibility and to facilitate further development in the field.
This paper incorporates a pretrained multimodal language vision model (LVM) to extract textual features and incrementally fine-tune the text encoder to minimize the domain gap between generated texts and original visual images. Meanwhile, to enhance the infrared modality with text, this paper employs LLM to augment textual descriptions. Furthermore, the authors introduce modality joint learning to align features of all modalities. Additionally, a modality ensemble retrieving strategy is proposed to consider each query modality for leveraging their complementary strengths to improve retrieval effectiveness and robustness. Adequate experiments verify the validity of the method.
优点
The authors utilize existing multimodal models to enhance cross-modal retrieval performance. The method is sound, and the experiments are adequate to effectively demonstrate the effectiveness of the proposed approach. PROS:
- This paper leverages Language Vision Models (LVMs) to automatically generate textual modality, which enriches the representations of the infrared modality and reduces the cost of human annotation.
- The proposed Incremental Fine-tuning Strategy (IFS) and Modality Ensemble Retrieving (MER) can improve the robustness and accuracy of existing VI-ReID systems by complementing the infrared modality with information from generated text.
- The experimental results shows that the proposed method achieve a significant gain than SOTA.
缺点
- This paper is not easy to follow. The method is not introduced clearly, and some key steps are not detailed. The motivation and advantages of each module in the paper should be clarified further to enhance understanding.
- It's important to outline the limitations of existing text-assisted ReID methods like YYDS and highlight the differences and advantages of this paper's method in comparison.
- The authors should provide further clarification about the motivation behind the new testing setting.
- There appears to be an imbalance in the content distribution among several modules. It is recommended that the authors adjust these accordingly.
- The authors should thoroughly discuss the challenges that remain unresolved by current text-assisted VI-ReID methodologies. Furthermore, they need to clearly outline how their approach addresses these issues comprehensively.
- The quality of the generated text descriptions may affect the performance of the model, and if the generated texts are inaccurate, which may lead to degradation of retrieval performance. 7 How are the voting scores obtained? It should be expressed further using formula. 8 There are some typos in this paper: 1) “... utilizing textwe employ...” in line 131 should be “... utilizing text, we employ...”; 2) “conbined with” in line 179 should be “combined with”.
问题
See Weakness
局限性
This paper only focuses on how to do but not analyze the insight why the proposed method is effective. And the limitation of this method is not introduced.
We are grateful for your recognition of the adequate experiments, the soundness and competitive performance of our work. We also appreciate the constructive comments on the motivation, rationale, and content distribution balance, which is valuable for improving our paper writing.
Q1: Motivation and advantages of each module.
A1: For better presentation, we adjust the framework as shown in Figure 1 (the submitted rebuttal pdf). LLM augmentation is now in "Text Generation" and elaboration of the proposed method is included in rebuttal pdf for better understanding. The motivation and advantages of each module are as follows:
- Text Generation. Existing methods rely on fixed manual annotations for training, incurring labor and time costs and is sensitive to text variation. Our method uses fine-tuned language vision models to automatically generate text and employs LLM random rephrasing to create dynamic descriptions for training, enhancing our framework's robustness against text variation.
- Incremental Fine-tuning Strategy (IFS) includes Fusion Module and Modality Joint Learning to integrate text into the infrared modality for improved cross-modal retrieval. Existing methods require prior-information like pre-defined color vocabularies for text-visual complmentary information alignment, complicated architecture with much additional parameters for information extraction and fusion, causing potential information loss. Our method creates fusion features at feature level by arithmetic operations and fine-tunes LVM with end-to-end ReID loss to jointly align semantics from all modalities without prior-information, mitigating the fusion-visible discrepancy and achieving more accurate cross-modal retrieval.
- Modality Ensemble Retrieving (MER). Features of different modalities focusing on distinct information, motivated by this, we fully uses of features from all query modalities to form ensemble representations, boosting retrieval accuracy and robustness against challenging retrieval cases.
Q2: Comparison of existing text-assisted ReID methods like YYDS & and this paper's method.
A2:
YYDS:
- Manually collected descriptions for images.
- Prior-information for text-visual complementary information alignment, like pre-defined color vocabularies; complicated architecture with much additional parameters for information extraction and fusion, causing potential information loss.
- Fixed auxiliary descriptions for training, sensitivity to text variation.
Our framework:
- Automatically generated text from visible and infrared images.
- Feature level fusion module without additional parameter. End-to-end ReID loss fine-tunes LVM, guiding semantic alignment across all modalities without prior-information, significantly mitigating fusion-vision discrepancy and achieving more acurrate cross-modal retrieval.
- Employs LLM random rephrasing to create dynamic descriptions for training, improving the robustness against textual variation.
Q3: Motivation behind the new testing setting.
A3: Humans perceive objects as visible images, and eyewitness descriptions are based on these perceptions. These descriptions, rich in information complementary to infrared modalities, serve as auxiliary clues for retrieval. Given the variability in eyewitness descriptions of the same target, our task setting allows any description of visible images of the same identity to be used with infrared features for retrieval, mimicking the varied visual perceptions of human eyewitnesses.
Q4: Imbalance of content distribution among several modules.
A4: For better content distribution balance, we will re-organize the content about task definition, text-generation, and add more details of each module.
Q5: Challenges unresolved by current text-assisted VI-ReID methodologies. How the proposed approach addresses these issues comprehensively?
A5:
- Sensitive to the text variation. Our method utilizes LLM based random rephrasing to create dynamic text for training, significantly enhancing the robustness against text variation.
- Struggling text-infrared information integration and fusion-visible features alignment. By fine-tuning LVM with end-to-end ReID loss to align semantics across all modalities, we simultaneouly achieve text-vision alignment for better infrared compensation and fusion-visible alignment for more accurate retrieval.
Q6: The quality of generated text descriptions may affect the retrieval performance.
A6:
- The quality of generated text affects model performance when original images' (for text generation) quality or generators' capabilities are suboptimal.
- However, even on challenging LLCM and lower-resolution RegDB with generated descriptions not completely accurate, our method still achieves improved performance. This demonstrated the robustness of our method against inaccuracies in descriptions.
Q7: Didn't introduce limitations.
A7: We introduced the limitation about impact of text quality in Appendix D. More detailed discussion and direction for future improvements will be added in the revision.
Q8: Using formula to express voting scores.
A8: The scores are difined as:
The score equivalently represents the cosine similarity between the visible feature and a concatenated infrared-text-fusion feature, increasing the feature dimension. This higher-dimensional space increases the distance between identities, enhancing identity discrimination.
Q9: There are some typos in this paper.
A9: We will revise all mentioned and other found typos as suggested.
This paper addresses a new problem by applying foundation models to VI-ReID tasks and offers a feasible solution for the field. The proposed approach is both innovative and effective, as demonstrated by extensive experiments. In the initial version, the reviewers have provided some suggestions to improve the writing of the manuscript. I believe the authors have provided a good rebuttal to address the concerns. The overall structure and clarity of the paper would be greatly improved in the final version. Given above strengths and the authors' rebuttal, this paper should be accepted. It is worth sharing with the community for utilizing large foundation models in specific down-stream tasks. It would be a good start for this field.
We are grateful for your insightful comments and your recognition of the foundation models' innovative application in the VI-ReID tasks. As suggested, we will further enhance our manuscript and are committed to sharing code and data with the community to foster further research.
Visible-infrared person re-identification often underperforms due to the significant modality differences, primarily caused by the absence of detailed information in the infrared modality. This paper investigates a feasible solution to empower the VI-ReID performance with off-the-shelf foundation models by proposing a text-enhanced VI-ReID framework to compensate for the missing information in the infrared modality.
优点
The figures and tables in this paper are detailed, and the proposed method is logical and well-founded. This method reduces the cost of manual labeling, effectively addresses the issue of missing infrared modality information, and offers some insights for this community. Extensive experiments demonstrate that the proposed method improves retrieval performance on three expanded cross-modality re-identification datasets, paving the way for utilizing LLMs in downstream data-demanding tasks. The new test setting proposed in the paper seems to be a composed image retrieval with some real-world applications.
缺点
- The structure of the article needs adjustment. Some content in the appendix, such as Datasets Expansion and Task Settings, should be moved to the main text to enhance readability.
- Compared with the previous text enhancement method YYDS, the advantages of this paper should be further explained.
- There are some spelling errors in the paper (line 131). It is recommended that the authors conduct a thorough check.
问题
The comparison settings in Table 4 show significant differences. For instance, in Tri-SYSU-MM01, YYDS presents the I+T->R results, while in Tri-RegDB and Tri-LLCM experiments, YYDS shows the I->R results. The rationale behind these settings needs further discussion.
局限性
The method utilizes text information to complement the infrared modalities. This approach's reliance on the accuracy of text information generation may pose a limitation, which the authors should briefly discuss.
We are grateful for your positive recognition of the detailed tables, clear figures, and the soundness of our framework. We also appreciate your constructive comments and will revise and clarify the suggested points to improve the quality of the paper writing.
Q1: Some content in the appendix should be moved to the main text to enhance readability.
A1: As suggested, we will add details of the proposed task setting and an introduction about the generation process into the main text to ensure the coherence and readability.
Q2: Compared with the previous text enhancement method YYDS, the advantages of this paper should be further explained.
A2:
YYDS
- Manually collected descriptions for images.
- Prior-information for text-visual complementary information alignment, like pre-defined color vocabulary dictionary; complicated architecture with much additional parameters for information extraction and fusion, resulting potential information loss.
- Fixed auxiliary descriptions for training, resulting sensitivity to text variation.
Our framework
- Automatically generated text from visible and infrared images.
- Feature level fusion module without additional parameter. End-to-end ReID loss fine-tunes LVM, guiding semantic alignment across all modalities without prior-information, significantly mitigating fusion-vision discrepancy and achieve more acurrate cross-modal retrieval.
- Employs LLM random rephrasing to create dynamic descriptions for training, improve the robustness against textual variation.
Q3: There are some spelling errors in the paper (line 131).
A3: We will conduct a thorough review and the correction of spelling errors throughout the document to ensure professionalism and clarity.
Q4: Varied comparison settings in Table 4, especially the differences in results for YYDS in Tri-SYSU-MM01 vs. Tri-RegDB and Tri-LLCM.
A4: We will standardize the experimental setups for YYDS across all tests to use the "I+T->R" configuration. Both YYDS and our method employ joint text and infrared sample retrieval, whereas other methods solely use infrared queries.
Q5: Briefly discuss the reliance on the accuracy of text information generation, which may pose a limitation.
A5:
- The quality of generated text can indeed affect model performance, particularly when the original images' (for generation) quality or generators' capabilities are suboptimal.
- However, even on challenging LLCM and lower-resolution RegDB, with generated descriptions not completely accurate, our method still achieves improved performance. This demonstrated the robustness of our method against inaccuracies in descriptions.
The rebuttal resolves my doubts. Compared to existing methods, this work has significant advantages and innovations. Consequently, I would like to argue for acceptance and raise my rating to Strong Accept.
- I think the contributions of this paper are enough. The extended dataset in this paper is very helpful for research in this field.
- The methodology of this paper is sound. Extensive experiments also verify its validity.
- This paper is a new exploration of VI-ReID that can provide new insights into the field.
Thanks for your support and comprehensive feedback! We greatly appreciate your acknowledgment of our method’s innovations and the interests of the proposed expanded datasets. We will share our data and code to further contribute to the community’s development.
To address the loss of performance in Visual_infrared re-identification wrt to visual , the authors propose a novel text-enhanced VI-ReID framework driven by Foundation Models (TVI-FM) , which enriches infrared representations with automatically generated textual descriptions. This framework incorporates a pretrained multimodal language vision model (LVM) to extract and fine-tune textual features, minimizing the domain gap between texts and visual images. Additionally, modality joint learning and an ensemble retrieving strategy are introduced to align and leverage features from all modalities, enhancing retrieval effectiveness and robustness.
优点
The paper presents an interesting approach to enriching the limited features of infrared imagery by incorporating textual information, effectively extracted and fused using a strategy based on Language Vision Models (LVMs) and Large Language Models (LLMs). This proposed method stands out as it leverages advanced models to bridge the gap between modalities, enhancing the overall effectiveness of Visible-Infrared Person Re-identification (VI-ReID). The detailed analysis of intra- and inter-class distributions is particularly noteworthy, providing valuable insights into the data characteristics and the impact of the proposed approach. This statistical evaluation underscores the robustness and depth of the methodology, highlighting its potential to significantly improve the performance of VI-ReID systems. Furthermore, the results demonstrate good performance on the adopted dataset, outperforming recent state-of-the-art solutions. This achievement validates the effectiveness of the proposed framework. The paper's findings contribute meaningfully to the field and offer a promising direction for future research in enhancing VI-ReID using advanced textual and visual fusion techniques.
缺点
The presentation of the methodology lacks sufficient clarity, making it challenging to follow the authors' logic. In particular, Section 3.2, which discusses the incremental fine-tuning strategy, is especially confusing. The connection between the textual descriptions and the proposed architecture shown in Figure 2 is not clearly established. Including notations for the features as they are outputted at different steps would greatly improve understanding.
Section 3.2.2 is overly complex and difficult to read. The sentence, “we employ the fine-tuned LVM in Section A to generate textual description,” is misleading because it references a Section A that does not exist within the document. This ambiguity needs to be addressed to avoid reader confusion. Additionally, the explanation of how features are fused is not clearly articulated, adding to the overall lack of coherence in this section.
Furthermore, there are numerous language errors throughout the text that must be corrected. These mistakes detract from the readability and professionalism of the paper, making it harder to take the work seriously. Overall, the paper requires significant revisions to improve its clarity and readability.
问题
Section 3.2.2 need a careful revision to present the solution is a clearer way.
How the text data has been aligned with image modality in pre-train? Which text?
Is N_{sum} different from N_i. If so, how? If not why using this notation?
What is the difference between MES and MER in the ablation
局限性
Although the limitations are addressed in Appendix D, this section is underdeveloped and lacks depth. The discussion fails to provide a clear and comprehensive strategy for improving the quality of the textual information to enhance overall performance. It is unclear at which specific step in the methodology this improvement should be implemented. More detailed explanations and actionable insights are needed to understand how enhancing textual quality can positively impact the system’s effectiveness. Clarifying these points would significantly strengthen the paper’s contribution and provide a more robust framework for future research.
We appreciate your recognition of the soundness, competitive performance, and adequate experiments of our method. Our method is a novel exploration of applying foundation models to down-stream data-intensive multimodal tasks. It uses LVM-generated text to enrich infrared representations and employs an end-to-end ReID loss to fine-tune the LVM text encoder, minimizing discrepancies between visible and fusion features thereby achieving competitive performance across three expanded VI-ReID datasets. We are grateful for feedback on the clarity and logical coherence of our presentation. In response, we will refine our manuscript to highlight contributions, correct misleading references and improve readability, especially in Section 3.2 and fusion process. We have adjusted the framework, moving LLM augmentation to "Text Generation," and added elaborations of each module, as detailed in Figure 1 of the submitted rebuttal PDF.
Q1: Suggestions for clear presentation of Section 3.2, especially fusion process.
A: Incremental Fine-tuning Strategy (IFS) includes Fusion Module and Modality Joint Learning, integrating text into the infrared modality for improved cross-modal retrieval.
Fusion Module. As shown in Figure 2 from the rebuttal material, the fusion process is defined as:
Where and are infrared and visible features; and are text features of visible and infrared image; and are text and visual complementary information for infrared modality, respectively.
The visible feature decomposes into the infrared feature and its complementary feature , and similarly, . Using foundation model's basic text-visual alignment capability, visible text features and infrared text features are roughly equivalent to visible and infrared features respectively. Thus, is roughly equal to . We can create fusion features by adding to , roughly equivalent to visible features.
Modality Joint Learning. Freezing visual encoders, we fine-tune a LVM text encoder with ReID loss to jointly align semantics across all modalities. The loss includes a cross-entropy loss and a weighted regularized triplet loss , defined as:
The fusion feature consists of the combination of frozen infrared features and text complementary features. With IFS, we can further align text and visual complementary features, thus we can further optimize fusion features, reducing fusion-visible discrepancy and improving cross-modal retrieval accuracy.
Q2: Misleading reference of Section A that does not exist within the document.
A2: The reference is detailed in Appendix A, about the fine-tuning process of two modality-specific text generators. We will add explanation of this process in main text and correspondingly correct the reference to ensure better coherence.
Q3: How the text data has been aligned with image modality in pre-train? Which text?
A3:
CLIP text encoder possesses text-image alignment capability benefited from pre-training on the large-scale image-text pairs (WebImageText) via contrastive learning:
Where and are the -th image-text features pair in a batch, is the temperature parameter, and denotes the number of image-text pairs in the batch. Using this basic capability of CLIP, visible text and infrared text features are roughly aligned with respective visual features, which is further aligned adapting to VI-ReID task in following optimization.
Q4: Is N_{sum} different from N_i? If so, how? If not, why using this notation?
A4: They are equal because each fusion feature is created based on the corresponding infrared feature. We use these two notations to distinguish the features of different modalities more clearly, so the subscript same as their belonging modality is used for the number of features of all modalities.
Q5: Difference between MES and MER in ablation.
A5: MES (Modality Ensemble Searching) and MER (Modality Ensemble Retrieving) describe the same module in Section 3.3. We will unify all names of this module to MER.
Q6: Strategy for improving the quality of generated text.
A6: To improve generated text quality, we can apply several strategies during the generator fine-tuning process (Appendix A):
Better data for Generator Fine-tuning: During generator fine-tuning process, we can use augmentations (e.g., flipping, brightness adjustments) for more diverse images, enhancing text generation against varying image quality.
Stronger Generative LVM: Use advanced LVMs with larger parameters thereby capturing more useful visual information in images to generate textual descriptions with better quality.
Progressive Generation Strategy: Fine-tune generators on images with attribute annotations to focus more on fine-grained attributes rather than sentence modeling, then use LLMs to reorganize them into high-quality descriptive sentences.
Q7: How enhanced text quality boosts system effectiveness?
A7: During training, high-quality text allows the model to learn better text-vision correspondences, maximizing the utilization of complementary information to create fusion features. During retrieval, high-quality text provides accurate information for infrared compensation, enabling more accurate cross-modal retrieval.
Thank you for taking the time to review our paper and for providing valuable feedback. We have submitted our rebuttal and have attempted to address your concerns, particularly regarding the clarity of Section 3.2 and the fusion process. If there are any remaining issues or further questions, please let us know, and we will actively work to resolve them. We would greatly appreciate it if you could update your score based on the feedback from other reviewers and our response.
The rebuttal is covering in a satisfactory way just some of the weaknesses. The clarity of the sections and methodology are still a weakness which, in my opinion, undermine the reproducibility. I do still think that this paper is not at NEURIPS level, but since some of my concerns have been cleared I would raise my score to weak reject.
Thank you for your feedback and for considering our responses to your previous concerns. We understand the importance of reproducibility and clarity in our methodology.
Therefore, we will release all the assets of our work, ensuring other researchers can fully replicate our experimental results and validate our findings.
- Framework Code and Extended Datasets: We will release the complete code for our framework, including both training and testing, along with the trained model weights. This package will include documentation and detailed explanations to facilitate understanding and ease of use. Additionally, we will release the text components of the three extended VI-ReID datasets, including the original and augmented text, enabling comprehensive replication of our results and further exploration building upon our framework.
- Captioners for Data Expansion: We will release the fine-tuned weights and code for text generation of the modality-specific captioners, which is designed to generate text descriptions for visible and infrared images. They will validate the feasibility and reliability of our data expansion methods, which can also be applied to other VI-ReID datasets.
We will also refine our manuscript based on the suggestions. This includes reorganizing our content for better coherence and adding explanations to better highlight each module's insights and connections. These refinements will further enhance the understanding of our framework for other researchers.
Moreover, our work is a new exploration of employing foundation models like LLMs and LVMs to enhance traditional VI-ReID tasks. The extended tri-modality datasets also offer significant benefits for VI-ReID research. We believe that the explanations of our work, along with the code and data, will support and inspire subsequent text-related VI-ReID works by researchers in the community, including the reviewers here who have expressed interest in our methods and expanded datasets.
Thank you again for your constructive feedback. We hope that our response addresses your concerns. The code, weight, and data will be released.
Thanks for your positive feedback on our rebuttal and indicating your intention to raise the score. In the last round response, we have clarified that we will release our code and data to ensure the reproducibility, and improve the clarity of our manuscript for better understanding according to your suggestions. We are not sure whether it addresses your concerns. We noticed that the score has not yet been updated. If there are any remaining concerns, please let us know. We sincerely appreciate your expertise and the dedicated effort in reviewing our work.
We thank all reviewers for their positive feedback on clear diagrams (\color{red}{R\\#88cp})(\color{red}{R\\#BZVu}), methodology feasibility (\color{red}{R\\#hNfL})(\color{red}{R\\#88cp})(\color{red}{R\\#EHC5}), competitive performance (\color{red}{R\\#EHC5})(\color{red}{R\\#368Y}), comprehensive experiments (\color{red}{R\\#88cp})(\color{red}{R\\#EHC5}), interesting approach and detailed visualization analysis (\color{red}{R\\#368Y}). We hope this rebuttal allows \color{red}{R\\#EHC5}, and \color{red}{R\\#368Y} to update the scores. The code will be released.
Five experts in the field reviewed this paper. Their recommendations are Weak Reject, Strong Accept, Accept, Strong Accept, and Strong Accept. Overall, the reviewers appreciated the paper because it addresses a new problem (applying foundation models to VI-ReID tasks), proposes an innovative approach, and shows effectiveness in its comprehensive experimental evaluations. Based on their feedback and the author’s satisfactory rebuttal that addresses the initial concerns, I recommend it for acceptance. The reviewers raised some valuable issues and concerns in the Weaknesses that should be addressed in the final camera-ready version of the paper. In particular, the overall structure and clarity of the paper should be significantly improved in the final version. The motivation, details and advantages of each module in the paper should be clarified to enhance understanding. The task setting should also be introduced from the outset, which can adequately support the rationale for text enhancement in cross-modality retrieval. Finally, it's important to outline the limitations of state-of-the-art text-assisted ReID methods to justify your method. The authors are encouraged to make the necessary changes to the best of their ability. We congratulate the authors on the acceptance of their paper!