PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.3
置信度
创新性2.8
质量2.5
清晰度2.5
重要性2.8
NeurIPS 2025

UGG-ReID: Uncertainty-Guided Graph Model for Multi-Modal Object Re-Identification

OpenReviewPDF
提交: 2025-05-08更新: 2025-10-29
TL;DR

We propose the Uncertainty-Guided Graph model for multi-modal object ReID (UGG-ReID) framework for multi-modal object re-identification to effectively alleviate the interference of noisy data.

摘要

关键词
Multi-Modal Object Re-IdentificationUncertainty-Guided GraphGaussian Patch-Graph RepresentationMixture of ExpertsUncertainty-Guided Routing

评审与讨论

审稿意见
4

This paper introduces a novel framework to address the challenges of local noise and modality misalignment in multimodal ReID. The key ideas are twofold: GPGR quantifies epistemic uncertainty in local and global features and models their structural dependencies, and UGMoE dynamically selects reliable experts based on sample-level uncertainty to improve cross-modal interaction. Extensive experiments on five public multi-modal ReID datasets show that UGG-ReID outperforms strong baselines.

优缺点分析

Strengths:

  1. The paper tackles a critical challenge in multi-modal Re-ID: fine-grained local noise and inter-modal inconsistencies, which degrade robustness.
  2. It achieves state-of-the-art performance in mAP and Rank-1 across all datasets, and performs especially well under artificially injected Gaussian noise, showing strong robustness and generalization.

Weaknesses:

  1. There are several instances where full terms and their abbreviations (e.g., Gaussian Patch-Graph Representation (GPGR), Uncertainty-Guided Mixture of Experts (UGMoE), Mixture of Experts (MoE)) are repeatedly introduced or inconsistently used throughout different sections. This redundancy and inconsistency negatively affect readability and make it harder for readers to follow the paper's core ideas. Terms like “uncertainty-guided” and “modal-specific” are used frequently without consistent clarification. These should be explained clearly in the introduction.
  2. While the paper introduces uncertainty modeling via GPGR and UGMoE, similar concepts have already been explored in prior work. DeMo also leverages expert selection and attention mechanisms for multi-modal object Re-ID, and uncertainty-guided fusion has been studied in the broader multi-modal learning literature (e.g., MAP, PDA). Although mechanisms like Gaussian modeling, KL divergence, MoE routing are introduced, the distinction between UGG-ReID and these approaches is not clear.

问题

According to the weaknesses, my questions are:

  1. The authors should improve the paper's clarity, including full terms and their abbreviations, repeatedly introduced parts, and terms without explanations.
  2. Please provide a clearer conceptual and technical comparison to prior works. This could be done through a comparison table in the methodology section that breaks down components across methods (e.g., whether uncertainty is modeled locally vs. globally, how routing is performed, whether expert selection is dynamic, etc.). Also, consider explicitly clarifying which aspects of your framework are new and which are extensions of known paradigms.
  3. The authors are encouraged to supplement the Limitations section.

局限性

No

最终评判理由

I keep my score after reading the rebuttal. The authors should further polish the paper to make it acceptable.

格式问题

N/A

作者回复

Response to Reviewer 9LdQ

We thank the reviewer for your useful comments and suggestions.

Q1. The authors should improve the paper's clarity, including full terms and their abbreviations, repeatedly introduced parts, and terms without explanations.

A1. In the final version, we have carefully revised the full text, including unified terminology, supplementing undefined abbreviations, removing duplicate descriptions, and further clarifying key concepts to improve overall readability and academic normability.

Q2. Please provide a clearer conceptual and technical comparison to prior works. This could be done through a comparison table in the methodology section that breaks down components across methods (e.g., whether uncertainty is modeled locally vs. globally, how routing is performed, whether expert selection is dynamic, etc.). Also, consider explicitly clarifying which aspects of your framework are new and which are extensions of known paradigms.

A2. For conceptual comparison, first, for Gaussian modeling, we exploit a Gaussian-based random graph for object representation, whose nodes are described via Gaussian distributions to represent the uncertainty of image patches with noise influencing, while previous works generally use a deterministic graph model whose nodes are represented via a feature vector. We design a Gaussian Patch-Graph Representation (GPGR) to quantify aleatoric uncertainties for global and local features while modeling their relationships. To our knowledge, this work is the first attempt to exploit a random patch-graph model for the object ReID problem. Second, for Mixture-of-Experts, we design the Uncertainty-Guided Mixture of Experts (UGMoE) strategy, which makes different samples select experts based on the uncertainty and also utilizes an uncertainty-guided routing mechanism to strengthen the interaction between multi-modal features, effectively promoting modal collaboration.

For technical comparison, first, we further analyze the impact of different uncertainty modeling approaches on the performance of multi-modal object ReID [1] [26] [32]. As shown in the Table, works EUAR [32], EAU [1] are able to perceive the inter-sample uncertainty. In contrast, MAP [26] introduces a more comprehensive uncertainty modeling mechanism to quantify the uncertainty of local cues. Although the above methods take uncertainty into account, they either neglect the modeling of uncertainty in local cues or the structural relationships between local regions. Second, we perform comparisons under different MoE strategies. As shown in this Table, we substitute two existing MoE methods [27] [32]. Compared with DeMo [27], UGMoE better exploits the diversity of samples by introducing uncertainty modeling, and compared with EUAR [32], UGMoE further strengthens the interaction between different modalities.

Table. Component-wise comparison of different methods on RGBNT201.

MethodUnc.MoELocalGlobalGraphmAPR-1
EUAR [32]74.177.6
EAU [1]75.680.3
MAP [26]76.878.2
DeMo [27]79.781.8
UGG-ReID (Ours)81.286.8

Q3. The authors are encouraged to supplement the Limitations section.

A3. Thanks to the reviewer for your suggestions. We will add a Limitations section to the final manuscript to supplement the current limitations. While our framework employs uncertainty-guided learning to enhance robustness against local noise, it may still struggle under extreme conditions where local cues are heavily corrupted or missing. Additionally, its performance across modalities has not been extensively studied, which limits its applicability in some real-world deployments.

审稿意见
4

This paper presents UGG-ReID, a novel multimodal person re-identification framework designed to address two key challenges: (1) the suppression of local noise interference, and (2) the effective fusion of heterogeneous modality features. The proposed approach introduces two main innovations: a Gaussian Patch Graph Representation (GpGR) that models feature uncertainty through Gaussian distributions and captures structural relations via graph propagation; and an Uncertainty-Guided Mixture-of-Experts (UGMoE) strategy, which dynamically assigns samples to specialized experts based on estimated uncertainty, thereby enhancing cross-modal interaction. Extensive evaluations on five public multimodal ReID benchmarks demonstrate that UGG-ReID achieves state-of-the-art performance, particularly exhibiting superior robustness under noisy conditions. Comprehensive ablation studies and robustness analyses further validate the effectiveness of each core component.

优缺点分析

Strengths:

  1. The paper conducts comprehensive experiments across five representative multimodal ReID benchmarks, covering both pedestrian and vehicle scenarios, which demonstrates the generalizability and effectiveness of the proposed approach.
  2. The work presents a novel perspective by incorporating Gaussian distribution modeling into local feature representation and leveraging sample-level uncertainty to guide cross-modal interaction, showing good originality.
  3. The paper is well-structured and logically presented. The progression from problem formulation to method design, experimentation, and result analysis is clear and easy to follow.
  4. The proposed method achieves strong performance, especially under noisy conditions, and outperforms state-of-the-art approaches, indicating its robustness to modality inconsistency and noise interference. Weaknesses:
  5. The mathematical exposition of the Gaussian graph convolution and uncertainty-based routing is limited. There is no analysis of convergence, complexity, or theoretical guarantees, which reduces the theoretical depth of the work.
  6. The core components—Gaussian modeling and Mixture-of-Experts—are not entirely novel, and the paper lacks sufficient discussion on their integration, boundary advantages, or potential synergy, which would have strengthened the methodological contribution.
  7. Some equations and notations are insufficiently explained, which could cause confusion for readers. For instance, the variable lms is introduced but never clearly utilized in the final loss formulation.
  8. While the UGMoE module is claimed to dynamically route samples to experts based on uncertainty, no visualization or statistical evidence is provided to support how samples are distributed across experts, nor is the actual benefit of such routing quantitatively analyzed.
  9. The paper mentions training on a single RTX 4090 GPU but lacks details on model size or inference latency. Given the use of graph convolutions and MoE, which are generally computation-intensive, the method may face challenges in real-world deployment scenarios.
  10. The noise robustness evaluation relies primarily on synthetic noise (e.g., Gaussian, arbitrary), which may not fully reflect real-world challenges such as occlusion, illumination changes, or viewpoint shifts.

问题

In Fig. 3, as the magnitude of Gaussian noise increases, UGG-ReID shows a greater performance drop compared to Demo, which seems to suggest that UGG-ReID may not be more robust than Demo. Could you clarify your view on this observation? In Section 3.3.1, the Uncertainty-Guided Experts Network introduces a new loss function. However, this loss does not appear to be included in the overall training objective. Could you provide an explanation for this omission? Could the authors elaborate on the theoretical motivation for using Gaussian distributions to model local uncertainties? Why is this choice particularly appropriate in the context of multimodal ReID? Have the authors considered extending UGG-ReID to other multimodal tasks such as cross-modal retrieval? If not, could the authors discuss the potential challenges or limitations that may arise in such scenarios? To what extent is the performance of UGG-ReID sensitive to key hyperparameters, such as the number of experts in the UGMoE module or the construction parameters in the GpGR? A sensitivity analysis or ablation could help clarify this aspect.

局限性

yes

最终评判理由

please see the comments.

格式问题

none.

作者回复

Response to Reviewer 9cgN

We thank the reviewer for your useful comments and suggestions.

Q1. In Fig. 3, as the magnitude of Gaussian noise increases, UGG-ReID shows a greater performance drop compared to Demo, which seems to suggest that UGG-ReID may not be more robust than Demo. Could you clarify your view on this observation?

A1. We also observe a performance degradation compared to DeMo after adding Gaussian noise. This phenomenon essentially stems from the design mechanism of UGG-ReID: introducing the uncertainty of Gaussian distribution modeling nodes and propagating neighborhood information through the graph structure, capturing perturbations in local features more sensitively. When Gaussian noise is injected, this noise directly acts on the uncertainty distribution of our modeling. This is not a reflection of the lack of robustness, but reflects the model's high perception of input changes, which is in line with our original design intention of wanting it to be adaptive in real and complex scenarios. Additionally, UGG-ReID consistently outperforms DeMo [27] at different noise intensities injected. Under real noise conditions (e.g., severe glare interference in WMVEID863), our method outperforms DeMo [27] by 3.8% mAP and 3.6% R-1 in Table 2 of this paper, confirming its superior robustness and discriminative power in complex conditions.

Q2. In Section 3.3.1, a new loss does not appear to be included in the overall training objective. Could you provide an explanation for this omission?

A2. We introduce an auxiliary loss Lsm\mathcal{L}^m_s for global features constraints in Section 3.3.1, which is a wording omission. In fact, this loss has been included in the overall training objective. The Lcm\mathcal{L}^m_c in Eq. (13) should be written as Lc,sm\mathcal{L}^m_{c,s}, which represents the sum of Lcm\mathcal{L}^m_c and Lsm\mathcal{L}^m_s. In addition, we perform an ablation experimental analysis of this loss in the Supplementary Material, as Table 5. We apologize for the misrepresentation and will make corrections in the final version.

Q3. Could the authors elaborate on the theoretical motivation for using Gaussian distributions to model local uncertainties?

A3. Yes, we adopt Gaussian distributions to model local features primarily to capture the inherent uncertainty and variability within each image patch. By representing each local patch as a Gaussian, we are able to encode both its mean and variance representation, which reflects the reliability or confidence of the local information. This designed Gaussian Patch-Graph Representation (GPGR) can quantify aleatoric uncertainties for local features while modeling their relationships. To our knowledge, this work is the first attempt to exploit a random patch-graph model for the fine-grained local noise problem.

Q4. Why is this choice particularly appropriate in the context of multimodal ReID? Have the authors considered extending UGG-ReID to other multimodal tasks, such as cross-modal retrieval? If not, could the authors discuss the potential challenges or limitations that may arise in such scenarios?

A4. Thank you for your question. The framework is highly versatile in its multi-modal design, and the core reason is the synergy of two key aspects: (1) The GPGR module effectively enhances the fine-grained information expression ability of the current mode through a graph modeling strategy, which is an infrastructure that can be directly migrated to other multi-modal tasks. (2) The UGMoE module realizes the modal fusion of task awareness by modeling sample uncertainty, and its design has strong adjustability. While we are currently focusing on the ReID task, the framework's ideas can also be applied to other multi-modal applications, such as cross-modal retrieval. For such tasks, we can further refine UGMoE to enhance semantic alignment and compress inter-modal distances. However, challenges may arise during this expansion, such as increased modality-specific discrepancies and limited stability in uncertainty estimation. We will investigate these issues more thoroughly in future work.

Q5. To what extent is the performance of UGG-ReID sensitive to key hyperparameters.

A5. We analyze the effects of the hyperparameters CC and kk on the UGMoE module in the Supplementary Material, as Table 6. We further analyze the local nodes nn of GPGL and layers LL of GPGCN in the GPGR for the effect of the model on RGBNT201, as below Table 3. For nodes n, we observe that nn=128 achieves excellent results. Too few nodes are not enough to cover rich local information, and too many introduce redundancy and noise, interfering with graph structure learning. For layers LL, GPGCN works best when LL=2. The number of layers is too shallow and may lead to insufficient fusion of local structures, while too deep may cause over-smoothing, weakening the expression ability of local discriminative features. 

Table 3. Results of the analysis on hyperparameters nn and LL of the GPGR on RGBNT201.

Nodes nnmAPR-1R-5R-10
3279.383.791.393.9
6480.182.990.094.0
9681.484.690.492.6
12881.286.892.094.7
16079.883.690.291.9
Layers LLmAPR-1R-5R-10
180.185.391.393.9
281.286.892.094.7
379.184.690.893.3
478.482.590.392.8
576.380.789.191.7

Q6. The mathematical exposition of the Gaussian graph convolution and uncertainty-based routing is limited. There is no analysis of convergence, complexity, or theoretical guarantees, which reduces the theoretical depth of the work.

A6. Thanks for your suggestion. The main aspect of Gaussian graph convolution is to conduct information aggregation with each neighborhood set. Here, we achieve Gaussian graph convolution from both a mean and a variance perspective. Theoretically, the outputs of the Gaussian graph convolutional layer also obey a Gaussian distribution. The main complexity of our Gaussian graph convolution involves two aspects, i.e., mean-based aggregation and variance-based aggregation. Thus, the overall complexity is generally twice as complex when comparing with regular patch-graph convolution. We will include this analysis in our final version.

Q7. The core components—Gaussian modeling and Mixture-of-Experts—are not entirely novel, and the paper lacks sufficient discussion on their integration, boundary advantages, or potential synergy, which would have strengthened the methodological contribution.

A7. First, for Gaussian modeling, we exploit Gaussian based random graph for object representation whose nodes are described via Gaussian distributions to represent the uncertainty of image patches with noise influencing, while previous works generally use a deterministic graph model whose nodes are represented via a feature vector. We design a Gaussian Patch-Graph Representation (GPGR) to quantify aleatoric uncertainties for global and local features while modeling their relationships. To our knowledge, this work is the first attempt to exploit a random patch-graph model for the object ReID problem. Second, for Mixture-of-Experts, we design the Uncertainty-Guided Mixture of Experts (UGMoE) strategy, which makes different samples select experts based on the uncertainty and also utilizes an uncertainty-guided routing mechanism to strengthen the interaction between multi-modal features, effectively promoting modal collaboration.

Q8. provide support for how samples are distributed across experts, nor is the actual benefit of such routing quantitatively analyzed.

A8. We quantitatively analyze the router gate scores and uncertainty average scores of all test samples for each modality on WMVEID863 in Table 3 below. The results show that the model can adaptively assign experts according to the uncertainty of samples, achieving more robust collaborative learning.

Table 3. Results of all test samples' average uncertainty and expert gate score for each modality on WMVEID863.

ModalityScoreExpert 1Expert 2Expert 3Expert 4
RUnc.0.310.270.300.12
Gate0.070.120.070.74
NUnc.0.240.240.270.24
Gate0.280.260.180.28
TUnc.0.280.260.290.18
Gate0.170.210.140.48

Q9. Lacks details on model size or inference latency.

A9. We evaluate the model size and inference latency against SOTAs on RGBNT201. As Table 4, UGG-ReID reaches an inference speed of 371.4 FPS while obtaining a relatively low parameter count of 103.2M and a computational cost of 35.0G FLOPs. This speed is only slightly lower than that of the lighter DeMo [31], and is higher than most mainstream approaches, including MambaPro and EDITOR. Notably, despite its high efficiency, UGG-ReID still achieves excellent retrieval results.

Table 4. Comparison of model efficiency with state-of-the-art methods on RGBNT201.

MethodParams (M)↓FLOPs (G)↓FPS↑mAP↑R-1↑
TOP-ReID [20]324.535.5398.972.275.2
EDITOR [21]119.340.8335.166.768.7
PromptMA [23]107.936.2343.578.480.9
MambaPro [18]74.852.4243.278.983.4
DeMo [31]98.835.1403.679.781.8
IDEA [22]91.743.7299.580.282.1
UGG-ReID (Ours)103.235.0371.481.286.8

Q10. The noise robustness evaluation relies primarily on synthetic noise (e.g., Gaussian, arbitrary), which may not fully reflect real-world challenges such as occlusion, illumination changes, or viewpoint shifts.

A10. We not only evaluate our method under synthetic noise but also on WMVEID863, which captures real-world multi-modal noise conditions, including glare and occlusion. Our method achieves improvements of 2.8% mAP and 3.6% R-1 over the secondary models in Table 2 of this paper, indicating its effectiveness in handling real-world interference.

评论

Dear Reviewers,

The NeurIPS 2025 author-reviewer discussion will be closed on August 6, 11:59 pm AoE. Please read the responses, respond to them in the discussion, and discuss points of disagreement.

Best, Your AC

评论

I appreciate the authors' responses to my concerns. They have clarified the design rationale behind the model's sensitivity to synthetic noise, and provided strong evidence of its robustness on real-world noisy data. These responses have addressed my main points. I consider keeping my score on 4.

评论

Thanks for your time and effort. We're pleased that our explanations have addressed your concerns. Your insights are highly valuable in improving the quality of our paper.

审稿意见
4

This paper focus on the valuable task of multimodal object re-identification. The authors mainly focus on two under-explored challenges, to learn robust representations against local noise and integrate different modalities. To combat, the authors propose GPGR that estimates uncertainty through Gaussian distributions and patch-graph representations, and UGMoE that guide samples to experts of low uncertainties. Further experiments demonstrate the effectiveness of UGG-ReID.

优缺点分析

Pros:

  1. The targeted multimodal object re-identification task is valuable in real-world applications.
  2. The paper is well-written and easy to follow.
  3. The proposed UGG-ReID achieves significant improvements on performance than SOTAs.

Cons:

  1. The novelty of the method is limited from my perspective. The paper mainly focus on dealing with local noise, which has been researched by several previous works, the strategies like patch-graph and uncertainty estimation have been widely applied before. Besides, a branch of related works that focus on multimodal noise [1-3] in ReID have not been discussed or compared by the authors.
  2. I appreciate that the authors try to deal with noise through MoE strategies. However, I concern about whether the balance between experts can be maintained. Although balancing loss has been applied in Eq. 12, I still worry that few experienced experts will deal with most samples with high certainty. I wonder if the authors can give visualization of router to show whether experts are balanced.
  3. The authors introduce a quite complicated framwork in this paper, including patch-wise Gaussian distribution estimation, patch-graph calculation, MoE, etc. Such framework is concerned to have high computational cost. I wonder if the authors can provide wall-clock time costs of UGG-ReID against SOTAs to prove that the method is efficient.

[1] Learning With Twin Noisy Labels for Visible-Infrared Person Re-Identification

[2] Robust Object Re-identification with Coupled Noisy Labels

[3] Noisy-Correspondence Learning for Text-to-Image Person Re-identification

问题

Please refer to the Cons section.

I will consider raise my score if the authors can:

  1. explain the novelty of the method, and discuss or compare with the mentioned related works.
  2. provide visualization of routers to show different experts are balanced.
  3. provide wall-clock time costs of UGG-ReID against SOTAs.

局限性

yes

最终评判理由

Considering the authors have solved my major concerns, I decide to raise my score to 4 in final decision.

格式问题

none

作者回复

Response to Reviewer 4UnX

We thank the reviewer for your useful comments and suggestions.
Q1. The paper mainly focuses on dealing with local noise, which has been researched by several previous works; strategies like patch-graph and uncertainty estimation have been widely applied before. Besides, a branch of related works that focuses on multimodal noise [1-3] in ReID has not been discussed or compared by the authors.

[1] Learning With Twin Noisy Labels for Visible-Infrared Person Re-Identification.

[2] Robust Object Re-identification with Coupled Noisy Labels.

[3] Noisy-Correspondence Learning for Text-to-Image Person Re-identification.

A1. Yes, patch-graph and uncertainty estimation have been studied to deal with local noises in previous works [1] [26] [32]. However, the key difference is that we propose to deal with local patch nodes by leveraging a novel Gaussion based random graph learning model, which provides a unified model to integrate the advantages of both the patch-graph model and uncertainty estimation strategy. To our knowledge, this work is the first attempt to exploit a random patch-graph model for the object ReID problem.  

In addition, we compare our method with some other previous multi-modal noise [1-3] in ReID as follows. First, previous works [1] [2] focus on label noises in cross-modal alignment. In contrast, we focus on visual (patch) noises in multi-modal fusion. Second, work [3] focuses on label noise at the annotation level, emphasizing data filtering and sample hierarchical training. Differently, we focus on the robustness of the patch modeling phase, especially in terms of sensing input perturbations (e.g., occlusion, blurring noise). Works [1-3] are more suitable for labeling unclean datasets, while UGG-ReID is better at uncertainty modeling of object representation in real complex scenarios.  In the current version, we mainly focus on multi-modal ReID correlation methods and multi-modal methods related to uncertainty modeling. In our final version, we will include the above works [1-3].

Q2. I appreciate that the authors try to deal with noise through MoE strategies. However, I am concerned about whether the balance between experts can be maintained. Although balancing loss has been applied in Eq. 12, I still worry that few experienced experts will deal with most samples with high certainty. I wonder if the authors can give a visualization of the router to show whether the experts are balanced.

A2. Empirically, our method can obtain relatively balanced results between different experts. Here, we show the router gate scores and uncertainty average scores of all test samples for each modality on WMVEID863 in Table 1. We observe that N and T modalities show a more balanced expert activation distribution, while Expert 4 in the R modality has a heavy weight when the uncertainty is low. These results indicate that the routing mechanism dynamically adjusts expert allocation according to different modalities and sample uncertainty.

Table 1. Results of all test samples' average uncertainty and expert gate score for each modality.

ModalityScoreExpert 1Expert 2Expert 3Expert 4
RUnc.0.310.270.300.12
Gate0.070.120.070.74
NUnc.0.240.240.270.24
Gate0.280.260.180.28
TUnc.0.280.260.290.18
Gate0.170.210.140.48

Q3. Provide wall-clock time costs of UGG-ReID against SOTAs.

A3. We evaluate the inference speed of each method on RGBNT201, using Frames Per Second (FPS) as the evaluation metric. As shown in Table 2, UGG-ReID reaches an inference speed of 371.4 FPS while maintaining a relatively low parameter count of 103.2 million and a computational cost of 35.0G FLOPs. This speed is only slightly lower than that of the lighter DeMo, and is significantly higher than most mainstream approaches, including MambaPro and EDITOR. Notably, despite its high efficiency, UGG-ReID still achieves strong retrieval results.

Table 2. Wall-clock efficiency and accuracy comparison with SOTA methods on RGBNT201.

MethodParams (M)↓FLOPs (G)↓FPS↑mAP↑R-1↑
TOP-ReID [20]324.535.5398.972.275.2
EDITOR [21]119.340.8335.166.768.7
PromptMA [23]107.936.2343.578.480.9
MambaPro [18]74.852.4243.278.983.4
DeMo [31]98.835.1403.679.781.8
IDEA [22]91.743.7299.580.282.1
UGG-ReID (Ours)103.235.0371.481.286.8
评论

I appreciate the author's explanation to my concerns and consider rasing my score to 4 for the following reasons:

  1. The authors discuss the difference between the proposed UGG-ReID and mentioned Re-ID methods.

  2. The authors demonstrate the balance between experts in MoE through additional experiments.

  3. The authors demonstrate the efficiency of UGG-ReID through wall-clock time experiments.

评论

Thank you for your valuable feedback. We're glad our explanation has addressed your concerns. Your insights are greatly appreciated and help us further improve our work.

评论

Dear Reviewer 4UnX, Our discussion session will be closed in the next 2 days. If the authors' feedback has solved your concerns, please give your final rating. If not, please give your detailed comments.

Best, Your AC

审稿意见
5

The paper proposes an Uncertainty-Guided Graph model for multi-modal object re-identification named UGG-ReID. UGG-ReID tackles two main challenges: local noise from occlusions or frame loss and effective multi-modal data fusion. It introduces Gaussian Patch-Graph Representation (GPGR) to quantify and structure epistemic uncertainty at global and local levels, enhancing feature robustness. Additionally, an Uncertainty-Guided Mixture of Experts (UGMoE) dynamically routes samples to appropriate experts based on uncertainty, improving fusion and reducing noise effects. Extensive experiments on five datasets demonstrate state-of-the-art performance and strong robustness in noisy conditions, confirming the method’s effectiveness.

优缺点分析

The strength of this paper lies in its performance and the idea of uncertainty. Compared to previous works, this work has achieved significant performance improvements. The weakness of this paper may lie in its unclear motivation and introduction. It makes me feel somewhat confused about why these parts need to be introduced. Meanwhile, the lack of visualization results also harms this paper. Although it is focused on uncertainty. There are no visualization results or analysis to show that these modules work in this way.

问题

  1. The motivation and Figure 1 are somewhat confusing. How do you define the local noise and how it influenced the final retrieval performance? It will be better to give some examples or a better explanation.

  2. Using graph nodes is not fresh in the ReID. HOReID [1] is a typical strategy that uses ga raph network to solve the missing information. What is the advantage of this work? Meanwhile, I think it will be more specific and interesting if using the graph passing to achieve interaction between different modalities.

  3. In Figure 2, the R, N, and T features are not considered equally in the UGMoE. In the paper, the author seems to consider them equally. So, how does UGMoE work?

  4. Are there any visualization results that can be given to show the difference in the uncertainty and expert selection? I think it will be useful to demonstrate the effectiveness of each module.

[1] Wang G, Yang S, Liu H, et al. High-order information matters: Learning relation and topology for occluded person re-identification[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 6449-6458.

局限性

Yes

最终评判理由

I think the author's response has solved my main concerns. Meanwhile, this result and discussion should be updated in the final manuscript.

格式问题

No

作者回复

Response to Reviewer 6275

We thank the reviewer for your useful comments and suggestions.
Q1. The motivation and Figure 1 are somewhat confusing. How do you define the local noise and how it influenced the final retrieval performance? It will be better to give some examples or a better explanation.

A1. (1) Here, the local noise refers to the disturbance of the object's local area that commonly exists in multi-modal ReID, as shown in Fig. 1 (a). For example, in thermal infrared images, intense thermal radiation may lead to overexposure of local areas (such as lights and the front of the vehicle), resulting in the loss of structural information. In visible light images, uneven lighting, background occlusion, or equipment sunshades can blur or obscure critical areas (e.g., windows, license plates). These local noises can affect the semantic patch-level learning between different modalities, which influences the final retrieval performance. (2) To mitigate the above problem, we design the GPGR module to quantify local uncertainty and extract stable structural relationships, resulting in robust fine-grained characterization, as shown in Fig. 1 (b). In addition, we propose a UGMoE strategy to guide modal interaction and compute the overall uncertainty of the sample to dynamically select experts and also utilize an uncertainty-guided routing mechanism to strengthen the interaction between multi-modal data, as shown in Fig. 1 (c).

Q2. (1) Using graph nodes is not fresh in the ReID. HOReID [1] is a typical strategy that uses the graph network to solve the missing information. What is the advantage of this work? (2) Meanwhile, I think it will be more specific and interesting if using the graph passing to achieve interaction between different modalities.

[1] Wang G, Yang S, Liu H, et al. High-order information matters: Learning relation and topology for occluded person re-identification[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 6449-6458.

A2. (1) Comparing with existing graph-based ReID methods, the key difference is that we exploit Gaussion based random graph for object representation, whose nodes are described via Gaussian distributions to represent the uncertainty of image patches with noise influence, while previous works generally use a deterministic graph model whose nodes are represented via a feature vector. The main advantages of our method are as follows: we design a Gaussian Patch-Graph Representation (GPGR) to quantify aleatoric uncertainties for global and local features while modeling their relationships. GPGR can further alleviate the impact of noisy data and effectively reinforce modal-specific information. To our knowledge, this work is the first attempt to exploit a random patch-graph model for the object ReID problem. (2) We appreciate your insightful suggestion. We think leveraging graph message passing to model interactions across different modalities is an interesting direction. In future work, we plan further to explore multi-modality integration under our random graph-based framework to improve multi-modal interaction.

Q3. In Figure 2, the R, N, and T features are not considered equally in the UGMoE. In the paper, the author seems to consider them equally. So, how does UGMoE work?

A3. The UGMoE module in Fig. 2 only shows the UGMoE work process for the T modality as an illustrative example. In practice, we treat all modalities equally in the proposed UGMoE framework. Each modality has its own specific experts and top-kk shared experts of other modalities, which are guided by the corresponding uncertainty estimation. For example, Fig. 2 (b) only shows the T modality and top-kk shared experts of R and N modalities.

Q4. Are there any visualization results that can be given to show the difference in the uncertainty and expert selection? I think it will be useful to demonstrate the effectiveness of each module.

A4. Yes, we have visualized some Class Activation Map (CAM) of our method in the WMVEID863 dataset in the Supplementary Material, as shown in Fig. 7. It shows that our method can focus on key areas even when local features are strongly disturbed. During rebuttal, we conduct some more visualizations to compare the main modules, which show significant improvements in feature selection and regional focus. Our visualization experiments show promising results that demonstrate the effectiveness of each module. We will include them in our final version.

Table. Results of all test samples' average uncertainty and expert gate score for each modality.

ModalityScoreExpert 1Expert 2Expert 3Expert 4
RUnc.0.310.270.300.12
Gate0.070.120.070.74
NUnc.0.240.240.270.24
Gate0.280.260.180.28
TUnc.0.280.260.290.18
Gate0.170.210.140.48

In addition, we quantitatively analyze all test samples' average uncertainty and expert gate score for each modality in the above Table. The results show that there are significant differences in uncertainty and expert allocation between different modalities. For example, in the R modality, expert four is assigned a higher weight due to its lower uncertainty, while other experts receive lower weights; In contrast, the uncertainty and routing weight distribution in N modality are more balanced, indicating that the model can adaptively assign experts according to the characteristics of different modalities, achieving more robust collaborative learning.

评论

Thanks for your constructive feedback. I think the author's response has solved my main concerns. Meanwhile, this result and discussion should be updated in your final manuscript.

评论

Thank you for your time and effort. We are glad to have addressed your concerns and remain open to any further discussion. The results and corresponding discussion will be included in the final manuscript.

最终决定

This paper addresses the Multi-Modal Object Re-Identification task and, following the rebuttal phase, received one clear acceptance rating along with three borderline-positive scores.

In the rebuttal, this paper addressed most of the novelty concerns, including expert experiments and mathematical, theoretical analysis of multimodality. Overall, AC agrees with the reviewers, believing that this paper still offers significant strengths for fine-grained Re-ID, such as its analysis of uncertainty and graph models for the current task, and its clear performance improvements.

AC noted some shortcomings in the paper and understands its borderline and positive responses. AC requested that the authors incorporate the key experiments mentioned in the rebuttal into the final version to further enhance the paper.