PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
5
4
4.5
置信度
创新性3.0
质量3.5
清晰度3.3
重要性3.5
NeurIPS 2025

Learning Source-Free Domain Adaptation for Visible-Infrared Person Re-Identification

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
Source-Free Domain Adaptation; Visible-Infrared Person Re-Identification; Residual Transfer Learning

评审与讨论

审稿意见
5

This paper presents SVIP, a novel Source-Free Domain Adaptation (SFDA) framework for Visible-Infrared Person Re-Identification (VI-ReID). The core innovation lies in adapting a pre-trained source model to an unlabeled target domain without access to either source data or target labels, thereby effectively addressing privacy concerns and real-world deployment constraints.

优缺点分析

Strengths:

  1. The three-pronged approach (SGCL/RTL/SCCA) is novel and well-motivated for SFDA in cross-modal ReID. The design elegantly leverages source model priors without data access. 2) The method addresses critical real-world constraints such as privacy protection and computational efficiency, which are often overlooked in existing VI-ReID literature. Moreover, the inclusion of weather robustness experiments (fog, rain, snow) enhances the practical relevance of the work.

Weaknesses:

Although the model's robustness under various weather conditions has been demonstrated, its performance under more extreme conditions remains unexplored and warrants further investigation, such as image degradation caused by temperature-induced camera noise.

问题

1.Equations (2)–(3) introduce an EMA momentum parameter (η = 0.1) to stabilize the cluster centers. While this choice appears reasonable, it would be helpful to briefly discuss the sensitivity of the model to different η values. Does performance degrade significantly when η is increased or decreased?

2.Section 4.1 states that SVIP can incorporate any VI-ReID method as the source model. In Figure 6, stronger baselines (e.g., MMM) seem to yield better adaptation performance. Could the authors confirm whether this trend holds consistently? Such evidence would further underscore the plug-and-play nature and generality of SVIP.

3.Figure 3 qualitatively demonstrates that SCCA produces more compact clusters in the t-SNE visualization. To support this, could the authors provide a quantitative comparison for “SGCL + RTL” versus the complete SVIP pipeline? For instance, the average L2 distance between matched visible-infrared pairs in the target domain.

4.Figures 7 and 8 present qualitative retrieval results. Are there any common patterns among the failure cases (e.g., extreme pose variation, occlusion, background clutter)? A brief analysis could help clarify the limitations and potential directions for future improvement.

局限性

As noted in Appendix E, the performance of SVIP partially depends on the generalizability of the source model, which may pose challenges when encountering large domain gaps between source and target domains.

最终评判理由

All my concerns have been resolved by the authors, and in light of the feedback from other reviewers, I am in favor of accepting this work.

格式问题

None

作者回复

Q1: Equations (2)–(3) introduce an EMA momentum parameter (η=0.1\eta=0.1) to stabilize the cluster centers. While this choice appears reasonable, it would be helpful to briefly discuss the sensitivity of the model to different η values. Does performance degrade significantly when η is increased or decreased?

A1: Thank you for the insightful comment. In our framework, the EMA momentum parameter η=0.1\eta=0.1 in Equations (2)–(3) is employed to update the memory bank representations of cluster centers in a temporally smooth manner. This helps suppress short-term fluctuations caused by noisy or uncertain pseudo-labels and provides a more stable reference for downstream learning.

To evaluate the sensitivity of this parameter, we conducted experiments by varying η\eta within a reasonable range. We observed that the model maintains stable performance across different values, indicating robustness to this hyperparameter. Specifically: 1) When η=0.3\eta=0.3, the updates become overly smooth, causing the cluster centers to lag behind the evolving feature space. This leads to slightly slower convergence and weaker response to recent representation shifts. 2) When η=0.01\eta=0.01, the cluster centers react too quickly to transient feature variations, which introduces minor instability during early training, especially when pseudo-label quality is still low. However, this does not significantly degrade final performance.

Based on these observations, we set η=0.1\eta=0.1 as it provides a balanced trade-off between temporal stability and responsiveness, and we use this value consistently in all experiments.


Q2: Section 4.1 states that SVIP can incorporate any VI-ReID method as the source model. In Figure 6, stronger baselines (e.g., MMM) seem to yield better adaptation performance. Could the authors confirm whether this trend holds consistently? Such evidence would further underscore the plug-and-play nature and generality of SVIP.

A2: We thank the reviewer for highlighting this important point. As stated in Section 4.1, SVIP is designed to be compatible with any VI-ReID method as the source model, reflecting its plug-and-play nature.

To validate this, we conducted additional experiments using different source models, including both moderate (e.g., OTAL) and stronger (e.g., MMM) VI-ReID baselines. As shown in Figure 6, we observe a consistent trend: better-performing source models generally lead to improved adaptation results under our SVIP framework.

This is expected, as stronger source models encode richer discriminative and structural knowledge, which our SGCL, RTL, and SCCA modules can more effectively transfer to the target domain. These results confirm the generality and flexibility of SVIP, and we appreciate the reviewer’s suggestion to emphasize this trend.


Q3: Figure 3 qualitatively demonstrates that SCCA produces more compact clusters in the t-SNE visualization. To support this, could the authors provide a quantitative comparison for “SGCL + RTL” versus the complete SVIP pipeline? For instance, the average L2 distance between matched visible-infrared pairs in the target domain.

A3: Thank you for the insightful suggestion. To further support the qualitative t-SNE visualization presented in Figure 3, which highlights the effectiveness of the proposed SCCA module in producing more compact feature clusters, we conducted an additional quantitative analysis. Specifically, we compare the “SGCL + RTL” variant (i.e., without SCCA) against the full SVIP pipeline (with SCCA) by computing the average L2 distance between matched visible-infrared image pairs in the target domain. Feature embeddings are extracted from each model variant, and matched pairs are determined based on pseudo-labels.

Under the Basic Setting with RegDB, the “SGCL + RTL” variant yields an average L2 distance of 0.812, whereas the full SVIP model reduces this to 0.623. These results provide quantitative evidence that the inclusion of SCCA significantly enhances cross-modal feature alignment, yielding more compact and semantically consistent embeddings. This aligns well with the visual trends observed in Figure 3, thereby validating the effectiveness of SCCA beyond qualitative inspection.


Q4: Figures 7 and 8 present qualitative retrieval results. Are there any common patterns among the failure cases (e.g., extreme pose variation, occlusion, background clutter)? A brief analysis could help clarify the limitations and potential directions for future improvement.

A4: We thank the reviewer for the thoughtful comment. As suggested, we have carefully examined the failure cases illustrated in Figures 7 and 8, which present representative retrieval examples under both the Basic Setting and the Weather-Degraded Setting. In particular, most failure cases are attributed to modality-specific domain gaps. These conditions hinder the model’s ability to extract modality-invariant and identity-discriminative features. Such issues are especially pronounced when infrared samples contain limited structure or are strongly affected by weather conditions.

While our method is robust under typical scenarios, these challenging edge cases reveal that subtle cross-modal semantic discrepancies remain difficult. Future directions to address these limitations may include: 1) Foreground modeling to suppress background noise; 2) Modality-aware feature disentanglement; 3) Structure-aware attention mechanisms; 4) Leveraging auxiliary priors such as pose estimation or vision-language alignment. We believe these enhancements could further strengthen robustness in real-world, unconstrained VI-ReID scenarios.

评论

I sincerely thank the authors for their clear and thoughtful point-by-point response. My initial concerns have been effectively addressed, and I remain confident in my decision to accept this submission.

评论

We sincerely thank you for your thoughtful feedback and are glad to hear that our responses effectively addressed your concerns. We greatly appreciate your support and recommendation for acceptance.

审稿意见
4

This paper addresses the problem of Source-Free Domain Adaptation (SFDA) in the context of Visible-Infrared Person Re-Identification (VI-ReID), where source domain data is unavailable due to privacy or deployment constraints. To tackle the challenges of modality gap, noisy pseudo-labels, and structural misalignment, the authors propose SVIP, a novel framework composed of three core modules: 1) Source Guided Contrastive Learning (SGCL), which dual clustering from both source and target encoders to provide reliable supervisory information; 2) Residual Transfer Learning (RTL), which aligns intermediate features via a residual distillation loss; 3) Structural Consistency Guided Cross-modal Alignment (SCCA), which mines cross-modal pairs in the target domain by leveraging structural priors from the source model. Extensive experiments on multiple VI-ReID benchmarks, including varying weather conditions, show that SVIP consistently outperforms state-of-the-art SFDA and unsupervised VI-ReID methods. The approach is plug-and-play and readily applicable in real-world scenarios.

优缺点分析

This paper demonstrates high technical quality, with a well-structured methodology and solid empirical validation. Ablation studies and visualizations provide clear evidence of each module’s contribution.

This work could provide slightly more detail on how general the method is to different backbone architectures, and could elaborate on the failure cases to identify current boundaries of the method's robustness.

问题

Please clarify the following points to further strengthen the paper:

I: The implementation uses AGW as the backbone feature extractor. Could the authors clarify whether SVIP is compatible with other VI-ReID architecture? Have any preliminary tests been performed to confirm generalizability across backbones?

II: In SGCL, the confidence weight γ (Eq. 6) balances the influence between source and target cluster predictions. Could the authors provide an intuitive explanation (or a short plot) of how γ evolves over training? This may help understand the dynamics of source vs. target reliance.

III: The framework relies on DBSCAN clustering in both source and target models to generate pseudo-labels. While the use of memory banks and EMA improves stability, could the authors comment on whether the number of clusters fluctuates significantly across epochs? A brief mention of clustering consistency (e.g., variance in cluster count) would help assess robustness.

IV: In VI-ReID, the identities in the source and target domains typically do not overlap. Could you please clarify whether the proposed SVIP method assumes any partial overlap between the identity classes of the two domains, or is it completely agnostic to the label space?

局限性

The authors provide a candid discussion of potential limitations, including the dependence on source model generalizability and challenges posed by large domain gaps (Appendix E). These are appropriate and well contextualized. No critical omissions were found.

最终评判理由

I appreciate the author's response, which provided ample clarification and addressed my concerns. Therefore, I'm willing to maintain my rating.

格式问题

N/A

作者回复

Q1: The implementation uses AGW as the backbone feature extractor. Could the authors clarify whether SVIP is compatible with other VI-ReID architectures? Have any preliminary tests been performed to confirm generalizability across backbones?

A1: We thank the reviewer for the insightful question. Although our implementation adopts the AGW baseline as the default backbone due to its widespread use and strong performance in VI-ReID tasks, we emphasize that the proposed SVIP framework is architecture-agnostic. All core modules in SVIP, including SGCL, RTL, and SCCA, are designed to operate on feature embeddings, not on architecture-specific outputs. These modules make no assumptions about the internal structure of the backbone and can work with any feature extractor that outputs modality-invariant embeddings for visible and infrared images. In preliminary tests, we have applied SVIP with other backbones such as ResNet-50 and OSNet. These experiments show consistent adaptation trends, confirming that SVIP generalizes well across different backbone architectures. Further results will be included in the supplementary material.


Q2: In SGCL, the confidence weight γ (Eq. 6) balances the influence between source and target cluster predictions. Could the authors provide an intuitive explanation (or a short plot) of how γ evolves over training? This may help understand the dynamics of source vs. target reliance.

A2: We appreciate the reviewer’s insightful question. In SGCL, the confidence weight γ\gamma defined in Equation (6) is designed to adaptively balance the contributions of the source and target cluster predictions based on their respective confidence levels, which are measured via entropy. This mechanism is especially crucial in the context of unsupervised domain adaptation, where the quality of pseudo-labels from the target domain varies throughout training.

At the beginning of training, the target model, having not yet adapted to the new domain, generally produces uncertain predictions with high entropy. In contrast, the source model retains relatively stable and confident predictions, reflected in lower entropy. As γ\gamma is calculated based on the relative entropy of source and target predictions, this results in higher γ\gamma values (typically around 0.7–0.8) in the early stages. Consequently, the loss function gives greater emphasis to the more reliable source-side predictions during this period.

As training progresses, the target model gradually adapts to the target domain, and its predictions become more confident. This improvement reduces the entropy of the target-side pseudo-labels, thereby decreasing the γ\gamma value. Eventually, γ\gamma tends to stabilize around 0.5, indicating a more balanced reliance between source and target predictions. This smooth evolution of γ\gamma over time effectively captures the dynamic shift from source-guided supervision in the early phase to target-adaptive learning in the later stages. To provide a more intuitive understanding of this process, we will include a visualization of the γ\gamma values across training epochs in the appendix. This plot will clearly demonstrate how the model transitions from strong dependence on the source model to increasing confidence in the target model as learning progresses.


Q3: The framework relies on DBSCAN clustering in both source and target models to generate pseudo-labels. While the use of memory banks and EMA improves stability, could the authors comment on whether the number of clusters fluctuates significantly across epochs?

A3: We appreciate the reviewer’s insightful comment regarding clustering stability. In SVIP, DBSCAN clustering is performed independently for both source and target models in each epoch. While some fluctuation in the number of clusters is expected due to evolving feature representations, we observe that such variation is limited in practice, for several reasons:

  • Memory Banks and Exponential Moving Average (EMA) (Section 3.2, Eq. 2–3) ensure smooth updates of feature embeddings across epochs, preventing sudden cluster shifts.
  • The SGCL module uses confidence-weighted fusion of clustering results, reducing the impact of minor variations in cluster counts.
  • Instead of using full probability distributions which would require aligned cluster dimensions, our framework applies feature-level contrastive learning using cluster centers (Eq. 7), making it robust to cluster number mismatches.

In summary, although DBSCAN is non-parametric and adaptive, SVIP is resilient to its dynamic behavior, and clustering consistency is implicitly stabilized by design. We will report the average and variance of cluster counts over training in the supplementary material.


Q4: In VI-ReID, the identities in the source and target domains typically do not overlap. Could you please clarify whether the proposed SVIP method assumes any partial overlap between the identity classes of the two domains, or is it completely agnostic to the label space?

A4: Thank you for raising this important question. We confirm that SVIP does not assume any identity overlap between the source and target domains. The method is fully label-space agnostic, designed specifically for realistic VI-ReID scenarios where identity sets across domains are non-overlapping due to surveillance environment differences.

Key points ensuring label-space independence:

  • SGCL aggregates clustering outputs from source and target models using entropy-based fusion, without any identity alignment.
  • RTL promotes feature-level consistency between the two models, enforcing alignment in intermediate representations, not in class labels.
  • SCCA discovers pseudo cross-modal pairs based on structural similarities within the target domain only, guided by the source model’s relational structure, not by matching identities.

These design choices enable SVIP to work in fully unsupervised and identity-disjoint settings, as demonstrated in our experiments on SYSU-MM01 → RegDB and RegDB → SYSU-MM01, where the source and target identity sets are strictly non-overlapping (see Table 1 and Section 4.3).

评论

I appreciate the author's response, which provided ample clarification and addressed my concerns. Therefore, I'm willing to maintain my rating.

评论

Thank you for taking the time to review our response. We appreciate your thoughtful engagement and your willingness to maintain your rating.

审稿意见
5

This paper proposes SVIP for Source-Free Domain Adaptation (SFDA) in Visible-Infrared Person Re-Identification (VI-ReID). Unlike existing methods that require access to source data during adaptation, SVIP relies solely on a pre-trained source model and unlabeled target domain data, addressing both privacy and deployment constraints in practical scenarios. The method introduces three key mechanisms: Source Guided Contrastive Learning (SGCL) for more reliable pseudo-supervision, Residual Transfer Learning (RTL) for structural knowledge distillation, and Structural Consistency Guided Cross-modal Alignment (SCCA) for effective cross-modal pair mining. Experiments on three benchmarks and under multiple weather conditions show that SVIP significantly outperforms state-of-the-art baselines.

优缺点分析

Strengths: This work addresses an important and underexplored problem, i.e. SFDA for VI-ReID, which has strong practical relevance, especially in privacy-sensitive and resource-constrained environments. The formulation and empirical results make a compelling case for the practical deployment of the method.

Weaknesses: While the contributions are solid, some design choices (e.g., hyperparameter sensitivity and impact of source model quality) could benefit from further clarification. Besides, a more quantitative analysis of failure cases and feature distributions would strengthen the empirical section.

问题

(1) Given the focus on real-world deployment (e.g., privacy, computation), it would be helpful to understand the computational cost of adapting SVIP to a new target domain. Could the authors report the typical adaptation time or GPU hours for one domain setting, and how it compares with conventional UDA methods requiring source data?

(2) In the SCCA module, pseudo cross-modal pairs are selected based on two constraints, including a similarity threshold T = 0.5 (Eq. 11). How sensitive is the alignment performance to this threshold? Including a brief justification for this value or analysis of its impact would improve transparency.

(3) Equations (4)–(5) and (12) involve the use of a temperature parameter τ in Softmax-based similarity computations. The implementation section fixes τ to 0.05, but its influence on model behavior (e.g., distribution smoothness, sensitivity to confidence fusion) is not discussed. Could the authors briefly comment on how τ was selected and whether SVIP is robust to its variation?

局限性

The paper clearly acknowledges limitations such as performance dependency on the generalizability of the source model and potential challenges under large domain gaps (Appendix E). These are reasonable and well-contextualized. Further discussion on failure modes (e.g., retrieval errors) could complement this.

最终评判理由

All my concerns have been addressed, and I have therefore raised my rating.

格式问题

None

作者回复

Q1: Given the focus on real-world deployment (e.g., privacy, computation), it would be helpful to understand the computational cost of adapting SVIP to a new target domain. Could the authors report the typical adaptation time or GPU hours for one domain setting, and how it compares with conventional UDA methods requiring source data?

A1: We thank the reviewer for raising this important point. Since SVIP is designed for real-world deployment, we pay close attention to computational efficiency in addition to accuracy. SVIP performs adaptation using only the unlabeled target-domain data and a fixed source model, without accessing any source data. This design naturally reduces computational costs and avoids privacy concerns associated with source data sharing.

To clarify the adaptation cost, we compare SVIP with a representative conventional UDA method (i.e., UAN) that requires both source and target data, under the SYSU-MM01→RegDB setting. All experiments are conducted on a single NVIDIA RTX 3090 GPU. Specifically:

  • SVIP completes adaptation in approximately 2.3 hours, with a peak GPU memory usage of 10.2 GB.
  • In contrast, UAN requires 6.0 hours and 14.1 GB of memory.
  • Moreover, SVIP operates without any access to source data, while UAN must load and process both source and target datasets.

These results demonstrate that SVIP reduces adaptation time by more than 50% and significantly lowers GPU memory consumption. This efficiency can be attributed to two key design choices:

  • SVIP eliminates the need to load or augment source data during training.
  • The source model remains frozen, such that only the target model is updated.

These properties make SVIP particularly well-suited for privacy-sensitive and resource-constrained deployment environments.


Q2: In the SCCA module, pseudo cross-modal pairs are selected based on two constraints, including a similarity threshold T = 0.5 (Eq. 11). How sensitive is the alignment performance to this threshold? Including a brief justification for this value or analysis of its impact would improve transparency.

A2: We thank the reviewer for the thoughtful question regarding the similarity threshold T=0.5T = 0.5 used in the SCCA module (Eq. 11). This threshold plays an important role in filtering reliable pseudo cross-modal pairs during structural consistency alignment.

We select T=0.5T = 0.5 based on empirical observations to balance two competing factors: precision and coverage. A lower threshold tends to include more sample pairs but may introduce noisy matches, while a higher threshold improves pair quality but may exclude useful pairs, especially in the early stages of training.

To assess the sensitivity of our method to this parameter, we conducted an ablation study under the SYSU-MM01→RegDB (V2T) setting by varying TT across the range [0.3,0.4,0.5,0.6,0.7][0.3, 0.4, 0.5, 0.6, 0.7]. The results are summarized as follows:

TTRank-1mAP
0.387.41%77.92%
0.489.06%80.34%
0.590.13%81.26%
0.689.48%80.71%
0.787.22%78.15%

These results indicate that:

  • Overly low thresholds lead to noisy supervision due to inclusion of unreliable pairs.
  • Overly high thresholds result in insufficient training signals due to overly strict filtering.

Therefore, T=0.5T = 0.5 provides a favorable balance between pair quality and quantity, and we use it consistently across all experiments to ensure stable and reliable performance.


Q3: Equations (4)–(5) and (12) involve the use of a temperature parameter τ in Softmax-based similarity computations. The implementation section fixes τ to 0.05, but its influence on model behavior (e.g., distribution smoothness, sensitivity to confidence fusion) is not discussed. Could the authors briefly comment on how τ was selected and whether SVIP is robust to its variation?

A3: We thank the reviewer for the helpful question about the temperature parameter τ\tau, which controls the sharpness of the Softmax distribution in both confidence estimation (Eqs. 4–5) and contrastive learning (Eq. 12).

We chose τ=0.05\tau = 0.05 based on empirical observations. To evaluate the sensitivity of our model to this parameter, we vary τ\tau in the range [0.01,0.03,0.05,0.07,0.10][0.01, 0.03, 0.05, 0.07, 0.10] and test performance on the SYSU-MM01→RegDB (V2T) setting. The results are as follows:

τ\tauRank-1mAP
0.0188.74%78.62%
0.0389.65%80.73%
0.0590.13%81.26%
0.0789.12%79.83%
0.1089.12%79.83%

These results indicate that SVIP maintains relatively stable performance across a wide range of τ\tau values, demonstrating robustness to this hyperparameter. The best performance is achieved at τ=0.05\tau = 0.05, which offers a good balance between confidence sensitivity and training stability. We therefore fix τ=0.05\tau = 0.05 in all experiments and will include this analysis in the supplementary material for completeness.

评论

Thank you for the response. It has addressed all my concerns, and I recommend accepting the paper.

评论

Thank you for your thoughtful response. I'm pleased to hear that your concerns have been fully addressed, and I sincerely appreciate your recommendation for acceptance.

审稿意见
4

The paper proposes SVIP, a source-free domain adaptation framework for visible-infrared person re-identification that exploits the prior knowledge embedded in the source model to guide target domain adaptation. SVIP consists of three mechanisms: (1) Source Guided Contrastive Learning (SGCL), which uses the discriminative feature space of the fixed source model to perform contrastive learning on unlabeled target data; (2) Residual Transfer Learning (RTL), which learns residual mappings to adapt the target model's representations while preserving the source model's knowledge; and (3) Structural Consistency Guided Cross-modal Alignment (SCCA), which enforces structural consistency between visible and infrared modalities to identify reliable cross-modal pairs for alignment. Experiments on benchmark datasets demonstrate that SVIP significantly improves target domain performance and outperforms prior unsupervised VI-ReID methods under source-free settings.

优缺点分析

This paper is well-motivated for source-free domain adaptation in visible-infrared person re-identification, introducing novel mechanisms (SGCL, RTL, SCCA) that effectively leverage the frozen source model to guide target adaptation without requiring source data. However, the paper lacks discussion on the capability of the source model, and also misses some qualitative visualizations explaining why some cases fail with other methods but succeed with the proposed method.

问题

1.Could the authors provide the performance of the source model directly evaluated on the target domain without adaptation? Additionally, how much improvement does the proposed method achieve compared to this baseline? This comparison would help clarify the effectiveness of the adaptation process. 2.It would be helpful if the authors could include qualitative visualizations that illustrate typical failure cases of existing methods and explain why the proposed method succeeds in these scenarios. Could such visual comparisons be added?

局限性

yes

最终评判理由

I appreciate the author's detailed response, which has addressed my concerns. Accordingly, I will revise my rating.

格式问题

none

作者回复

Q1: Could the authors provide the performance of the source model directly evaluated on the target domain without adaptation? Additionally, how much improvement does the proposed method achieve compared to this baseline? This comparison would help clarify the effectiveness of the adaptation process.

A1: Thank you for your insightful question. To evaluate the effectiveness of the proposed domain adaptation process, we report the performance of the source model directly evaluated on the target domain without any adaptation, denoted as Source Only in Tables 1 and 2. This baseline represents the scenario where a model trained on the source domain is applied to the target domain without any further learning or adjustment. As expected, due to the significant domain and modality gaps in visible-infrared person re-identification, the source model alone performs poorly across all target domains, clearly justifying the necessity of domain adaptation.

In the basic domain adaptation setting (Table 1), we observe that the Source Only model yields substantially lower Rank-1 accuracies. For example, when adapting from SYSU-MM01 to RegDB (V2T), the source model achieves only 31.26% Rank-1 accuracy, while our method SVIP achieves 90.10%, yielding a remarkable improvement of 58.84%. Similarly, for SYSU-MM01 to LLCM (V2I), the source model obtains 12.47%, while SVIP reaches 40.50%, showing an improvement of 28.03%. In the LLCM to RegDB setting, the source-only performance is 24.51%, whereas SVIP achieves 90.20%, with a significant gain of 65.69%. These consistent and substantial improvements across different adaptation scenarios confirm the effectiveness of our framework in leveraging the pretrained source model to boost target-domain performance under the source-free setting. Furthermore, we evaluate our method in more challenging conditions in the weather-degraded setting (Table 2), where environmental corruptions such as fog, frost, rain, and snow are introduced. In the foggy version of LLCM-W, the source model achieves only 8.20% Rank-1 accuracy, whereas SVIP improves this to 36.13%, yielding a gain of 27.93%. On the snow condition of RegDB-W, the source model achieves 11.55%, and SVIP reaches 46.05%, improving by 34.50%. These results further demonstrate the robustness and adaptability of our approach under adverse conditions.

In summary, the proposed SVIP consistently and significantly outperforms the source model baseline across various domains and weather settings. This clearly verifies the effectiveness of the adaptation process and highlights the importance of exploiting the source model’s prior knowledge through our carefully designed mechanisms for cross-modal domain adaptation.


Q2: It would be helpful if the authors could include qualitative visualizations that illustrate typical failure cases of existing methods and explain why the proposed method succeeds in these scenarios. Could such visual comparisons be added?

A2: We sincerely appreciate the reviewer’s constructive suggestion. Overall, existing methods tend to produce incorrect matches in scenarios involving large modality discrepancies or complex background clutter, primarily due to inadequate cross-modal alignment and limited structural guidance. In contrast, our method effectively addresses these challenges by incorporating source-guided structural priors through the SCCA module and applying confidence-weighted supervision via SGCL. These mechanisms enable more robust and semantically aligned representation learning. The qualitative comparisons in the supplementary material clearly demonstrate that SVIP achieves more accurate and consistent retrieval results in challenging cases where prior approaches often fail.

最终决定

Four experts in the field reviewed this paper. Their recommendations are 2 Borderline Accepts and 2 Accepts. This paper presents an innovative source-free domain adaptation (SFDA) method for visible–infrared person ReID. This is relevant in privacy-sensitive and resource-constrained application scenarios. Overall, the reviewers appreciated the paper because the proposed SFDA method integrates the SGCL, RTL, and SCCA components. They are well motivated, and allow leveraging the source model to guide adaptation across visible and infrared modalities. Reviewers also highlighted the strong empirical results of the proposed method, noting substantial improvements over SOTA UDA and SFDA methods on SYSU-MM01, RegDB, LLCM, and synthetic weather perturbations data. The ablation studies on hyperparameters, e.g., similarity threshold and temperature parameter, were also viewed positively.

In the discussions and rebuttal, Reviewers noted the missing discussion on source model capability, the lack of qualitative visualizations and analysis of failure cases, and requested an analysis of computational complexity, and sensitivity to thresholds. These concerns were well addressed in the rebuttal, and two of the reviewers raised their ratings. The paper makes a decent contribution to the adaptation for cross-modal ReID , and the empirical improvements are substantial. The weaknesses are minor and were largely addressed during the rebuttal. Based on the reviewers’ feedback and the authors’ satisfactory rebuttal that addresses most concerns, I recommend it for acceptance. The reviewers raised some important issues and concerns in the Weaknesses that should be addressed in the final camera-ready version of the paper. The authors are encouraged to make the necessary changes to the best of their ability.