PaperHub
7.6
/10
Poster3 位审稿人
最低4最高5标准差0.5
4
5
5
4.0
置信度
创新性3.3
质量3.7
清晰度3.7
重要性3.3
NeurIPS 2025

Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We show that self-supervised models like DINOv2 can develop strong noise robustness without any explicit denoiser at downstream fine-tuning or inference, by leveraging a data curriculum and a denoised regularization loss during pretraining.

摘要

Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($\sigma=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.
关键词
Self-Supervised LearningCurriculum LearningNoiseRobustnessRepresentation Learning

评审与讨论

审稿意见
4

The paper offers a method to have off-the-shelf self-fupervised models like DINOv2 to robustly learn representations on noisy images in a self-supervised way. A denoiser is trained and then it generates noisy-denoised image pairs to build a curriculum for post-training DINO. The latter parts of the curriculum have a teacher-based regularizer to force convergence. The denoiser is jettisoned after post-training.

The idea is not that different from the use of the decoder half of a diffusion model, or denoising-quantification as in syed & mirza. The idea is built with visible increments, like showing DINO's weakness in data scarce setting, meidum improvements on data addition, and finally a new paradigm for gradually building noise tolerance. Experiments are reflective.

优缺点分析

The strengths include having not to denoise during inference, and the tailorize use of a foundation model.

The paper does not use several fms, making the claim above a little weak. The noise is random, and may vary across many runs, something that the experiment at larger scales (data size) avoids, raising a question over scaling.

问题

How important will the temperature of the curriculum be?

局限性

The noise added is ...added. Multiplicative or mixed source often plagues data.

最终评判理由

I was convinced of the usefulness of the idea for real world data, but had given a lower score because of some impediments to the method really working well in the wild. The authors have clarified some, and admitted some, which puts a better definition on the employment of the method.

格式问题

None.

作者回复

We sincerely thank Reviewer 5arj for the positive review and helpful suggestions. Your comments have helped us better clarify important methodological details and highlight the robustness of our approach. Below, we would like to address each of your questions and concerns:

Question 1: How important will the temperature of the curriculum be?

Thank you for the thoughtful question. Our curriculum does not explicitly involve a temperature parameter. Instead, we adopt a discrete two-phase schedule, transitioning from denoised (easier) to noisy (harder) data. We experimented with gradually increasing the proportion of noisy samples during training but found that this neither accelerated convergence nor improved final performance.

Currently, the softmax temperatures used in the DINOv2 joint embedding loss are set to the default schedule for both training stages. The teacher temperature follows a warm-up phase and then remains constant. The student temperature is held constant. We are aware that, in the knowledge distillation literature [1], temperature can be dynamically learned to maximize the distillation loss, thereby increasing learning difficulty and forming an effective curriculum. While in our paper the learning difficulty is primarily controlled by the presence of noise, we think exploring the incorporation of dynamic temperature into our research is also highly worthwhile. We appreciate your insight that this is an interesting direction for future exploration.

[1] Curriculum temperature for knowledge distillation Li et al., AAAI 2023  
 

Comment 1 on the weakness: The noise is random, and may vary across many runs, something that the experiment at larger scales (data size) avoids, raising a question over scaling.

Thank you for raising this concern, and we apologize for the lack of clarity. We would like to clarify that the noise does not vary across runs in any of our experiments. We generate the noisy dataset once prior to training and do not apply noise on-the-fly during data loading. This setup is intended to closely simulate real-world scenarios where one starts with a fixed noisy dataset, aligning with realistic settings. We will make sure to explicitly clarify this detail in Section 4.1 (Noise Addition) of our revised paper.  
 

Comment 2 on the limitation: The noise added is …added. Multiplicative or mixed source often plagues data.

We fully agree with your valid concern that multiplicative or mixed noises often plagues data. We would like to clarify that many noises tested in this paper, like shot noise and speckle noise, are not additive. In particular, speckle noise is a multiplicative noise, as shown in its formulation in Appendix A.5 equation (9).

To address the concern regarding mixed noise sources, we evaluated our method on Poisson-Gaussian noise, which is one of the most common mixed noises in imaging. The Poisson-Gaussian noise is formulated as x=Poisson(x~λ)λ+N(0,(σ255)2)x = \frac{\operatorname{Poisson}(\tilde{x} \cdot \lambda)}{\lambda} + \mathcal{N}(0, \left(\frac{\sigma}{255}\right)^2), where x~\tilde{x} is the clean image. We set λ=3\lambda = 3 and σ=100\sigma = 100, using the default 100 epochs to train the N2N denoiser, and 200 epochs to train DINOv2 ViT-S on ImageNet-100. The linear probing accuracies are shown in the table below:

Comparison of linear probing accuracies on Poisson-Gaussian mixed noise with 200 epochs fixed training budget

MethodAccuracy
DINOv253.8
DINOv2 w/ NC63.2
DINOv2 w/ NCT65.1
N2N + DINOv265.8

The results are fully consistent with Figure 1 and Table 5 in the paper, where DINOv2 w/ NC significantly outperforms the DINOv2 baseline, and DINOv2 w/ NCT closely matches N2N + DINOv2. This demonstrates our methods are effective under mixed noise conditions and generalize to more complex scenarios. We will ensure to include these results in our revised paper to highlight the robustness of the methods under mixed noises.  
 

We appreciate your thoughtful feedback and hope we have addressed all of your questions and concerns. We remain engaged and willing to provide further clarification during the ongoing discussion.

评论

We sincerely hope our responses have addressed all of your questions and clarified any ambiguities. Your feedback has helped us greatly strengthen our paper, particularly in clarifying the temperature schedule, the noisy dataset curation details, and demonstrating the robustness of our method under mixed noises. As the discussion period is drawing to a close, we would be grateful to hear your final thoughts. If there is anything that remains unclear or requires additional discussion, we would be glad to provide further details promptly.

Thank you again for your positive and valuable review!

审稿意见
5

This paper proposes a self-supervised training approach aimed at enhancing DINOv2's performance in high-noise scenarios using curriculum learning. The authors provide thorough analysis and comparisons, demonstrating that the two-stage (Clean + noise) self-supervised training yields superior performance compared to single-stage training (using solely noisy data). The primary experimental setting involves training on synthetic noisy images and evaluating on synthetic noisy images. A notable advantage is that this method eliminates the need for a denoiser at test time, while the final results remain robust on both clean and noisy data.

优缺点分析

  1. Novel Setting:​​ SSL model training tailored for noisy images

​​2. Extensive Experiments:​​ Evaluated across classification and instance-level recognition tasks, applied to multiple SSL methods beyond DINOv2

​​3. Well-Written Paper:​​ Clear presentation and good organization

问题

  1. Critical point: Performance of DINOv2 trained on clean images evaluated on the noisy test set? The paper primarily reports DINOv2(clean) performance on clean images. However, the performance drop of DINOv2(clean) on the noisy test set is crucial for establishing this work's significance.
    Suggestion: Supplement Tables 2 and 3 with these results.

  2. The inclusion of denoised images inherently enhances training set expressiveness.
    Please show results when training on:

    • Solely noisy images vs.
    • Mixed noisy + denoised images
  3. Significant performance gap exists between the author-trained DINOv2-s (79.4% ImageNet accuracy) and the official DINOv2-small (81.1%). I understand computational constraints may limit training.
    Key question: Can the proposed method effectively fine-tune a well-pretrained official DINOv2 model? Enhancing the robustness of widely-used official weights against noisy data would be highly impactful.

  4. Urgently needed: Evaluation on real-world noisy datasets (e.g., Hendrycks & Dietterich, "Benchmarking Neural Network Robustness to Common Corruptions and Perturbations", arXiv:1903.12261, 2019).

  5. Can PCA visualizations on noisy images be shown for different models? (Similar to the style used in the original DINOv2 paper).

  6. Releasing code would be beneficial to the community.

局限性

The authors have honestly acknowledged one limitation I previously noted: the lack of testing on real-world noisy datasets. However, one critical limitation remains unaddressed: The paper does not analyze whether the proposed method can effectively fine-tune well-pretrained models (e.g., off-the-shelf DINOv2 checkpoints) rather than training exclusively from scratch.

最终评判理由

I am satisfied with the authors' response. Their revisions and clarifications have thoroughly resolved my questions.

格式问题

The article is well-structured.

作者回复

We sincerely thank Reviewer 1LLb for the positive feedback, and for raising a series of very insightful questions. Below, we would like to give detailed responses to each of your suggestions and questions.

Question 1: Performance of DINOv2 trained on clean images evaluated on the noisy test set?

Thanks for raising the necessity of this baseline and we fully agree that it is crucial for establishing this work’s significance. To address this, we conduct further evaluation to supplement Table 2 with the results for both ImageNet-1k and ImageNet-100, as shown below. We observe that significant drops in accuracy occur (>20 points for Gaussian-100, >40 points for Gaussian-255) when evaluating the clean-pretrained DINOv2 on the noisy test set. This demonstrates SSL models trained exclusively with clean data cannot generalize to noisy images in downstream tasks, thus it is important to research training methods that produce noise-robust models, which is the main goal of our paper. We will update Table 2 and plan to include the additional results for Table 3 in our revised paper.

DatasetTraining NoiseMethodCleanNoisy
ImageNet-1k (100 epochs)Gaussian-100DINOv2 Clean79.058.1
Gaussian-255DINOv2 Clean79.020.5
ImageNet-100 (1000 epochs)Gaussian-50DINOv2 Clean81.474.3
Gaussian-100DINOv2 Clean81.461.3
Gaussian-255DINOv2 Clean81.434.5

 
 

Question 2: Show results when training on mixed noisy + denoised images

We understand the reviewer’s valid suspicion that incorporating denoised images could enhance the training set’s expressiveness. In practice, we found mixing clean (denoised) and noisy images during training often leads to degraded performance. This is primarily due to reduced representation alignment between noisy and denoised samples, which hinders the model’s ability to learn consistent features. The table below shows the linear probing accuracies of DINOv2 ViT-S on the noisy ImageNet-100 test set, when trained for 200 epochs on 100% noisy data (from Table 1(a)) versus on 50% noisy + 50% denoised data. We observe a significant drop of 7.4 points in accuracy when noisy and denoised data are randomly mixed during training.

Training SettingAccuracy
100% Gaussian-10055.4
50% Gaussian-100 + 50% denoised48.0

In Appendix B.4, we comprehensively tested the performance of DINOv2 when trained and evaluated on mixed noisy + clean dataset, as seen in Table 7, introducing a small amount of clean images (e.g., 2%, 10%) during training destabilizes the process and slightly degrades performance. The above results shows naive mixing is suboptimal, and reinforces the necessity of using a staged curriculum learning approach that separates noisy and clean (denoised) images.  
 

Question 3: Can the method fine-tune a well-pretrained official DINOv2 model?

We sincerely appreciate your understanding of the computational constraints. DINOv2-small in the original DINOv2 paper is based on the closed-source LVD-142M dataset, which contains 142 million images. In contrast, our model is trained on ImageNet-1k with only 1 million images, which helps explain the performance gap. Additionally, the official DINOv2 models use a patch size of 14, resulting in more parameters and higher input resolution compared to our model, which uses a patch size of 16.

Yes, the proposed method, in particular, NCT can effectively fine-tune a well-pretrained DINOv2 model against noisy data. While our method is primarily designed for pretraining from scratch, we demonstrate that our denoised-regularized loss can significantly improve upon the vanilla DINOv2 fine-tuning baseline. We use the official ViT-B/14 weights, and fine-tune them on ImageNet-1k with Gaussian-255 noise for 4 epochs due to the limited rebuttal time window. Two finetuning strategies are employed: 1. Finetune directly using the original DINOv2 loss, 2. Finetune using our NCT loss where a copy of the pretrained weights serves as the denoised teacher. The linear probing accuracies on both noisy and clean test sets are shown in the table below:

Test setDINOv2 ViT-B (official)Finetune w/ DINOv2 loss (baseline)Finetune w/ NCT loss
Gaussian-25529.139.247.2
Clean84.548.058.2

We observe that NCT loss achieves significantly higher accuracy in both noisy target set (+8.0) and clean set (+10.2) compared to original loss. This shows the regularization not only helps convergence on the noisy set, but also reduces distributional shift from the clean set. We acknowledge that tuning on very noisy data will inevitably reduce clean performance due to the significant distributional gap, but researching how to improve noise robustness of pretrained models without sacrificing the clean accuracy remains a promising direction for future work.

We would like to emphasize that due to rebuttal time constraints, the above results were produced out-of-box without any cross-validation or hyperparameter tuning. We believe that extending the number of training epochs would further improve performance. We would also like to highlight that fine-tuning the official DINOv2 weights is non-trivial, as they only release weights for the backbone, but the iBOT and DINO heads are not released, which are crucial for fine-tuning. Thus, we first train the heads on clean data for 75k steps with the backbone frozen, before fine-tuning on noisy data. These results likely represent a lower bound, and we believe that access to the official iBOT and DINO heads, which are trained on large-scale datasets, would substantially improve fine-tuning performance.  
 

Question 4: Evaluation on real-world noisy datasets

Thanks for the valuable suggestion. The source “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations” (i.e., ImageNet-C) also utilizes synthetic noise injection processes. In our work, we actually employ the code base of ImageNet-C to create noisy datasets.

As acknowledged in Lines 354–356 of Section 5 (Limitations), a key challenge is the scarcity of large-scale real-world noisy datasets with reliable labels, which hinders standardized evaluation. As a promising next step, we are actively exploring the curation of such a benchmark to facilitate more realistic and comprehensive assessment. We fully agree that such a benchmark allows a fairer and more practical assessment of our method.  
 

Question 5: PCA Visualizations

Thanks for the great suggestion. We agree that PCA visualizations provide an intuitive and vivid representation of a model’s learned features. However, due to the current NeurIPS rebuttal policy, which prohibits any links, we cannot post images here. Instead, we will describe the PCA visualizations verbally.

We strictly follow the style in the original DINOv2 paper to generate PCA visualizations. Under Gaussian-255 noise, we observe that the DINOv2 baseline generates very noisy visualizations, as it struggles to separate salient objects (i.e., birds) from the background. In contrast, NC and NCT generate visualizations that clearly separate the salient objects with sharp boundaries. Notably, objects of the same class are consistently represented using similar color patterns across different images, indicating semantically aligned features. This helps explain why NC and NCT substantially outperform the DINOv2 baseline in downstream tasks. Moreover, from the visualizations, we can infer that NC and NCT will outperform DINOv2 not only in classification tasks, but also in dense tasks like segmentation and depth estimation. We will ensure to include these visualizations in the Appendix of our revised paper.  
 

Question 6: Releasing code

We have included all source code in the zipped supplementary materials, along with detailed instructions and links to download trained weights, which we will ensure are released to the community. We fully agree that open-sourcing the code and weights are beneficial to the scientific advancement of the community.  
 

We truly appreciate the feedback and believe the changes have made our submission much stronger. We hope we've covered all your questions, and we will remain available for conversation until the discussion period ends.

评论

Regarding the response to Question 1:​​

The response has significantly enhanced the paper's significance. As a reader, I was previously unaware of the substantial performance degradation that Gaussian-100 imposes on DINOv2. This observation aligns with phenomena noted in other domains. Could the authors provide citations to other studies documenting this phenomenon to further substantiate their claim? ​I believe this warrants a modest upward adjustment in the score.​​

​Regarding the response to Question 3:​​

The response clarifies a limitation of the proposed method: its difficulty when applied to fine-tuning official pre-trained weights. However, the performance drop observed on the clean dataset is primarily attributable to external factors (e.g., computational constraints, missing partial training weights), ​not to the authors' method itself. Consequently, I do not consider this sufficient grounds for lowering the score.​

评论

We greatly appreciate your thoughtful follow-up and careful consideration. We fully agree providing citations on other studies can substantiate our claim.

To further support our claim regarding the vulnerability of DINOv2 models trained on clean data when evaluated on noisy inputs, we first provide additional results and then discuss relevant findings from prior literature.

In our previous rebuttal response to Question 1, although the DINOv2 backbones were pretrained on clean images, the linear heads were fine-tuned on noisy images with the same noise distribution as the test set during linear evaluation. This simulates a common downstream scenario where a frozen feature extractor is adapted to a new domain via task-specific heads.

To isolate the impact of training exclusively on clean data, we now benchmark a stricter setting in which both the backbone and linear heads are trained solely on clean images, and evaluation is performed on noisy test sets. The results are shown in the table below:

DatasetTraining NoiseMethodCleanNoisy
ImageNet-1k (100 epochs)Gaussian-100DINOv2 Clean79.040.7
Gaussian-255DINOv2 Clean79.02.2
ImageNet-100 (1000 epochs)Gaussian-50DINOv2 Clean81.460.0
Gaussian-100DINOv2 Clean81.422.3
Gaussian-255DINOv2 Clean81.43.1

We observe an even more severe performance drop (>20 points for Gaussian-50, >35 points for Gaussian-100, >70 points for Gaussian-255) in this setting compared to our earlier setup. This is expected, as the model's representations are directly evaluated under substantial domain mismatch without any adaptation, leading to significantly worse generalization. The results reinforce our central claim: models trained solely on clean data do not generalize to noisy scenarios.

These observations are also fully consistent with prior work. Chhipa et al. (2023) [1] benchmarked the robustness of various SSL models under ImageNet-C degradations. The study adopts the same setting as our table above: SSL backbones pretrained on clean images, with linear heads fine-tuned on clean data, and evaluation performed on corrupted ImageNet-C images. As shown in the 'gaussian_noise' plot in Figure 2, the error rates increases drastically for all SSL models as noise severity increases. At severity level 5 (Gaussian-97), the error rates of all evaluated models (e.g., SwAV, DINO, SimSiam, SimCLR, BYOL, Barlow Twins) are greater than 90 with ViT-S/8 and ResNet50. Our results with ViT-S/16 (accuracy 22.3, error 77.7) and ViT-B/16 (accuracy 40.7, error 59.3) show slightly lower error rates, likely due to stronger backbones (i.e., DINOv2) and better fine-tuning, but the degradation trend remains similarly severe.

The same trend is also observed for other downstream tasks. In Vanyan et al. (2023) [2], as shown in the middle plot of Figure 4, applying Gaussian-40 noise significantly reduces the mIoU of clean-trained DINOv2 on a semantic segmentation task from 0.59 to 0.36, a 39% relative drop. Notably, Gaussian-40 is a relatively mild corruption compared to the higher noise levels studied in our work (Gaussian-50, 100, and 255). The performance degradation is even more severe for other SSL models, such as DINO and MAE.

Consequently, our results and discussion reflect the widespread vulnerability of clean-pretrained SSL models to noise, and even for state-of-the-art model like DINOv2, underscoring the significance of our research. We will make sure to supplement our section 2 Related Works with these literature in our revised paper, and we believe these additional experiments have further strengthened our contributions.

We sincerely appreciate your constructive feedback, and we will remain actively engaged in the remaining discussion period.  
 

[1] Can Self-Supervised Representation Learning Methods Withstand Distribution Shifts and Corruptions?, Chhipa et al., ICCV Workshop, 2023

[2] Analyzing local representations of self-supervised vision transformers, Vanyan et al., 2023

评论

I am satisfied with the authors' response. Their revisions and clarifications have thoroughly resolved my questions.

评论

We appreciate your confirmation that our responses have addressed your questions. Thank you again for your valuable feedback and support!

审稿意见
5

This paper presents a novel, fully self-supervised framework to train noise-robust representation models without needing a denoiser during inference or fine-tuning. The central problem addressed is that state-of-the-art Self-Supervised Learning (SSL) methods, like DINOv2, perform poorly when pretrained on noisy data, a common issue in real-world applications such as medical imaging and astrophysics. The proposed solution involves a curriculum learning strategy where an SSL model is first trained on a denoised version of the dataset (created using an auxiliary SSL denoiser) and then continues training on the original noisy data. This "denoised-to-noisy" curriculum encourages the model to internalize noise-robust features, allowing the denoiser to be discarded after pretraining. Experiments on ImageNet under various synthetic noise conditions show that the proposed methods significantly improve performance over standard SSL training on noisy data and often match or exceed a pipeline that explicitly uses a denoiser.

优缺点分析

Strengths:

  1. The work addresses the critical and under-explored challenge of applying SSL to noisy, uncurated datasets, which is a frequent scenario in many scientific and industrial domains.
  2. The proposed noise curriculum (NC) is an intuitive and simple yet powerful strategy. It effectively builds noise robustness directly into the representation model without requiring complex architectural changes or a persistent denoiser module.
  3. The authors conduct extensive experiments across multiple datasets (ImageNet-100, ImageNet-1k) , noise types (Gaussian, Shot, Speckle), and severity levels. The method's effectiveness is demonstrated on different downstream tasks (classification, instance recognition) and its generalizability is tested on a wide range of SSL models (SimCLR, MoCo v3, iBOT, etc.).
  4. A key advantage is that the denoiser is only used during the pretraining phase and can be discarded for downstream tasks. This simplifies deployment and reduces computational overhead and latency during inference.

Weaknesses:

  1. To my undestanding, the framework's success hinges on the initial step of training a self-supervised denoiser that can produce a reasonably clean version of the dataset. The performance of the entire pipeline is therefore dependent on the effectiveness of the chosen denoiser, which might be a challenge for unusual or extremely high levels of noise.
  2. The timing of the switch from denoised to noisy training is a crucial hyperparameter that currently requires manual tuning. While the paper shows the method is robust within a certain range, an automated or adaptive strategy for this transition would enhance the method's practicality.
  3. The proposed method introduces additional computational steps to the pretraining pipeline: training an SSL denoiser and then using it to process the entire dataset. While it saves computation at inference, this increases the upfront resource requirement for pretraining.
  4. The experiments are benchmarked using synthetic noise types. While this allows for controlled and reproducible evaluation, real-world noise can be more complex. For example, in medical images, the CT images in sinogam can be viewed as poisson noise, however in the image domain, it is difficult to define what the noise is. The translation of these impressive results to real-world noisy datasets is not yet fully demonstrated.

问题

  1. The performance of your framework depends on the initial SSL denoiser. How does the performance of DINOv2 w/ NC degrade if a less effective denoiser is used? Is there a performance threshold for the denoiser below which the curriculum provides diminishing or no returns?

  2. The restart epoch is a key hyperparameter. Have you explored any methods to automate the selection of this epoch? For instance, could the transition be triggered dynamically based on the convergence of the model on the denoised data?

  3. Do you have plans to evaluate your framework on such real-world datasets, and how do you anticipate its performance will translate from i.i.d. synthetic noise to more complex, structured real-world noise?

  4. The analysis on clean validation sets is very insightful. You suggest that DINOv2 w/ NC/NCT outperforms the N2N + DINOv2 baseline because explicit denoising leads to information loss. Could an alternative or complementary explanation be that the two-stage curriculum acts as a powerful regularizer, forcing the model to learn more general features by being exposed to both clean (denoised) and noisy distributions?

  5. For the DINOv2 w/ NCT results on ImageNet-1k, the denoised teacher used for regularization is trained for a full 100 epochs, while the student model it guides only sees the denoised data for 30 epochs before switching to noisy training. This seems to create a significant gap between the teacher and the student's initial state in the second training phase. Could you elaborate on why this design choice was more effective than using a teacher from the 30-epoch checkpoint, and how embedding alignment is preserved?

局限性

Yes, the authors have adequately addressed the major limitations of their work in Section 5. A constructive discussion of limitations is critical, and the authors handle this well.

The paper correctly identifies and discusses two key limitations: Dependency on Denoiser Quality: The authors acknowledge that the framework assumes a reasonably effective self-supervised denoiser can be trained on the noisy data. They note that this is a fair assumption in many cases but remains an "open dependency." Manual Curriculum Scheduling: The paper points out that the current design requires manual tuning to determine the optimal point to switch from denoised to noisy pretraining.

The authors also implicitly acknowledge the use of synthetic noise as a necessary choice for controlled, reproducible assessment. To further strengthen this section, I would suggest also adding a brief discussion on the increased upfront computational cost of the pretraining pipeline (training a denoiser and processing the dataset) as a trade-off for the gains in inference efficiency and robustness.

最终评判理由

The additional experimental evidence and theoretical insights the author provided will strengthen the revised manuscript significantly. I maintain my positive assessment of this work.

格式问题

NA

作者回复

We sincerely thank Reviewer BtzQ for the thoughtful and highly positive review. Below, we would like to address each of your questions in detail:

Question 1: How does the performance of DINOv2 w/ NC degrade if a less effective denoiser is used?

Thanks for raising this practical question. To investigate it, we conducted additional studies on how the number of denoiser training epochs impacts the final SSL model’s performance. We use the number of training epochs as a quantifiable proxy for denoiser effectiveness. The evaluation follows our default setting: 200 epochs of DINOv2 ViT-S training with Gaussian-100 noisy data. The results are shown in the table below:

Performance of DINOv2 w/ NC on ImageNet-100 with the N2N denoiser trained for varying epochs

DINOv2 Baseline1 epoch5 epochs100 epochs (from table 1 (a))
55.466.167.468.1

We observe that even training the N2N denoiser for just 1 epoch leads to a substantial improvement of DINOv2 w/ NC over the DINOv2 baseline (i.e., 10.7 increase), while being only 2 percent less than that of a well-trained denoiser (100 epochs). Although the 1-epoch denoiser still produces many undesirable artifacts like a very weak denoiser, the curriculum still yields substantial improvements in downstream performance. This highlights the robustness of our curriculum, which has considerable tolerance to the denoiser’s quality. A denoiser has to be unreasonably bad (e.g., training diverged or collapsed) to provide no return in performance. We will include this result as an additional ablation study in the Appendix B of our revised paper. Thank you again for the insightful suggestion; we believe this investigation further strengthens our contribution.  
 

Question 2: Any method to automate the selection of the restart epoch?

Thank you for raising this important point. We agree that the restart epoch is a key hyperparameter. In our experiments, we fix the total training budget (e.g., 200 epochs), which makes automated selection of the restart point non-trivial: restarting too early leaves insufficient time to benefit from clean supervision, while restarting too late limits the model’s capacity to adapt to noisy data. The noise level, model size and learning rate schedule can also impact the convergence speed in both stages. In practice, we heuristically select the restart point by beginning from the midpoint and adjusting based on performance trends observed in preliminary runs.

That said, if the training budget were unconstrained, we believe a dynamic transition based on model convergence, such as monitoring linear probe accuracy on the denoised data, would be a highly promising heuristic. We view this as a valuable direction for future work and will mention it as a potential improvement in Section 5.  
 

Question 3: Any plans to evaluate on real-world datasets? And how do you anticipate its performance will translate from synthetic to real noise?

We thank the reviewer for this important question. Indeed, as we acknowledged in Section 5, our current benchmark is limited to synthetic noise due to the scarcity of large-scale, labeled noisy datasets that support standard downstream evaluation. However, we fully agree that demonstrating robustness on real-world noise is critical for practical impact.

Yes, we do have plans to evaluate on real-world datasets. As a promising next step, we are actively exploring the curation of a large benchmark that contains real-world noisy images with accurate labels. Our goal is to include a diverse collection of noise types across domains such as biomedical imaging, astronomy, and remote sensing, where noise is often complex and structured. We believe this would provide a more rigorous testbed for evaluating our methods.

Furthermore, we are confident that the performance gains observed on synthetic noise will translate to real-world settings. Firstly, Our method has been validated on a wide spectrum of synthetic noise types (Gaussian, Shot, Speckle), noise levels (SNR from 9.2 dB to 0.3 dB), and multiple architectures (e.g., SimCLR, MoCo v3, DINOv2, iBOT). The consistent improvements across these settings indicate that the curriculum-based robustness is not tightly coupled to any particular noise assumption. Secondly, as shown in our response to reviewer 5arj comment 2, our method continues to perform well under mixed noise (i.e., Poisson-Gaussian), which is a common real image degradation. This supports the hypothesis that robustness induced by our denoised-to-noisy training strategy can transfer to more complex noise distributions.  
 

Question 4: Could the two-stage curriculum acts as a powerful regularizer, forcing the model to learn general features?

Thank you for the insightful observation. We fully agree that, in addition to mitigating information loss from explicit denoising, the two-stage denoised-to-noisy curriculum can indeed act as a powerful regularizer. By exposing the model sequentially to denoised (low-entropy) and noisy (high-entropy) data distributions, the training process encourages the emergence of more general and robust features, as the model must reconcile representations across varying input conditions. This interpretation aligns well with our motivation for the curriculum design and supports our findings in Section 4.4, where DINOv2 w/ NC and NCT generalize better than N2N + DINOv2 on clean validation sets. We will incorporate this valuable perspective into Section 4.4 of our revised version to provide a more complete discussion.  
 

Question 5: Why is the denoised teacher at 100 epochs a better regularizer than that at 30 epochs? And How is the embedding alignment preserved?

Thanks for pointing out this confusion. Here are our more detailed explanations. On lines 295-296, we explained that “the denoised teacher at this early stage lacks sufficiently strong representations to serve as an effective regularizer.”, this is because we empirically observed that the linear probing accuracy of the denoised teacher at 30 epochs is only 53.5 when tested on denoised dataset. This means that 53.5 would be an approximate upper bound of DINOv2 w/ NCT if using the 30-epoch denoised teacher, because DINOv2 w/ NCT consistently converges to or very slightly outperforms its anchor denoised teacher’s accuracy, as shown in table 1 (a) and line 285. To push for stronger performance and raise the ceiling, we thus employ denoised teacher at 100 epochs (i.e. accuracy = 57.2) to regularize the noisy training. And indeed, our experimental results show substantial improvement of NCT over both NC and DINOv2 baseline, and we expect its accuracy to reach 57.2 given long enough training epochs.

As for how the embedding alignment is preserved, on line 298, we explained that “the frozen teacher and the DINOv2 backbone are derived from the same training run and pass through the same 30-epoch state”. This means at 30 epoch, the frozen teacher and the trainable teacher share identical weights. Recent SSL literature [1][2][3] shows that self-supervised models evolve their representations in a coarse-to-fine manner: broad semantic clusters emerge early, while later training improves fine-grained separability. Consequently, although we continue to train the denoised teacher to full 100 epochs, its embeddings are generally still aligned with the backbone at 30 epoch state, as the coarse semantic structure (e.g., principal axes in embedding space) is preserved despite the continual training. Our empirical result also agrees with the theory as NCT solidly surpasses NC, demonstrating the 100-epoch denoised teacher can act as an effective regularizer during noisy training. We will make sure to add more discussion in section 4.3 in our revised manuscript.

[1] Reverse Engineering Self-Supervised Learning, Ben-Shaul et al., NeurIPS 2023

[2] On the Stepwise Nature of Self-Supervised Learning, Simon et al., ICML 2023

[3] Understanding Learning Dynamics of Neural Representations via Feature Visualization at Scale, Kuntala et al., NeurIPS Workshop 2023  
 

Comments on Computational Cost

Thank you for your suggestion. We agree that our pipeline introduces a slight increase in upfront computational cost due to training a denoiser and preprocessing the dataset. However, this additional cost is minimal compared to the full pretraining and downstream application stages. The denoising step only requires a single round of inference, and training the denoiser is highly efficient, as demonstrated by our ablation studies in Question 1. Moreover, in industry settings, inference cost increasingly becomes the primary bottleneck. A core contribution of our work is that it does not increase inference cost, which makes it particularly well-aligned with real-world deployment needs. Overall, we believe that the small upfront cost is a worthwhile trade-off for the substantial gains in inference efficiency and robustness. We will include this important discussion in Section 5 of our revised paper. Thank you again for pointing it out.  
 

We sincerely thank you for your feedback. Your suggestions have significantly helped us strengthen our submission, and we will remain available during the remaining discussion period.

评论

Your responses maintain the high quality of the original submission while addressing the practical concerns raised. The additional experimental evidence and theoretical insights you've provided will strengthen the revised manuscript significantly. I maintain my positive assessment of this work. The combination of practical importance, technical soundness, and thorough experimental validation makes this a valuable contribution to the SSL community.

评论

We greatly appreciate your careful evaluation and continued support. Your feedback has been instrumental in refining our paper. Thank you once again!

最终决定

This paper presents a self-supervised method for representation learning that doesn't need a denoiser. This addresses a common issue of poor performance when training with noisy data. The three reviewers of the paper were clearly in favor of accepting it. The authors also provided much additional information and experiments in the rebuttal that the reviewers found helpful in confirming their initial responses. The authors may wish to incorporate these post-discussion results and clarifications in the revision to improve the paper.