W1. Evaluation on fully unseen datasets

Thanks for helping us enrich the validation experiments. In the previous manuscript, we showed our visual results on fully unseen datasets such as DND and ‘Real-Internet’ in Fig. 8 and Fig. A6-A8. To further validate the generalization ability of the proposed framework, we have followed your suggestion and evaluated the performance on the PolyU dataset. The updated results are shown in the following table. All the evaluated methods are trained on the labeled synthetic dataset and real-world SIDD dataset (only degradation observation).

	DANN	CyCADA	Ne2Ne	MaskedD	Ours
PSNR	33.64	33.86	32.69	33.91	34.80
SSIM	0.8001	0.8092	0.7609	0.8187	0.8994

W2. An analysis of training time

As suggested, we provide a detailed comparison of training different methods. The evaluated results are reported in the following table. For a clear comparison, we show the total GPU hours (on NVIDIA A100 40G GPU) of training different denoising methods:

Training Cost	Vanilla	DANN	DSN	PixelDA	CyCADA	Ne2Ne	MaskedD	Ours	Ours_ex
GPU hours	35	60	130	40	60	80	160	95	66

We also provide a detailed analysis here: Compared with previous domain adaptation and self-supervised methods, our method did not introduce significant burdens during training. As we can observe, our method is affordable and the whole training can be finished in one day using 4 GPUs. Additionally, as described in Section 3.2 and Appendix A4.1, our method can extend to the unpaired condition case by relaxing the diffusion’s input with the image from other clean datasets. Thus, the shortcut issue can be potentially eliminated since trivial solutions such as matching the pixel’s similarity between input and condition do not exist. Such an extension keeps the channel shuffling layer but is free to the residual swapping contrastive learning. As a result, the training cost can be further reduced as shown in "Ours_ex".

W3. Ablation studies on the sensitivity to hyperparameters

We agree the suggested experiments are necessary, and thus we provide the ablation studies on the sensitivity to hyperparameters as follows. In particular, we show the PSNR metric (dB) evaluated on the SIDD denoising test set. The experimental results demonstrate the moderate values for these two hyperparameters achieved the best performance, which means carefully balancing the roles of the restoration model and diffusion mode is crucial during their joint training.

beta/gamma	1	5	10
0.1	34.14	34.23	34.07
0.2	34.20	34.39	34.11
0.5	34.09	33.94	33.75

W4. Analysis of the performance differences across different tasks

The improvement in image deraining (Table 2) shows that the improvement achieved by our method is apparent compared to that of the vanilla model, where +1.35 dB gain can be achieved in the PSNR metric. For the image deblurring task (Table 3), we acknowledge that +0.19 dB improvement is not as noticeable as other tasks; however, the performance still surpasses the classical domain adaptation methods and state-of-the-art self-supervised learning methods.

We also provided a discussion to analyze the different improvements across different restoration tasks as one of our limitations in Section 4.4. The related content has been presented here for your convenience: The natural mission of the diffusion model is to predict the noises mixed in the input, which is usually sampled from a high-frequency distribution. Diffusion models excel at capturing and modeling these small-scale variations due to their ability to learn fine-grained details through their denoising process. Thus, higher improvements can be observed in image denoising and deraining tasks, which typically involve high-frequency noises in images. By contrast, low-frequency noise in blurred images, which consists of smooth, gradual changes in intensity, can be more insensitive for diffusion models. This type of noise affects larger regions of the image and requires the model to correct broad, sweeping distortions rather than fine details. As a result, diffusion models may struggle to fully restore images with low-frequency noise compared to those with high-frequency noise. We leave it as one of the future works.