6.0

/10

Poster4 位审稿人

最低5最高7标准差0.7

3.8

置信度

正确性3.0

贡献度3.3

表达3.0

NeurIPS 2024

Transfer Learning for Diffusion Models

Yidong Ouyang,Liyan Xie,Hongyuan Zha,Guang Cheng

OpenReview PDF

提交: 2024-05-15更新: 2025-01-08

摘要

关键词

Diffusion ModelTransfer Learning

评审与讨论

审稿意见

评分: 6置信度: 42024-06-25

This paper introduces a new framework, the Transfer Guided Diffusion Process (TGDP), for transferring a pre-trained diffusion model from the source domain to the target domain. They connect the score function for the target domain and the score function of the source domain with a guidance term related to the density ratio. They use a classifier to estimate the density ratio and extend TGDP to a conditional version. Besides, Cycle Regularization term and Consistency Regularization term are proposed to improve the performance.

优点

This paper introduce a novel training paradigm for transferring a pre-trained diffusion model from the source domain to the target domain in a more efficient and effective way. The proposed guidance term proposed in this paper is a technique worthy of reference and further in-depth study by researchers.
The paper is well-written and easy to understand. The theoretical analysis of the robustness is interesting and well described, which helps to understand the proposed techniques.
Experimental results show the effectiveness of the proposed method.

缺点

The part of Extension to the Conditional Version could be more detailed. The authors could discuss what is the Lemma 3.2 in the conditional version.
This paper is missing a comparison with articles related to transfer learning for diffusion models. In experiment part, the authors only compare the TGDP with vanilla diffusion model and finetune generator.
Except Gaussian mixture simulations and benchmark electrocardiogram (ECG) data, the authors could provide more experiment results on other datasets.

问题

Is there a writing mistake in Line.162? It is hard to sample from p(x0|xt) rather than q(x0|xt)?
What is the Lamma 3.2 in the conditional version?
Do you find out other articles related to transfer learning for diffusion models and can you compare other methods with TGDP?

局限性

A limitation of this study is the lack of empirical validation regarding TGDP’s performance on language vision tasks.

作者回复

2024-08-07

We thank the reviewer for the comments and suggestions. We appreciate the time you spent on the paper. Below, we address your concerns and comments.

Q: The authors could discuss what is the Lemma 3.2 in the conditional version. What is the Lamma 3.2 in the conditional version?

A: Thank you very much for this kind reminder. Yes, the key idea behind Lemma 3.2 is that the conditional expectation is the optimal solution to the least regression problem. It can be easily extended to conditional generation, i.e., $h _{\psi^*}\left(\mathbf{x}_t, y, t\right) = \mathbb{E} _{p\left(\mathbf{x}_0 \mid \mathbf{x}_t, y\right)}\left[q\left(\mathbf{x}_0, y\right)/p\left(\mathbf{x}_0, y\right)\right]$ . We give the Lemma and proof here for the sake of completeness, and will add them to the revised version.

Lemma: For a neural network $h _{\boldsymbol{\psi}}\left(\mathbf{x}_t, y, t\right)$ parameterized by $\boldsymbol{\psi}$ , define the objective $\mathcal{L} _{\text{guidance}}(\boldsymbol{\psi}) :=\mathbb{E} _{p(\mathbf{x}_0, \mathbf{x}_t, y)}\left[\left\|h _{\boldsymbol{\psi}}\left(\mathbf{x}_t, y, t\right)-\frac{q(\mathbf{x}_0, y)}{p(\mathbf{x}_0, y)}\right\|_2^2\right],$ then its minimizer $\boldsymbol{\psi}^* = \underset{\boldsymbol{\psi}}{\arg \min } \ \mathcal{L} _{\text{guidance}}(\boldsymbol{\psi})$ satisfies: $h _{\boldsymbol{\psi}^*}\left(\mathbf{x}_t, y, t\right)=\mathbb{E} _{p(\mathbf{x}_0 |\mathbf{x}_t, y)}\left[{q(\mathbf{x}_0, y)}/{p(\mathbf{x}_0, y)}\right].$

The proof is straightforward and very similar to the unconditional version. Note that the objective function can be rewritten as

\mathcal{L} _{\text{guidance}}(\boldsymbol{\psi}) :&= \mathbb{E} _{p(\mathbf{x}_0, \mathbf{x}_t, y)}\left[\left\|h _{\boldsymbol{\psi}}\left(\mathbf{x}_t, y, t\right)-\frac{q(\mathbf{x}_0, y)}{p(\mathbf{x}_0, y)}\right\|_2^2\right] \\\\ & = \int _{\mathbf{x}_t} \int _{y} \{\int _{\mathbf{x}_0}p(\mathbf{x}_0|\mathbf{x}_t,y) \left\|h _{\boldsymbol{\psi}}\left(\mathbf{x}_t, y, t\right) - \frac{q(\mathbf{x}_0, y)}{p(\mathbf{x}_0, y)}\right\|_2^2 d\mathbf{x}_0 \} p(\mathbf{x}_t|y) p(y) dy d\mathbf{x}_t \\\\ & = \int _{\mathbf{x}_t} \int _{y} \{ \left\|h _{\boldsymbol{\psi}}(\mathbf{x}_t, y, t)\right\|_2^2 - 2 \langle h _{\boldsymbol{\psi}}(\mathbf{x}_t, y, t), \int _{\mathbf{x}_0}p(\mathbf{x}_0|\mathbf{x}_t, y) \frac{q(\mathbf{x}_0,y)}{p(\mathbf{x}_0,y)} d\mathbf{x}_0 \rangle \} p(\mathbf{x}_t|y) p(y) dyd\mathbf{x}_t + C \\\\ & = \int _{\mathbf{x} _t} \int _{y} \left\|h _{\boldsymbol{\psi}}(\mathbf{x}_t, y, t) - \mathbb{E} _{p(\mathbf{x}_0 |\mathbf{x}_t, y)}\left[\frac{q(\mathbf{x}_0, y)}{p(\mathbf{x}_0, y)}\right] \right\|_2^2 p(\mathbf{x}_t|y) p(y) dy d\mathbf{x}_t, \end{aligned}$$ where $C$ is a constant independent of $\boldsymbol{\psi}$. Thus we have the minimizer $\boldsymbol{\psi}^* = \underset{\boldsymbol{\psi}}{\arg \min } \ \mathcal{L} _{\text{guidance}}(\boldsymbol{\psi})$ satisfies $h _{\boldsymbol{\psi}^*}\left(\mathbf{x}_t, y, t\right)=\mathbb{E} _{p(\mathbf{x}_0|\mathbf{x}_t, y)}\left[{q(\mathbf{x}_0, y)}/{p(\mathbf{x}_0, y)}\right]$. **Q**: *This paper is missing a comparison with articles related to transfer learning for diffusion models. Do you find out other articles related to transfer learning for diffusion models and can you compare other methods with TGDP?* **A**: Thank you very much for this question. As far as we know, [1,2,3] explores approaches to fine-tuning diffusion models. They focus on methods that either significantly reduce the number of tunable parameters or introduce regularization terms to alleviate the overfitting for image data. Since fine-tuning only a subset of weights in diffusion models often yields results that are worse or comparable with methods that finetune all weights, we compared our method to full weights fine-tuning, which we believe serves as a strong baseline. [1] Moon, T., Choi, M., Lee, G., Ha, J., Lee, J., Kaist, A. Fine-tuning Diffusion Models with Limited Data. NeurIPS 2022 Workshop on Score-Based Methods. [2] Xie, E., Yao, L., Shi, H., Liu, Z., Zhou, D., Liu, Z., Li, J., Li, Z. DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 4207-4216. [3] Zhu, J., Ma, H., Chen, J., Yuan, J. (2023). DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited Data. ArXiv, abs/2306.14153. **Q**: *Except Gaussian mixture simulations and benchmark electrocardiogram (ECG) data, the authors could provide more experiment results on other datasets.* **A**: Thank you very much for this question. We refer to the "general" response and Table 1 in the attached file. **Q**: *Typo in line 162.* **A**: Thank you very much for your kind reminder. We have corrected it in the revised version.

2024-08-11

I thank the authors for their response. I will keep my original rating.

审稿意见

评分: 5置信度: 42024-07-07

This paper addresses the transfer learning problem for diffusion models, specifically adapting pre-trained diffusion models to downstream datasets, particularly when the data size is small. Traditional parameter-efficient fine-tuning methods often use pre-trained models as parameter initialization, selectively updating parameters based on prior knowledge. These methods, however, can be suboptimal and not robust across different downstream datasets. The authors propose a novel approach called Transfer Guided Diffusion Process (TGDP), which aims to transfer pre-trained diffusion models more effectively. Instead of fine-tuning, TGDP treats the knowledge transfer process as guided by the pre-trained diffusion model. This method involves learning a domain classifier and using its gradient to guide the estimated score function from the source to the target domain, complemented by additional regularizations in practical implementation. Experimental results demonstrate that TGDP outperforms traditional fine-tuning methods, achieving state-of-the-art performance on both synthetic and real-world datasets.

优点

Provides a novel perspective on knowledge transfer for pre-trained diffusion models.
Theoretical foundation of the proposed method is well-constructed.
The paper is well-organized with clear presentation.

缺点

Scope of Title:
- The title "Transfer Learning" is too broad. The paper focuses solely on (few-shot and supervised) domain adaptation, where the upstream and downstream tasks are similar in nature. General transfer learning encompasses a wider range of tasks, including those with different downstream tasks from the upstream one. For instance, transferring pre-trained text-to-image models to controllable generation or text-to-video generation are broader and more complex tasks that fall under transfer learning.
- The paper does not address transfer learning across different label spaces. Even within the same generation tasks, domain adaptation requires identical label spaces for source and target domains, which is not always practical. For example, transferring pre-trained conditional generation models on ImageNet to other fine-grained datasets with different label spaces presents a more challenging problem than the domain adaptation addressed in this paper.
Empirical Results:
- The benchmarks used are insufficient. The authors conduct experiments primarily on synthetic datasets and a single real-world dataset (ETG). This limited scope is inadequate for demonstrating the method's effectiveness. More experiments on various modalities and datasets, such as the DomainNet dataset, which is a benchmark for domain adaptation, are necessary to showcase the generalization ability of the proposed method.
- The provided analyses are insufficient. While several ablation studies on simulations are included in Appendix C.2, they are not comprehensive. Essential missing analyses include (1) examining the main term $\mathcal{L}\_{\text{guidance}}$ alone and (2) combining $\mathcal{L}\_{\text{guidance}}$ with $\mathcal{L}\_{\text{cycle}}$ . These studies are crucial for understanding each term's effectiveness. Additionally, ablation studies should also be conducted on real-world datasets, not just synthetic ones.

问题

Consistency Regularization:
- The consistency regularization requires optimization of the gradient term $\nabla_{x_t}\log h_\psi(x_t,t)$ , but the paper lacks details on how to optimize this term, especially when second-order gradients are involved. Could the authors elaborate on the optimization process for this term?
Effectiveness of TGDP:
- In Table 1, although TGDP significantly outperforms baselines, its performance is highly dependent on the target data size $n$ . This seems contradictory to two points mentioned by the authors: (a) the training of the domain classifier $c$ is not significantly affected by $n$ (Figure 2), and (b) the training of the guidance network $h$ with the main term $\mathcal{L}_{\text{guidance}}$ does not require target data (Equation 8). How do the authors explain this discrepancy?
Explanation of Figure 3:
- Figure 3 shows that the visualizations of Finetune Generator and TGDP appear similar, yet their performance differs significantly. Could the authors provide more insights and explanations about the figure and the reasons behind this performance difference?

局限性

Yes.

作者回复

2024-08-07

We thank the reviewer for the comments and suggestions. We appreciate the time you spent on the paper. Below we address the concerns and comments that you have provided.

Q: The title "Transfer Learning" is too broad. The paper focuses solely on (few-shot and supervised) domain adaptation, where the upstream and downstream tasks are similar in nature. General transfer learning encompasses a wider range of tasks, including those with different downstream tasks from the upstream one. For instance, transferring pre-trained text-to-image models to controllable generation or text-to-video generation are broader and more complex tasks that fall under transfer learning.

A: Thank you very much for this question. Actually, our proposed framework is general enough to deal with more complex tasks, e.g., fine-tuning text-to-image models or text-to-video generation. These models can be viewed as the conditional generative model with text embedding as the condition.

Q: The paper does not address transfer learning across different label spaces. Even within the same generation tasks, domain adaptation requires identical label spaces for source and target domains, which is not always practical. For example, transferring pre-trained conditional generation models on ImageNet to other fine-grained datasets with different label spaces presents a more challenging problem than the domain adaptation addressed in this paper.

A: Thank you very much for this question. Our framework (conditional guidance) does not assume identical label spaces for source and target domains. In the ECG experiments, the label set of target domain is the subset of the source domain. When the source and target domain contain different class labels, our framework is still applicable, i.e., when $y_t\neq y_s$ , $\underbrace{\mathbf{s} _{\boldsymbol{\phi}^*}(\mathbf{x} _t, y _{t}, t)} _{\substack{\text {target source}}} = \underbrace{\nabla _{\mathbf{x} _t} \log p_t(\mathbf{x}_t | y_s)} _{\text {pre-trained conditional model on source}}+ \underbrace{\nabla _{\mathbf{x}_t} \log \mathbb{E} _{p(\mathbf{x}_0|\mathbf{x}_t, y_s)}\left[\frac{q(\mathbf{x}_0, y_t)}{p(\mathbf{x}_0, y_s)}\right]} _{\text {conditional guidance}}.$ To generate an unseen class $y_t$ , the key problem here is to choose a particular class from the source domain $y_s$ such that we can borrow more information from the source domain to generate this unseen class from the target domain. The coupling between $y_t$ and $y_s$ can be learned by solving a static optimal transport problem.

We show our framework is general enough to deal with homogeneous transfer learning together with heterogeneous transfer learning. More in-depth design (e.g. coupling solved by static optimal transport) can be left to future work.

Q: More experiments on various modalities and datasets are necessary to showcase the generalization ability of the proposed method.

A: Thank you very much for this suggestion. We refer to the "general" response and Table 1 in the attached file.

Q: While several ablation studies on simulations are included in Appendix C.2, they are not comprehensive.

A: Thank you very much for this suggestion. We refer to the “global” response and Figure 1 in the attached file.

Q: The consistency regularization requires optimization of the gradient term $\nabla _{\mathbf{x}_t} \log h _{\boldsymbol{\psi}}\left(\mathbf{x}_t, t\right)$ . Could the authors elaborate on the optimization process for this term?

A: Thank you very much for this question. Optimization of the gradient term is widely used by the meta-learning method. There is only one-line difference in practice to implement it, i.e., "torch.autograd.grad( $\log h_{\boldsymbol{\psi}}\left(\mathbf{x}_t, t\right)$ , $\mathbf{x}_t$ , retain_graph=True)" to retain the computational graph for the backpropagation of the weights of guidance network $\boldsymbol{\psi}$ , which can be found at line 249 in density_ratio_guidance.py. We will add additional clarifications to the revised version.

Q: Effectiveness of TGDP.

A: Thank you very much for this question. Both of the two claims are correct. (The training of the domain classifier is not significantly affected by the number of samples in the target domain since the domain classifier can achieve more than 90% accuracy learning from only 10 samples in the target domain. And the training of the guidance network with $\mathcal{L}_{\text{guidance}}$ does not require target data.) The key problems are: 1) The density ratio estimator may not be accurate in some regions since the coverage will not be large enough when the number of training samples is 10; 2) The $\mathcal{L} _{\text{cycle}}$ and $\mathcal{L} _{\text{consistence}}$ depends on the number of samples in the target domain.

Q: Explanation of Figure 3.

A: Thank you very much for your question. From Table 2, we demonstrate improvements in the diversity and fidelity of TGDP against two baseline methods. For each sub-figure in Fig 3, we get the embedding for T-SNE by using the same encoder from learning directly on target samples. If we change the encoder to the learned classifier by each method, we may see a big difference between Finetune Generator and TGDP.

2024-08-12

Thank you for your response. However, some of my concerns still remainsunsolved.

Q1: The title "Transfer Learning" is too broad. One typical advantage of fine-tuning and other transfer learning methods (compared with score shaping) is to adapt to a new downstream task. For example, stable-diffusion is a text-to-image generation model, and [1-2] transfer it to controllable generation and text-to-video generation, respectively. Could the authors provide further details about how TGDP handles these tasks?
Q7: Explanation of Figure 3. Why does using the same encoder learn on target samples lead to similar T-SNE visualizations, although the diversity and fidelity are improved? Are there any results following "change the encoder to the learned classifier"?

[1] Adding Conditional Control to Text-to-Image Diffusion Models, ICCV 2023
[2] Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models, CVPR 2023

2024-08-13

We thank the reviewer for the comments and suggestions. We appreciate the time you spent on the paper. Below we address the concerns and comments that you have provided.

Q1: The title "Transfer Learning" is too broad. One typical advantage of fine-tuning and other transfer learning methods (compared with score shaping) is to adapt to a new downstream task. For example, stable-diffusion is a text-to-image generation model, and [1-2] transfer it to controllable generation and text-to-video generation, respectively. Could the authors provide further details about how TGDP handles these tasks?

Thank you very much for your question. Given a source distribution $(x, \boldsymbol{c}_t) \sim p$ ( $\boldsymbol{c}_t$ denotes text prompts by using the terminology from [1]), a pre-trained diffusion model can be trained on the source distribution. Given a target distribution $(x, \boldsymbol{c} _t, \boldsymbol{c} _{\mathrm{f}}) \sim q$ ( $\boldsymbol{c} _t$ denotes a task-specific condition), [1,2] can be fine-tuned by noise matching objective,

\mathcal{L}= \mathbb{E} _{\boldsymbol{x}_0, \boldsymbol{t}, \boldsymbol{c}_t, \boldsymbol{c} _{\mathrm{f}}, \epsilon \sim \mathcal{N}(0,1)}\left[\| \epsilon-\epsilon _\theta\left(\boldsymbol{x}_t, \boldsymbol{t}, \boldsymbol{c}_t, \boldsymbol{c} _{\mathrm{f}}\right) \|_2^2\right].

In our work, we propose directly estimate $\nabla _{\mathbf{x}_t} \log \mathbb{E} _{p(\mathbf{x}_0|\mathbf{x}_t, \boldsymbol{c}_t)}\left[\frac{q(\mathbf{x}_0, \boldsymbol{c}_t, \boldsymbol{c} _{\mathrm{f}})}{p(\mathbf{x}_0, \boldsymbol{c}_t)}\right]$ rather than fine-tuning by noise matching objective. We can still use domain classifier $c_w(x, y)$ to estimate the density ratio, where $y$ denotes the embedding of the condition.

Moreover, we would like to discuss some similarities and differences between our work and [1,2]. Directly fine-tuning the diffusion model on data from the target domain is similar to the consistency regularization proposed in our work, while [1,2] has a more in-depth design for the architecture and has great results on vision-language tasks. However, with limited data from the target distribution, direct fine-tuning may not achieve good enough performance, which we verify in a two-dimensional Gaussian setting.

Q7: Explanation of Figure 3. Why does using the same encoder learn on target samples lead to similar T-SNE visualizations, although the diversity and fidelity are improved? Are there any results following "change the encoder to the learned classifier"?

Thank you very much for your question. When computing FID/Diversity, we use the same encoder to get the feature map, which is the common practice for fair comparison. Therefore, we use the same feature map for T-SNE visualizations, which are generated by the same encoder. For the downstream task, we train a separate encoder for classification.

2024-08-13

Thanks for your response.

For Q1, I appreciate the potential of TGDP to adapt to more complex tasks and the discussion on their similarities and differences. But I'm still confused about technique details. Could the authors provide more details about the design of $c_w(x,y)$ in case when $c_f\notin\\\{c_t\\\}$ ? In this case, should $y$ stands for the embedding of $c_f$ or $\\{c_f,c_t\\}$ ? Since [1] is also applicible to the case with limited data, I expect the authors could provide a comprehensive solution or discuss it in the limitation section.

For Q7, I understand which encoder to use, but still concerned why Figure 3(b) and Figure 3(c) looks similar (even if there is an improvement over both diversity and FID). Could the authors provide more details?

[1] Adding Conditional Control to Text-to-Image Diffusion Models, ICCV 2023

2024-08-13

We thank the reviewer for the comments and suggestions. We appreciate the time you spent on the paper. Below we address the concerns and comments that you have provided.

Q: For Q1, I appreciate the potential of TGDP to adapt to more complex tasks and the discussion on their similarities and differences. But I'm still confused about technique details. Could the authors provide more details about the design of $c_w(x,y)$ in case when $c_f\notin\{c_t\}$ ? In this case, should $y$ stands for the embedding of $c_f$ or $\{c_f,c_t\}$ ? Since [1] is also applicible to the case with limited data, I expect the authors could provide a comprehensive solution or discuss it in the limitation section.

A: Thank you very much for your question. In [1], text prompts $c_f$ and a task-specific condition $c_t$ can have different modalities, e.g., image condition. We believe TGDP is still general enough to deal with this case. We propse a possible way in the following:

When $c_t$ has the same data modality as $c_f$ , we can use the same pre-trained Language Model to generate the embedding, i.e., $y_{\text{source}}=\text{Encoder}(c_f)$ for the data from the source domain and $y_{\text{target}}=\text{Encoder}(c_f)+\text{Encoder}(c_t)$ for the data from the target domain.
When $c_t$ has different data modalities as $c_f$ , e.g., image condition, we rely on the aligned encoder to generate the embedding, e.g., ALIGN [Jia et. al. 2021] in vision-language tasks, i.e., $y_{\text{source}}=\text{Encoder}_1(c_f)$ for the data from the source domain and $y _{\text{target}}=\text{Encoder} _1(c_f)+\text{Encoder} _2(c_t)$ for the data from the target domain.

Then, the domain classifier $c_w(x,y)$ is used to distinguish the $(x_{\text{source}}, y_{\text{source}})$ and $(x_{\text{target}}, y_{\text{target}})$ , thereby estimating the density ratio.

The approach in [1] addresses limited data scenarios by using zero convolution layers (i.e., 1×1 convolution layers with both weight and bias initialized to zero), which mitigates instability during fine-tuning. This is distinct from our methodology, which leverages the smaller sample complexity of the classifier/density ratio estimator compared with the diffusion model. We will include a discussion of the similarities and differences with [1] in the revised version.

Q: For Q7, I understand which encoder to use, but still concerned why Figure 3(b) and Figure 3(c) looks similar (even if there is an improvement over both diversity and FID). Could the authors provide more details?

A: Thank you very much for your question. We observe a significant difference in the distribution patterns of the purple and black classes. In Figure 3(c), the black class is more concentrated compared to Figure 2(c), suggesting that the synthetic data may lie on the correct manifold. However, given that T-SNE reduces a 128-dimensional feature space to 2 dimensions, we must be careful not to overinterpret these visualizations. The numerical results will provide a more reliable basis for conclusions.

[Jia et. al. 2021]: Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., & Duerig, T. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. ICML 2021.

2024-08-13

Thanks for your response and my concerns are addressed. I appreciate the contribution of TGDP in motivation novelty and theoretical soundness, but yet expect the scope to be more clear (since the title is Transfer Learning), and believe more discussions about TGDP on other fine-tuning scenarios could make the paper better. Above all, I raise my score to 5, i.e., a borderline score.

评论- reminder to authors to reply before the end of the author/reviewer discussion period

2024-08-13

Hello authors of the paper,

It would be great if you could respond to reviewer UUwS's most recent comments before the author/reviewer discussion period closes (at Aug 13, 11:59pm AoE).

Thanks, Your AC

2024-08-13

We thank AC for the kind reminder. We appreciate the time you spent on the paper.

审稿意见

评分: 6置信度: 32024-07-11

The paper proposes an approach for transfer learning based on score-based generative models and density-ratio estimation. The authors show that in order to transfer a trained diffusion model from a source to a target domain, all that is needed is the expectation of density ratios of source and target domains. This expectation can be learned using neural networks reducing the problem of learning a generative model on the target domain to a regression/classification task.

优点

As far as I can judge, the proposed method for transfer learning is both original and significant. It should make a good contribution to the research community.
In addition to the core of the method, the proposed regularizers that improve efficiency are in my opinion conceptually convincing.
The paper is well written and easy to follow.
The experimental evaluation on the ECG data is in my opinion well executed and convincing.

缺点

While the method is in my opinion theoretically convincing, the experimental section and ablations are a bit lackluster which decreases confidence in the method (even though I acknowledge the complexity of the task itself).
The first experiment on a 2D Gaussian illustrates the performance of the method in comparison to baselines decently, but it is in my opinion ultimately too trivial to be really meaningful (since the density ratios of the 2 Gaussian mixtures can be easily learned, because they are well separated, I would expect that a method based on learning density ratios outperforms baselines here). The ablation in Figure~4 in the appendix is conducted too superficially. It consists of scatterplots, no real quantitative analysis, and doesn't evaluate all relevant loss terms of the model.
The evaluation on the ECG data is not fully clear, since meta-information, such as sampling rate, and other details are not given (as far as I can tell).
In my opinion, a quantitative evaluation on at least 2-3 more high-dimensional real-world data sets should have been conducted in addition.

问题

Figure 1 shows the data as scatterplot which makes it impossible to see the learned density. Since, the true densities are known, a representation using kernel density estimates would be better to see if the learned model aligns with the truth.
For readers who do not know what a "12-lead ECG data" is, the supplement should contain this information (at least a rough overview). How does 12-lead ECG data look? Are they one-dimensional time series? At what rate are they sampled?
Task 4.2.2 is a bit unclear. The authors should be more detailed what the goal is and what they are exactly doing (even though one can infer it from the text).
It would have been nice to see the ablation for all combinations that contain $\mathcal{L}\_{\text{guidance}}$ , and in particular $\mathcal{L}\_{\text{guidance}}$ alone (and not $\mathcal{L}\_{\text{consistence}}$ ). In this state, it is not clear what, e.g., $\mathcal{L}\_{\text{cycle}}$ or $\mathcal{L}\_{\text{consistance}}$ contribute separately.
How are the hyper-parameter $\eta$ chosen?

局限性

A statement on limitations regarding empirical evaluation exist.
Possible negative societal impact, e.g., regarding generation of deepfakes using transfer learning, is acknowledged.

作者回复

2024-08-07

We thank the reviewer for the comments and suggestions. We appreciate the time you spent on the paper. Below we address the concerns and comments that you have provided.

Q: The first experiment on a 2D Gaussian illustrates the performance of the method in comparison to baselines decently, but it is in my opinion ultimately too trivial to be really meaningful (since the density ratios of the 2 Gaussian mixtures can be easily learned, because they are well separated, I would expect that a method based on learning density ratios outperforms baselines here).

A: Thank you very much for this question. We agree that the two-dimensional Gaussian mixture setup is a relatively easy task and serves as a proof-of-concept numerical example to validate the proposed method. To further demonstrate the effectiveness of our method, we have conducted experiments on real-world ECG datasets, and on image datasets in the "general" response above and Table 1 in the attached file therein. Additionally, we would like to highlight a challenge associated with the two-dimensional Gaussian example: when two distributions differ significantly, the density ratio estimator struggles to accurately estimate the density ratio. Due to the well-separated nature of the distributions, the density ratio at some points can be as large as $10^5$ . One key takeaway from our work is that our method is robust to the magnitude of the estimated density ratio.

Q: The ablation in Figure~4 in the appendix is conducted too superficially. It consists of scatterplots, no real quantitative analysis, and doesn't evaluate all relevant loss terms of the model. It would have been nice to see the ablation for all combinations that contain $\mathcal{L} _{\text{guidance}}$ , and in particular $\mathcal{L} _{\text{guidance}}$ alone (and not $\mathcal{L} _{\text{consistence}}$ ). In this state, it is not clear what, e.g., $\mathcal{L} _{\text{cycle}}$ or $\mathcal{L} _{\text{consistence}}$ contribute separately.

A: Thank you very much for this suggestion. We refer to the “global” response and Figure 1 in the attached file.

Q: The evaluation on the ECG data is not fully clear, since meta-information, such as sampling rate, and other details are not given (as far as I can tell). For readers who do not know what a "12-lead ECG data" is, the supplement should contain this information (at least a rough overview). How does 12-lead ECG data look? Are they one-dimensional time series? At what rate are they sampled?

A: Thank you very much for this suggestion. A 12-lead ECG (electrocardiogram) refers to the 12 different perspectives of the heart's electrical activity that are recorded, which is 12-dim time series data. We use the data from PTB-XL dataset and ICBEB2018 dataset at a sampling frequency of 100 Hz, which means 100 samples per second. We include this necessary information in the revised version in section 4.2.

Q: How are the hyper-parameter $\eta$ chosen?

A: Thank you very much for this question. The guidance term is calculated based on data from the source distribution, while two regularization terms are calculated based on data from the target distribution. To choose a good $\eta_1$ and $\eta_2$ , we must take the number of samples in the source $m$ and target distribution $n$ into consideration. Therefore, we initially set $\eta_1= \eta_2 = \sqrt{n/m}$ and rely on the grid search to determine the $\eta_1$ and $\eta_2$ .

Q: A quantitative evaluation on at least 2-3 more high-dimensional real-world data sets should have been conducted in addition.

A: Thank you very much for this suggestion. We refer to the "general" response and Table 1 in the attached file.

Q: Task 4.2.2 is a bit unclear. The authors should be more detailed what the goal is and what they are exactly doing (even though one can infer it from the text).

A: Thank you very much for this suggestion. We include a paragraph in section 4.2.2 to describe the ECG classification task in the revised version.

2024-08-10

Thank you for the response to my review and comments.

I believe the method would make good contribution to the community, but a sparse and not revised experimental section prohibit a higher score. Having read the reviews of the other reviewers, I agree that the method clearly needs more evaluations.

审稿意见

评分: 7置信度: 42024-07-13

introduces a novel approach called Transfer Guided Diffusion Process (TGDP) for transferring knowledge from a pre-trained diffusion model in the source domain to a target domain with limited data. The authors present the methodology for transferring knowledge, including the formulation of the guidance network and its learning process. They also extend TGDP to a conditional version for joint data and label distribution modeling.

优点

TGDP offers a new perspective on transferring knowledge from pre-trained diffusion models to new domains with limited data. The whole framework is innovative and reasonable.
The paper provides a solid theoretical basis for TGDP, including proofs for the optimality of the approach.

缺点

More experiments on real-world datasets are required to further validate the effectiveness of the proposed framework.
The computational cost of training the guidance network and the diffusion model should be discussed.
The paper primarily validates TGDP on Gaussian mixture simulations and ECG datasets. Its performance on other types of data or domains is not empirically tested or discussed.

问题

In what scenarios does the proposed framework work best?

局限性

Authors have discussed limitations well.

作者回复

2024-08-07

We thank the reviewer for the comments and suggestions. We appreciate the time you spent on the paper. Below we address the concerns and comments that you have provided.

Q: The paper primarily validates TGDP on Gaussian mixture simulations and ECG datasets. Its performance on other types of data or domains is not empirically tested or discussed. More experiments on real-world datasets are required to further validate the effectiveness of the proposed framework.

A: Thank you very much for your comments. We refer to the "general" response and Table 1 in the attached file.

Q: The computational cost of training the guidance network and the diffusion model should be discussed.

A: Thank you very much for this question. We provide the computation cost of training in Table 2 on page 9. The number of parameters in the guidance network (domain classifier) is much less than that of the diffusion model, which reduces the computational costs against fine-tuning based methods.

Q: In what scenarios does the proposed framework work best?

A: Thank you very much for this question. In this work, we prove that using the guidance $\nabla _{\mathbf{x} _t} \log \mathbb{E} _{p(\mathbf{x}_0|\mathbf{x}_t)}\left[q(\mathbf{x}_0)/p(\mathbf{x}_0)\right]$ (or $\nabla _{\mathbf{x}_t} \log \mathbb{E} _{p(\mathbf{x}_0|\mathbf{x}_t, y)}\left[q(\mathbf{x}_0, y)/p(\mathbf{x}_0, y)\right]$ for conditional generative model), we can generate the samples from target distribution with the source pre-trained generative model. Therefore, The performance of our method depends significantly on the accuracy of the estimated guidance network, particularly the estimated density ratio between the target and source distributions. When this density ratio is accurately estimated (especially when the relative magnitudes are accurately captured), our method achieves optimal performance.

2024-08-12

Thanks for the detailed response. My concerns have been properly addressed and I decide to raise my score to 7.

作者回复

2024-08-07

We thank the reviewer for the valuable comments and suggestions. We appreciate the time you spent on the paper. We summarize the positive feedback that we received as follows:

Motivation and Novelty: The whole framework is innovative and reasonable (9nph); the proposed method for transfer learning is both original and significant. It should make a good contribution to the research community (53Zj); Provides a novel perspective on knowledge transfer for pre-trained diffusion models (UUwS); this paper introduce a novel training paradigm and the proposed guidance term proposed in this paper is a technique worthy of reference and further in-depth study by researchers (brAM).

Theoretical solid: The paper provides a solid theoretical basis for TGDP, including proofs for the optimality of the approach (9nph); the proposed regularizers that improve efficiency are in my opinion conceptually convincing (53Zj); theoretical foundation of the proposed method is well-constructed (UUwS); The theoretical analysis of the robustness is interesting and well described, which helps to understand the proposed techniques (brAM).

Writing Quality: The paper is well written and easy to follow (53Zj); the paper is well-organized with clear presentation (UUwS); The paper is well-written and easy to understand (brAM).

Effectiveness: The experimental evaluation on the ECG data is in my opinion well executed and convincing (53Zj); Experimental results show the effectiveness of the proposed method (brAM).

Secondly, we aim to address the common weaknesses and concerns raised by the reviewers.

Firstly, all reviewers suggest providing additional experimental results on various data types and real-world datasets beyond the electrocardiogram (ECG) benchmark. Therefore, we conducted a preliminary experiment using an image dataset to verify the effectiveness of the proposed method. Given the time constraints and the computational cost associated with training diffusion models on large datasets (e.g., DomainNet) or fine-tuning/evaluating conditional diffusion models on language-vision tasks, we selected the MNIST to one-channel Street View House Numbers (SVHN) dataset for a preliminary verification. Our method achieves a lower Fréchet inception distance (fid) than the two baseline methods, which can be found in Table 1 in the attached file. Although the digit dataset is relatively simple, we believe these experimental results can still demonstrate the effectiveness of the proposed method.

Moreover, Reviewer 53Zj and UUwS point out that the ablation study on the effectiveness of two regularization terms is not comprehensive enough. The key reason we do not provide the ablation for all combinations involving $\mathcal{L} _{\text{guidance}}$ is that we considered $\mathcal{L} _{\text{consistence}}$ as an intuitive solution for the transfer setting, while $\mathcal{L} _{\text{guidance}}$ and $\mathcal{L} _{\text{cycle}}$ represent the key contributions of our proposed methods. Therefore, in the ablation study, we verify that using $\mathcal{L} _{\text{consistence}}$ alone does not yield a sufficiently effective solution. To better illustrate the contributions of the three terms, we provide the ablation studies, $\mathcal{L} _{\text{guidance}}$ and $\mathcal{L} _{\text{guidance}} + \mathcal{L} _{\text{cycle}}$ , in Table 1 in the attached file.

Finally, we summarize our individual responses to each reviewer below. We appreciate the questions raised by the reviewers, which have helped improve the clarity of our paper.

Reviewer 9nph: We clarified the computational advantages of training the guidance network compared to fine-tuning the diffusion model. Additionally, we discussed the scenarios in which our method performs well.

Reviewer 53Zj: We enhanced the clarity of our paper by including detailed information about the 12-lead ECG data, the experimental setup, and the preliminary theoretical insights behind choosing the hyperparameters for the regularization terms.

Reviewer UUwS: We clarified the applicability of the proposed framework, demonstrating its use for transferring text-to-image/video diffusion models and transferring knowledge across different label spaces. We also provided additional insights on Table 1 and Figure 3.

Reviewer brAM: We improved the clarity of the conditional version of our framework by providing the necessary lemma along with the proof.

We believe these revisions address the concerns raised by the reviewers and have improved the overall clarity and strength of our paper. Should you have any further questions, please do not hesitate to let us know.

Sincerely,

TGDP Authors.

最终决定Accept (poster)

2024-09-25

Thanks to the authors and reviewers for engaging in discussion. Overall, the four reviewers leaned toward acceptance (1 accept, 2 weak accept, 1 borderline accept), where some lingering concerns involved making the experimental results more thorough (reviewer 53Zj) and having the discussion be more thorough (reviewer UUwS). In their response, the authors have provided some additional experimental results, although these appear preliminary and a bit limited in scope. I would encourage the authors to try to make the experiments even more thorough (now not having to worry about the tight turn around time from the author/reviewer discussion period), and to--as well as possible--address reviewer UUwS's remaining concerns within the manuscript text.