Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation
摘要
评审与讨论
The authors extend the traditional image domain adaptation to modality adaptation for unsupervised semantic segmentation in real-world multimodal scenarios. They use text-to-image diffusion model with strong generalization capabilities and proposed Diffusion-based Pseudo-Label Generation (DPLG) and Label Palette and Latent Regression (LPLR) to correct the pseudo-labels and obtain high-resolution features. SOTA performance on depth, infrared, and event modalities proves the effectiveness of the method.
优点
- The paper is well-organized and easy to understand.
- For the first time, the authors extend the adaptation between image domains to the adaptation between modalities.
- The proposed MADM significantly outperform existing methods in the adaptation of three different modalities.
- Figures 3 and 5 clearly demonstrate the effect of DPLG. They visualize the influence of noise-adding on the generation of pseudo-labels.
缺点
- The paper lacks the visualization results obtained by feeding potential features into the VAE Decoder. These results will provide a better understanding of LPLR.
- The specific layers of the three multiscale features extracted from the UNet decoder are not clearly stated in the paper.
问题
- Add the LPLR visualization results and analyze them.
- Details of the framework implementation need to be supplemented.
局限性
- Add the LPLR visualization results and analyze them.
- Details of the framework implementation need to be supplemented.
To Reviewer Qg6E
Thank you for the insightful and very positive comments. In the following, we provide our point-by-point response and hope our response helps address your concerns. We also look forward to the subsequent discussion which further helps to solve the current issues.
Q1: Add the LPLR visualization results and analyze them.
A1: Thank you for your suggestion. We have visualized LPLR under different iteration steps in Figure 2 of the attached PDF. "Regression" and "Classification" in Figure 2 denote the output of the VAE decoder and segmentation head, respectively. Our proposed LPLR leverages the up-sampling capability of a pre-trained VAE decoder in a recycling manner. As the model converges, the regression results transform from blurry to progressively clearer states, presenting more details compared to the classification results. This assists the segmentation head in producing more accurate semantic segmentation results. We will include these results in the revision.
Q2: The specific layers of the three multiscale features extracted from the UNet decoder are not clearly stated in the paper.
A2: Thanks for your carefully proofread. We extract the multiscale features from the denoising UNet decoder at the outputs of the 5th, 8th, and 11th blocks. We will clarify them in the revision.
The authors have addressed my concerns. I keep my score.
We would like to extend our heartfelt thanks for the time and effort you have invested in reviewing our submission. Your insights have been instrumental in enhancing the quality of our work. Following your constructive feedback, we have diligently answered all your questions and presented the LPLR visualization in the rebuttal and attached pdf. We believe these changes have addressed your concerns and further strengthened our research. We understand that the reviewing process is demanding and time-consuming. Thank you once again for your dedication to the review process. We are hopeful for the opportunity to refine our work further based on your feedback.
This paper introduced text-to-image diffusion models to enhance the generalization across different modalities. The proposed MADM includes two key components: Label Palette and Latent Regression (LPLR) and Diffusion-based Pseudo-Label Generation (DPLG). This method alleviates issues related to pseudo-labeling instability and low-resolution features extraction within TIDMs. Experimental results show that MADM achieves state-of-the-art results across three different modalities.
优点
- The topic is interesting and valuable.By leveraging the powerful pre-trained Text-to-Image Diffusion Models, the method effectively reduces discrepancies across different modalities (e.g., image, depth, infrared, and event).
- The writing style is concise and easy to understand, and the paper is logically clear and well-organized.
缺点
- The proposed modules designed of the model is commonly used in generative fields, such as DPLG which converts labels into latent space and DPLG which utilizes the extracted features from TIDM to generate pseudo-labels, thus lacking novelty.
- The approach is time-consuming, and there is no complexity analysis in Table 1.
- The paper does not discuss the method's performance with different data volumes. The size of the dataset often significantly impacts performance improvement. It is recommended to test with varying data volumes (e.g., from 500 to 10,000) across different modalities to validate the enhancement and discuss the results, which would be beneficial for practical applications.
- Minor issues: It should be Table 2 in Lines 260 and 267, Page 8.
问题
I like the idea of utilizing text-to-image diffusion models to enhance the generalization across different modalities. However, the significant time consumption and absence of experiments testing different data volumes lead me to downgrade the recommendation.
局限性
Yes
To Reviewer 4TqZ
Thank you for the insightful and positive comments. In the following, we provide our point-by-point response and hope our response helps address your concerns. We also look forward to the subsequent discussion which further helps to solve the current issues.
Q1: The proposed modules lack novelty.
A1: Thanks. We address your concerns from two main perspectives.
1) Extension of TIDMs to UMA: For the first time, this work uniquely extends TIDMs to the UMA task across a broader range of visual modalities, introducing a novel perspective to this domain. While TIDMs have been used in various dense prediction tasks, their application has predominantly been limited to the RGB modality. Applying TIDMs directly to our UMA problem encounters significant issues, i.e., unstable and inaccurate pseudo-labeling and the lack of fine-grained feature extraction, as illustrated in Figure 3 and Table 2. These challenges have not been explored, despite their significance. Our work pioneers the extension of TIDMs to unsupervised semantic segmentation in other visual modalities, paving a new path for TIDMs in the UMA problem.
2) Novel Techniques of DPLG and LPLR: Our DPLG and LPLR techniques are novel, and effectively address or alleviate two severe issues that have not been explored before, yielding significant improvements over previous approaches.
LPLR addresses the lack of fine-grained features in the UMA task, which has been entirely ignored by previous approaches. Latent-diffusion models pre-trained solely on continuous RGB images cannot handle discrete segmentation labels well. So previous methods [1, 2] discard the VAE decoder and use an additional classifier for features from the low-resolution latent space. However, the spatial size of the low-resolution space is much smaller than the original input size, leading to significant loss of spatial details crucial for semantic segmentation tasks. Our LPLR converts discrete labels into the RGB format, allowing them to be regressed by the VAE decoder to achieve high-resolution features and finer detail. Table 3 shows that LPLR achieves a significant +1.85% average improvement across all modalities over previous methods.
DPLG addresses the issue of noisy and unstable pseudo-label generation in previous approaches like [3]. The significant distribution gap between images and other modalities results in noisy and inaccurate vanilla pseudo-labels generated by self-training methods as shown in Figure 5 and Table 3. DPLG utilizes the unique task property by injecting a certain amount of noise into the target modality for accurate pseudo-label generation. This injection aligns the latent space more closely with the data distribution encountered during the pre-training phase, fostering more robust and accurate semantic interpretation and pseudo-label generation. Figure 5 and Table 2 demonstrate that DPLG improves vanilla pseudo-label generation by a significant +3.39% on the infrared dataset.
We appreciate your thoughtful comments and will include these clarifications and results in the revision.
[1] DDP: Diffusion Model for Dense Visual Prediction. In ICCV, 2023.
[2] Unleashing Text-to-Image Diffusion Models for Visual Perception. In CVPR, 2023.
[3] DAFormer: Improving Network Architectures and Training Strategies for Domain-adaptive Semantic Segmentation. In CVPR, 2022.
Q2: The approach is time-consuming.
A2: We appreciate your feedback on the computational costs of our proposed MADM. The following table presents a detailed comparison of training time per iteration, number of iterations, total training time, parameters, and performance across various methods in the event modality, including our MADM and its distilled variant.
While MADM does exhibit a higher training time per iteration, the advanced visual prior derived from TIDMs necessitates fewer iteration for adaptation, presenting a minimum total training time. Moreover, MADM achieves a substantial performance improvement, with an MIoU of 57.34%, surpassing other methods. Recognizing the trade-off in parameter count, we have leveraged our MADM model as a teacher to perform a secondary self-training. This approach has enabled us to distill the knowledge embedded in MADM into a more compact DAFormer model, MADM (Distilled), which retains a high MIoU of 54.03% while significantly reducing parameters to 85M and only increasing the training time by 1.3 hours.
Our distilled model demonstrates that it is possible to maintain high performance with reduced computational costs, addressing the concerns raised regarding the parameters and efficiency of MADM.
| Method | Training time/Iter. (seconds) | Iteration | Total training time (hours) | Params (million) | MIoU |
|---|---|---|---|---|---|
| DAFormer | 0.36 | 40k | 4.0 | 85 | 33.55 |
| PiPa | 1.12 | 60k | 18.7 | 85 | 43.28 |
| MIC | 0.48 | 40k | 5.3 | 85 | 46.13 |
| Rein | 1.25 | 40k | 13.9 | 328 | 51.86 |
| MADM | 1.38 | 10k | 3.8 | 949 | 56.31 |
| MADM (Distilled) | 0.46 | 10k | 1.3 | 85 | 54.03 |
Q3: The paper does not discuss the performance with different data volumes. It is recommended to test with varying data volumes to validate the enhancement and discuss the results.
A3: Thanks. Per your suggestion, we train our method with 10%, 25%, and 50% of the total target samples in the event modality. Here, the "Baseline-100%" column indicates the performance of the MADM model without DPLG and LPLR and trained on the whole target samples. The results in the following table indicate that our proposed MADM consistently outperforms the baseline across all tested data volumes. Additionally, our MADM is robust and effective even when the dataset size is relatively small. We will include these results into revision.
| Method | Baseline-100% | MADM-10% | MADM-25% | MADM-50% | MADM-100% |
|---|---|---|---|---|---|
| MIoU | 52.27 | 53.21 | 53.69 | 54.55 | 56.31 |
Q4: Minor issues: It should be Table 2 in Lines 260 and 267, Page 8.
A4: Thanks for your carefully proofread. We will fix this typo in revision.
All my concerns have been addressed. I will raise my score to 6.
We would like to extend our heartfelt thanks for the time and effort you have invested in reviewing our submission. Your insights have been instrumental in enhancing the quality of our work. Following your constructive feedback, we have diligently answered all your questions and conducted more convincing experiments (data volumes comparison) in the rebuttal. We believe these changes have addressed your concerns and further strengthened our research. We understand that the reviewing process is demanding and time-consuming. Thank you once again for your dedication to the review process. We are hopeful for the opportunity to refine our work further based on your feedback.
This paper proposes Modality Adaptation with text-to-image Diffusion Models (MADM). MADM leverages pre-trained text-to-image diffusion models to enhance cross-modality capabilities, comprising two main components: diffusion-based pseudo-label generation to improve label accuracy and a label palette with latent regression to ensure fine-grained features. Experimental results show MADM achieves SOTA performance across various modalities.
优点
- Overall, the paper is well-written.
- Label Palette and Latent Regression are well-designed.
- The ablation studies show the effectiveness of the method.
缺点
-
The use of text-to-image Diffusion Models introduces higher training and inference costs, as well as an increase in model parameters. The authors need to discuss the fairness of these costs compared to other methods.
-
Overall, using text-to-image Diffusion Models for semantic segmentation is not novel, as many works have applied diffusion models for dense prediction tasks, and pseudo-label generation is also commonly used.
问题
N/A
局限性
Yes
To Reviewer uNLv
Thank you for the insightful and positive comments. In the following, we provide our point-by-point response and hope our response helps address your concerns. We also look forward to the subsequent discussion which further helps to solve the current issues.
Q1: Discuss the fairness of the costs compared to other methods.
A1: We appreciate your feedback on the computational costs of our proposed MADM. The following table presents a detailed comparison of training time per iteration, number of iterations, total training time, parameters, and performance across various methods in the event modality, including our MADM and its distilled variant.
While MADM does exhibit a higher training time per iteration, the advanced visual prior derived from TIDMs necessitates fewer iteration for adaptation, presenting a minimum total training time. Moreover, MADM achieves a substantial performance improvement, with an MIoU of 57.34%, surpassing other methods. Recognizing the trade-off in parameter count, we have leveraged our MADM model as a teacher to perform a secondary self-training. This approach has enabled us to distill the knowledge embedded in MADM into a more compact DAFormer model, MADM (Distilled), which retains a high MIoU of 54.03% while significantly reducing parameters to 85M and only increasing the training time by 1.3 hours.
Our distilled model demonstrates that it is possible to maintain high performance with reduced computational costs, addressing the concerns raised regarding the parameters and efficiency of MADM.
| Method | Training time/Iter. (seconds) | Iteration | Total training time (hours) | Params (million) | MIoU |
|---|---|---|---|---|---|
| DAFormer | 0.36 | 40k | 4.0 | 85 | 33.55 |
| PiPa | 1.12 | 60k | 18.7 | 85 | 43.28 |
| MIC | 0.48 | 40k | 5.3 | 85 | 46.13 |
| Rein | 1.25 | 40k | 13.9 | 328 | 51.86 |
| MADM | 1.38 | 10k | 3.8 | 949 | 56.31 |
| MADM (Distilled) | 0.46 | 10k | 1.3 | 85 | 54.03 |
Q2: using text-to-image Diffusion Models for semantic segmentation is not novel, as many works have applied diffusion models for dense prediction tasks, and pseudo-label generation is also commonly used.
A2: Thanks. We address your concerns from two main perspectives.
1) Extension of TIDMs to UMA: For the first time, this work uniquely extends TIDMs to the UMA task across a broader range of visual modalities, introducing a novel perspective to this domain. While TIDMs have been used in various dense prediction tasks, their application has predominantly been limited to the RGB modality. Applying TIDMs directly to our UMA problem encounters significant issues, i.e., unstable and inaccurate pseudo-labeling and the lack of fine-grained feature extraction, as illustrated in Figure 3 and Table 2. These challenges have not been explored, despite their significance. Our work pioneers the extension of TIDMs to unsupervised semantic segmentation in other visual modalities, paving a new path for TIDMs in the UMA problem.
2) Novel Techniques of DPLG and LPLR: Our DPLG and LPLR techniques are novel, and effectively address or alleviate two severe issues that have not been explored before, yielding significant improvements over previous approaches.
LPLR addresses the lack of fine-grained features in the UMA task, which has been entirely ignored by previous approaches. Latent-diffusion models pre-trained solely on continuous RGB images cannot handle discrete segmentation labels well. So previous methods [1, 2] discard the VAE decoder and use an additional classifier for features from the low-resolution latent space. However, the spatial size of the low-resolution space is much smaller than the original input size, leading to significant loss of spatial details crucial for semantic segmentation tasks. Our LPLR converts discrete labels into the RGB format, allowing them to be regressed by the VAE decoder to achieve high-resolution features and finer detail. Table 3 shows that LPLR achieves a significant +1.85% average improvement across all modalities over previous methods.
DPLG addresses the issue of noisy and unstable pseudo-label generation in previous approaches like [3]. The significant distribution gap between images and other modalities results in noisy and inaccurate vanilla pseudo-labels generated by self-training methods as shown in Figure 5 and Table 3. DPLG utilizes the unique task property by injecting a certain amount of noise into the target modality for accurate pseudo-label generation. This injection aligns the latent space more closely with the data distribution encountered during the pre-training phase, fostering more robust and accurate semantic interpretation and pseudo-label generation. Figure 5 and Table 2 demonstrate that DPLG improves vanilla pseudo-label generation by a significant +3.39% on the infrared dataset.
We appreciate your thoughtful comments and will include these clarifications and results in the revision.
[1] Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo. DDP: Diffusion Model for Dense Visual Prediction. In ICCV, 2023.
[2] Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, Jiwen Lu. Unleashing Text-to-Image Diffusion Models for Visual Perception. In CVPR, 2023.
[3] Lukas Hoyer, Dengxin Dai, Luc Van Gool. DAFormer: Improving Network Architectures and Training Strategies for Domain-adaptive Semantic Segmentation. In CVPR, 2022.
We would like to extend our heartfelt thanks for the time and effort you have invested in reviewing our submission. Your insights have been instrumental in enhancing the quality of our work. Following your constructive feedback, we have diligently answered all your questions and conducted more convincing experiments (complexity analysis) in the rebuttal. We believe these changes have addressed your concerns and further strengthened our research. We understand that the reviewing process is demanding and time-consuming. Thank you once again for your dedication to the review process. We are hopeful for the opportunity to refine our work further based on your feedback.
This paper proposes an interesting task that adapting image segmentation knowledge to other input modalities, such as depth, infra, and event. This is beneficial for applications at nighttime.
优点
- The task is promising for applications at nighttime.
- The label palette is novel and can be generalized to more tasks.
缺点
-
The presentation needs to be improved.
- The meaning of "Unsupervised Modality Adaptation" is unclear from the introduction section.
- Although the authors try to illustrate why they use a pretrained Text2Image Diffusion Model (TIDMs), and why they propose DPLG and LPLR, some motivations are not well supported by experiments or other accepted papers. For example, in Lines 45-47, "Although TIDMs are not trained on other visual modalities, their large-scale samples and unification through texts enable them to adapt to a broader distribution of domains." What is the meaning of "a broader distribution of domains"? Which prior is provided by TIDMs that results in this convenience?
-
Lacking discussions on the motivation of critical components. For example, see the third question below.
-
Experiments do not well support the potential applications (see question 4).
Typos:
- Line 134, Sec.1 --> Fig. 1
问题
- Since the training dataset of TIDMs mainly consists of RGB images, why can TIDMs robustly extract features across modalities?
- What's the motivation behind the single-step diffusion operation (Line 146)?
- Why does injecting noise into the latent code produce more accurate pseudo labels?
- As stated in Line 28, taking depth or other modalities is valuable in nighttime perception. But there are no experiments showing this. For example, given a nighttime dataset, comparing the performance with the input of RGB images or depth images (adapted by MADM).
局限性
The limitations are briefly stated in Sec. 5.
To Reviewer x1By
Thank you for the insightful and positive comments. In the following, we provide our point-by-point response and hope our response helps address your concerns. We also look forward to the subsequent discussion which further helps to solve the current issues.
Q1: The meaning of "Unsupervised Modality Adaptation" is unclear from the introduction section.
A1: Thanks. We apologize for any confusion caused by the unclear explanation. To clarify, "Unsupervised Modality Adaptation" refers to the adaptation of a model from a labeled source image modality to an unlabeled target modality, such as depth, infrared, or event data. This concept was initially explained in lines 132-134 of the manuscript. To further address your concerns, we will provide a more detailed explanation when we first introduce this concept in the revision.
Q2: Some motivations are not well supported by experiments or other accepted papers. What is the meaning of "a broader distribution of domains"? Which prior is provided by TIDMs that results in this convenience?
A2: Thanks. The phrase "a broader distribution of domains" refers to the model’s adaptability to different visual modalities beyond its primary design of generating RGB images from text. The prior we refer to is derived from the large-scale pretraining data on which TIDMs are trained. This extensive pretraining provides TIDMs with a robust understanding of high-level visual concepts, enabling their application to various domains, such as semantic matching [1], depth estimation [2], and 3D awareness [3]. As demonstrated in Table 2 of our manuscript, our baseline method performs comparably with SoTA techniques, highlighting the potential of leveraging this advanced visual prior for the Unsupervised Modality Adaptation (UMA) problem.
However, two significant challenges remain. First, significant modality discrepancies hinder robust and high-quality pseudo-label generation during self-training. Second, TIDMs extract features in the 8x downsampling latent space which limit the acquisition of high-resolution features. To address these issues, we propose LPLR and DPLG. These methods are designed to mitigate the identified problems, as detailed in our manuscript.
In the revised version, we will include the above discussion to enhance the motivations.
[1] SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching. In CVPR, 2024.
[2] Repurposing Diffusion-based Image Generators for Monocular Depth Estimation. In CVPR, 2024.
[3] Probing the 3D Awareness of Visual Foundation models. In CVPR, 2024.
Q3: Why can TIDMs robustly extract features across modalities?
A3: Thanks. Through extensive pre-training on diverse and large-scale datasets, TIDMs have demonstrated a remarkable capacity to generate various objects with distinct attributes, such as cars of different types and colors. While TIDMs are not explicitly trained on other modalities, they can capture the fundamental characteristics of objects, such as shape and spatial orientation (e.g., a car's position on the road). These characteristics remain consistent across different modalities, enabling TIDMs to achieve high-level visual intelligence. As a result, TIDMs can identify and extract essential characteristics of objects even when encountering modalities that were not part of their pre-training data. This ability to generalize across modalities showcases the strength and versatility of TIDMs on feature extraction in various tasks.
Q4: What's the motivation behind the single-step diffusion operation?
A4: Thanks. As evidenced by the experimental results in [1], a single diffusion step effectively removes noise for dense visual prediction. Additionally, single-step diffusion significantly reduces inference costs compared to multi-step diffusion. Considering these factors, we adopt the single-step diffusion operation, following the approach in [2, 3].
[1] DDP: Diffusion Model for Dense Visual Prediction. In ICCV, 2023.
[2] Unleashing Text-to-Image Diffusion Models for Visual Perception. In CVPR, 2023.
[3] Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. In CVPR, 2023.
Q5: Why does injecting noise into the latent code produce more accurate pseudo labels?
A5: In the pre-training of TIDMs, the objective is to estimate noise from latent inputs containing various noise levels. By injecting noise into the latent code, we effectively simulate this noisy distribution. This simulation aligns the latent space more closely with the data distribution encountered during the pre-training phase. Such alignment fosters a more robust and accurate semantic interpretation, which, in turn, enhances the quality of the pseudo labels generated. This shares similar spirits in other applications of diffusion models, such as the text-to-3D [1] where injecting extra noise into data can improve the denoising quality of image and yields better pseudo labels.
[1] DreamFusion: Text-to-3D using 2D Diffusion. In ICLR, 2023.
Q6: Given a nighttime dataset, comparing the performance with the input of RGB images or depth images (adapted by MADM).
A6: The FMB-Infrared dataset includes both image and infrared modalities on daytime and nighttime scenes. We adapt from cityscapes with daytime RGB images to the nighttime image modality and infrared modality by our proposed MADA, respectively. The following table and Figure 1 in the attached pdf show that the infrared modality has a clear advantage in the "Person" class due to obvious thermal differences and a good suppression of light interference. We will include these results into the revision.
| Modality | Sky | Build. | Person | Pole | Road | S.walk | Veg. | Vehi. | Tr.S. | MIoU(avg) |
|---|---|---|---|---|---|---|---|---|---|---|
| RGB | 88.85 | 68.14 | 64.79 | 25.80 | 89.09 | 32.43 | 70.32 | 84.13 | 7.27 | 58.98 |
| Infrared | 87.94 | 82.40 | 82.69 | 21.50 | 76.21 | 26.50 | 76.61 | 83.80 | 16.69 | 61.59 |
We would like to extend our heartfelt thanks for the time and effort you have invested in reviewing our submission. Your insights have been instrumental in enhancing the quality of our work. Following your constructive feedback, we have diligently answered all your questions and conducted more convincing experiments (nighttime comparsion) in the rebuttal and attached pdf. We believe these changes have addressed your concerns and further strengthened our research. We understand that the reviewing process is demanding and time-consuming. Thank you once again for your dedication to the review process. We are hopeful for the opportunity to refine our work further based on your feedback.
To further address your concerns, we will provide a more detailed explanation when we first introduce this concept in the revision.
UMASS is the task of this paper, which however is first introduced in the Method section. I think it's unfriendly for most of the readers in the Computer Vision community.
This extensive pretraining provides TIDMs with a robust understanding of high-level visual concepts, enabling their application to various domains, such as semantic matching [1], depth estimation [2], and 3D awareness [3].
I briefly read the mentioned references:
- SD4Match[1] takes image as the input.
- Marigold [2] only use the encoder of VAE to compress depth image, the unet is fine-tuned.
- In [3], the authors have demonstrate DINOv2 gets better feature than StableDIffusion. See their introduction:
We find that recent self-supervised models such as DINOv2 [60] learn representations that encode depth and surface normals, with StableDiffusion [69] being a close second.
Therefore, the question remains unsolved: when the input is depth, or other visual modality, rather than RGB, why the unet of the StableDIffusion can recognize it and provide correct feature?
While TIDMs are not explicitly trained on other modalities, they can capture the fundamental characteristics of objects, such as shape and spatial orientation (e.g., a car's position on the road).
Can you show some evidence? How good features can Stable Diffusion learn for some modalities he has never seen before?
Considering these factors, we adopt the single-step diffusion operation, following the approach in [2, 3].
I briefly read VPD [2]. I couldn't find any mention of the use of one-step diffusion in the manuscript.
such as the text-to-3D [1] where injecting extra noise into data can improve the denoising quality of image and yields better pseudo labels.
DeamFusion [1] obtains pseudo labels with supervision from StableDiffusion, not by injecting extra noise.
I must remind: the author should answer the reviewer's questions carefully and provide the correct references.
To Reviewer x1By
We are truly grateful for your continued engagement and valuable feedback on our submission. Your willingness to communicate with us further is greatly appreciated and provides us with an opportunity to clarify any misunderstandings and enhance our research even more.
Q1: UMASS is the task of this paper, which however is first introduced in the Method section. I think it's unfriendly for most of the readers in the Computer Vision community.
A1: We apologize for the oversight in the initial presentation of the UMASS concept. We will revise the manuscript to introduce this key concept earlier in the paper to ensure clarity and accessibility for our readers.
Q2: Therefore, the question remains unsolved: when the input is depth, or other visual modality, rather than RGB, why the unet of the StableDIffusion can recognize it and provide correct feature? Can you show some evidence? How good features can Stable Diffusion learn for some modalities he has never seen before?
A2: (1) TIDMs trained on large-scale data can learn very general high-level semantics and concepts, allowing them to generate images in unseen scenarios by combining different semantics/concepts, as demonstrated in many works [e.g., DaLL-E, Parti, SD, etc.]. For instance, SD3 [1] can successfully generate "a hybrid creature that is a mix of a waffle and a hippopotamus" as shown on page 14. Although such specific scenes have never been encountered during training, TIDMs can understand and disentangle the high-level semantics, such as those of a waffle and a hippopotamus, and creatively imagine their combination. This enables them to seamlessly merge these elements into a coherent and realistic image. In our case, while TIDMs may not have seen data from other modalities, such as depth images, they can still grasp the high-level semantics shared across modalities, like object shapes in RGB and depth images, and may also interpret the semantic combinations from different modalities.
(2) Importantly, similar to the approach used in Marigold, we fine-tune the UNet for adaptation instead of directly extracting features from frozen TIDMs. In our work, adaptation to other visual modalities is achieved through self-training. This fine-tuning process further enhances the TIDMs' ability to understand high-level semantics and establish their combinations across different modalities. Similar evidence can be seen in ControlNet [2]. TIDMs can also adapt to different modalities by integrating and fine-tuning an additional ControlNet initialized by the same parameters as TIDMs. Specifically, TIDMs can generate corresponding RGB images when text and other modalities are taken as conditions. These reference modality inputs can be sketch, normal map, depth, human pose, etc. that TIDMs have not seen during pre-training. Specifically, the third column of Fig. 7 in [2] demonstrates that after fine-tuning with depth data, TIDMs understands the high-level semantics of depth modality and successfully generates images that meet the depth conditions.
[1] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In ICML, 2024.
[2] Adding Conditional Control to Text-to-Image Diffusion Models. In ICCV, 2023.
Q3: I briefly read VPD [2]. I couldn't find any mention of the use of one-step diffusion in the manuscript.
A3: In Section 3.2 of VPD, the authors state, “Note that we simply set t = 0 such that no noise is added to the latent feature map,” and further clarify, “It is also worth noting that our method is not a diffusion-based framework anymore, because we only use a single UNet as a backbone (see Figure 1 to better understand the differences).” Also, outs = self.unet(latents, t, c_crossattn=[c_crossattn]) in line 102 of https://github.com/wl-zhao/VPD/blob/main/depth/models_depth/model.py indicates the use of one-step diffusion in VPD.
Q4: DeamFusion [1] obtains pseudo labels with supervision from StableDiffusion, not by injecting extra noise.
A4: Since TIDMs cannot generate images that are spatially consistent across viewpoints and thus optimize the NeRF model, DeamFusion leverages the generative capabilities of TIDMs for a different purpose: to supervise the training of the NeRF model through its diffusion process, not the final output.
In DeamFusion, gaussian noise is added to the 2D images rendered by the NeRF, and the pre-trained TIDM is allowed to predict the noise. Successful prediction, where the noise matches the one added, signifies that the NeRF model has internalized the statistical priors of the TIDMs.
Our DPLG shares a conceptual alignment with this approach. TIDMs are essentially used to predict noise from noise-injected inputs, so adding noise to inputs in our proposed DPLG enable TIDMs to provide more robust prior information, thereby enhancing the utilization of pre-trained knowledge.
We would like to express our sincere gratitude to all of the four reviewers for the valuable and constructive feedback provided on our manuscript. Your insights have been instrumental in enhancing the quality and clarity of our work.
In the attached PDF, we include additional visualizations as requested by Reviewers x1By and Qg6E, which we believe will clarify our experimental findings.
Thank you for your valuable time and insights. We are open to further discussion and look forward to your continued guidance.
Dear authors,
This draft has received 1 Accept, 1 Weak Accept, and 2 borderline. One of the reviewer has expressed. concerns regarding After carefully reviewing the comments, the draft is being accepted for publication. Authors are encouraged to update the draft according to the suggestions and concerns mentioned by the reviewers.
For example,
-
As concern raised by x1By, "Although TIDMs are not trained on other visual modalities, their large-scale samples and unification through texts enable them to adapt to a broader distribution of domains" ----Needs proper citation or experimental verification. Reviewer x1By further raised the concern that SD4Match (and others) referenced in rebuttal do not properly answer the question regarding TIDM never trained on other modalities e.g. Depth can still adapt to Broader Distribution of Domain. We will encourage authors to answer this question properly in the final draft.
-
Qg6E's Q1 should at least be in supplementary and Q2 should be adjusted in the main draft.
-
Improve the draft to clarify novelty (4TqZ) and concerns raised by other reviewers.
-
Authors have indicated the Code will be release, we are taking this as that both the Training and Evaluation code will be released (along with the experimental settings used in the paper), when the paper gets accepted. This is vital for the community to reproduce the results and do the comparisons.
Congratulations.
regards AC