Building a Special Representation for the Chinese Ancient Buildings in Diffusion models.
摘要
评审与讨论
This paper propose to use diffusion models to generate images of Chinese ancient buildings. The authors adopt pinyin and LoRA to finetune the text encoder of the diffusion model. The experimental results show that the pinyin expression is better than the English expression and the LoRA is better than prompt tuning.
优点
The experimental results show that pinyin and LoRA are effective.
缺点
- Lack of novelty: the LoRA is an existing approach.
- Lack of contribution: the usage of pinyin is trivial.
问题
What is the technical novelty of this paper?
This paper aims to build representation for Chinese ancient buildings with diffusion models. The authors firstly collect the pictures of the buildings to form the dataset. Then they develop the representation using Pinyin sequence as the prompt input with fine-tuning and LoRA strategies on the diffusion model. Finally, they design experiments to proof the outstanding performance in Chinese ancient building generating area with the learned prompt. This can benefit some specific area like Chinese ancient-style building generation and some down-stream tasks in this community, if any.
优点
The propose of this benchmark can benefit some specific area like Chinese ancient-style building generation and some down-stream tasks in this community, if any.
缺点
- The benchmark is novel and with great effort, while the learned representation and the prompt design are relatively simple. Though the diffusion model is not trained with such specific data, the pre-trained diffusion model still has the ability to generate realistic Chinese ancient buildings (with detailed caption keywords like ‘realistic’). The paper should provide such detailed comparisons.
- Lacking qualitative and quantitative comparisons with well-defined metrics like FIDs, CLIPIQA or so to validate the necessity of this benchmark and the special representation development.
- Why the Pinyin sequence? Is there a more effective prompt? The authors should provide a more detailed explanation.
问题
Why the Pinyin sequence? Is there a more effective prompt? The authors should provide a more detailed explanation.
In this paper, the authors proposed to generate Chinese Ancient Buildings with Diffusion models, which is a very interesting topic. The authors apply pre-trained diffusion models and fine tune it into the specific area where the lora is used. In addition, the authors proposed to use Pinyin as prompts in this specific topic. Furthermore, the authors collect one dataset of Chinese Ancient buildings with about 1200 large-resolution image, which will be benefited for the community. The experimental results by the visual comparison shows the proposed generation methods perform well.
优点
I think there are several strengths attracting me: (1) The topic is meaningful and interesting. I appreciate the authors' work on this topic, including the trial of applying novel technology (e.g., diffusion and lora) and the experimental results. (2) The collected dataset seems useful for the related community. (3) The authors introduced pinyin as prompts into diffusion models, which is promising.
缺点
(1) I think the big issue of this paper is about the paper-writing, even I suspect some places are not completed. For example, what does the symbols ",,," and "..." mean in the first paragraph?. Also, many writing typos and grammatical mistakes. For example, the sentence “It si similar to the CLIP-diffusion model” seems to be “It is similar to the CLIP-diffusion model”. I suggest that the authors proofread the paper carefully to avoid these.
(2) It is suggested to check the format and completeness of references in the paper. For example, it is incorrect to list reference as Shen et al. (2023), it should be (Shen et al., 2023). In addition, there are some places where references are missing, such as the sentence “Big models in both of CV and NLP area, like ChatGPT and Stable Diffusion” in INTRODUCTION.
(3) The way of citing pictures in the paper is suggested to use Fig. 5, 6 instead of directly an index. For example, “the visualized results5”, “in different perspectives in 5” and “by SD in 6” in Section 4.3.
(4) In the experimental results, there are only some visualization results. It is suggested that the authors could add more quantitative comparison.
问题
Utilizing pinyin sounds reasonable and visualization results shows its effectiveness. However, what about other prompts?
This paper introduces a novel application of diffusion models for generating images depicting Chinese ancient buildings, incorporating pinyin and LoRA for fine-tuning the text encoder within the diffusion model framework. The experimental findings assert the superiority of pinyin over English expression and highlight the efficacy of LoRA compared to prompt tuning.
However, a consensus emerges among reviewers regarding the paper's perceived lack of novelty. The utilization of LoRA is identified as an existing approach, diminishing the technical contribution of the proposed methodology. While some reviewers acknowledge the authors' commendable efforts in dataset creation and the introduction of pinyin prompts, the overall sentiment is that the paper falls short of providing a significant advancement in the field. Despite reviewers' appreciation for certain aspects, the absence of a rebuttal from the authors leaves the identified concerns unaddressed, contributing to the unanimous negative review scores.
为何不给更高分
The authors did not provide a rebuttal and all review scores were not positive at the end.
为何不给更低分
N/A
Reject