Multi-Image Zero-Shot Subject Generation for Visual Storytelling
摘要
评审与讨论
- The paper formulates the task of generating images for a short story. The challenge here is to generate a set of images that are consistent: a) with the respective image captions, and b) wrt to the subject across the set of the generated images. While the first condition could be satisfied by the existing methods like StableDiffusion, the authors build over it to propose a solution satisfying both conditions.
- The proposed solution cleverly uses EDICT based Diffusion inversion to optimize the initial noises. For both the conditions, the authors proposes losses, which need to be balanced.
- The authors also create a dataset to evaluate their method.
优点
- The problem setting is novel and useful in a storytelling setting. A slight modification using GPT to insert captioned images in the existing stories could enable the usage of this method to create consistent images for any given story.
- Cleverly employed EDICT for noise optimization with constant memory cost (like GAN inversion) to obtain consistent images.
- Proposed method is simple and intuitive.
缺点
- The evaluation baselines could be stronger. For eg. one possible solution is to learn a few shot generations for any concept (eg monster) by scraping images off the internet (which have the same monster) and then using them in the caption. Although there is an overhead of scraping the images otherwise, a few qualitative comparisons would be good, along with quantitative evaluation if possible.
- The benchmark dataset needs to be evaluated thoroughly for coherence and formatting of the story. The authors could explain more about the dataset curation process if it were necessary to exclude any generated examples and the reason behind it.
问题
- How does the method scale up to the number of images generates? A histogram of Tables 1 and 2 according to the number of images could help analyse it.
The paper presents a method for generating multiple images with consistent subject. The method is similar to textual inversion in that it learns textual embedding via gradient descent. On the other hand, the learning objective is different in that it optimizes to reduce the distance between CLIP image features of generated images and the mean CLIP image features. In addition, to retain text adherence, authors propose to add CLIP image-text similarity as an objective. The paper also proposed a set of 111 narrated stories generated by GPT4 for evaluation. In experiment, the paper made a comparison to some baselines, including stable diffusion without any consistency regularization and ELITE, a one-shot subject personalization of text-to-image model, by conditioning on the image of the first frame. Experimental results show that the proposed method outperforms baselines in terms of CLIP image-image and image-text consistency, as well as human eval.
优点
- While many papers have been focusing on "personalization", i.e., generating consistent subject images given a set of subject reference images, this paper tackles a different problem of generating consistent subjects without subject reference images. This is an interesting and could be important problem with diverse applications, including storyboard generation.
- The proposed method sounds reasonable, though it is a straightforward extension of EDICT with mean CLIP image vectors as a similarity target.
- The proposed method outperforms some baselines, though the margin for win rate seems marginal (e.g., 32:32:36).
缺点
-
While the proposed objective (Equation (6)) makes sense, it is incremental extension of EDICT.
-
Lack of ablation study
- Which value of is used for experiment, and how is it validated?
- How does the generation quality change w.r.t. ?
- Why is normalized mean CLIP feature is used when individual CLIP features are not normalized in Equation (6)?
-
While the proposed method outperforms baselines, the improvement is rather marginal.
- The qualitative results shown in the paper (Figure 4, Figure 5) are far from being good (e.g., aliens may be more coherent than baselines, but they are no where close to being consistent. Same goes for monsters in Figure 5).
- Details on the human evaluation is not provided. For example, how many answers are collected for the same example? What is the variance? How is tie handled?
- Missing baselines: similarly to the process done in ELITE, authors should include another baseline based on the DreamBooth or DreamBooth-LoRA that fine-tunes the model on a first frame image.
-
The paper lacks the analysis of the dataset. Why is this a good dataset? For example, is this dataset reflective of the realistic visual story telling? How diverse is the story? While there may be no standard metrics to answer these questions, it seems to be what authors should have thought about and provide.
问题
Please see weaknesses section.
The paper introduces a novel task called Multi-Image Zero-Shot Subject Generation (MZSG) aimed at generating a sequence of images that share a consistent subject based on a set of related textual captions. The authors propose an initial method, VisuTeller, to tackle this task. VisuTeller iteratively optimizes image sets to maintain visual content uniformity while adhering to the text conditions. The paper also provides a benchmark consisting of over 100 caption sets that correspond to short stories with repeated subject concepts.
优点
The paper has originality in defining a new task, MZSG, which fills a gap in the existing literature on image generation. The task itself is a creative combination of text-to-image synthesis and subject consistency.
The proposed method, VisuTeller, is well-designed and the authors provide a benchmark to evaluate its performance. The method does not require retraining or fine-tuning of existing models, which adds to its quality.
The paper is well-structured and easy to follow.
The work has the potential to contribute to the field of visual storytelling and multi-image generation. The benchmark provided has the potential to serve as a standard for future research in this area.
缺点
While the task is novel, the methodology employs existing techniques in a new setting. A discussion on how VisuTeller significantly diverges from or improves upon existing methods would strengthen the paper.
The technical contribution is limited. Both the textual inversion and DOODL are from previous works.
Only the CLIP score is used as the evaluation metric. Incorporating other metrics like ImageRewards and HPS could provide a more comprehensive evaluation.
The benchmark, although a good start, is relatively small with just over 100 caption sets. Expanding this could provide a more robust evaluation. Additionally, since the benchmark is mainly generated by Large Language Models (LLMs), the quality of the benchmark should be further justified.
问题
Why do you use the spherical distance to compute the distance between the CLIP features?
The paper mentions memory limitations for running the proposed algorithm. Could you specify the memory cost for backpropagation for a single image? Are any memory optimization strategies like gradient checkpointing used?
Are there plans to expand the benchmark dataset for more comprehensive evaluations?
How does VisuTeller handle ambiguous or conflicting textual prompts?
Could you provide more details on the computational efficiency of VisuTeller?
The authors define a problem of zero-shot subject generation, which involves generating multiple images while maintaining consistency between them. They optimize a token to represent the shared subject of the generations which they then condition a T2I model on.
优点
-
The problem is interesting and relevant with recent focus on generative models
-
The method does not require specialized datasets or retraining of models
-
The way the method defines consistency is elegant
-
The results look good
-
The paper is easy to read. I had no trouble following all of the formulas and details. At first glance, it seems to have sufficient details for replication.
缺点
-
The method is specialized towards diffusion models (via DOODL). It would be interesting and more impactful if the method applied to non-diffusion approaches to image generation.
-
The technical novelty and interest level for a general ML venue like ICLR is not very high. The paper is better suited to a CV-specific conference.
问题
-
Is the Stable Diffusion model used in the comparisons the same model that is used in VisuTeller?
-
Why use spherical distance vs. cosine similarity (which seems to be more common for comparing CLIP embeddings)?
-
Table 2 results are confusing; there were multiple questions asked of the raters. How do those translate into Table 2? Based on the examples in the paper, I would expect the performance difference between the paper method and Stable Diffusion to be much higher than reported.
-
Will the benchmark dataset be made publicly available?