PaperHub
7.0
/10
Spotlight4 位审稿人
最低6最高8标准差0.7
7
7
6
8
4.5
置信度
正确性3.3
贡献度3.0
表达3.3
NeurIPS 2024

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

OpenReviewPDF
提交: 2024-04-30更新: 2024-11-06
TL;DR

Generating long-range image and video with consistent characters, based on Consistent Self-Attention and Motion Predictor.

摘要

关键词
Consistent character generationDiffusion modelImage generationVideo generationTransition prediction

评审与讨论

审稿意见
7

This paper presents StoryDiffusion, a framework based on diffusion models designed to generate consistent images or videos. StoryDiffusion comprises two key components. The first is a novel self-attention mechanism, named Consistent Self-Attention, which enhances the consistency between generated images and can augment pre-trained diffusion-based text-to-image models without additional training. The second component is a Semantic Motion Predictor module, which predicts motion conditions in semantic space to generate videos with smooth transitions and stable subjects. The combination of these two components in StoryDiffusion offers a lightweight, training-free solution to the problem of content consistency in generated images and videos, while maintaining high controllability over the generated content.

优点

  1. The article is logically structured and easy to understand.
  2. The authors provide clear code examples, making the proposed methods easy to reproduce.
  3. The Consistent Self-Attention mechanism can augment pre-trained diffusion-based models without requiring additional training.
  4. The Semantic Motion Predictor module extends the method to address video consistency issues, ensuring smooth transitions and stable subjects in generated videos.
  5. The paper includes extensive experimental results demonstrating the effectiveness of StoryDiffusion.
  6. The framework offers a lightweight and efficient approach to generating consistent visual content.

缺点

  1. Although the authors introduce methods like IP-Adapter in the related work and introduction sections, the paper's focus should be on story generation. Therefore, it lacks comparisons with similar story generation works, such as [1], [2], [3].
  2. The paper repeatedly emphasizes that the proposed modules are plug-and-play. However, the main text does not provide corresponding ablation studies to offer more substantial examples.
  3. The paper claims that StoryDiffusion is a lightweight method with minimal data and computational cost, but it lacks a detailed analysis of time and space overhead.
  4. The quantitative metrics are limited, mostly based on the CLIP score. Additionally, the user study involves a small number of participants and lacks detailed explanations of the setup.

[1] Avrahami, Omri, et al. "The Chosen One: Consistent Characters in Text-to-Image Diffusion Models." arXiv preprint arXiv:2311.10093 (2023).

[2] Tewel, Yoad, et al. "Training-Free Consistent Text-to-Image Generation." arXiv preprint arXiv:2402.03286 (2024).

[3] Jeong, Hyeonho, Gihyun Kwon, and Jong Chul Ye. "Zero-shot generation of coherent storybook from plain text story using diffusion models." arXiv preprint arXiv:2302.03900 (2023).

问题

The motivation and method introduction are very clear. Here are some experiment-related questions and suggestions corresponding to the weaknesses that need the authors' responses:

  1. Could you provide a qualitative comparison with newer story generation tasks such as [1], [2], [3]? These works also rely solely on text prompts rather than input images like IP-Adapter, making them more relevant to your method. Additionally, since [2] also modifies self-attention to maintain content consistency, please explain the differences between Consistent Self-Attention and their approach.
  2. Given that the paper repeatedly emphasizes that the proposed modules are plug-and-play, there should be relevant experiments in the main text to substantiate this claim.
  3. It would be beneficial to have a clear comparison demonstrating the superiority of StoryDiffusion in terms of inference time and space overhead.
  4. For objective metrics, consider including FVD. For the user study, increasing the number of participants would help avoid the collected samples having a cluster bias.

局限性

The authors have addressed limitations and societal impacts well. They acknowledge ethical concerns, noting that StoryDiffusion, like other generative methods, could be misused to create false information. They call for clear responsibilities and stronger legal and technical supervision to ensure proper use.

In the appendix, they identify two main limitations: minor inconsistencies in subject details, such as clothing, which may require detailed prompts, and challenges in generating very long videos due to difficulties in stitching images with significant differences. Future work will explore these areas further.

作者回复

We wish to express our sincere gratitude to the reviewer for the thorough review and constructive feedback. Following the suggestion of the reviewer, we provide detailed responses in the hope of addressing the reviewer's concerns.

Q1: Additional Comparison Methods.

We are thankful to the reviewer for the feedback. We have run extra experiments and the visualization results are included in Figure 2 of the uploaded PDF in the rebuttal. We will cite these three papers and include comparisons with them in the revised paper. Compared with them, our method performs better and is more flexible with less inference time as it requires only a single generation pass. In contrast, "The Chosen One" requires time-consuming LoRA self-training for each example; "Zero-shot Generation of Coherent Storybook" requires first generating the images and then performing Iterative Coherent Identity Injection embedding; "Consistory" iteratively calculates the segmentation mask during the diffusion process to improve consistency and can also be considered concurrent work with ours due to the close timing.

Q2: Experiments about the "plug-and-play" characteristics.

We are very appreciative of the reviewer's feedback. We have additionally implemented our StoryDiffusion on SD1.5 and SD2.1, and put the result compared SDXL into Figure 1 of the rebuttal PDF. Our method maintains good performance when integrated into different models, which demonstrates the "plug-and-play" characteristics.

Q3: Efficient analysis.

Based on the reviewer's suggestions, we performed an efficient analysis of consistent image generation and transition video generation:

(1) Our consistent self-attention is training-free, which means we are more efficient than the models needed to train at large datasets (IP-Adaper or PhotoMaker) or tune models at inference time (The chosen one). We have listed the average time required to generate a 1024x1024 image, training parameters, as well as the amount of training data used below. Our inference time is similar to that of the comparison methods, but it does not require extensive additional pre-training, which demonstrates that our method is efficient.

IP-adapterPhotoMakerStoryDiffusion
Inference Time6s5s8s
Trainable Parameter22M110.3M0
Finetune Dataset10M0.3M0

(2) For video, our inference time and parameter count are comparable to other methods; Meanwhile, we achieve significantly better performance.

SEINESparseCtrlStoryDiffusion
Inference Time20s94s27s
Parameter0.9B1.9B2.0B

Q4: “For objective metrics, consider including FVD. For the user study, increasing the number of participants”

(1) We sincerely thank the reviewer for the feedback. We have expanded the quantitative experiments accordingly following the reviewer's suggestions. We first have supplemented the FVD comparison results:

SparseCtrlSEINEStoryDiffusion
429321271

(2) Additionally, to further address the reviewer's concerns, we have also added the FID results calculated on the frames:

SparseCtrlSEINEStoryDiffusion
181140109

(3) Furthermore, for the user study, following the reviewer's suggestions, we have increased the number of participants and organized a study with 79 participants. Our survey also included participants from a diverse range of professions and knowledge backgrounds to avoid cluster bias.

SparseCtrlSEINEStoryDiffusion
5.9%9.6%84.5%
IP-adapterPhotoMakerStoryDiffusion
20.8%10.9%68.3%
评论

Thank you for your detailed response. Your reply addressed most of my concerns, so I will raise my score. I also suggest including these content in the final version.

评论

Special thanks to the reviewer for the thorough review! We will include the contents you mention in the revised version of the paper.

审稿意见
7

The paper proposes a diffusion-based generative model designed to create subject-consistent images and videos that correspond to given story texts. The paper presents both qualitative and quantitative results showing enhanced subject consistency in generated images and videos compared to existing work.

优点

  • The paper proposes a training-free and pluggable consistent self-attention module that enhances the consistency of characters in a sequence of generated images.
  • To effectively model the large movements of characters, the paper proposes a semantic Motion Predictor for predicting the transition between two images in semantic space.
  • Experimental results show that the proposed model surpasses existing methods in generating subject-consistent images and videos.

缺点

  • The Method section lacks clarity:
    • The process of splitting the story text into several prompts as mentioned in Figure 2 is not described in the main text.
    • In Section 4.1, the formation of story text from character prompts and activity prompts also lacks clarity.
    • In L182 and Equation 6, it is unclear whether the text embedding T represent the text associated with frame FsF_s or FeF_e or something else.
  • According to sec 3.2, it seems like the proposed model independently generates transition videos for every two consecutive frames, which could be time-intensive and potentially result in less smooth transitions.
  • The main paper should include the necessary/essential training details instead of leaving all implementation details in the appendix.
  • In Figure 6, for the ablation study on sampling rates, it is hard to tell the effect of different sampling rates. It seems for all sampling rates, the model failed to maintain consistency? In the second and third columns, the dog no longer wears the blue vest. Also, according to Algorithm 1, in the actual implementation, the module uses a title size W when sampling the other tokens. However, the value of title size is not specified or discussed. The impact of varying the tile size "W" on consistency is also not discussed.

问题

  • According to Algorithm 1, a tile size W is used for solving the OOM issue. I’m curious whether reducing the sampling rate while maintaining the tile size across the entire batch can be considered an alternative approach? Have the authors experimented with such a configuration? If so, why the current implementation with a specific tile size was preferred?
  • In L217, how is the CLIP score for character similarity computed? Are any segmentation or bounding boxes applied for computing the Image-to-Image CLIP score?
  • According to Appendix B, the paper implements upon stable diffusion, which uses VAE for encoding the images into the latents of shape BxHxWxC. In this context, how does the paper define an image token?
  • In Equation 5, ll -> LL.

局限性

The limitation section has adequately stated most of the limitations. However, the time efficiency of video generation could be discussed more. Generating transitions independently for every two frames could significantly increase generation time. How does this approach compare to other baseline methods like SparseCtrl and SEINE in terms of overall video generation time?

作者回复

Firstly, we would like to thank the reviewer for acknowledging our efforts and contributions. Additionally, we deeply appreciate the reviewer's constructive feedback and would like to reply to the question raised by the reviewer in detail.

Q1: "The Method section lacks clarity"

We are very thankful to the reviewer for pointing out these detailed issues. We will revise our paper to address the clarity issues pointed out by the reviewer to enhance clarity:

a. We will add the submodule to Figure 2 to illustrate how we use GPT-4 to split the story text into several prompts.

b. We will add the format of prompts to Section 4.1 for clarity. The format of our character prompts is "[adjective] [group or profession] [wearing clothing]" and the format of activity prompts is “[action] [location or object]”.

c. We will add this statement to clarify that T in Line 182 and Equation 6 refer to the prompt corresponding to the previous image.

Q2: "could be time-intensive and potentially result in less smooth transitions."

We would like to address the reviewer's concerns through the following aspects:

(1) For the time consumed, we compared our method with comparable methods of the average generation time for a 16-frame 512x512 video. Our StoryDiffusion's average inference time is comparable to SEINE and significantly better than SparseCtrl, demonstrating that our method is not time-consuming compared to previous approaches.

SparseCtlSEINEStoryDiffusion
Inference Time94s20s27s

(2) For the smooth transition, although our method generates transitions for every two consecutive keyframes, as demonstrated in the videos uploaded in the supplementary materials, our method can produce smooth long videos. The reason may be that our method learns from a large number of video clips, making each video clip coherent on its own. Additionally, we acknowledge that long video generation is a very challenging problem in the "Limitations" section of the paper, which may not be completely resolved within a single work. Since the video files in our supplementary materials have already demonstrated effectiveness, we will further explore how to establish connections between multiple short videos to make the transition smoother in our future work.

Q3: "the dog no longer wears the blue vest."

We sincerely thank the reviewer for pointing out the ambiguity. The "blue vest" appears only in the first image just because the corresponding prompt contains "blue vest," while the other prompts do not. We will add the prompts to Figure 6 (a) of the main paper to prevent misunderstanding.

Q4: "Why is the current implementation with a specific tile size preferred?"

We sincerely thank the reviewer for the feedback on the tile size. Overall, we use a fixed tile size for stability and robustness. To better illustrate this, we conducted an additional experiment, which we present below. Increasing the tile size and reducing the sampling rate within certain limits indeed enhances consistency. However, beyond a certain point, further increases result in decreased character consistency. Additionally, we may not need such a large number of images at once in most cases. Therefore, we chose to use a slightly smaller tile size and a larger sampling rate after considering multiple factors. We appreciate the feedback and will make revisions in the method section to better help readers understand this point.

Sample:0.3, Tile:3Sample:0.5,Tile:3Sample:0.7, Tile:3Sample:0.2,Tile:8Sample:0.1, Tile:16
Character Similarity86.3988.3789.2689.5086.35

Q5: "Are any segmentation or bounding boxes applied?"

Yes, in the process of calculating character similarity, we first use the SOTA background removal method RMBG-1.4 to eliminate the background. Then, we use only the images of the foreground characters for calculating character similarity. We sincerely thank the reviewer for pointing this out and will add the relevant details to our paper for clarity.

Q6: "How does the paper define an image token?"

We greatly appreciate the reviewer for pointing out this issue. In the attention modules of the UNet, the 3-dimensional HxWxC latent feature in VAE space is flattened into a 2-dimensional sequence of HWxC. We refer to the 1xC vector in this HxW sequence as an image token. We will add the definition of "image token" at the beginning of the Methods section to avoid any confusion

评论

Thank you for your response! I would like to maintain my score as it reflects the current status of the paper.

评论

Also thank you for your response! We are pleased to receive your valuable review!

审稿意见
6

Aiming to generate a story-based images or videos, this paper introduces StoryDiffusion, and proposes to use the following methods:

  1. Consistent self-attention, which is a training-free way that modifies existing self-attention to maintain the between frames.
  2. Semantic motion predictor, which is additional module that predicts accurate motions from generating video.

优点

  • This paper tackles maintaining consistency and similarity between batches, which is necessary when it comes to application to real-world applications.
  • It maintains consistency in a simple but effective way, with consistent self-attention working without training.

缺点

  • The random selection of tokens in consistent self-attention is not well-explained. What are advantages over other baselines despite the randomness?

  • There are some attention mechanisms that achieve consistency across frames within batches. In particular, some works (e.g., [A], [B], [C], and [D]) heuristically pick frames to attend to (e.g., the first frame, the frame before or after a given frame, or frames in a rotational manner), possibly due to the complexity. It seems that this paper lacks such discussion or comparison.

  • The novelty seems questionable. It appears to be a post-processing technique or trick for Stable Diffusion that increases consistency between frames.

  • Typo: L129 has a "." in the middle of the sentence.

[A] Wu, Jay Zhangjie, et al. "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[B] Khachatryan, Levon, et al. "Text2video-zero: Text-to-image diffusion models are zero-shot video generators." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[C] Hong, Susung, et al. "DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation." arXiv preprint arXiv:2305.14330 (2023).

[D] Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." arXiv preprint arXiv:2307.10373 (2023).

问题

  1. I'm curious if the consistent self-attention is robust across random seeds, since it picks tokens randomly. Again, it would be great to see comparisons with other token schemes.
  2. The Semantic Motion Predictor seems to require video datasets. What is the main motivation for not fully fine-tuning the model with the datasets but only the module?

局限性

The authors have addressed the limitations but they haven't discussed the potential social impact.

作者回复

Firstly, we sincerely thank the reviewer for the thorough review and constructive feedback. We also appreciate the positive rating provided by the reviewer. After carefully reading the comments, we hope to address the reviewer's concerns in the following:

Q1: "(whether) consistent self-attention is robust across random seeds, and comparisons with other token schemes."

We are very thankful for the feedback and hope to address the reviewer's concern from three aspects: intuition and insights, the robustness of the method, and the quantitative evaluation of the method as below:

(1) The insight behind random sampling is to avoid reinforcing fixed characters' poses across different images, which is probably reducing the controllability of the text prompts over the generated images. This trend is also reflected in the newly added experiment shown in the table in part (2). The introduction of the sampling ratio is to achieve the best trade-off between character consistency and text controllability. Besides, a lower sampling ratio indeed effectively reduces the computational cost of self-attention, reducing GPU memory consumption.

(2) To further address the reviewer's concerns, we add a quantitative evaluation for the impact of the random sampling ratio and the comparison with the Image Grid method. As shown in the table, while the non-random GridSampling method indeed results in better character consistency, the controllability of the text prompt is reduced. More visualization results are also shown in Figure 3 in the uploaded file in the rebuttal. In summary, the introduction of the sampling ratio is to achieve the best trade-off between character consistency and text controllability.

RandSample (Rate 0.3)RandSample (Rate 0.5)RandSample (Rate 0.7)GridSample (Rate 0.5)
Character Similarly86.39%88.37%89.26%89.29%
CLIP Score57.14%57.11%56.96%56.53%

(3) All reported results in the paper are average values over three different random seeds. We will add more details in the revision to make this clearer. Under this setting, our proposed method still achieves consistently higher performance than the selected baseline methods.

Q2: "Novelty, ... (Consistent Self-Attention) appears to be a post-processing technique"

We respectively remind the reviewer that our contributions are two-fold: (1) a new block to generate consistent sequence of images and (2) a novel semantic motion predictor to convert the sequence of images into dynamic videos with infinite length. The novelty of each point is explained below for your reference:

(1) We hope to clarify that Consistent Self-Attention is not a post-processing trick. It is used to modify the UNet architecture and operate together with the generation process. Even though there are some self-attention mechanisms in video editing tasks, the purpose and focus of our consistent self-attention are different as mentioned in Q3.

(2) We propose a new framework for long video generation with high dynamics. We modify an image model with CSA in a training-free manner and the semantic motion predictor transforms the sequence of images into videos with meaningful and dynamic motions. Compared to traditional text-to-video approaches that treat the entire process as a black box, our two-stage method, offers better controllability. Because it allows dynamic adjustment of keyframes to modify video results, aligning with the demands of film production.

Based on the above justification, we hope the reviewer could re-assess the contributions of the proposed framework.

Q3: "Discussion with other attention mechanisms"

We thank the reviewer for raising those extra works. Some of the papers have been discussed in the main paper (E.g. Tune-A-Video). The major differences between StoryDiffusion and the rest of the works are summarized below and will be added in the revision: The attention mechanisms used in the mentioned papers are designed for highly similar consecutive video frames, and adopt specific methods for selecting keys and values in the attention mechanism to enhance coherence of video: "Tune a video" uses the first frame and the previous frame as references for selection and requires additional fine-tuning of the "q linear" to align queries; "Text2video-zero" uses the first frame as keys and values; "DirecT2V" uses a temporal rotational selection of keys and values; Tokenflow uses a frame set with fixed frame intervals as keys and values.

Differently, our method targets to generate consistent characters with large dynamics that allow strong text controllability. The generated images could vary from each other to a large extent without direct motion continuity between adjacent images. In this case, these methods for selecting keys and values are not applicable. Differently, consistent self-attention emphasizes the mutual interaction between images within a batch to enhance consistency, rather than simply selecting some frames as key and value for unidirectional interaction.

Q4: What is the main motivation for not fully fine-tuning the model with the datasets but only the module?

The main motivation for freezing the spatially related model parameters is to save the computational cost. As we have the first frame and the last frame as prior knowledge, this practice significantly saves the training cost with acceptable video generation performance.

Q5: haven't discussed the potential social impact.

We kindly remind the reviewer that the potential social impact has been included in the Broader Impact paragraph in Sec. 5. (Line 275-279) We will try to make it clearer in the revision.

评论

I appreciate the authors' response. Since it addresses most of my concerns, I would like to increase my score. Also, it would be great to see the experiment on the random sampling rate in the final revision.

评论

We also deeply appreciate the reviewer's highly constructive feedback! We will diligently revise our paper according to the reviewer's suggestions. The experiments on random sampling rates will be added to the final version.

审稿意见
8

This paper proposes story diffusion which is a video generation model encapsulating two contributions:

  • First, the authors show how to generate a sequence of image that are self-consistent (e.g. consistent appearance for characters, consistent attire, etc) but obey different prompts. This is done using ordinary image diffusion models by generating these images in a batch but having tokens attend to randomly selected tokens in othe rimages in the batch. The authors show that their approach does not require any additional training.
  • Second, these images (which might be generated following a “story” or sequence of text prompts) can be filled in with higher frequency video frames. The approach is to train a Stable Diffusion model with an AnimateDiff temporal module to use CLIP features of these keyframes (as well as linearly interpolated ones) as control signals.

The authors show in evaluations that their approach beats previous papers like IP-Adapter and Photo Maker on character consistency and outperforms SEINE and SparseCtrl on video generation.

优点

I very much enjoyed reading this paper. It has strong and compelling results and attacks a timely/important problem. The consistent generation solution is particularly simple conceptually and does not require any training and the second part of the paper builds on top of AnimateDiff2 which content creators should be able to use very easily.

缺点

I do not see any weaknesses that would cause me to vote against acceptance. That being said, the sampling ablation is not very serious (just a handful of cherry picked examples that don’t clearly show any differences relative to the parameter being ablated). More generally it is not so clear why the authors prefer to sample tokens rather than use all of them. Is this for computational reasons? If I am right, the proposed attention should scale linearly in the batch size (so it is unclear why this would be problematic and necessitate sampling).

问题

  • There is some detail provided for the semantic motion predictor in the Appendix that I recommend moving to the main paper (it reads somewhat too abstractly without this detail).
  • What does plug-and-play / hot-pluggable mean and how is it different from training-free?

局限性

Yes

作者回复

First, we are extremely grateful that the reviewer recognizes our efforts and contributions. At the same time, we also appreciate the reviewer's constructive feedback and would like to respond specifically to the following aspects.

Q1: Improve ablation experiments of Random Sampling.

Special thanks for your insightful feedback. We also agree that a more systematic quantitative evaluation of the sampling strategy would make the paper better. Thus, we have run some quantitative evaluations as shown below:

Random Sample Rate0.30.50.7
Character Similarly86.39%88.37%89.26%
CLIP Score57.14%57.11%56.96%

Insights & Analysis: The initial reason for using random sampling is to avoid reinforcing fixed characters' poses across different images, which is probably reducing the controllability of the text prompts over the generated images, visualized in Figure 3 in the rebuttal PDF. This trend is also reflected in the newly added experiment shown in the above table. The introduction of the sampling ratio is to achieve the best trade-off between character consistency and text controllability. Besides, a lower sampling ratio indeed effectively reduces the computational cost of self-attention, reducing GPU memory consumption, but this is not the initial purpose of doing so. There are also many engineering implementations to address memory consumption, such as the tiling method demonstrated in supplementary materials, or saving sampled tokens for reuse when inferring other images.

Q2: Transferring semantic motion predictor details from the appendix to the main paper.

We are very thankful for the reviewer's suggestions. We will adjust the structure of the article by moving the implementation details section to the first subsection of the experiments chapter to improve the readability of the paper in the revision.

Q3: Meaning of the word "plug-and-play".

Thanks to the reviewer for pointing out this issue. The difference between plug-and-play and training-free is that plug-and-play implies compatibility with multiple different models, whereas training-free merely emphasizes that no training is required, without indicating whether the method can be extended. We state in the paper that our consistent self-attention (SA) is plug-and-play because our method is implemented based on the properties of self-attention.

In our experiments with consistent self-attention, we found that consistent SA works effectively across multiple models, such as Stable Diffusion 1.5, Stable Diffusion 2.1, and Stable Diffusion XL. Therefore, we believe that SD is plug-and-play. We have presented the experimental results in Figure 1 of the upload Rebuttal PDF.

作者回复

We would like to sincerely thank all the reviewers for their thorough review, and we are deeply grateful for their recognition of our efforts in this work. After carefully reading their review comments, We are very pleased to reply to the reviewer regarding the issues they are concerned about.

We also uploaded a PDF file containing three figures to visually support our rebuttal explanations, as mentioned in the corresponding responses.

最终决定

The paper received one Strong Accept, two Accepts, and one Weak Accept. The reviewers appreciated the problem setting as well as the technical approach (in particular the proposed Consistent Self-Attention). While there were concerns regarding the comparisons and clarity about the method, these were addressed in the rebuttal stage and all reviewers recommend accept with varying degrees. I agree with the reviewers and also recommend Accept.

Based on the interest in the problem and the potential usefulness of the approach, I recommend this paper for a Spotlight. In the final paper, it would be good to include all the discussion from this rebuttal stage and having a more extensive qualitative or even quantitative comparison against the methods mentioned by Reviewer uTuy would be appreciated.