PaperHub
6.5
/10
Poster4 位审稿人
最低6最高8标准差0.9
8
6
6
6
3.8
置信度
ICLR 2024

Seer: Language Instructed Video Prediction with Latent Diffusion Models

OpenReviewPDF
提交: 2023-09-19更新: 2024-03-06

摘要

关键词
video predictionmulti-modality generationdiffusion modellanguage-vision model

评审与讨论

审稿意见
8

This paper introduces the Seer model, a Language-Instructed Video Prediction with Latent Diffusion approach, for the text-conditioned video prediction (TVP) task. It enhances the U-Net architecture by incorporating a spatial-temporal attention module and introducing a Frame Sequential Text (FSText) Decomposer. This paper conducts experiments on three different datasets, demonstrating the effectiveness of the proposed method.

优点

  • The authors propose a Frame Sequential Text Decomposer, which extends the capabilities of pretrained text-to-image (T2I) diffusion models along the temporal axis.
  • Seer achieves superior performance on three datasets: Something Something V2 (SSv2), Bridgedata, and EpicKitchens-100.

缺点

  • While the paper repeatedly claims to have designed a sample and computation-efficient model, the only comparison provided in the appendix is in Table 8, focusing on temporal attention. It would be beneficial to include more computational comparisons to validate this claim.
  • The writing could be improved for better clarity. Additionally, the positioning of figures could be optimized for easier readability; consider rearranging Figures 3 and 4.
  • It would better to enhance the paper to include analysis of failure cases and generated video examples for a more comprehensive evaluation.

问题

Please refer to weaknesses.

评论

Weakness 2: The writing could be improved for better clarity. Additionally, the positioning of figures could be optimized for easier readability; consider rearranging Figures 3 and 4.

Thank you for your valuable feedback. In response to your concerns, which were also raised by Reviewer shhp and Reviewer xJcW regarding the clarity of our paper, we have undertaken comprehensive revisions to enhance readability on four folds:

  1. We have streamlined our motivation section, minimizing detailed discussions on robot learning, and have refocused the introduction to underscore the main contributions of Seer.
  2. The mistaken claim that text-conditioned video prediction is more challenging than text-to-video generation has been clarified in the related work section.
  3. To improve the understanding of our method, we have highlighted key components of Seer in the method figure and reorganized the flowchart for better clarity.
  4. Figures have been strategically repositioned to enhance overall paper readability.

These revisions have been diligently incorporated into our paper to address the concerns raised, with the aim of significantly improving the overall readability of our work.

Weakness 3: It would better to enhance the paper to include analysis of failure cases and generated video examples for a more comprehensive evaluation.

Thank you for your constructive suggestion. We have expanded our qualitative analysis of failure cases, incorporating a dedicated section in Appendix D.7 of our revised paper. Figure 18 in the revised paper showcases instances where Seer struggles to capture appearance consistencies, such as the color of the card in the previous frame and the cover of the book in the reference environment. These challenges are attributed to the difficulty Seer faces in forecasting newly occurring concepts and maintaining temporal consistency within complex backgrounds. In the context of the Epic-Kitchens-100 dataset, Seer tends to predict camera pose movements and generate novel views of the environment, rather than accurately learning human motion. This difficulty arises due to the prevalent scene transitions in Epic-Kitchens. Further insights into these failure cases are discussed in detail in the new Appendix D.7 section.

评论

Thanks for your response. I have no other concerns.

评论

Weakness 1: While the paper repeatedly claims to have designed a sample and computation-efficient model, the only comparison provided in the appendix is in Table 8, focusing on temporal attention. It would be beneficial to include more computational comparisons to validate this claim.

Thank you for your valuable suggestion. We have included an additional comparison involving our ablated setting and another baseline method, VideoFusion * 11 *, which shares a network design based on the Stable Diffusion U-Net. Notably, in our proposed Seer setting, the 2D spatial layers of the U-Net are frozen during fine-tuning. In contrast, the ablated setting keeps all 2D spatial layers trainable. We assessed GPU memory consumption and training time for different models, utilizing a single 80GB NVIDIA A800 GPU during the training phase. The results in Appendix Table 8 of the revised paper, reveal that the proposed Seer setting with frozen 2D layers exhibits approximately half the GPU memory consumption (24.7GB) compared to Seer without frozen 2D layers (39.2GB) and VideoFusion (45.0GB). To ensure a fair comparison of training speed among various settings on the single A800 GPU, we conducted additional assessments of models with GPU memory usage exceeding 90%, as detailed in Appendix Table 9 of the revised paper. The results in the table reveal a notable advantage in computation efficiency for the proposed Seer setting with frozen 2D layers, exhibiting reduced training time (3.10 sec/iter) compared to Seer without frozen 2D layers (6.89 sec/iter) and VideoFusion (7.63 sec/iter). These findings underscore the heightened computational efficiency of Seer compared to key video baseline models.

Appendix Table 8. Training time (time.) and GPU memory (Mem.) consumption of the models on a 16-frame video

ModelTraining timeGPU memory
Seer w/ 2D. frozen (Ours)0.75 sec/iter24.7 GB
Seer w/o 2D. frozen0.96 sec/iter39.2 GB
VideoFusion.1.07 sec/iter45.0 GB

Appendix Table 9. Training time (time.) and GPU memory (Mem.) consumption of the models on a 16-frame video (>= 90% GPU memory usage)

ModelTraining timeGPU memory
Seer w/ 2D. frozen (Ours)3.10 sec/iter72.9 GB
Seer w/o 2D. frozen6.89 sec/iter75.8 GB
VideoFusion.7.63 sec/iter78.7 GB

References:

11 Zhengxiong Luo, et al. "Videofusion: Decomposed diffusion models for high-quality video generation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10209–10218, 2023.*

审稿意见
6

The authors propose Seer, a video prediction model created from extending a pretrained text-to-image StableDiffusion model and finetuning additional elements. When performing a denoised output, the per-frame convolutions are maintained from a pretrained text-to-image StableDiffusion model, however their outputs are subsequently updated by a temporal attention layer that can condition on reference/initialization frames as well as past temporal frames. This serves to visually learn temporal consistency. Secondly, the authors propose learning and using per-frame sub-instruction embeddings as conditioning information. Ultimately this helps to guide each frame's generation with respect to sub-portions of the input instruction. The authors perform experimentation on text-video datasets, and thorough ablations over different design choices of their approach.

优点

There are multiple works that have proposed generating video conditioned on text, particularly how to do so leveraging models that are pretrained to do text-to-image generation. A large contribution of this work, in this reviewer's opinion, is the thorough exploration into what design decisions are important for the text-video generation task while making limited adjustments to a pretrained text-to-image model. In other words, what are the important considerations that enable text-to-image models to cheaply extend to modeling videos. The design decisions (such as what type of temporal attention is useful) are interesting insights. This paper also focuses on natural (or photo-realistically simulated) videos, which is appreciated. This reviewer makes the following additional, specific keyword comments:

  • Originality: this work is not particularly original, in that there are many efforts that try to extend a pretrained text-to-image diffusion model with minimal adjustments to perform text-to-video generation. The architecture design decisions to extend static image diffusion models were reused from prior work, with additional novel components in the form of temporal window attention and per-frame subinstruction conditioning.
  • Quality: the quality of the work was high due to the experimentation on natural-looking videos, and the ablation insights.
  • Clarity: the paper was not the most easy to read, with certain design decisions described using unnecessarily confusing prose. It often took multiple reads to understand what should have been an easy to describe design decision. Furthermore, the figures were small (to the point of illegibility on printed paper) and often not particularly intuitive to understand. I recommend the authors highlight the components of their method with larger and more intuitive figures (beyond arrows and flow-charts). For example, in Figure 4, the output of part (a) is sub-instructions, but there is also an arrow that leads back with a "step 1" label, confusing the reader on how the flow of computation should be interpreted. There is a similar issue in Figure 5, where deciphering the steps and how the arrows should be interpreted are naturally confusing (there are arrows pointing up as well as down, and all steps are shown simultaneously). Perhaps such figures that depict multiple steps can be split into larger visuals with each step independently shown for clarity.
  • Significance: this work is of mediocre significance; it builds off of existing text-to-image models and past text-to-video approaches, but adds additional insight into what might be important to consider for future efforts along this direction (such as improving temporal consistency and fine-grained conditioning).

缺点

This reviewer believes that learning text-grounded subinstructions for conditioning is an interesting and fruitful direction forward for solving fine-grained video generation, particularly for long-form video generation. The authors therefore make useful initial efforts towards this effort. However, the authors generate subinstructions by 1) initializing subinstructions using as the global embeddings and 2) performing cross attention with the global embeddings. Therefore, it appears that subsets of the same information within a global embedding is selected for each subinstruction embedding, similar to performing attention on the global embedding. Is it possible for a model to learn proper subgoals/subinstructions in such a way, when it is constrained as a manipulation of global information. Intuitively, for a global instruction of "cook a pasta dinner", subinstructions might naturally be about "boiling water" and "heating up sauce", but such information may not inherently be extractable from a global embedding of "cook a pasta dinner". This reviewer would be more convinced if the global text was expanded to include intermediate info, such as by thinking "step by step" via a pretrained LLM, which is then converted into subinstructions per timestep. The current formulation does not seem conceptually convincing beyond being a slightly more advanced attention head, which may be inherently limiting as an approach. Additional investigations into what the subinstructions may be encoding in terms of textual information would be helpful.

The second weakness is with the length of video generation. The authors state that they only generate 12 frames for SSv2 and Epic100, and 16 frames for BridgeData. Let us focus on SSv2 for the sake of example, but the point generally holds true. SSv2 videos are stated to have a FPS of 12. Then, essentially the Seer model finetuned on SSv2 is only able to generate 1 second of video. Put in this context, can enough happen within one second where finegrained details such as subinstructions are necessary. On the other hand, if the authors are downsampling videoclips by sampling frames from the entire video equally spaced out by time, then essentially Seer is only learning to produce intermediate snapshots of a given natural video. Only for very short sequences would the result of Seer then be treated as a smooth, natural video.

Text-Video Prediction (TVP) was motivated by the authors as a way to help robots follow human instructions (specifically, how to map abstract language instructions to policy actions). Indeed, the authors motivate it by stating that humans employ an imagining-planning approach, where future trajectories are imagined in their minds. This reviewer agrees with the authors, but notes that an additional step needs to be performed to ground generated text-conditioned videos to actual policies; given the motivations, it would be stronger to demonstrate an example (even synthetic) where the videos generated by Seer was able to help policy learning either in a robot, simulated environment, or game.

问题

  1. Why were 2 reference frames used for Something Something-v2, but not for Bridgedata or EpicKitchens100? Is there something special about SSv2, or were the generated results simply empirically worse for 1-reference frame SSv2 (and why might this be the case for this particular dataset)?
  2. Seer was demonstrated for 12 and 16 frame generations, but how does such a model perform for long-form generation? Can the final frame(s) of an output sequence be used as a reference to generate subsequent frames using the same text prompt to create a longer video? Or would the downsampling assumptions used prevent that from being meaningful (since it might generate 12 frames spanning an entire video all at once)? It would strengthen the work to showcase and evaluate longer-form video generation.
  3. Are there ways to probe, textually, the subinstructions learned to understand what they are encoding?
  4. The paper makes repeated reference to "initializing" the subinstruction embeddings to be CLIP global embeddings. However, at the end of Section 4 on Page 6, a loss equation is provided. Is this an additional objective that is utilized during optimization of the FSText decomposer during finetuning, or are the subinstructions explicitly initialized to the global CLIP embeddings before performing self-attention, cross-attention, and temporal attention? This distinction is important.
  5. Given that the global CLIP embedding is used for initialization, as well as for cross-attention, there does not seem to be any additional textual or contextual data to utilize for subinstructions. Then, subinstructions seem to be generated purely as a combination of information captured in the specific words of the global instruction; to this reviewer, the FSText decomposer therefore acts as a form of attention network over existing the global embedding. Can the authors highlight how this is conceptually different, given that the information generated in each subinstruction is simply a subset of the global instruction? Furthermore, this approach seems highly dependent on the initial global instruction utilized. Are there other choices for the text embedding that could be ablated over?
评论

Question 6: Why were 2 reference frames used for Something Something-v2, but not for Bridgedata or EpicKitchens100?

This comment is regarding the choice of reference frames in training datasets. The Something Something-v2 (SSv2) dataset, which is concerning real-world human behaviors, contains various unexpected cases in real-world video such as object occlusions, scene changes, and camera view movements. Initially, we were concerned that Seer might face challenges in learning from such intricate scenes when provided with only one reference frame and lacking additional background information. Hence, SSv2 was chosen as our primary training dataset, and we employed the initial setting of using 2 reference frames for generating 12-frame-long videos on SSv2.

To our surprise, Seer demonstrated robust performance even with only one reference frame in the SSv2 dataset in our follow-up test in Appendix Figure 14 of the revised paper. This result prompted us to extend our experiments to the Egocentric human kitchen video dataset EpicKitchens100, adopting the 1 reference frame condition. Remarkably, Seer exhibited competitive performance compared to other baseline methods in this setup.

Question 7: How does such a model perform for long-form generations? Can the final frame(s) of an output sequence be used as a reference to generate subsequent frames using the same text prompt to create a longer video?

Thank you for your valuable suggestion. In this paper, our primary goal is task-level video prediction with language guidance, and most downsampling video clips inherently have sufficient length for task-level behavior performance. Intuitively, appending an additional temporal upsampler network is compatible with most video diffusion models including Seer further aid Seer in expanding video frames.
Moreover, considering the suggestion about using the final frame of the output sequence as a reference for generating the next sequence, this approach could offer the advantage of sampling longer video sequences with the alignment of a long sequence of subinstructions. Therefore, we realize the potential benefits of exploring the expansion of macroscopic-level global instruction to multi-step intermediate guidance via a Large Language Model (LLM) for long-sequence video prediction. This direction is certainly worth further exploration in our future work.

Regarding your suggestion on using the same text for long video generation, we have conducted an additional experiment on long video prediction using a 12-frame SSv2 fine-tuned Seer model. The results of this experiment are presented in Appendix D.6 Figure 19 of the revised paper. These findings demonstrate that using the same sub-instruction to generate the next sequence of output can maintain coherent motion, persisting until the target object is no longer present or the current motion is in a terminated state. While this strategy enables the generation of coherent movements in the output sequence, it is distinguished from the upsampling of video clips along the frame axis. Based on these outcomes, we believe that expanding the generated subinstruction embeddings along the frame axis during the training stage represents a promising direction for enabling Seer to generate longer video frames in multi-steps.

评论

Weakness 4: Seer model finetuned on SSv2 is only able to generate 1 second of video, can enough happen within one second where finegrained details such as subinstructions are necessary. If the authors are downsampling videoclips by sampling frames from the entire video, then Seer is only learning to produce intermediate snapshots of a given natural video.

Thank you for your constructive comment. Many existing approaches, such as Make-A-Video and Video Diffusion Model, extend video length by incorporating Upsampler Networks. However, considering our primary focus on task-level video prediction achieved through downsampling of video clips and emphasizing computation efficiency, we choose not to adopt the Upsampler Network in our work. Furthermore, like most video diffusion models, Seer is compatible with various Upsampler Networks. While we haven't incorporated it in our current work, the integration of Upsampler Networks with Seer could be a feasible direction for future study, particularly when adapting to longer video datasets.

On the other hand, the frame-per-second (FPS) of downsampling in our SSv2 training dataset varies from 1.67 FPS to 12 FPS. Given that most human behaviors in the SSv2 scenario involve one-step motions (e.g., "boiling water"), we have empirically observed that a minimum of 2 frames for a one-second clip is generally sufficient to depict coherent motion in the majority of cases within the SSv2 dataset.

Weakness 5: Additional steps need to be performed to ground generated text-conditioned videos to actual policies; given the motivations, it would be stronger to demonstrate an example (even synthetic) where the videos generated by Seer was able to help policy learning either in a robot, simulated environment, or game.

Thank you for your suggestion. We have incorporated additional visualizations of Seer's performance on robot benchmarks, including MetaWorld, RLBench, and Cliport, in Appendix D.5 of our revised manuscript. In this setup, commencing with the Seer model fine-tuned on SSv2, we have frozen the FStext decomposer and conducted further fine-tuning on a diverse dataset comprising SSv2, Bridgedata, 236 MetaWorld video clips, 785 CliPort video clips, and over 2000 video clips from RLBench. This fine-tuning process spans 80k steps with a reduced learning rate of 4e-5.

The results of these experiments are presented in Appendix Figure 18 of our revised paper, illustrating Seer's adaptability to a multi-task environment that includes human video, robot manipulation, and simulation scenarios. Importantly, Seer maintains robust temporal alignment with task descriptors across these diverse video domains.

Furthermore, to investigate whether Seer can help policy learning, we choose the UniPi 11 as our baseline in the simulated robotics environment Meta-World 22 , which generates videos from the initial state and infers actions from the adjacent frames via a pre-trained inverse-dynamics model. Specifically, we distill a policy model from the videos generated by the Seer and the labeled actions from the pretrained inverse-dynamics model. We choose 3 tasks and use 10 in-domain videos for each task to fine-tune the Seer. And we make comparisons between:

a. Distilled policy from the dataset generated by the Fine-tuned Seer model (1000 generated videos for each task).

b. Distilled policy from the 10 in-domain videos for each task.

The results are as follows:

Success ratesa. 1k videos generated from Seerb. 10 groud-truth videos
button-press-topdown-v20.450.4
drawer-close-v20.10
drawer-open-v20.050

We find that compared to utlize the given 10 in-domain videos to generate policy, fine-tuning the Seer with them and generating more videos can be better, because it can acquire more scalable data, which is of comparable quality. Therefore, the videos generated by Seer can help policy learning in simulated robotics tasks in a way.

11 Du, Y., Yang, S., Dai, B., Dai, H., Nachum, O., Tenenbaum, J. B., ... & Abbeel, P. (2023, November). Learning universal policies via text-guided video generation. In Thirty-seventh Conference on Neural Information Processing Systems.

22 Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., & Levine, S. (2020, May). Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning (pp. 1094-1100). PMLR.

评论

Weakness 1: The figures were small (to the point of illegibility on printed paper) and often not particularly intuitive to understand. I recommend the authors highlight the components of their method with larger and more intuitive figures (beyond arrows and flow-charts).

We appreciate the reviewer's suggestions regarding the clarity of our paper. After a thorough review, we have carefully examined the figures in the method section. To enhance the emphasis on crucial components of our proposed method, we have made explicit highlights in Figures 4 and 5. We hope these revisions address the concerns regarding clarity.

Weakness 2: The authors generate subinstructions by 1) initializing subinstructions using as the global embeddings and 2) performing cross attention with the global embeddings. Therefore, it appears that subsets of the same information within a global embedding is selected for each subinstruction embedding, similar to performing attention on the global embedding. It is more convinced if the global text was expanded to include intermediate info, such as by thinking "step by step" via a pretrained LLM.

We appreciate the reviewer's insightful feedback and appreciate the thorough suggestions provided. Overall, language instructions can be categorized into macroscopic-level and microscopic-level instructions. Macroscopic-level instructions involve abstract narrations of multi-step tasks, such as "cook a pasta dinner." On the other hand, microscopic-level instructions focus on one-step actions, like "boiling water" within the broader context of "cook a pasta dinner." In response to these valuable suggestions, we emphasize that our primary focus is on microscopic-level action generation. Specifically, the FSText decomposer is designed to transform one-step instructions into sequential fine-grained subinstructions. Through the alignment of visual tokens at each time step, these subinstructions effectively plan a continuous motion trajectory. Consequently, these subinstructions can be considered equivalent to low-level robotic instructions, providing specific values such as "rotate hand by 30 degrees then move forward 1/32 of the distance between the hand and the target object."
Furthermore, we agree with the reviewer's suggestion regarding macroscopic-level global instructions. We acknowledge that these multi-step global instructions can be expanded into intermediate information through a pre-trained Large Language Model (LLM). This intermediate information can then be systematically converted into one-step instructions for each state, ultimately serving as the foundation for generating low-level subinstructions for video prediction. And recent approaches such as SayCan 11 , have attempted grounding language in robotic tasks by translating high-level abstract knowledge of macroscopic tasks into procedures for performing complex and temporally extended instructions. From the above discussion, we see the value in exploring this direction further in our future work.

References:

[1] Ahn, Michael, et al. "Do as i can, not as i say: Grounding language in robotic affordances." arXiv preprint arXiv:2204.01691 (2022).

Question 3: Additional investigations into what the subinstructions may be encoding in terms of textual information would be helpful. Are there ways to probe, textually, the subinstructions learned to understand what they are encoding?

We appreciate the reviewer's observation and understand the challenges posed by the limitations of CLIP in generating readable textual content from encoded embeddings. In the learning process of FSText decomposer, where subinstructions are acquired from transition distributions along the temporal dimension, we have noted that attempting to generate text-to-image content solely based on subinstructions without temporal awareness results in content that is not easily interpretable by humans.

Consequently, to indirectly explore the semantic expression of subinstructions, we have included visualizations of videos generated with the guidance of constant clones of subinstructions from the third or twelfth frame along the video axis. These visual samples are presented in Figure 9 of Section 5.6 and Figure 16 in Appendix Section D.3. Through these visual representations, we aim to illustrate that subinstructions indeed provide proximate semantic guidance for the motion trajectory at each time step.

评论

Question 8: The paper makes repeated reference to "initializing" the subinstruction embeddings to be CLIP global embeddings. Is this an additional objective that is utilized during optimization of the FSText decomposer during finetuning, or are the subinstructions explicitly initialized to the global CLIP embeddings before performing self-attention, cross-attention, and temporal attention?

Thank you for your comment, we clarify that the additional objective at the end of Section 4 is utilized during the initialization stage of the FSText decomposer in Figure 4 of the paper. This stage is performed prior to the diffusion process.

Question 9: Can the authors highlight how this is conceptually different, given that the information generated in each subinstruction is simply a subset of the global instruction? Furthermore, this approach seems highly dependent on the initial global instruction utilized. Are there other choices for the text embedding that could be ablated over?

Thank you for your comments. We would like to clarify that the generated information in subinstructions is not simply an expanding subset of the global instruction but reflects the fine-grained guidance of the transition of the states at the microscopic level. For instance, consider the global instruction "moving object from left to right" as shown in Appendix Figure 15. With the temporal alignment of motion in training video samples, the generated subinstruction embeddings can decompose the high-level abstract information of language into low-level instructions at continuous time steps, containing low-level control information such as relative position or angle, In this case, the subinstruction at 1/32 time step of video can be equivalent to the microscopic level instruction of "moving right in 1/32 of image width". This is where the FSText Decomposer assumes a crucial role, facilitating the transformation of global language instructions into coherent subinstructions guidance along the temporal sequence, leading to the prediction of continuous motion.

审稿意见
6

The authors propose a novel method for Text conditioned Video Prediction (TVP) that utilizes pretrained CLIP embeddings for language and inflates pretrained UNets from static image generation models to produce a language and image-conditioned next-frame prediction model. They perform experiments on something-something-v2, the Berkeley Bridge dataset, and Epic-Kitchens data demonstrating improved video prediction in comparison to competing methods on FVD and KVD. They also perform human studies and report that their model shows an improvement in the prediction task in comparison to existing baselines.

优点

  1. The paper is well-written and easy to follow.
  2. The method utilizes pretrained models thereby requiring less data than training from scratch.
  3. The experiments are thorough and performed on a variety of benchmarks
  4. The authors also perform human studies to validate the quantitative claims made in the paper

缺点

  1. Improve the motivation for the paper: The first few paragraphs of the introduction focus on robot learning, but the paper completely eschews this domain. It would be useful to motivate the paper from a different perspective that does not involve robotics. I appreciate that there are experiments conducted on the bridge dataset but I still believe this is primarily not a robotics paper as there are no resultant policies.
  2. Difficulty in comparison to text-to-video generation: Playing devil’s advocate here, one could argue that performing text-to-video generation involves learning both the initial state distribution along with the transition distribution, while prediction involves only learning the transition distribution. Therefore it is unclear which is harder and it would be useful to avoid making such claims.

问题

  1. How does sequence length of the video along with number of steps of prediction affect performance?
  2. What is the inductive bias that forces the Text Decomposer to produce coherent sub-instructions? Its unclear why it would segment the instructions into substeps on its own.
评论

Weakness 1: The first few paragraphs of the introduction focus on robot learning, but the paper completely eschews this domain. It would be useful to motivate paper from a different perspective that does not involve robotics.

Thank you for your thoughtful suggestion. We acknowledge that we overemphasize robot learning in our introduction section. The primary goal of this paper is to introduce a text-conditioned video prediction (TVP) model to predict task-level videos based on a specific task descriptor that describes behavioral movements in both the real world and robot environment. We have duly noted your feedback and have revised the motivation in the introduction section to accurately reflect the primary focus of our research.

Weakness 2: Difficulty in comparison to text-to-video generation: One could argue that performing text-to-video generation involves learning both the initial state distribution along with the transition distribution, while prediction involves only learning the transition distribution. Therefore, it is unclear which is harder and it would be useful to avoid making such claims.

Thank you for bringing this to our attention. We acknowledge the concern that claiming text-conditioned video prediction is inherently more challenging than text-to-video generation tasks may not be appropriate. In response to this feedback, we have revised the description in the introduction section of the paper to accurately represent the nature of our task.

Question 3: How does sequence length of the video along with number of steps of prediction affect performance?

Thank you for your comments. We have included additional analysis regarding the correlation between sampling steps and the quality of video predictions, particularly with two sequence lengths (12 and 16) in Appendix section B.3 of the revision. Specifically, we conducted an assessment for both 12-frame and 16-frame video predictions using various DDIM sampling steps (10, 20, 30, 40, 50, 60 DDIM steps). All generated outputs were sampled using a 12-frame SSv2 fine-tuned model. The comparative results are illustrated in Appendix Figure 10 of the revision and the following table. Notably, the 16-frame curve experiences a steeper decline from DDIM steps 10 to 20 compared to the 12-frame curve. As both curves progress beyond DDIM step 30, they tend to stabilize and show marginal gains. These results collectively demonstrate that increasing DDIM sampling steps notably improves video quality with longer video sequences when the sampling steps are in the lower range (10 to 30 DDIM steps). However, the improvement of generated video is limited as the DDIM steps are increased to larger than 30.

Additionally, the qualitative results of the 16-frame video are illustrated in Appendix Figure 12 of the revision. Although the generated 16-frame video shows a more downgraded quality compared to the 12-frame video, both 16-frame and 12-frame videos can exhibit coherent motion aligned with language instruction. And since 16-frame videos are not the expected sequence length for a 12-frame model, the generation quality of the 16-frame video can be further improved by training a dedicated 16-frame Seer model.

Appendix Figure 10 results

FVDseq len=12seq len=16
DDIM=10280.7772.8
DDIM=20129.8448.2
DDIM=30112.9355.4
DDIM=40101.3359.3
DDIM=50101.8348.7
DDIM=60105.2346.2
KVDseq len=12seq len=16
DDIM=100.732.00
DDIM=200.161.11
DDIM=300.120.56
DDIM=400.080.63
DDIM=500.080.47
DDIM=600.060.48
评论

Question 4: What is the inductive bias that forces the Text Decomposer to produce coherent sub-instructions?

In our work on generating subinstruction embeddings, the identity mapping of global instruction, the distillation of frozen 2D text-to-image guidance, and visual temporal alignment in 3D temporal layers collectively contribute to the generation of coherent subinstructions. Specifically, based on the above three constraints, the optimization of the FSText decomposer follows a two-stage process. Initially, this training process forces the sub-instructions to have a context distribution close to the global instruction, which is interpretable for the pre-trained text-image conditional U-Net during the first stage.
In the second stage, the learnable initial tokens are transformed into optimized conditional embeddings via the FSText decomposer. Through the end-to-end training of the diffusion model, these optimized conditional embeddings are constrained by the frozen pre-trained text-to-image conditioning module of the U-Net to generate related content within the image space. The training video samples provide supervision of temporal alignment within the trainable 3D layers during the fine-tuning period. The combination of the supervised visual-temporal signal from the temporally aligned 3D layer and the guidance distillation from the frozen text-to-image conditioning ensures the optimization of these subinstructions during the diffusion process.

审稿意见
6

The authors propose a novel text-conditioned video prediction "Seer". Seer leverages pretrained text-to-image stable diffusion model, extending it along the temporal axis to generate video. The key contributions of the authors are a novel Frame Sequential Text Decomposer module for frame-aligning sub instructions and a computation-efficient spatial-temporal attention to improve efficiency of temporal attention. FSText dissects global textual instructions into temporally aligned sub-instructions for precise integration into video frames, enabling the effective use of prior knowledge embedded in pretrained T2I models. Seer's architecture allows for high-fidelity, coherent video generation that aligns with textual instructions with only fine-tuning of a few layers on a small dataset. The model demonstrates superior video prediction performance on 3 egocentric task datasets, achieving significant improvements in Frechet and Kernel video distances and parameter efficiency compared to previous models.

优点

  1. The authors propose and combine several ways to make text conditional video generation computationally tractable: reusing LDMs pre-trained on text to image tasks, and introducing a novel spatiotemporal attention that combines the benefits of bidirectional and directed temporal attention. I think the proposed spatiotemporal attention method is quite well thought and explained, and the empirical results and ablations suggest that it is effective for various task based video prediction tasks.

缺点

  1. The design choices behind FSText are explained by the authors in Sec 4.3, but the overly complicated design choices could limit the utility of SEER outside of highly specific stepwise instruction videos. Did the authors perform any ablation studies that justified the various design choices in FSText (e.g. using cross-attention, using directed temporal attention, starting with CLIP text embedding etc.)

  2. For the ablations, could these be reported on more than one dataset? Results on one particular task in the SSv2 are quite convincing, but I am wondering if their could be dataset biases that seem to justify these design choices. For example, the BridgeData is a robotic task dataset versus human, so maybe that could be a good candidate to ensure results are not cherry picked.

问题

In Sec 1. Introduction 'existing text-conditioned video generation task (use) text specifying the general content, such as “a person is skiing”, while our aim in the TVP task is to use the text as a “task descriptor” in robotics, such as “tipping a beer can”'. Could the authors please explain how does using text as a task descriptor differ from text specifying general content? Does the VLM model encode the intent of text in the former, and if so are there significant differences across the two representations (e.g. CKA value)? How does it relate to the frame aligned sub-instructions in the FSText decomposer?

评论

Weakness 1: Did the authors perform any ablation studies that justified the various design choices in FSText.

Thank you for your suggestion. we have included additional ablation results for the FSText network design in Appendix Table 11 of the revision. To ensure a fair comparison within the same network depth, we opted to replace the ablated component with a Cross-Attention layer. Notably, the Cross-Attention layer plays a pivotal role in the FSText decomposer network for global instruction projection, and our experimentation revealed that removing this layer induces a vanishing gradient problem during training. Consequently, we did not perform an ablation study on the Cross-Attention layer in the FSText decomposer. As presented in Appendix Table 10, our proposed configuration, featuring appended Text-Sequential-Attention layers and Temporal Attention layers, outperforms the other two settings. This outcome underscores the critical functionality of Text-Sequential-Attention layers and Temporal Attention layers in capturing text contextual information and modeling temporal dependencies, which all contribute to the performance improvement effectively.

Appendix Table 11. Components in FSText Decomposer (SSv2)

Components in FSText.FVDKVD
Temporal-Atten.125.80.13
Text-Seq-Atten.127.70.14
Text-Seq-Atten.+Temporal-Atten.(Ours)112.90.12

Weakness 2: For the ablations, could these be reported on more than one dataset?

Thank you for your constructive advice. We have incorporated additional ablation results in Appendix Table 12 and Table 13, specifically including the BridgeData dataset, in our revision. In these results, we consistently observe performance improvements with Seer across various settings. Notably, similar to the ablation results on SSv2, the additional fine-tuning "cross+temp-attn." leads to significant degradation of Seer performance, as observed in the results presented in Appendix Table 13. These findings underscore the robustness of the Seer design, reaffirming its consistent performance across different datasets.

Appendix Table 12. Ablation results of temporal attention in Inflated 3D U-Net (BridgeData)

Temporal AttentionFVDKVD
bi-direct.284.50.71
directed.258.00.64
autoreg.261.50.83
win-auto.(Ours)246.30.55

Appendix Table 13. Ablation results of fine-tune setting in Inflated 3D U-Net (BridgeData)

Fine-tune settingUse FSText.FVDKVD
temp-attn.No410.70.97
cross+temp-attn.No319.91.01
cross+temp-attn.Yes2058.49.43
temp-attn.Yes246.30.55
评论

Question 3: Could the authors please explain how does using text as a task descriptor differ from text specifying general content? How does it relate to the frame aligned sub-instructions in the FSText decomposer?

Thank you for your feedback. Our work places a primary emphasis on temporal movement. Unlike text that specifies general content and tends to guide the generation of similar scenes with inertial motion along the frame axis, our approach centers around a task descriptor. This descriptor guides the generation of a sequence of behaviors, necessitating the decomposition of one-step global behavioral instructions into multi-step, fine-grained motion guidance. Compared to our method with task descriptor guidance, most text-to-video and text-video-editing methods, such as Make-A-Video 11 , MagicVideo 22 , Align your Latents 33 , and Tune-A-Video 44 more focus on the alignment between text-specifying content and the overall scene generation of frames with temporal consistency, but neglect the fine-grained motion guidance at the task level.

References:

11 Uriel Singer, et al. "Make-a-video: Text-to-video generation without text-video. data." arXiv preprint arXiv:2209.14792, 2022*

22 Daquan Zhou, et al. "Magicvideo: Efficient video generation with latent diffusion models." arXiv preprint arXiv:2211.11018, 2022*

33 Andreas Blattmann, et al. "Align your latents: High-resolution video synthesis with latent diffusion models." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563-22575, 2023*

44 Jay Zhangjie Wu, et al. "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623--7633, 2023*

AC 元评审

The paper is concerned with video generation. In particular, it proposes to leverage text-to-image generative models, e.g. Stable Diffusion, and expands them to video. The authors do so by identifying sub-instructions for each frame generation, thus identifying what's special in video generation compared to image, while leveraging current advance in image only generative models.

The proposed spatio-temporal approach is well justified and computationally efficient. Although extending text-to-image generation models to video isn't novel, the reviewers find that the authors do a great job at empirically justifying their architectural choices, and as such find the work a great learning for the community. Further, most reviewers appreciate the breath of the experiment -- three video datasets -- and the strength of the presented results.

On the other hand, there are several minor concerns with the work. Some reviewers find that the manuscript can be improved for clarity of exposition. Further, the motivation of the paper is about robotics applications, but the results aren't on robotics benchmark. In addition, although metric-wise the approach performs well, the generation horizon is no more than a second in the future.

为何不给更高分

The paper has received mostly borderline accept ratings. It is a great systems paper for an important problem with a thorough empirical investigation, however, the novelty isn't strong and the generation horizon can be improved.

为何不给更低分

The paper has received positive ratings: 3 x borderline accept and 1 x accept. All reviewers appreciate the approach, the benefits of reusing text-to-image models, and the empirical thoroughness of the justification of the algorithmic design decisions. Further, the method is evaluated on three different dataset with strong results. The reservations raised by the reviewers are minor. Hence, the paper is accepted to ICLR 2024.

最终决定

Accept (poster)