4.0

/10

withdrawn4 位审稿人

最低3最高5标准差1.0

4.0

置信度

正确性2.3

贡献度2.3

表达2.0

ICLR 2025

Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

Ge Ya Luo,ZhiHao Luo,Anthony Gosselin,Alexia Jolicoeur-Martineau,Christopher Pal

OpenReview PDF

提交: 2024-09-27更新: 2024-12-02

摘要

关键词

video generationvideo synthesiscomputer visiondiffusion modelsautonomous drivingcontrollable video generation

评审与讨论

审稿意见

评分: 3置信度: 52024-10-22

This paper proposes Ctrl-V, a method for bounding box controlled video generation. The proposed method is split into two parts. The first part is a generator that synthesizes images of bounding box sequences. The second part takes the bounding boxes images as an input to condition a pre-trained video model. The results demonstrate that the bounding boxes images can be used as a ControlNet style conditioning signal for video generation.

优点

Writing: The paper is well written and easy to follow.
Simple approach: The proposed two-stage method seems simple and easy to implement. Moreover, the method is generally sound.

缺点

No video results: There is no supplementary website with video results. This makes the submission incomplete, as the most important part of a video generation work are the video generations. It’s impossible to judge the results at all, especially regarding temporal consistency.
Generalizability: The method is trained on driving data and evaluated on that, even though the method is using a pre-trained video model trained on diverse data. It’s not clear how well this method generalizes, especially because the denoising U-Net is fine-tuned. All previous works demonstrate generalizability to any input, which is a clear advantage over the proposed method.
Missing comparisons: Why does Tab. 2 with the bounding box control metrics not include other trajectory-conditioned video generation methods? It’s not clear why there is only this one made up baseline instead of adapting previous trajectory-conditioned video generation approaches to the same setup.
Unclear metric scores: The average precision scores in Tab. 3 are not clear. The caption says that prior works don’t evaluate driving datasets. So are these methods adapted to these datasets or do the test datasets differ here? Because it sounds like different evaluation sets were used for different methods.
No insights provided: The approach is very simple. Create bounding box images with one network, then do ControlNet to condition the generation with these bounding box images. If an approach is that simple without some interesting twist or super convincing results, the paper needs insights. There is no ablation in the paper. For example: How important is it to use bounding boxes images? What happens if just the coordinates and sizes are used as conditioning? There is no clear motivation why the proposed method is the best way to solve this issue.

问题

I currently rate this paper below the acceptance threshold. Video results are missing, hence it’s impossible to judge the results. Furthermore, only driving datasets are shown while all previous works show generalizability to any kind of scene.

I would like authors to address following questions:

Why are there no video results?
Does the method transfer bounding box conditioning to OOD prompts?

I am open to adjusting my rating based on the rebuttal.

2024-11-16

Dear Reviewer,

Thank you for taking the time to provide your valuable feedback on our work. We appreciate your insights.

In response to your comments, we have made the following updates and would like to provide the following clarifications:

We have added video demos generated by our model, accessible at this link. Also, we included a 19-page appendix in our supplementary materials, which also features 11 pages of video samples embedded in the PDF. Due to the submission size limit, it was not feasible to integrate it into the main document.
Regarding the model's generalizability, we would like to clarify that our primary focus is achieving precise motion control in video generation, which is particularly important for applications such as autonomous vehicle (AV). Although our approach utilizes an off-the-shelf general-purpose model, which suggests potential adaptability to other domains with sufficient fine-tuning data, our specific interest lies within the AV domain for this particular work.
Regarding the comment on the missing comparisons, to the best of our knowledge, this is the only study that conducts bounding box prediction directly in the pixel space, which is one of the novel aspects of our model. To clarify, Table 2 specifically evaluates the quality of bounding box sequence predictions, which are not the generated videos. Therefore, it wouldn’t be meaningful to compare our results with trajectory-conditioned video generation methods, as the metrics in Table 2 are tailored to bounding box prediction performance.
Regarding the comment on the unclear metric scores, it is true that the mAP evaluations are not directly comparable, as they are conducted on different datasets and with different evaluation methods. Due to the way bounding boxes are tracked and utilized in our work, the calculation of average precision differs. The details of these evaluation differences are provided in lines 448-456. However, since they can still serve as a reference for comparison, we chose to include these numbers. Additionally, we have provided detailed explanations of these mAP evaluations in the appendix of the supplementary material (Appendix B.3 and Appendix B.4).
Regarding the comment on no insight provided, we acknowledge that our core idea may seem straightforward, but its novelty lies in its departure from conventional approaches. Specifically, prior work processes bounding box information as strings and extracts encodings using sequential or transformer-based models (e.g., Boximator). Moreover, existing methods overlook bounding-box trajectory predictions, they either:
- Provide the bounding-box trajectory as input (e.g., TrackDiffusion)
- Only focus on objects’ locations in the first and last frame; ignore intermediate frame bounding boxes (e.g., Boximator) In contrast, our approach introduces a distinct innovation: leveraging a diffusion model to predict bounding box trajectories directly in the pixel space. By bridging bounding box encoding and pixel space, our method unlocks two key advantages:
- Scalability: It enables conditional generation of an unlimited number of bounding boxes.
- Improved bounding box prediction: As demonstrated in Table 2, predicting bounding box trajectories in pixel space outperforms sequential methods such as Trajenglish.

To our knowledge, this is the first work to explore bounding box trajectory prediction with diffusion models in pixel space, making our contribution novel and noteworthy.

Regarding the comment on our work lacking an ablation study, we would like to point out that it includes several ablations:
- Table 1: We compare the generation quality of our model against a fine-tuned SVD model that does not condition on bounding boxes.
- Table 2: We ablate the Box2Video model by isolating the BBox-Generator to compare its performance with Trajenglish.
- Table 3: We evaluate the Box2Video model using teacher-forced bounding box trajectories to assess its ability to generate videos that adhere to the provided bounding box trajectories.
Regarding the comment on transferring to out-of-distribution prompts outside the scope of bounding boxes,, we have not tested this capability of our model, as this paper focuses solely on using bounding boxes to control object motion.

I hope these responses have addressed all your concerns. We are also ready to discuss any issues that remain unclear. If we have cleared up the confusion, we hope you will consider boosting your scores for our work.

Best regards.

评论- Follow-up

2024-11-24

Dear Reviewer,

To evaluate our model's generative capabilities in out-of-distribution scenarios, we tested its performance using manually crafted car crash prompts, which were absent from our training dataset. The accompanying demo (accessible via the link) simulates a collision by positioning two agents to intersect from opposing directions, resulting in a mid-scene crash. Notably, the generated clip shows the colliding cars merging into a single entity and driving off-frame, illustrating the model's behavior when faced with an out-of-distribution input.

Thank you once more for your input. We appreciate your feedback and welcome any additional suggestions.

2024-11-26

Thanks for the reponse. First of all, I don't think it is good to submit a video paper without video results and then buy yourself more time to do the visual results for the rebuttal. Video results always have to be in the initial submission so reviewers can properly assess the work in the first round. The video results lack comparisons to existing works. Just a simple ControlNet baseline for visual comparisons even. Also it not clear if the bounding box conditioning inputs are from the test set?

评论- Regarding video demos

2024-11-26

Dear Reviewer,

Thank you for your feedback. We appreciate your comments and would like to address your concern regarding our video paper submission.

At submission, we assumed having video frames drawn out in the supplementary material was a sure way to have reviewers look at our generation results. And it did not strike us to use any external links, not knowing if it would get viewed at all. At the suggestion of multiple reviewers, we now understand using an external web page of our video results is a much better method to showcase our generations.

We appreciate you bringing up the question of whether we were buying time to generate visualization. We treat this very seriously and would like to address this with utmost conviction and complete transparency.

Our project has a public demo website and repository that have been up for more than 5 months, containing the same set of visualizations plus even more results. We did not pull any tricks to buy time for visualization generation. It was simply a regrettable choice to use frames of images as our visualization. All reviewers are welcome to verify after the paper is out of anonymous review.

We are ready to prove our statements as soon as allowed after the review period. Meanwhile, we hope the reviewer can treat this as the unfortunate choice it was, keep a fair eye on our visualization, and judge our work on the quality of our method and contribution.

2024-11-26

I recommend against mentioning public websites, as it hints reviewers to break anonymity. Even if you mention to only do it "after the paper is out of anonymous review."

Could you please address my concerns about missing visual comparisons I mentioned? Currently, there are no proper comparisons based on videos.

评论- Follow-up II

2024-11-27

The requested videos (comparisons of the fine-tuned SVD baseline vs. Ctrl-V) have been added to our anonymous repository at the bottom.

The bounding box conditioning inputs are from validation sets specified in Appendix B.3 Dataset Configuration.

审稿意见

评分: 3置信度: 42024-11-02

This paper presents a bounding-box (bbox) controllable video generation method called "Ctrl-V." Ctrl-V employs a two-stage pipeline: it first generates a bbox sequence based on both the first and last frames, and then the diffusion model Box2Video generates a video conditioned on the sequence of generated bboxes. To evaluate the performance of this new problem formulation, a new benchmark has been introduced.

优点

This paper proposes an innovative video generation method that involves generating a sequence of bounding boxes. Compared to other approaches, it demonstrates superior generation quality.

缺点

The figures in the paper require significant improvement. The references to the figures are unclear. For Figure 1, the information presented is sparse. Figure 2 is difficult for readers to interpret due to unclear connections. Additionally, Figure 3's purpose is not well-defined within the context of the paper.
The experimental section lacks validity. For Table 1, there is no information regarding the last column for Multi-view on the nuScenes dataset. The meaning of "Ctrl-V + BBox Generator + Box2Video" is ambiguous, as it appears that Ctrl-V is already integrated with the BBox Generator and Box2Video.

问题

Where is the corresponding description for the figures in this paper?

2024-11-17

Dear Reviewer,

Thank you for your valuable feedback. We appreciate your input and will revise our manuscript accordingly. Specifically, we will add figure references in the text and modify the notation in Table 1 to “Ctrl-V: BBox Generator + Box2Video”.

To further illustrate our model’s performance, we have provided video demos generated by our model, available at link. These demos offer additional insights into our model’s capabilities. And there is supplementary material with details of our method’s implementation.

Given that your concerns appear to be primarily formatting-related, we would appreciate clarification on two points:

Are there any substantive or technical issues we should address beyond formatting?
If we resolve all formatting issues, would this potentially influence your evaluation scores?

Thank you again for your time and expertise. Best regards.

评论- Follow-up

2024-11-23

Dear Reviewer,

Thank you for your valuable feedback. We have revised our manuscript according to your suggestions. We would appreciate any further feedback you may have and look forward to the opportunity to continue refining our work.

Best regards

评论- Official Comment by Reviewer Y7ge

2024-11-24

Dear authors, I am glad to see your hard work on this rebuttal. However, there are still several issues in the current paper.

I suggest highlighting the changed sentences in your paper (e.g., using \textcolor{blue}{}) rather than simply adding them. It is difficult to identify which parts you have added in this rebuttal without reading the entire paper again.
Regarding Figure 2, while you have provided more explanations for each element, why not add these explanations directly to the figure? Additionally, I cannot locate the "red vectors" mentioned in the figure.
...

Although the authors have shown a great attitude in this rebuttal, and the amount of work is sufficient for this paper, readability is a basic requirement for any paper. In summary, I believe this paper still falls below ICLR's acceptance standards. Since there is no option for a score of 4, I will maintain my current score.

评论- Follow-up II

2024-11-26

Dear Reviewer,

Thank you for your insightful feedback. We greatly appreciate the time you invested in our work.

In response to your comments, we:

Rewrote Section 3 in its entirety
Rewrote portions of Section 4
Highlighted all rebuttal changes in red
Created a global post summarizing revisions

We kindly request your review of the revised manuscript. Please share any further suggestions.

审稿意见

评分: 5置信度: 22024-11-03

This paper presents an approach to generating driving scene videos by utilizing existing pre-trained image-to-video diffusion models, Stable Video Diffusion. Notably, all the bounding box conditionings are encoded into RGB (in contrast to feature representations), and the off-the-shelf VAE is used to render the conditionings to latent representation. The proposed method first interpolates bounding boxes placed on the first frame and the last frame (and optionally other frames?) to generate a sequence of bounding conditionings, and then realistic driving videos are generated with a control-net-based conditioning mechanism. The authors demonstrate reasonable generated quality.

优点

The authors demonstrate that bounding box conditioning can be encoded with the off-the-shelf VAE, which is interesting.
Investigating the ability to generate driving scene videos with existing image-to-video diffusion models (i.e., SVD) is interesting.
The quality of generated videos seems to be reasonable although the authors did not provide the sample of generated videos (in a video format) in the supplemental materials.

缺点

Overall, paper clarity is lacking, and significant revision might be needed to meet the standard conference paper quality. Since I am not specifically an expert in driving video generation, it was difficult for me to precisely understand the technical contribution and soundness of the paper. (Hence, I set the confidence score as 3.) Here are some suggestions to improve clarity:

Generated videos (in a video format) are not provided in supplementary materials. Since this paper works on video generation, it is important to provide samples of generated videos in a video format.
The figure of the overall framework (Figure 2) is not referred to in the paper. Since there is no explanation related to this figure (for instance what is c', k?), it is difficult for me to judge the correctness.
Some equations are missing. For instance, the authors mentioned Equation 3.3, but the equation does not exist in the paper.
I would suggest defining all the notations in the authors' papers instead of directing the readers to other papers for the definition of notations (for instance, in section 3.1).
In section 4.2, the authors set up baselines but it is not clear to me exactly what these baselines are. For instance, what is "Teacher-forced Box2Video generation"? Exactly how it is different from the proposed approaches. I would appreciate it if the authors could explain all the baselines in detail.
The motivation of input conditioning seems unclear. Could the authors provide reasons for why the authors experiment with first/last frame conditioning, 2D/3D conditioning, and first/three frame conditioning, trajectory conditioning in the last frame?
Are there indeed no baselines from previous works that can be compared with the authors' works? How is the authors' method superior to relevant works? Since I am not familiar with driving scene generation, I will rely on other reviewers' judgments for this. But in general, I would like to see how relevant baselines (even if they are not directly applicable to your problem setting) are inferior to your approach.

问题

I would appreciate it if the authors could revise the paper to improve clarity. Please see above for some suggestions.

评论- Rebuttal 1/2

2024-11-21

Dear Reviewer,

Thank you for taking the time and effort to review our manuscript. We have carefully considered your feedback and revised the manuscript accordingly. Below, we provide detailed responses to address your concerns and clarify the points you highlighted.

Regarding the comment on video demos, we have provided a new link containing several sample clips generated by our model.
Regarding the notation and reference for Figure 2(Pg 5), we have updated the figure and its caption. Additionally, we have included references to the figure in the main text.
Regarding the missing equation in Section 3.3 (Pg 6), we have resolved the issue.
Regarding the definition of the scaling equations for EDM in Section 3.1 (Pg 4), we have included them in Appendix E (Pg 32).
Regarding the experimental setup in Section 4.2 (Pg 7), our generation pipeline consists of two models: the BBox generator and the Box2Video model. The BBox generator creates plausible bounding-box trajectories based on the provided conditioning bounding-box frames. This trajectory is then used as input to the Box2Video model to condition the video generation. In the "Teacher-Forced Box2Video Generation" scenario, instead of using the BBox generator to create bounding box trajectories, we use the ground-truth bounding-box trajectories provided by the dataset.
Regarding the comment on the motivation for input conditioning, our goal is to achieve precise control of generated objects by using a sequence of bounding boxes. While users can provide entire bounding box sequences, they can be hard to create. We sought to simplify this process by creating the BBox Generator model which only requires users to provide the beginning and ending frames. The numbers of initial frames used are chosen to test how our model responds with and without knowledge of the bounding boxes’ momenta (i.e. single frame bounding boxes only contain position data, using 3 initial frames allows the model to deduce speed and acceleration information etc.). As suggested by reviewer EsMH, we also included experiments where the last-frame bounding boxes are absent (Appendix C.8, Pg 28). We believe these experimental setups are crucial for testing our model’s ability to predict and control vehicle motions in AV videos.

In response to the comment on the types of conditioning (e.g., 2D/3D bounding box frames), it is worth noting that some of the datasets we use contain 3D bounding boxes, while others do not. We utilized whichever bounding box data was available in each dataset. Additionally, to the best of our knowledge, this is the first work to explore 3D bounding box conditioning in video generation. Conditioning on 3D bounding boxes in pixel space is very straightforward, which makes this approach a novel contribution.

评论- Rebuttal 2/2

2024-11-21

Regarding the comment on prior work, several related studies, such as Boximator, TrackDiffusion, and Direct-A-Video, address similar challenges. However, these works differ from ours in task formulation and even more so in implementation and evaluation, with many focusing on text-to-video (T2V) models. These differences make direct and fair performance comparisons challenging, if not impossible.

Nevertheless, we include the performance metrics of other works alongside our results in Tables 1,2,3 (Pg 7, 8, 10). While some of these comparisons are not directly applicable, they offer a general sense of how our method performs relative to existing approaches.

We would like to stress that, even without directly comparable metrics from prior work, our work still shows clear novelty in its approach. Previous studies use Fourier embedders to encode bounding boxes, while we propose to integrate them into pixel space.

Specifically, prior works use Fourier embedders to encode individual objects' bounding box coordinates, concatenating the coordinates of multiple objects in a frame into a string that is then passed to a video generation network for conditioning. However, this approach has several drawbacks:
- Limited object control: This type of conditioning typically caps the number of objects that can be controlled.
- Learning challenges: The relationship between the Fourier-embedded bounding box coordinates and the frame is difficult to learn. To mitigate this, Boximator introduces a "self-tracking" technique to ensure objects adhere to bounding box conditioning. However, this technique requires a two-stage training process (discussed in Lines 144–160, Pg 3).
To the best of our knowledge, our work pioneers the integration of bounding box conditioning into pixel space, leveraging color and shape to embed rich object information. Notably, our approach overcomes the object count limitations of existing models. Additionally, by utilizing ControlNet and pixel-space conditioning, we enable a more intuitive relationship between conditioning signals and generated frames.

We hope these responses have addressed all of your concerns. We are also happy to discuss any remaining issues that may still be unclear. If we have clarified any confusion, we kindly ask that you consider revising your scores for our work.

评论- Official Comment by Reviewer aa8J

2024-11-24

Thank you for the clarification. My main concern is still the lack of clarity. While the authors have addressed some of my concerns in rebuttal, it is still difficult to understand the whole methods and experiments by only reading this paper. For instance, the diagram of Figure 2 could be improved by directly inserting texts into the figure. I have read the revised captions, but it was hard to associate the explanation with the figure (for instance, what is a red vector? Is it dark red or light red?). Another example is the lack of the explanation on baselines. Although the author has clarified this in the rebuttal, it should also be clarified in the paper. I feel the paper's clarity still does not meet the standard of ICLR paper.

Note that I am not an expert on driving scene generation, but the effective reuse of existing VAE to condition bounding boxes is interesting. To acknowledge this, I raised my score from 3 to 5, where rejection reasons came from lack of organization in the paper.

评论- Follow-up II

2024-11-26

Dear Reviewer,

Thank you for your insightful feedback. We greatly appreciate the time you invested in our work.

In response to your comments, we:

Rewrote Section 3 in its entirety
Rewrote portions of Section 4
Reworked Figure 2
Highlighted all rebuttal changes in red
Created a global post summarizing revisions

We kindly request your review of the revised manuscript. Please share any further suggestions.

评论- Official Comment by Reviewer aa8J

2024-11-26

Thank you for the revision. While the clarity of the method section has improved, I am maintaining my current score (5, but leaning toward 4) due to the insufficient organization of the paper.

For instance,

The authors should define the problem formulation before discussing the method. What exactly is the video generation conditioned on? While it is mentioned that the video is conditioned on bounding boxes, specific attributes associated with bounding boxes are not explained. Terms such as "object type" and "orientation" appear suddenly in Section 3.5, which was confusing.
Bounding box representation seems to be a core contribution. It is recommended that a detailed explanation be provided in the main paper.
My concerns about the experiment section, such as the clarity of the baselines, remain unaddressed. For instance, the baselines are introduced without formal explanation, which I believe many people can not understand.
Notation misuse: For instance, both $K$ and $N$ are used to represent the frame length. Similarly, $z_{\epsilon}$ defined in section 3.1 does not appear to be used in subsequent sections.

I firmly believe that the paper requires significant revision that is not within a reasonable range addressable during the ICLR rebuttal. I strongly recommend taking additional time to carefully reorganize the paper.

评论- Follow-up III

2024-11-27

Thank you for your detailed feedback and for taking the time to review our revision. We appreciate your acknowledgment of the improved clarity in the method section.

We understand your concerns regarding the organization of the paper and would like to assure you that we have carefully addressed all the points you raised in this revision:

Problem Formulation and Conditioning on Bounding Boxes: We have clearly defined the problem formulation several times in our paper:
- Abstract (Ln 16-19):
  
  To achieve this, we 1) use a separate, specialized model to predict object bounding box trajectories given the past and optionally future locations of bounding boxes, and 2) generate video conditioned on these high quality trajectory predictions
- Introduction (Ln 85-89):
  
  “Bounding box Motion Generations with Diffusion: We devise a novel diffusion based approach for generating 2D/3D bounding box trajectories at the pixel-level (as 2D videos) based on their initial and final states, and the first frame.“ “2D-bounding box and 3D-bounding box Conditioning: We condition on 2D or 3D Bounding boxes in order to provide a fine-grained control over the generated videos.”
- The entire Section 3.2 (Ln 208-220):
  
  “1. First, we generate bounding box frames using our diffusion based bounding box predictor network, the BBox Generator, which is shown on the left side of Figure 2. These frames contain only bounding boxes. They make up a video of moving (or stationary) bounding boxes and it serve as the “skeleton” for the generated video.” “2. Then, we generate a video using our video generator network, Box2Video, shown on the right side of Figure 2, where the bounding boxes frames act as the control signal. The bounding boxes in each frame determine the objects generated in the corresponding frames of the video.
Bounding Box Representation Contribution: While the bounding box representation is an important aspect of our work, it is not our sole contribution. We also introduce the use of a diffusion model for predicting object trajectories, which is thoroughly discussed in the abstract, introduction, experiments, and conclusion sections.

Additionally, we have provided a detailed explanation of the bounding box representation in the text:

"We encode information such as track ID, object type, and orientation for each bounding box using a combination of visual attributes, including border color, fill color, and markings. Specifically, the track ID represents a unique identifier for each tracked object across frames, the object type specifies the category of the object (e.g., car, pedestrian), and the orientation indicates the direction the object is facing."

Details about how these bounding box frames are rendered can be found in Appendix B.1. We chose to include this information in the appendix because space in the main paper is limited, and these details, while helpful, are non-essential to our main contributions. Moreover, we provide multiple visualizations throughout the main paper to ensure clarity (Figures 1, 2, 3).
Baseline Model Explanation: The Stable Video Diffusion (SVD) baseline model was first introduced in detail in the preliminary Section 3.1 (Ln 175-204):

"In this section, we provide an overview of the Stable Video Diffusion (SVD) (Blattmann et al., 2023a;b) model, due to its importance in our approach. SVD is an image-to-video (I2V) model that employs video diffusion… "

We have added a new footnote about the baseline model on Line 375-377.
We have fixed the typos in Section 3.1.

We understand your view that significant revisions were needed; however, we believe these changes and clarifications have thoroughly addressed your concerns. We hope the revisions meet your expectations.

审稿意见

评分: 5置信度: 52024-11-04

This paper presents a controllable video generation pipeline for autonomous driving, where bounding boxes are first generated and then used as conditions for video generation. Experiments on various driving datasets are conducted to evaluate the pipeline's capabilities.

优点

The paper introduces a novel diffusion-based approach for generating 2D/3D bounding-box trajectories at the pixel level.
This paper proposes a two-part controllable video generation method that first generates trajectories and then uses these trajectories to condition video generation, achieving solid performance in autonomous driving scenarios.
The experiments cover multiple autonomous driving datasets and provide sufficient quantitative and qualitative results, effectively validating the model’s applicability in autonomous driving contexts.

缺点

The primary distinction of the proposed two-part method from one-stage generation is the additional step of bounding-box trajectory generation. However, the paper does not discuss the specific advantages of this intermediate trajectory generation step. Including a comparison would help clarify the benefit of this two-part method.
While AP is used to assess bounding-box generation location accuracy, this paper lacks evaluations of trajectory smoothness, rationality, and temporal consistency. These aspects are important in assessing the realism of generated motion trajectories.
Methods compared in the paper, such as Boximator, are designed for general scenarios, while this paper only concentrates on driving scenes, making it difficult to ensure fair comparisons. The paper does not clarify whether the proposed method could work in broader contexts beyond autonomous driving.
In Table 1, the Teacher-forced method performs worse than the BBox-Generator-and-Box2Video-combination method on KITTI and BDD datasets. Does this imply that the Box2Generator model is not optimal? Discussion on this phenomenon would be valuable.

问题

It would be helpful to see more ablation studies, such as whether using the bounding box of the last frame is necessary and the role of the adapter layer.
More specific visualizations or numerical results would be valuable for cases involving occluded objects or objects that appear midway, as mentioned in the model.
Discussion on failure cases is insufficient.
The two-stage generation method involves two diffusion forward processes. However, there is a lack of discussion regarding the computational overhead and parameter cost associated with this approach.
It would be insightful to explore whether the model can handle more control information.

2024-11-19

Dear Reviewer,

Thank you for reviewing our paper and providing constructive feedback. We appreciate your time and effort. We have revised our manuscript. Below, we'd like to address your concerns and clarify the points you've raised.

Regarding the suggestions for an ablation study, we conduct additional experiments (Appendix C.8, Pg 28) to investigate the effect of conditioning various bounding-box frames on Ctrl-V's video generation capabilities. Notably, we assess the Ctrl-V performance with:
- No final bounding-box frame
- Replacement of the final bounding-box frame with a trajectory frame, where central points represent bounding boxes
Key findings from our ablation study include: a) Complete ground-truth bounding-box trajectory conditioning yields the best generation quality; b) Even minimal conditioning with the final frame’s bounding boxes’ central points substantially enhances quality over the no-final-frame baseline. Additional details, findings, and illustrations supporting this study are available in the supplementary material (Pg 28-30).
Regarding the concern on the visualization of occluding objects and new object appearance, we supplement our analysis with Figure 25 in Section E.2 (Pg 33). Figure 25 illustrates the model's robustness in handling:
- New object entry: car and pedestrian correctly assigned to bounding boxes
- Occlusion: cars temporarily obscured during traffic crossover
- Reappearance: The car re-emerges with preserved appearance after occlusion during crossover
Moreover, in Section C.6 (Pg 25-26), we examine the model's performance in turning scenarios, marked by abrupt changes in viewpoint, resulting in many instances of new object emergence.
Regarding the suggestion of the discussion of the failure cases, we present a new Section E.1 (Pg 32), which delves into the limitations of our model in capturing fine visual details and the challenges of resolution degradation.
Regarding the concern of the computation overhead, our pipeline was trained and evaluated on a single A100-80GB GPU. A detailed account of the model's architecture and parameter count is provided in Section A.5 (Pg 16).
Regarding the concerns about handling additional control information, our current work focuses specifically on controlling object motion in video generation via bounding boxes. However, our foundation model, an SVD network, is inherently flexible and can accommodate text and other types of conditioning. Future integration of other conditioning is feasible but falls outside the scope of this study.

Best regards.

2024-12-02

Thanks for the detailed response from the author side and it partly solves my concern. I have also read the comments from other reviewers and decide to modify the score to 5.

In the past two years, I mainly worked on the driving scene generation, especially for the nuScenes dataset. So it is not that hard to understand the motivation of this paper, because of some prior information on the direction. When I put myself in the reader’s shoes, I tend to notice the biggest problem of this paper is the organization in the paper. Obviously, the paper's clarity still does not meet the standard of ICLR, as revealed by other reviewers. The paper requires very significant revision that should not be addressed during the ICLR rebuttal.

I have noticed that the authors really worked hard to solve those concerns raised from reviewers, which should be addressed before submission. I would suggest the authors to revise this paper carefully and resubmit to other top conferences since its idea or insight is really good.

评论- Rebuttal Summary

2024-11-21

Dear Reviewers,

We appreciate your feedback and have revised our manuscript accordingly:

Writing Revision

Revised figures and captions Figures 2 with added inline references (Pg 5 & 9). [aa8J, Y7ge]
Manuscript revision: Enhancing Writing Clarity We have rewritten the entire Section 3 and revised portions of Section 4 (Pg 4-6). [aa8J, Y7ge]

New Experiments

Added new experiment results evaluating performance without last-frame bounding boxes conditioning (Appendix C.8, Pg 28-30). [EsMH]

New Discussions

Provided model capacity information (Section A.5 (Pg 16)). [EsMH]
Included discussion on failure cases (Section E.1 (Pg 32)). [EsMH]

New Supplementary Materials

Supplemented with video demos (https://anongreenelephant.github.io/ctrlv.github.io/). [oRa5, aa8J, Y7ge]
Our supplementary materials now include simulated NuScene intersection scenarios featuring manually added agents on the bird's eye view map. These scenarios display the new agents' bounding boxes projected in the ego view and include generated videos by our model highlighting interactions with these agents. You can view these videos via the provided link.
We also offer a car crash simulation in our demos (accessible via link) to showcase our model's ability to generate outputs in out-of-distribution scenarios. [oRa5]

We hope the newly provided results alleviate any confusion. We welcome any further feedback you may have and look forward to the opportunity to continue refining our work.

Authors

评论- Manuscript Revision: Enhancing Writing Clarity

2024-11-26

In response to Reviewer aa8J and Y7ge suggestions, we have made substantial revisions to improve the clarity of our manuscript. Specifically, we have completely reworked the presentation of our method in Section 3 (Pg 4-6) and updated Figure 2 (Pg 5), which details our architecture. The revised method explanation utilizes Figure 2 to guide the reader through the details of our approach. Additionally, we have included a table defining our notation in Appendix E (Pg 32), as well as explanations for the notations used in prior work, which are detailed in the Preliminaries of Section 3.1. All changes in response to reviewer feedback are marked in red.

In addition, we firmly believe our research makes significant contributions to:

Motion Prediction: Predicting object motions using diffusion network
Pixel Space Conditioning: Exploring novel bounding box conditioning techniques in pixel space.
Multimodal Information Encoding: Encoding object information through color and shapes.

We kindly invite you to review the video demonstrations on the anonymous webpage. These demonstrations highlight the exceptional results achieved by our method, emphasizing its significant contribution to advancing controllable video generation. We would like to emphasize that the autonomous vehicle (AV) research community invests considerable resources in developing highly accurate motion models for vehicles. By integrating our architecture, video generation models can accurately manipulate pedestrian and vehicle movements, addressing a critical need in AV research. This key application motivated our work and informed our architectural design.

We greatly value this opportunity to work with you in enhancing our manuscript. We welcome any further feedback you may have and look forward to continuing to refine our work.

Authors.

撤稿通知

2024-12-02

We appreciate the time and effort the reviewers invested in evaluating our manuscript. We are grateful for the constructive feedback provided, particularly from Reviewer EsMH, who acknowledged the significance of our contribution.