E-Motion: Future Motion Simulation via Event Sequence Diffusion
We propose to integrate the strong learning capacity of the video diffusion model with the rich motion information of an event camera as a motion prediction framework.
摘要
评审与讨论
The paper proposes a novel approach to integrate event-sequences with a video diffusion model for event-based future motion prediction. The authors integrate the learning capacity of video diffusion models with the rich motion information of event cameras to create a motion simulation framework and propose to align the event-sequence diffusion model with the real-world motion via a reinforcement learning process. They demonstrate the effectiveness of their method in various scenarios and highlight its potential in several downstream applications.
优点
(1) The paper makes the first attempt to combines event-based sensors with video diffusion models, offering a unique solution for future motion prediction. (2) The paper provides a unique solution to align the pre-trained event-sequence diffusion model with the real-world motion via a reinforcement learning process. (3) The paper includes sufficient testing and validation, demonstrating the effectiveness an potential of the proposed method.
缺点
(1) This paper lacks an deep analysis of the relationship between event cameras and future motion estimation tasks. If only the role of high temporal resolution is emphasized, high-speed cameras are also an alternative, of which the spatial content is richer. (2) The proposed solution leans towards image-based techniques and lacks to utilize the characteristics of events. (3) The rationale for incorporating reinforcement learning remains unclear to me. I hope the authors can provide a more convincing justification. (4) Some experiment settings lack explanations. For example, in Table 2, the specific processes corresponding to 'T' and 'S+T' are not clearly described. Although I can infer from the supplementary materials that they correspond to spatial and temporal attention layers, this is difficult for readers to understand in main text.
问题
See Weaknesses.
局限性
No
Q1: Event V.S. High-Speed Camera
There are basically three reasons for event data outperforming high-speed cameras:
(1) Data characteristics. Event cameras record only dynamic intensity changes and have extremely high-temporal resolution, which means they capture a wealth of motion information. Compared to RGB data from evenly spaced exposures and the often difficult-to-obtain optical flow data, event data has a natural advantage in the task of predicting future motion information. Moreover, due to the concise structure of data, events can be sensed in low latency, which is also a strength of event data.
(2) Measurement requirements. For capturing high-temporal resolution motion information with a high-speed camera, we usually need to enhance the lighting condition or highlight our target, since the exposure time of a high-speed camera is very short. Otherwise, the captured images will be blurry and dark, making the content difficult to discern.
(3) Cost. The acquisition and usage costs of high-speed cameras far exceed those of event cameras. Theoretically, the smallest time interval for data generation by an event camera is 1 , which corresponds to a maximum temporal resolution of 1e6 frames per second. In practice, many studies e.g.[1,2] have confirmed that event cameras can easily achieve nearly 1,000 frames while maintaining good semantic information. To achieve the same frame rate, the computational and storage costs for high-speed cameras are significantly higher than those for event cameras. Moreover, event cameras have a larger dynamic range compared to conventional cameras, allowing them to capture object motion effectively even under poor lighting conditions, as shown in Fig. 1 of the uploaded PDF file.
[1] Tulyakov, Stepan, et al. "Time lens: Event-based video frame interpolation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
[2] Tulyakov, Stepan, et al. "Time lens++: Event-based frame interpolation with parametric non-linear flow and multi-scale fusion." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Q2: Event-based Design
In fact, our proposed method takes full advantage of the characteristics of event data. Specifically, as detailed in Section 4.1 (Lines 153-165) of the main text, under the High-Temporal Resolution Guided Sampling section, we exploit the high temporal resolution and flexibility of event data by segmenting the temporal bin into multiple sections, which containing rich motion information from the events. During the reverse diffusion process, we replace pure Gaussian Noise with these high-temporal resolution voxels in each denoising phase. This strategy enables us to achieve a high-temporal resolution diffusion process by utilizing these detailed event representations.
Q3: Reinforcement Learning Incorporation
Reinforcement learning is a training strategy that stabilizes and enhances the results generated by diffusion. Diffusion models are trained with decoupling steps from a continuously linked probability flow. Since the model cannot accurately estimate the score function at each step, there must be accumulated errors during the reverse diffusion steps. RL-based methods can take the accumulated error into consideration since the reward is directly modeled on the final reconstructions. Furthermore, Figure 10 in our Appendix demonstrates the effectiveness of using reinforcement learning for motion alignment. Notably, as shown in Figure 10(b), even extensively pre-trained video diffusion models can yield unstable and flawed outcomes in challenging tasks like motion prediction. However, after applying motion alignment through RL, our model delivers more stable results (Figure 10(c)), highlighting the importance of incorporating RL-based motion alignment.
Q4: Experiment Settings
Yes, the “T” and "S" indeed represent the temporal and spatial layers. We have added relevant annotations to the table. We will carefully review the figures and tables in the main text and make corrections in subsequent versions. Thank you for pointing this out.
Dear Reviewer vmNW
Thanks for your time and effort in reviewing our manuscript and the favorable recommendation. In our previous response, we addressed your remaining concerns directly and comprehensively. We are looking forward to your further feedback on our responses.
Best regards,
The authors
Dear reviewer vmNW,
Thanks for your time and effort in reviewing our paper. In our previous response, we provided detailed and comprehensive explanations to resolve your concerns. We are looking forward to and sincerely appreciating your further feedback.
Best regards,
The authors
Dera reviewer,
your questions seem to have been addressed by the authors. Can you please comment to confirm?
The paper explores video diffusion models on the modality of information captured by event cameras. Stable Video Diffusion model is fine-tuned on an event stream dataset. On top of traditional diffusion set up, additional training using FVD and SSIM losses as rewards in a PPO is done. A method to inject motion priors during inferences further proposed. The method is evaluated in terms of generation quality downstream segmentation and object tracking applications.
优点
- The paper explores an interesting problem of predicting future motion based on temporally dense settings offered by event cameras.
- The proposal builds on successes of SVD models and adapts to work in the event stream domain. Additional techniques such as PPO-based optimisation and guided sampling are also interesting.
- The writing is sufficient, and the presentation of the results is good.
缺点
- While the main motivating point is to benefit from the specifics of event stream data, the processing of it is done by treating it mostly like RGB stream, including VAE and CLIP encoders (B.1 Fig. 5). Does this not abandon the benefits of the event stream data?
- The majority of the metrics used are defined and proposed in the RGB space (FID, SSIM, LPIPS). However, they appear to be applied on top of the event stream data. It is not immediately clear whether this is correct or would signal the results in the same way. Moreover, the results appear to be presented alongside the evaluation done on RGB modality, although these are not directly comparable.
- The main motivation is "future motion estimation"; however, the predictive accuracy is measured in a more perceptual, structural way (SSIM, LPIPS) and not on more "raw" metrics like PSNR or MSE.
问题
The main point to address in the rebuttal is the discrepancy between the commonly RGB-based metrics like FID, SSIM, LPIPS, and whether it makes sense to apply them to event data.
局限性
The limitations are discussed.
Q1: Event-specific Design
The authors want to note that the proposed method indeed makes event-based designs. Specifically, during the high temporal resolution guided sampling stage in Section 4.1 of the main text, our method fully leverages the high temporal resolution and flexible sampling capabilities of event cameras. We divided the temporal bin of event data into multiple sections and achieved a high-temporal resolution diffusion process by prompting those high-temporal resolution event representations.
Moreover, due to the large and diverse training datasets and extensive training, those RGB-based modules (VAE and CLIP) have very strong generalization ability. Although it may not outperform some event-based designs on specific datasets, its generalization and robustness make it not fail in most scenes. This is also why we retained the original architecture and adjusted certain weights to adapt the SVD to event data. Besides, the authors also want to note that it's ineffective to only change some parts of a large generative model, since the diffusion U-Net is trained with the perception of original VAE, CLIP models generation distributions. We also have experimentally validated that after changing those modules. The experimental results are shown in the following table, where we feed features from different clip models (Event-trained or RGB-trained) to the SVD U-Net. Note that all CLIPs are fed with event voxels. Even further fine-tuning the SVD with plenty of data, the resulting diffusion model with Event-trained is still underperformed.
Table: Experimental results of SVD fed with different CLIP inputs, where both CLIP models are fairly perceived with event data.
| #Prompt | Fine-tuning | CLIP | FVD ↓ | FID ↓ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|---|---|---|
| U(1,3) | T | RGB-Trained | 1972.91 | 210.89 | 0.69513 | 0.3651 |
| U(1,3) | S+T | RGB-Trained | 1378.92 | 230.18 | 0.78496 | 0.3076 |
| 1 | S+T | RGB-Trained | 1406.24 | 233.58 | 0.78374 | 0.3219 |
| U(1,3) | S+T | Event-Trained | 1646.78 | 298.88 | 0.72779 | 0.3464 |
Moreover, in the future, the authors will also try to design and adapt more event-specific designs with the diffusion model, with large datasets and an extensive training process.
Q2: RGB-Domain Metrics & Raw Data Metrics
That's a good question for generative models. Here the authors merge those two questions and answer them together. The conclusion goes first that metrics on the raw space are inappropriate for measuring and training generative models.
The authors want to first note that perceptual metrics (or the metrics operated in low-dimensional data manifold) are essential for the generative model. In other words, the measuring and training of the generative model especially for the diffusion model must (or to be more effectively) be carried out on the data distribution manifold (or we can say in the perceptual space) [1]. The underlying reason is quite simple usually the target data distribution is not continuous on the raw space. If we utilize a metric such as MSE (as mentioned in the third question), it will make the generated data approach the GT data in the raw space. However, in the raw space, the data distributed near GT samples, e.g., blurring or noised versions of GT samples are not in the original GT data distribution. The target data distribution contains nearly all clear and meaningful samples. Thus, considering that if we can find the manifold of data, the nearby samples of a target sample must be clear event data showing the content with similar semantic meaning. Then, we should use perceptual metrics, which can act as a measurement tool for semantic and high-level meaning similarity.
Moreover, just as we discussed the aforementioned question, the current RGB-based model has a strong generalization ability, which ensures there with less probability to be failure on the perception in various scenarios. Thus, the authors adopt the aforementioned FVD, FID, and LPIPS as our measure metrics.
Finally, to address your concern, we also show the quantitative results for MSE and PSNR metrics in the table below. Our method also outperforms SOTA methods in terms of pixel-level metrics in the raw space.
Table: Quantitative comparison between SOTA methods.
| Methods | Modal | MSE ↓ | PSNR ↑ | FVD ↓ | SSIM ↑ | LPIPS ↓ | mIoU ↑ |
|---|---|---|---|---|---|---|---|
| PredRNNv2 | EVT | 0.0306 | 15.143 | 1339.05 | 0.6598 | 0.3388 | 0.166 |
| SimVP | EVT | 0.0210 | 16.778 | 1242.25 | 0.7961 | 0.3371 | 0.213 |
| TAU | EVT | 0.0231 | 16.364 | 1218.03 | 0.7972 | 0.3354 | 0.228 |
| Ours | EVT | 0.0170 | 17.696 | 1055.25 | 0.7998 | 0.3123 | 0.302 |
[1]Song, Yang, et al. "Consistency models." arXiv preprint arXiv:2303.01469 (2023).
Dear Reviewer AE3a
Thanks for your time and effort in reviewing our manuscript. In our previous response, we addressed your concerns directly and comprehensively. We very much look forward to your further feedback on our responses. Let us discuss.
Best regards,
The authors
I thank the authors for their response.
I understand the author's reasoning for preferring to measure the performance in terms of perceptual metrics. However, as stated in the response "if we can find the manifold of data", my concern over the use of LPIPS and FVD, etc., is that it relies on a model that has not observed RGB-ified event sequence data in training and thus, has not really had a chance to measure or learn such manifold.
However, the authors have provided MSE and PSNR, which are imperfect as they measure proximity to only a single example. At least they show that results are in the vicinity of a known point, which was not guaranteed with LPIPS and FVD. It is interesting that such metrics correlate and generalise despite the change in distribution. I would encourage including the two additional metrics if possible.
I think my main concerns have been addressed. I have updated my recommendation accordingly.
The authors sincerely appreciate your feedback.
This work focuses on the task of future motion estimation, where the goal is to leverage event-based vision sensors (an alternate modality, compared to traditional vanilla RGB inputs) to predict motion flow in settings useful for robotics and autonomous vehicles. The authors propose a method that leverages stable video diffusion models (pretrained on RGB settings) and adapting them with an event sequence dataset for this specific task. They authors consider two large-scale datasets (VisEvent and EventVOT), showing improvements in FVD and mIoU compared with prior work, and extensions to settings like video object tracking. The paper also discusses ablations of the method, with varying prompts and fine-tuning techniques.
优点
+ Motion estimation (both flow and video object tracking) are important sub-tasks for embodied vision applications (robotics, autonomous vehicles).
+ The proposed framework is a sensible extension on traditional RGB-only stable diffusion models, and represents a good early exploration of incorporating recent techniques to this relatively smaller focused area of research.
+ The results indicate promising improvements over ablation variations and key baselines for this task, and the authors examine several different downstream tasks (segmentation, tracking and flow estimation).
缺点
- Additional analysis w.r.t. baselines. It is unclear why some of the metrics show regression in the ablation analysis and comparison tables. For example, in Table 1, TAU and SimVP outperform on FID and aIoU metrics, while the qualitative visuals seem to indicate a substantially different picture (are there examples for which these prior work show more compelling visual results, and if so, what areas of improvement could be identified by this?). Relatedly, another example is in Table 2/3, where it is unclear why some ablations (e.g., removing motion alignment with RL, one of the core technical listed contributions) show improvements over the full approach. If the authors could expand further on this analysis it would be helpful to understanding the overall value and impact of the work relative to the prior work in this space.
- Novelty beyond specific context of event sequence inputs. This specific area for video event understanding is relatively niche, so while novelty is present, it is also limited to this specific context. In particular, the broader ideas around incorporating similar signals for video diffusion (inputs and outputs) have been explored previously, e.g. [A1, A2] ([A1] considers diffusion models for optical flow and monocular depth estimation [related tasks], and [A2] specifically looks at incorporating related depth estimation signals). Additional discussion with such methods (beyond the brief note in L545-547 in supplement section C) would be helpful to better contextualize the broader potential impact of the work beyond this specific domain.
Referenced above:
[A1] The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation, NeurIPS 2023. ([40] in paper, referenced in supplement section C.)
[A2] Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models, 2023.
问题
Overall, the preliminary rating of the work leans borderline+; the work offers an exploration of a modern set of diffusion tools and related techniques in a relatively underexplored area (which can have some useful applications downstream), but there remain some questions regarding the analysis and broader novelty. If the authors could address the questions and clarification areas listed in the weaknesses section above for the rebuttal, it would be helpful to inform the discussion phase and final rating.
局限性
The authors provide a discussion of limitations, future work, and impacts in the paper and supplement.
Post-rebuttal update: The rebuttal discussion and additional results help to strengthen the initial impression of the work. Given the reviewer consensus, I am maintaining my rating leaning towards acceptance. (And given that I also believe that the reviewers have adequately addressed concerns of a reviewer who has not updated their review, I am upgrading my rating a bit further since I believe the rating for this work falls between 5-6).
Q1: Additional Analysis w.r.t. Baselines
There are indeed some conflicts between different metrics, e.g., FID, IOU, and FVD, because only FVD can comprehensively evaluate the both spatial and temporal distribution alignment between the generated samples and GTs. FID and IOU can only measure the per-frame (spatial) distributions, which neglects the spatial-temporal consistency. However, through showing those metrics, we want to illustrate a comprehensive evaluation of different methods. To further resolve your concern, we draw a sample metric distribution graph, comparing the proposed method with SimVP, which performs best in terms of aIoU. The results are shown in Fig. 5(a) and Fig. 5(b) of the uploaded PDF file, It can be seen that our method overall outperforms SimVP, but SimVP achieves excessively high scores on certain samples, such as the blank scene shown in Fig. 5(c).
As shown in Lines 187-190 in the main text, we utilize the FVD and SSIM as metrics to reward the motion alignment reinforcement learning process. Thus, it's plausible that the trained metrics are increased. Moreover, as aforementioned, the FVD is the most principle metric for evaluating the generation results. Further, we actually have experimentally tried to utilize a mixture of all metrics to model the reconstruction reward. However, the performance results are much worse than the method we ultimately used, as shown in the following table. It may be due to that sometimes the optimization of pixel-level metrics of MSE and PSNR may be contradictory to perceptual metrics. Thus, such a mixture of metrics makes the objective hard to optimize. Meanwhile, due to the target data distribution, it's more plausible to optimize the process on the data manifold (perceptual space) than the raw space. We also refer the reviewer to the second reply for the Q2 of reviewer AE3a.
Table: Ablation Study of reward metrics
| Reward Metrics | MSE ↓ | PSNR ↑ | FVD ↓ | FID ↓ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|---|---|---|
| mixture metrics | 0.0240 | 16.198 | 1562.66 | 265.32 | 0.6674 | 0.3463 |
| FVD & SSIM | 0.0170 | 17.696 | 1055.25 | 243.45 | 0.7998 | 0.3123 |
Q2: Related Work
Thanks for indicating such wonderful work. The authors will definitely include more discussions with related works in the main body of the paper in the final version. Moreover, we write some discussions as below shown:
"In recent years, the development of multimodal diffusion technology has advanced rapidly. Researchers are dedicated to applying the powerful generative capabilities of diffusion to different modalities with unique advantages, such as optical flow and depth. Saxena et al. [A1] were the first to apply diffusion models to optical flow and depth estimation. For the characteristics of training data, they introduced infilling, step-rolling, and L1 loss during training to mitigate distribution shifts between training and inference. To address the lack of ground truth in datasets, they also used a large amount of synthetic data for self-supervised pretraining, enabling the diffusion model to acquire reliable knowledge. Chen et al. [A2] utilized the motion information embedded in control signals such as edges and depth maps to achieve more precise control over the text-to-video (T2V) process. They used pixel residuals and optical flow to extract motion-prior information to ensure continuity in video generation. Additionally, they proposed a first-frame generator to integrate semantic information from text and images. Unlike them, we focus on exploring the rich motion information contained in event data and use it to conditionally achieve more precise control over the generation of future motions. Furthermore, we also investigate the significant role of reinforcement learning in video diffusion and the task of motion estimation."
[A1] The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation, NeurIPS 2023.
[A2] Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models, 2023.
Dear Reviewer Yrkm
Thanks for your time and effort in reviewing our manuscript and the favorable recommendation. In our previous response, we addressed your remaining concerns directly and comprehensively. We are looking forward to your further feedback on our responses.
Best regards,
The authors
Dera reviewer,
your questions seem to have been addressed by the authors. Can you please comment to confirm?
Dear reviewer Yrkm,
Thanks for your time and effort in reviewing our paper. In our previous response, we provided detailed and comprehensive explanations to resolve your concerns. We are looking forward to and sincerely appreciating your further feedback.
Best regards,
The authors
Thank you to the authors for your rebuttal - this note is to confirm that I have read it and looked over the additional qualitative figures + graphs you've attached in the pdf. I plan to update + finalize my review after the final reviewer discussion, but overall, the results and rebuttal do help to reinforce my rating leaning towards acceptance, and I do not have any further major questions for the authors at this time.
The authors sincerely appreciate your feedback.
The paper introduces a novel framework that leverages the high temporal resolution of event-based sensors to predict future motion trajectories with unprecedented detail and precision. The authors propose an integration of video diffusion models with event camera data, resulting in an Event-Sequence Diffusion Network. This network is designed to capture the nuances of dynamic scenes and generate video sequences that are both rich in detail and grounded in realistic motion dynamics.
优点
Integration of Event Sequences with Video Diffusion Models: The paper presents the first attempt to combine event sequences with a video diffusion model, creating an event-sequence diffusion model capable of estimating future object motion.
Reinforcement Learning for Motion Alignment: The authors propose a method to align the pre-trained event-sequence diffusion model with real-world motion using reinforcement learning techniques, enhancing the fidelity and coherence of the generated motion sequences.
Test-Time Prompt Augmentation: A method is introduced to augment the test-time prompt with high temporal resolution event sequences, which improves the generation performance of the diffusion model.
Extensive Testing and Validation: The authors demonstrate the effectiveness of their approach across various complex scenarios, showcasing its potential for applications in autonomous vehicle guidance, robotic navigation, and interactive media.
Promising Direction for Future Research: The findings suggest a new direction for enhancing the interpretative power and predictive accuracy of computer vision systems, particularly in the context of motion flow prediction.
The paper's contributions are significant as they push the boundaries of motion estimation in computer vision by harnessing the unique capabilities of event-based sensors and integrating them with advanced diffusion models. The proposed framework opens up new possibilities for accurate prediction of dynamic environments, which is crucial for various real-world applications.
缺点
Following is my concerns: While the paper demonstrates strong results in controlled scenarios, it may lack evidence of how well the model generalizes to a broader range of real-world conditions, such as various weather effects or low-light environments. The paper could benefit from a more detailed discussion on the computational efficiency of the proposed model. Including runtime analysis and resource requirements would provide a clearer picture of the model's practicality for real-time applications. Although the paper acknowledges the high temporal resolution of event data, it could delve deeper into the limitations of such data, such as the lack of texture information, and how this might impact the model's performance in complex visual scenes. While the paper includes some ablation studies, there could be a more thorough investigation into the contribution of each component of the model. This would provide clearer insights into which aspects are most critical to the model's performance. The paper could address how the model performs under noisy conditions or when outliers are present in the event data. This is particularly important given the sensitivity of diffusion models to input quality. There is an opportunity to discuss the explainability and interpretability of the model's predictions. Understanding the factors that contribute to the model's decisions could be valuable for applications in autonomous systems. Although the paper touches on potential societal impacts, a more detailed discussion on ethical considerations, such as privacy concerns or the potential for misuse, could be beneficial. While the paper mentions the availability of source code, providing more detailed instructions on how to reproduce the experiments, including the exact versions of software and hardware used, would enhance the reproducibility of the study. The paper focuses on short-term motion estimation. It could discuss the model's capability or limitations in predicting motion over longer time horizons, which is crucial for some applications.
问题
Has the paper addressed how the model performs under various conditions such as different lighting, weather, or in the presence of occlusions? Does the paper provide evidence of the model's ability to generalize beyond the datasets it was trained on? Are there any specific domains or scenarios where the model might underperform? Are there discussions on the computational resources required to run the model, and is it scalable for use in resource-constrained environments? Does the paper discuss any potential misuse of the technology, such as in surveillance or other applications that might infringe on individual rights? Are there discussions on how the technology could affect different demographic groups differently, potentially exacerbating existing biases? Are there any regulatory or compliance issues related to the deployment of such technology, especially in sectors like automotive or robotics?
局限性
See above
Q1: Performance in Challenging Visibility
Benefiting from the unique characteristics of event cameras, our event-based video diffusion framework can handle future motion estimation issues to some extent. To further address your concern about algorithm performance on challenging visibility scenes, we conducted experiments across various scenarios, as shown in Fig.1 of the uploaded PDF file. Specifically, we first illustrate the poor exposure scene in which a car is passing through a tunnel (Fig.1(a)). The proposed method clearly predicts the car's contour, whereas the contour is difficult to discern from the RGB data and even some frames of the GT event data.
Furthermore, for the object occlusion, the experimental results shown in Fig.1(b) illustrate a scenario where a person is passing through an occluding object. Our method successfully estimates the future motion of the object despite the occlusion. Similarly, Fig.1(c) demonstrates that when a bicycle enters an occlusion, our method provides a more accurate prediction of the bicycle's motion after it enters the occlusion.
Q2: Generalization Ability on Other Datasets
We further validate the generalization ability of the proposed algorithm on other datasets, i.e., CRSOT[1], VisEvent([54] in paper), and hs-ergb[2]. The following table presents the test results, the consistent performance of our method indicates strong generalization. Fig.1 and Fig.2 in the uploaded PDF demonstrate that our method outperforms existing methods across various scenarios from these datasets.
| Datasets | Methods | Scenarios | FVD↓ | FID↓ | SSIM↑ | LPIPS↓ |
|---|---|---|---|---|---|---|
| CRSOT | PredRNNv2 | normal | 1120.2 | 252.8 | 0.809 | 0.453 |
| CRSOT | SimVP | normal | 834.7 | 202.5 | 0.829 | 0.328 |
| CRSOT | TAU | normal | 811.7 | 196.8 | 0.827 | 0.353 |
| CRSOT | ours | normal | 780.6 | 192.6 | 0.853 | 0.290 |
| VisEvent | PredRNNv2 | Poor exposure | 2321.6 | 308.8 | 0.593 | 0.535 |
| VisEvent | SimVP | Poor exposure | 2124.4 | 248.5 | 0.642 | 0.379 |
| VisEvent | TAU | Poor exposure | 2125.0 | 241.9 | 0.651 | 0.363 |
| VisEvent | ours | Poor exposure | 1638.0 | 322.3 | 0.696 | 0.291 |
| hs-ergb | PredRNNv2 | close | 1675.1 | 279.0 | 0.619 | 0.565 |
| hs-ergb | SimVP | close | 1363.9 | 273.1 | 0.655 | 0.458 |
| hs-ergb | TAU | close | 1343.3 | 274.1 | 0.705 | 0.395 |
| hs-ergb | ours | close | 1109.7 | 232.3 | 0.726 | 0.290 |
[1] Zhu Y, Wang X, Li C, et al. Crsot: Cross-resolution object tracking using unaligned frame and event cameras[J]. arXiv preprint arXiv:2401.02826, 2024.
[2] Tulyakov, Stepan, et al. "Time lens: Event-based video frame interpolation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
Q3: Limitation & Failure Cases
Our method is relatively limited in the following scenarios:
(1) Complex background scenarios. Event cameras may capture incomplete textures in certain situations, leading to poorer prediction performance, as shown in Fig. 3(a) of the uploaded PDF. Complex backgrounds can reduce the clarity of the target object, resulting in worse outcomes, especially in cases where the camera lens is shaking.
(2) Heavily overlapped object scenarios. When objects overlap, their motion becomes quite complex, and due to the edge-focused characteristics of event cameras, understanding such motion is challenging, as shown in Fig.3(b) and Fig.3(c). When people overlap, their footsteps often become chaotic, leading to less accurate predictions.
However, we also note that those scenes (extreme complexity or occlusion) are also very challenging for traditional RGB-based vision. We expect that in future research, a diffusion model that perceives different modalities may help to address those problems.
Q4: Computational Resource & Scalability
The following table compares the computational resources of our method with SOTA methods. The powerful generative capability and high fidelity of diffusion models lead to the cost of substantial computational resource consumption. As shown in the following table, our parameter count and FLOPs significantly exceed those of traditional models. However, we believe this trade-off is necessary because of the powerful learning capability of large models in the real world. Taking the future motion estimation task as an example, our diffusion-based method significantly surpasses traditional methods in understanding and learning motion.
For a salable model size for different inference environments, there indeed are many works [1] indicating that the diffusion model can be applied to quantization or other acceleration techniques for speeding up the inference process. The authors will also add further illustrations in the final version.
| Methods | Params (M) | FLOPs (G) |
|---|---|---|
| PredRNNv2 | 23.9 | 48.92 |
| SimVP | 58.0 | 60.61 |
| TAU | 44.7 | 92.50 |
| ours | 1521.0 | 693.92 |
[1]So, Junhyuk, et al. "Temporal dynamic quantization for diffusion models." Advances in Neural Information Processing Systems 36 (2023).
Q5: Different demographic groups & Potential misuse
Since the event sensors have a higher dynamic range and without color information, the authors believe it has better equality for different skin color races. Moreover, because it contains less texture information, the individual's privacy can be better protected. We will add further discussions in the final version.
Q6: Regulatory and Compliance Issues
For applying the proposed algorithm on the robotic or automotive platform, such a system must have an event sensor first. Moreover, sufficient computational resources are also necessary for network running. We will add further discussion in the final version of the paper. Thanks for the advice.
Q7: Source Code
In the supplementary materials file provided, we include the hyperparameters and core components required for training the network. We will also make the code open-source in the future to facilitate replication of our methods by other researchers.
Dear Reviewer 8wZD
Thanks for your time and effort in reviewing our manuscript. In our previous response, we addressed your concerns directly and comprehensively. We very much look forward to your further feedback on our responses. Let us discuss.
Best regards,
The authors
While the authors have provided insightful rebuttals to my initial comments, I would like to raise a few additional concerns that were not addressed:
-
Long-Term Forecasting Duration. Have you forgotten the rebutal of the concerns about the long-term forecasting duration mentioned in the review comments? This is particularly important for some applications where anticipating motion far into the future is crucial. Could the authors provide more information on how the model performs under longer forecasting durations and whether the model's time complexity increases significantly with longer prediction times?
-
Given the unique characteristics of event cameras, I would like to know if the diffusion model has been specifically tailored to leverage these features. Specifically, have there been any improvements or modifications to the diffusion model that are designed to exploit the high temporal resolution and sparse event data produced by event cameras? Are there more efficient diffusion strategies available can use?
-
The authors mention that their method significantly outperforms traditional methods in understanding and learning motion. However, the computational resources required for their model are considerably higher than those of state-of-the-art methods. Considering that the current experimental setup only predicts up to t=20, is the substantial increase in computational time justified for this task? Moreover, if the prediction horizon is extended, would the algorithm's time complexity increase significantly? Please provide a theoretical analysis to support your claims.
Q1: Long-Term Forecasting Duration.
To address your concerns regarding long sequence forecasting, we conducted additional experiments to assess the performance of our method in generating extended sequences. Specifically, for long-term forecasting, we evaluated the method in an auto-regressive manner, where previously generated frames are used to predict new frames. The results of these evaluations are presented in the table below. It is evident that the time complexity generally scales linearly with the number of predicted frames. Moreover, it experiences only slight performance drops with respect to long sequences. Thus, the proposed method is capable of long-term forecasting.
Table S1. The performance of the proposed method across different prediction time durations.
| Estimation Frames | Test Time (s) | FVD ↓ | FID ↓ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|---|---|
| 25 | 25.5 | 1055.25 | 243.45 | 0.7998 | 0.3123 |
| 50 | 51.3 | 1114.28 | 250.39 | 0.7932 | 0.3824 |
| 75 | 78.1 | 1148.36 | 257.11 | 0.7834 | 0.4196 |
| 100 | 104.1 | 1295.67 | 281.39 | 0.7807 | 0.4451 |
Q2: Event-specific Tailored Design
Event Specified Designs. To harness the unique features of event cameras, we investigated a high-temporal resolution event prompt strategy that involves utilizing multiple event voxels with shorter time intervals to guide event generation. Additionally, we conducted ablation studies, the results of which are detailed in Tables 2 and 3 of our manuscript. For your convenience, we have included some experimental results below.
Table S3. The ablation studies of the proposed method trained W/WO high-resolution prompts, where "U(1,3)" indicates utilizing high-resolution prompt.
| #Prompt | Fine-tuning | FVD ↓ | FID ↓ | SSIM ↑ | LPIPS ↓ | mIoU ↑ | aIoU ↑ | |
|---|---|---|---|---|---|---|---|---|
| 1 | S+T | 1406.24 | 233.58 | **0.78374 | 0.3219 | 0.268 | 0.524 | |
| U(1,3) | S+T | 1378.92 | 230.18 | 0.78496 | 0.3076 | 0.252 | 0.525 |
Table S4. The ablation studies of the proposed method inferenced W/WO high-resolution (HR) prompts.
| Method | HR Prompt | MA | FVD ↓ | SSIM ↑ | LPIPS ↓ | mIoU ↑ | aIoU ↑ | |
|---|---|---|---|---|---|---|---|---|
| C | × | ✓ | 1119.71 | 0.79597 | 0.3246 | 0.277 | 0.516 | |
| D | ✓ | ✓ | 1055.25 | 0.79981 | 0.3123 | 0.302 | 0.522 |
From the tables, we can see that our event prompt strategy can increase performance to a large extent.
Diffusion Acceleration. There are several methods available to enhance the efficiency of the diffusion model, such as the DDIM-based sampling strategy [1], various ODE or SDE solvers [2,3], and distillation-based techniques [4,5]. This paper primarily concentrates on validating the effectiveness and feasibility of event sequence diffusion. In our future work, we aim to enhance the efficiency of the method based on your constructive feedback. Thanks for your valuable advice.
Q3: Algorithm's Time Complexity.
The proposed method can extend prediction length in an autoregressive manner, as evidenced by the response to the first question. The proposed method indeed has a large model size because it can leverage the rich, pre-trained knowledge of the diffusion model. While the compared methods have smaller sizes, their performance and generalization capabilities are limited, as demonstrated in our experimental validation. Furthermore, computational costs can be reduced through distillation or other integral acceleration techniques. Additionally, there are specific scenarios in which running the model offline with fewer real-time requirements is allowed, such as mechanical or physical motion simulation and offline reinforcement learning with video diffusion simulation. These scenarios further underline the potential applications of the proposed algorithm. For the long sequence diffusion, we can extend the length in an auto-regressive manner. Thus, consumption of is just linearly increased with generation length as shown in the Table S1 of Q1.
Theoretical Analysis of Diffusion Acceleration. There are techniques, as previously discussed, that can reduce computational costs: (1) ODE/SDE solvers[2,3] aim to find analytical solutions for the reverse integral process and approximate integral results using high-order methods like Runge-Kutta methods. (2) Distillation Methods[4,5] Diffusion distillation methods straighten diffusion integral trajectories, simplifying the integral process even with single steps. Additionally, computational resources can be conserved using techniques such as KV-Cache for transformer layers [6] and flash attention [7]. Thus, we believe that the efficiency of the proposed method can be further enhanced.
We sincerely appreciate your comments and are looking forward to further feedback. We understand that you may be reviewing multiple papers and have a busy schedule. Thanks very much for your time and efforts.
[1] Song, Jiaming, Chenlin Meng, and Stefano Ermon. "Denoising diffusion implicit models."
[2] Dockhorn, Tim, Arash Vahdat, and Karsten Kreis. "Genie: Higher-order denoising diffusion solvers."
[3] Lu, Cheng, et al. "Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps."
[4] Song, Yang, et al. "Consistency models."
[5] Liu, Xingchao, Chengyue Gong, and Qiang Liu. "Flow straight and fast: Learning to generate and transfer data with rectified flow."
[6] Dai, Zihang, et al. "Transformer-xl: Attentive language models beyond a fixed-length context."
[7] Dao, T., et al. "Fast and memory-efficient exact attention with io-awareness, 2022."
Dear reviewer 8wZD,
Thanks for your time and effort in reviewing our paper. In our previous response, we provided detailed and comprehensive explanations to resolve your remaining concerns. We are looking forward to and sincerely appreciating your further feedback.
Best regards,
The authors
Dear reviewer 8wZD,
Thanks for your time and effort in reviewing our paper. We have provided detailed and comprehensive explanations to your further comments. We are looking forward to and sincerely appreciating your further feedback, considering that the discussion deadline is coming soon. Let us discuss to clean up all issues.
Best Regards,
The authors
General Response
We thank all reviewers for your time, constructive feedback, and acknowledgment of our work. We believe all concerns have been clearly and directly addressed. Here, we also want to summarize a few key clarifications concerning the contributions of our work.
Our major contribution lies in the pioneering integration of event sequence data with video diffusion, i.e., utilizing the high temporal resolution of event data to accurately predict future object motions in various scenarios.
Specifically, we transform events into 3-channel voxels and fine-tune the spatiotemporal cross-attention layers of the U-net in SVD. During the denoising phase, to fully leverage the high temporal resolution characteristics of event data, we sample events into sub-streams and prompt multiple event voxels to SVD. Compared to the input of a single RGB frame for the original SVD, our method effectively utilizes the motion priors present in event data, leading to more accurate future motion estimation. Furthermore, we introduce motion alignment using reinforcement learning to enhance the stability of both the diffusion training process and the estimations.
Figures 1, 2, and 4 in the uploaded PDF file, as well as the table in the response to Reviewer 8wZD, demonstrate the excellent generalization capability of our method across various scenarios and datasets. Additionally, ablation studies in Table 3 of the main paper and the visual results in Figure 10 in the Appendix further substantiate the necessity of reinforcement learning. For motion estimation tasks that demand high accuracy, the absence of reinforcement learning for aligning with real motion leads to unstable and prone-to-failure diffusion-generated results.
We posit that our contributions will pave the way for advancing event-based diffusion and future motion estimation fields. Our method improves the SOTA performance of event-based future motion estimation to a higher level, providing a promising benchmark for this community.
Last but not least, we will make the reviews and author discussion public regardless of the final decision. Besides, we will include the newly added experiments and analysis in the final manuscript/supplementary material.
The paper received mixed reviews, leaning towards acceptance. Reviewers commended the importance and originality of the problem addressed, as well as the extensive experimental evaluation and the high quality of the presentation. However, they also noted concerns regarding the method's generalization abilities and the absence of certain details and discussions. In their rebuttal, the authors thoroughly addressed all these concerns. After considering the paper, the reviews, the rebuttal, and subsequent discussions, the area chair decided to accept the paper. The authors are encouraged to incorporate the additional experiments and discussions from the rebuttal into the camera-ready version of the manuscript.