PaperHub
6.6
/10
Poster4 位审稿人
最低2最高4标准差0.9
4
4
4
2
ICML 2025

Impossible Videos

OpenReviewPDF
提交: 2025-01-14更新: 2025-07-24
TL;DR

This work present a comprehensive benchmark of impossible videos and text prompts, aiming to evaluate sota video generation and understanding models.

摘要

关键词
VideosBenchmarkEvaluationImpossible VideosCounterfactualAnti-reality

评审与讨论

审稿意见
4

The paper introduces IPV-VID, a dataset designed to evaluate video understanding models on "impossible videos", which depict scenarios that violate commonsense. The study evaluates from two perspectives: video understanding and video generation. For video understanding, benchmark tasks such as VideoQA and video-text alignment reveal that state-of-the-art models struggle significantly on "impossible videos," exposing their limitations in temporal reasoning and commonsense knowledge. For video generation, the paper assesses the ability of text-to-video (T2V) models to generate high-quality "impossible videos" and proposes IPV-Score to evaluate their semantic consistency and visual quality. The findings highlight the challenges and opportunities for advancing both video understanding and generation in counterfactual and out-of-distribution scenarios.

给作者的问题

  1. For the synthetic data generated by current T2V models, is there a significant performance difference in downstream video understanding tasks between "normal" videos and "impossible" videos? Similarly, for text-guided video generation, are there notable differences in visual quality and prompt-following between "normal" and "impossible" videos?
    Clarifying this would help determine whether the challenges observed are truly due to the "impossibility" of the videos or are influenced by general limitations in synthetic data quality. This could validate or challenge the paper's claims about the unique challenges of "impossible videos."

  2. Can "impossible videos" serve as negative sample guidance to significantly improve the performance of Video-LLM models in reasoning tasks or enhance T2V video generation quality?
    If "impossible videos" can effectively guide reasoning or generation improvements, it would strengthen the practical value of the dataset and benchmark, highlighting its utility beyond evaluation.

论据与证据

Claim1: Existing video understanding models struggle with "impossible videos" due to their reliance on commonsense and reasoning beyond real-world scenarios.

Evidence: Experimental results show that state-of-the-art video understanding models perform significantly poor on the IPV-VID dataset, suggesting their limitations in understanding "impossible videos."

Comment: While the results indicate poor performance on "impossible videos," this may not necessarily be attributed to the "impossibility" itself. The performance drop could also stem from the gap between synthetic and real-world data, such as differences in visual quality or semantic consistency. The lack of comparative experiments on similarly scaled "normal synthetic videos" makes it difficult to isolate the effect of "impossibility" as the primary cause of the performance drop. Drawing such a conclusion without addressing this gap might be risky.

Suggestions for Improvement: To better validate the unique impact of "impossibility" on model performance, future studies could include a control group with normal synthetic videos. This would help disentangle the influence of data quality from that of "impossibility."

Claim2: The IPV-VID dataset provides high-quality "impossible videos" that are semantically consistent and visually realistic.

Evidence: The dataset was generated using T2V models and manually filtered to ensure semantic consistency and visual quality. The authors also proposed the IPV-Score as a metric to evaluate the quality of generated videos.

Comment: Despite the reported quality control, generating high-quality videos remains a significant challenge for current T2V models. For instance, many models struggle with action coherence (e.g., unnatural movements and violations of physical laws), and only a few models, such as Hunyuan and Hailuo, perform reasonably well in this regard. Since the dataset has not been fully released, it is unclear whether all "impossible videos" in the dataset strictly meet the claimed quality standards. Issues such as motion coherence and physical consistency may directly affect the validity of downstream tasks, but these aspects were not thoroughly addressed in the paper.

Suggestions for Improvement: Increasing the dataset's transparency by fully opening sources would enhance its credibility.

方法与评估标准

Yes.

理论论述

Based on the provided document, there is no explicit mention of formal mathematical proofs or theoretical validations for the claims made. The paper primarily focuses on the construction of the IPV-BENCH benchmark, categorization of impossible video types, and empirical evaluations of existing video understanding and generation models. These aspects are more experimental and empirical in nature rather than relying on theoretical proofs.

实验设计与分析

The experimental designs and analyses in the paper are well-constructed and align with the goal of evaluating video understanding and generation models using the IPV-BENCH benchmark. For video understanding, the results demonstrate that existing models struggle with "impossible videos," as evidenced by their poor performance on the IPV-VID dataset. This supports the claim that these models face challenges when reasoning beyond real-world scenarios. However, while the findings highlight a significant performance drop, it is worth noting that this may not solely be attributed to the "impossibility" of the videos. The gap between synthetic and real-world data, such as differences in visual quality or semantic consistency, could also play a role. Without comparative experiments on similarly scaled "normal synthetic videos," it remains difficult to isolate the unique impact of "impossibility" on model performance. This limitation in the experimental design underscores the need for a control group to disentangle these factors in future studies.

For video generation, the IPV-VID dataset is presented as a high-quality resource of "impossible videos," with semantic consistency and visual realism ensured through manual filtering and evaluation metrics like the IPV-Score. While these efforts are commendable, challenges inherent to current T2V models, such as issues with action coherence and physical consistency, suggest that the dataset's quality may not fully meet the claimed standards. Additionally, since the dataset has not been fully released, it is difficult to independently verify the quality of the videos or their impact on downstream tasks. A more transparent approach, including full dataset release and quantitative analyses of video quality, would strengthen the dataset's credibility and its utility for benchmarking.

补充材料

Yes, I have read through the supplementary meterial.

与现有文献的关系

The paper’s contributions align with ongoing research in video understanding and generation, extending prior work on real-world datasets like Kinetics and Something-Something by introducing "impossible videos" that challenge models to reason beyond physical laws. This builds on ideas from benchmarks like CLEVRER but applies them to dynamic video data. In video generation, the paper advances text-to-video (T2V) research by focusing on generating semantically consistent "impossible videos," addressing limitations of current T2V models in temporal coherence and complex prompts. The proposed IPV-Score complements existing evaluation metrics like FVD and CLIP alignment. Additionally, the IPV-BENCH benchmark fills a gap in AI evaluation by targeting counterfactual reasoning, contributing to the trend of specialized benchmarks. These efforts position the paper as a significant step in addressing underexplored challenges in video reasoning and generation.

遗漏的重要参考文献

Up to me, it seems to be sufficient.

其他优缺点

Strengths

  1. Easy to follow for the readers
  2. Interesting.

Weakness

  1. The experiment design and analyses can be imporved if the authors compare the performance between synthetic normal and impossible videos. More analysis could be offered on the difference between synthetic videos and the real ones.
  2. Not fully open-source and the "impossible videos" data may not reach the standard that supposed to be.

其他意见或建议

  1. To better validate the unique impact of "impossibility" on model performance, future studies could include a control group with normal synthetic videos. This would help disentangle the influence of data quality from that of "impossibility."
  2. Increasing the dataset's transparency by fully opening sources would enhance its credibility.
作者回复

Thanks for the encouraging review and valuable suggestions. We appreciate your acknowledgment of the paper’s novelty and will carefully address your concerns to improve clarity and rigor.

Q1: Disentangling Impossibility vs. Synthetic Data Quality

Disentangling impossibility from synthetic data quality is crucial for strengthening the rigor of our study. We leverage the Multi-choice QA task to investigate this. Due to the limited time during rebuttal, we were unable to generate and annotate a large-scale set of synthetic videos. We conducted a preliminary study on a smaller dataset:

  1. Data Collection: We collect 420 synthetic videos from an existing work [1], generated using HunyuanVideo.
  2. Annotation: We label each video with an action description.
  3. Filtering: Videos were excluded if they:
    • Lacked an explicit event (e.g., landscape scenes).
    • Contained counterfactual content (ensure distinction from impossible videos).
      After filtering, 200 videos remained.
  4. Task Construction: We instructed GPT-4o to generate multi-choice questions and answers, adapting prompt from the impossible video setting.
  5. Evaluation: Several popular VideoLLMs were evaluated and compared to impossible videos.
ModelAcc. (Normal Vid)Acc. (Impossible Vid)
Video-LLaVA70.025.8
NVILA89.562.2
LLaVA-NEXT90.083.4

The results show that models consistently perform better on normal synthetic videos than on impossible videos, indicating that "impossibility" introduces a distinct challenge for video understanding.

Recent studies [2] suggest that the visual quality gap between synthetic and real-world videos is rapidly shrinking. This further underscores the unique challenges posed by impossible videos, which are independent of synthetic data quality.

We appreciate this valuable suggestion. We will conduct a more comprehensive study in the revised version.

Q2: Code/Data Release

Upon acceptance, we will publicly release all code and data, including the IPV-Bench taxonomy, IPV-TXT prompts, IPV-VID videos, and evaluation protocols. To ensure accessibility, we will also include a detailed Data Usage Instruction in the Appendix.

Regarding concerns about video quality, e.g., action coherence, our human annotation filters out low-quality videos:

  • If the action forms a semantically meaningful impossible phenomenon, we consider it a valid sample.
  • If the action exhibits inconsistencies, artifacts, or unnatural distortions, we classify it as low-quality data and exclude it.

For a detailed data filtering criteria, please refer to Response 2 of Reviewer F9mi. Besides, we provide sufficient video examples on the anonymous website in the paper abstract for further reference.

Q3: For T2V, any difference in visual quality and prompt-following between normal and impossible videos?

Current T2V models are primarily optimized for normal videos, while impossible videos is often overlooked. Our assumption: creating impossible videos is more challenging than creating normal ones. To explicitly verify this, we conducted a human evaluation on normal text prompts:

  1. We collected 420 synthetic videos from an existing work [1], generated using the HunyuanVideo model.
  2. We annotated the visual quality and prompt-following of these videos.
  3. During annotation, we excluded 83 prompts that describe impossible phenomena and evaluated the remaining 337 normal prompts.
Visual QualityPrompt Following
Normal92.666.5
Impossible88.937.2

We observe that 1) normal visual quality is slightly better than impossible ones; 2) normal prompt following is significantly better than impossible prompt following, which further underscores the unique challenges posed by impossible videos.

We appreciate this insightful suggestion. In the revised paper, we will conduct a more comprehensive study to further strengthen our findings.

Q4: Can impossible videos serve as negative samples to improve Video-LLM in reasoning tasks or enhance T2V generation quality?

We recognize the potential in these directions:

  • For Video-LLMs, impossible videos can serve as high-quality training samples to enhance reasoning capabilities. Since these videos introduce "novel" counterfactual knowledge beyond real-world data, they may help models develop stronger reasoning and generalization skills.
  • For T2V models, while the community has focused heavily on physical law adherence, there is limited understanding of how to explicitly improve this. Impossible videos could serve as negative samples, reinforcing a more structured comprehension of physical plausibility.

Both directions present exciting and meaningful research opportunities. We hope the release of Impossible Videos will inspire further exploration in these areas.

[1] The Dawn of Video Generation: Preliminary Explorations with SORA-like Models. arXiv 2024.

[2] Cosmos World Foundation Model Platform for Physical AI. arXiv 2025.

审稿人评论

Thank you very much for the sufficient response. I am pleased with the contribution of this work. Therefore, I will increase my score.

作者评论

Dear Reviewer AP4m,

Thank you so much for your positive feedback and for helping improve our paper!

Best Regards

审稿意见
4

This paper introduces the novel concept of "impossible videos" as a challenging testbed for advancing video understanding and generation models. It proposes the IPV-Text benchmark composing of many anti-reality videos for evaluating the video LLM models in the understanding task. Results in the paper reveal that although the video LLM models excel at processing real-world scenarios, they struggle with anti-reality content which need deep reasoning rather than simple memorization. Besides, it also propose a IPV-VID benchmark, which has many anti-reality text prompts. These prompts can be used to prompt the T2V models to generate the corresponding videos. Results in the paper find that today's T2V models also struggle to generate the aligned videos, highlighting their reliance on pattern matching rather than true understanding. The benchmarks and findings not only identify crucial shortcomings in existing approaches but also establish promising directions for future research aimed at developing more robust and generalizable video AI systems.

update after rebuttal

I appreciate that the authors take time to explain the implementation details and the metrics. These responses have addressed my concerns. I think the benchmarks proposed by the paper is promising on the future research for more robust and generalized video AI systems. I will increase my score and support its acceptance.

给作者的问题

See the weakness in the Other Strengths And Weaknesses section.

论据与证据

Yes

方法与评估标准

Yes

理论论述

N/A, not proofs in this paper.

实验设计与分析

Yes, the experimental designs are validity.

补充材料

Yes, all of the supplementary materials.

与现有文献的关系

This paper proposes two benchmarks for evaluating the understanding and generation model in video domain. These benchmarks focus on the impossible videos, a field previous benchmarks ignored. They are very valuable for the practitioner to probe whether their model truly understand the real world law or simple memory the training set.

遗漏的重要参考文献

This paper should discuss some other benchmark regarding both the video understanding and generation.

其他优缺点

Strength:

  1. The motivation of this paper is compelling and thoughtful.
  2. The paper is easy to follow.
  3. Results in the paper is valuable for the research community to better understand current video understanding and generation task. Weakness:
  4. How to calculate the score in the Open-ended QA task?
  5. When calculating the Prompt Following metric, it needs the human involve, which is time consuming and cost unfriendly. Are there any method to calculate this metric without the human?

其他意见或建议

No

作者回复

Thank you for your encouraging review and valuable suggestions. We appreciate your acknowledgment of the paper’s novelty and will carefully address your concerns to improve clarity and rigor. Below are our detailed responses:

Q1: Better to explain more clearly about how to calculate the score in the Open-ended QA task.

We appreciate the reviewer’s suggestion and provide a more detailed explanation below.

For evaluating the Open-ended QA task, we employ an LLM-based evaluator that compares model responses against the annotated text explanations in the benchmark. However, we empirically observed that directly instructing the LLM to assign scores led to instability. To address this, we propose a justification-then-score approach:

  1. Justification Step: The evaluator first analyzes key matches or mismatches between the model’s response and the ground truth, providing a textual justification.
  2. Scoring Step: Based on this justification, the evaluator assigns a semantic alignment score on a scale from 0 to 1:
    • 1.0 – Perfect alignment
    • 0.8-0.9 – Good alignment
    • 0.5-0.7 – Partial alignment
    • 0.1-0.4 – Weak alignment
    • 0.0 – No alignment

This justification step is crucial for ensuring fair and stable score assignment. In this work, we employ GPT-4o as the evaluator. In the current version of the paper, we briefly mention the use of GPT-4o in Section 4, line 306. We will include this detailed explanation in the revised version (or supplementary materials) to improve clarity.

We sincerely thank the reviewer for highlighting this point.

Q2: Are there any method to calculate the Prompt Following metric without human?

This is an insightful question, and we fully agree that an automatic evaluation strategy would enhance the scalability of our benchmark.

In our main paper (Table 4), we report human-annotated results to provide reliable insights into the performance of current T2V models. Additionally, in Appendix B.2, we introduce an automatic evaluation strategy for impossible video generation. Specifically, the Prompt Following metric can be assessed using state-of-the-art Vision-Language Models (e.g., GPT-4o) in conjunction with a carefully designed prompting strategy. Experiment result clearly demonstrate the consistency between human evaluation and auto-evaluation, as revealed in Table 6 and Figure 6 in the Appendix.

We appreciate the reviewer’s insightful question and will further emphasize this discussion in the revised version.

Q3: This paper should discuss some other benchmark regarding both the video understanding and generation.

We appreciate the reviewer’s comment. In the Related Work section and Table 1, we have comprehensively discussed the relationships and distinctions between IPV-Bench and existing benchmarks across three key areas: video understanding, video generation, and AIGC video detection. This comparison highlights the unique contributions of IPV-Bench and its role in bridging gaps that are not addressed by prior benchmarks.

We will ensure that this discussion is clearly emphasized in the revised version. Thank you for your valuable feedback.

审稿意见
4

The paper introduces IPV-BENCH, a novel benchmark designed to evaluate video understanding and generation models from the perspective of impossible videos. It categorizes scenarios that violate physical, biological, geographical, and social laws. The main experimental results reveal that current models have difficulty understanding and generating such videos, pointing potential developments in video models.

update after rebuttal

I appreciate the author's response and I maintain positive.

给作者的问题

None

论据与证据

The claims made in the paper are largely supported by the empirical results. However, some claims regarding the models' limitations in understanding impossible videos could be enhanced by more detailed quantitative metrics. For example, the authors claim most video models fall short on impossible videos, specific metrics across categories would strengthen this statement.

方法与评估标准

The methods, including the construction of the IPV-BENCH benchmark and the associated taxonomy, make sense. The evaluation criteria, encompassing the Judgment, Multi Choice, and Open-ended QA tasks, effectively assess the model's capabilities in understanding impossible scenarios. The diverse sources of video data, including synthetic, real, and community-generated content, enhance the robustness of the evaluation.

理论论述

The paper does not present formal proofs for any theoretical claims but relies on empirical evaluations.

实验设计与分析

The experimental designs are generally thorough, with well-defined tasks and clear criteria for measuring model performance. Whereas, providing more information about the filtering criteria for selecting videos would enhance the dataset's integrity.

补充材料

The supplementary material presents abundant visualizations of the impossible videos.

与现有文献的关系

The paper highlights the gaps in existing benchmarks that do not address impossible or counterfactual videos. It may be related to some literatures that study how humans react with the impossible or counterfactual information.

遗漏的重要参考文献

No

其他优缺点

The paper provides an interesting perspective of evaluating video model's ability. The experimental results and analysis provide insights in how to develop video understanding and generation models.

其他意见或建议

None

作者回复

Thank you for your positive feedback and constructive suggestions. We are grateful for your recognition of our work’s novelty and will incorporate your recommendations to further strengthen the paper. Below, we address your comments in detail.

Q1: Claims about model limitations of understanding impossible videos could be strengthened with category-specific quantitative metrics.

We appreciate the reviewer’s suggestion.

Table 2 of the paper presents impossible video understanding metrics (Multi-choice QA and Open-ended QA) across the four categories of IPV-Taxonomy: Physical, Biological, Social, and Geographical. Notably, the Physical category proves to be the most challenging, yielding the lowest scores across both tasks.

Upon further analysis, we observed that videos in the Physical category often exhibit complex impossible temporal dynamics, requiring sophisticated temporal reasoning. In contrast, videos in other categories can be largely addressed through world knowledge reasoning, which aligns well with the capabilities of large language models (LLMs).

To further investigate this, Table 3 reports an experiment where videos are classified into two groups via human annotation:

  • Spatial – Impossible phenomena identifiable from a single frame.
  • Temporal – Impossible phenomena requiring cross-frame temporal reasoning.

Results indicate that models perform significantly worse on Temporal videos than on Spatial ones, highlighting temporal reasoning as a major bottleneck in understanding impossible videos.

Q2: It is better to provide more information about the video filtering criteria.

We appreciate the reviewer’s valuable suggestion. The goal of video filtering is to ensure that the selected videos: 1) Maintain high visual quality; 2) Clearly demonstrate impossible phenomena.

Visual Quality Criteria:

  • Accepted: Clear, sharp videos with high aesthetic value and smooth temporal motion.
  • Rejected:
    • Videos with jitter, flicker, blur, large-scale artifacts, or indistinct/distorted foreground objects.
    • Completely static videos with no visible changes.
    • Videos that lack logical coherence and appear visually chaotic.

Impossible Semantics Criteria:

  • The video must clearly depict an impossible, counterfactual phenomenon that cannot occur in the real world. The impossibility should be a salient event, rather than minor visual details that are difficult to perceive.
  • The video should be in a photo-realistic style. Non-realistic styles (e.g., cartoon-style videos) are excluded to avoid confusion in video understanding.

We will include these detailed filtering criteria in the revised paper. Thank you again for your insightful review. Your feedback is helpful on enhancing the rigor and clarity of our work. We are happy to address any further questions.

审稿意见
2

The paper introduces IPV-BENCH, a benchmark for evaluating video understanding and generation models using "impossible videos". It includes a taxonomy, a prompt suite (IPV-TXT), and a video dataset (IPV-VID). Evaluations reveal limitations in current models, highlighting the need for improved reasoning and generalization in non-real-world scenarios.

给作者的问题

How do you plan to encourage broader adoption of IPV-BENCH as a standard benchmark in the video understanding and generation community?

论据与证据

The claims in the submission are not fully supported by clear and convincing evidence. The paper lacks a detailed release of code and benchmark datasets, which are crucial for reproducibility and validation. Additionally, focusing on "impossible videos" is niche and may not attract widespread adoption, limiting the benchmark's impact. These issues weaken the overall credibility and practical utility of the claims.

方法与评估标准

The proposed methods and evaluation criteria, including the IPV-BENCH benchmark, are relevant for assessing video models on impossible videos.

理论论述

The paper does not present any theoretical claims or proofs that require verification. It focuses on empirical evaluation and benchmarking of video understanding and generation models.

实验设计与分析

The paper lacks detailed experimental designs and analyses, particularly in the evaluation of video generation models. The IPV-Score metric is introduced but not thoroughly explained or validated. Additionally, the absence of released code and datasets undermines the reproducibility and soundness of the experiments.

补充材料

Yes. However, the supplementary material contains no useful information.

与现有文献的关系

No.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  1. Introduces a novel concept of "impossible videos" to challenge video models.
  2. Constructs a comprehensive taxonomy and benchmark (IPV-BENCH) for evaluation.

Weaknesses:

  1. Lacks release of code and benchmark datasets, limiting reproducibility.
  2. Focuses on non-mainstream scenarios, potentially reducing broad interest and adoption.

其他意见或建议

  1. Provide detailed supplementary materials, including code and benchmark datasets, to enhance reproducibility.
  2. Highlight practical applications of impossible videos to justify their significance in the research community.
作者回复

Thank you for your thoughtful feedback. We greatly appreciate your insights on our paper. Below, we outline our responses to each point.

Q1: Reproducibility: Code/Dataset Release

We appreciate the reviewer’s concern regarding reproducibility. Upon acceptance, we will publicly release all code and data, including the IPV-Bench taxonomy, IPV-TXT prompts, IPV-VID videos, and evaluation protocols. Additionally, we will provide a detailed Data Usage Instruction in the Appendix to enhance accessibility. We believe that this comprehensive release will maximize the impact of Impossible Videos and foster further research in this area.

Q2: Significance and Applications of Impossible Videos.

We appreciate the reviewer’s valuable suggestion, as it will further enhance the impact of Impossible Videos.

Significance:

  • As highlighted in our paper, this benchmark fills a critical gap in evaluating counterfactual video generation and understanding—an area currently absent in the community. This importance has also been recognized by Reviewers F9mi and AP4m.
  • Evaluating AI robustness and generalization in out-of-distribution scenarios is a well-established challenge, particularly for autonomous systems encountering rare events. As noted by Reviewers F9mi and JPJB, "Impossible Videos" serves as a robustness and generalization benchmark for video understanding and generation models, addressing an overlooked yet crucial aspect of AI evaluation.
  • As Reviewers F9mi and AP4m acknowledged, our work is closely related to broader topics in counterfactual and causal reasoning in AI. "Impossible Videos" provides a valuable case study for counterfactual reasoning in the video domain, encompassing both video understanding and generation.

Applications:

  • "Impossible Videos" can be leveraged to enhance video understanding and generation models by improving their robustness and generalization capabilities.
  • Real-world applications include:
    • Creative industries – Enhancing special effects, game design, advertising, and filmmaking, etc.
    • Industrial safety – Assisting in anomaly detection and risk assessment.
    • Advanced AI assistants – Equipping AI systems with stronger reasoning capabilities for more intelligent decision-making.

Q3: Experimental Details: IPV-Score and Evaluation

We appreciate the reviewer’s feedback on this issue. The computation of the IPV-Score is based on the statistics of Visual Quality and Prompt Following and is defined as:

IPVScore=Num. of High Visual QualityNum. of Good Prompt FollowingNum. of All Vid.IPV Score=\frac{\text{Num. of High Visual Quality}\cap \text{Num. of Good Prompt Following}}{\text{Num. of All Vid.}}

This metric intuitively aligns with our design philosophy: it measures the percentage of videos that satisfy both high visual quality and strong prompt adherence. To improve clarity, we will include this equation along with a detailed textual explanation in the revised paper.

Q4: How to Encourage Broader Adoption?

We appreciate this constructive question and will take the following steps to encourage broader adoption:

  • Public Release – Make all data and evaluation code openly available to enhance accessibility.
  • Comprehensive Documentation – Provide detailed instructions for data usage to facilitate easy adoption.
  • Community Engagement – Organize competitions and workshops on Impossible Videos to attract more researchers to this domain.
  • Integration with Existing Toolkits – Collaborate with established toolkits to incorporate IPV-Bench into widely used evaluation suites.
  • Ongoing Benchmark Maintenance – Maintain a leaderboard and regularly update the benchmark to ensure its relevance.

We sincerely hope that these efforts will inspire further research and drive innovation in the video understanding and generation community.

最终决定

The paper analyses generated videos that defy reality. The analysis consists of two aspects: from the perspective of video understanding models, and from the perspective of video generators, whether they can follow unrealistic prompts. The reviewers overall suggest acceptance, as the results are interesting. The paper doesn't propose novel ideas or theoretical analysis. It proposes a new benchmark generated using existing state-of-the-art video generators. AC believes that the paper has value as of now, but very likely it'll need to be revisited in the future as video generators, as well as understanding methods will change.

The decision to accept the manuscript is conditional upon the authors’ commitment to release the benchmark code and data.