PaperHub
5.2
/10
Rejected5 位审稿人
最低5最高6标准差0.4
5
5
6
5
5
4.0
置信度
正确性2.2
贡献度2.6
表达3.0
ICLR 2025

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

OpenReviewPDF
提交: 2024-09-21更新: 2025-02-05

摘要

关键词
World SimulatorPhysical CommonsenseVideo GenerationEvaluation

评审与讨论

审稿意见
5

The paper proposes PhyGenBench and PhyGenEval. PhyGenBench is a benchmark with about 160 text prompts used to evaluate models' video generation ability on physics-related text prompts. PhyGenEval is an evaluation framework of PhyGenBench, used to automatically assess the video quality of physics laws, via GPT-prompted questions and VLM perception.

优点

  • The topic is novel and interesting. Evaluating physics in AI-generated videos is really important. This paper is the first one on this topic as far as I know.

  • Experiments show that PhyGenEval is closer to human value.

缺点

  • The PhyGenBench is a dataset with 160 text prompts. As a comparison, for the works mentioned in this paper, VideoPhy has 688 prompts with 36.5k human annotations, and DEVIL has more than 800 prompts. Only 160 text prompts may not represent the full complexity of physics law.

  • In PhyGenEval, the overall score is set on a four-point scale, but even the top-performing video generation model scores only 0.5 on average. That means the model gets a 0 score in more than half of the test cases. This suggests that the evaluation metric might be overly strict, potentially limiting its effectiveness in distinguishing between models. Such stringent scoring could reduce the benchmark’s ability to accurately reflect model performance differences.

  • Since the topic is related to evaluating the physics in generative models, I think it is better to add some discussion on physical reasoning benchmarks in related works, which has been a heated debate topic, such as SuperCLEVR-Physics[1], ContPhy[2], Physion[3] and so on.

[1] Wang, X., Ma, W., Wang, A., Chen, S., Kortylewski, A., & Yuille, A. (2024). Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering. ArXiv. https://arxiv.org/abs/2406.00622

[2] Zheng, Z., Yan, X., Chen, Z., Wang, J., Lim, Q. Z., Tenenbaum, J. B., & Gan, C. (2024). ContPhy: Continuum Physical Concept Learning and Reasoning from Videos. ArXiv. https://arxiv.org/abs/2402.06119

[3] Bear, D. M., Wang, E., Mrowca, D., Binder, F. J., Tung, H., Pramod, R. T., Holdaway, C., Tao, S., Smith, K., Sun, F., Kanwisher, N., Tenenbaum, J. B., Yamins, D. L., & Fan, J. E. (2021). Physion: Evaluating Physical Prediction from Vision in Humans and Machines. ArXiv. https://arxiv.org/abs/2106.08261

问题

See Weakness above.

评论

Q1: The PhyGenBench is a dataset with 160 text prompts...

A1: Thank you for your suggestion. We emphasize that PhyGenBench focuses on the most fundamental physical laws and simple scenarios. It undergoes rigorous screening and quality control. Experiments reveal that even for these basic physical scenarios, current video generation models struggle to produce videos that align with physical commonsense. We provide a more detailed explanation in General Response-Part1.

评论

Q2: In PhyGenEval, the overall score is set on a four-point scale...

A2: We apologize for any confusion caused (We have explained this in the caption of Table 2 in the main text.). And We provide detailed clarifications below:

  1. Normalization of Scores
    The results presented are normalized scores (out of a maximum score of 3). We emphasize this in the table captions in the paper to ensure clarity.

  2. Distinguishing Capability of the Method
    The method demonstrates a clear distinction between open-source and closed-source models. For example:

    • Among open-source models, the best-performing Vchitect2.0 achieves 0.45, while the best-performing closed-source model, Gen-3, achieves 0.51. This reflects a notable gap of nearly 29 points in total scores, highlighting the performance difference between open-source and closed-source models.
    • Within open-source models, the total score difference between CogVideo2b and CogVideo5b is also approximately 29 points, showcasing the impact of scaling laws on models' ability to understand physical realism.
  3. Challenges of the Benchmark
    The challenging nature of the benchmark results in relatively low scores across models. Generating physically accurate videos is inherently difficult, as it requires balancing smooth video generation with physical correctness. While current video generation models aim to achieve physical realism, their performance still falls short. We hope that PhyGenBench and PhyGenEval can support the community in further advancing this direction.

Thank you again for your thoughtful suggestions. If you have further questions or need additional clarification, please feel free to contact us.

Q3: Since the topic is related to evaluating the physics...

A3: Thank you for your valuable feedback. We have added the related work for this section in Appendix A. If you have further questions or need additional clarification, please feel free to contact us.

评论

I would like to thank the authors for these invaluable comments, which have solved my concerns about W.2 and W.3. However, I am still not fully convinced that the proposed benchmark is robust enough to be adopted in the video generation community.

评论

We would like to extend our heartfelt gratitude for the invaluable suggestions provided for this paper. In light of your feedback, we have diligently provided comprehensive and elaborate explanations.

  1. Reproducibility and Model Replacement: We have discussed the reproducibility of the method in detail in Response to Reviewer PPCZ (Q2), as well as the potential replacement with Open-Model in Generat Response-Part2. Experimental results confirm that the method is highly reproducible.
  2. Human Evaluation of GPT-4o generated questions: We conducted a human evaluation to assess the questions generated by GPT-4o, which is shown in Response to Reviewer wXRw (Q1) Table 2 . The results show that, despite some hallucinations from LLMs, the simplicity of PhyGenBench ensures that the generated questions are highly reliable.
  3. Effectiveness of the Three-Stage Method: In Appendix D.3 (The Component in PhyGenEval on physical commonsense alignment evaluation), we explain that all three stages of our method contribute to the final results. By combining these stages, we effectively reduce the hallucinations that occur when using a single VLM, which also enhances the robustness of the method.
  4. Robustness Discussion in Appendix E: In Appendix E (The robustness of PhyGenBench and PhyGenEval), we further discuss the robustness of our method. When using video quality enhancement modules like VEnhancer that do not affect physical correctness, the results of PhyGenEval remain nearly unchanged, highlighting the robustness of the method.

Therefore, we believe that both PhyGenBench and PhyGenEval demonstrate strong reproducibility and robustness. If you have any further inquiries, we are more than delighted to offer additional clarifications and conduct further experiments.

评论

As today marks the final day of the discussion period, we would like to sincerely thank you for your thoughtful feedback on our paper. In response, we have addressed each of your points with thorough and detailed explanations. If you have any further questions or would like additional clarifications, we are more than happy to provide further insights and conduct additional experiments. We kindly ask if you would reconsider the score, and we look forward to hearing your thoughts.

审稿意见
5

This paper introduces PhyGenBench, a new benchmark for evaluating the physical commonsense capabilities of Text-to-Video (T2V) models, particularly their understanding of intuitive physics. It includes prompts across 27 physical laws within four domains: mechanics, optics, thermal, and material properties. To evaluate performs on this benchmark, the authors propose PhyGenEval, a hierarchical evaluation framework using advanced vision-language models (VLMs) and large language models. Experimental results reveal that current T2V models lack robust physical commonsense, underscoring the gap between these models and true world simulators.

优点

  • This paper clearly have great novelty. It focus on intuitive physics is unique and addresses an important gap in T2V evaluation.
  • PhyGenEval's three tiered framework (key phenomena detection, order verification, and overall naturalness) thoroughly assesses physical realism.
  • By getting more attention on the gap in physical commonsense, the benchmark provides great insights on how to improve video generation models to become a real world simulator.

缺点

  • The paper includes extensive comparisons to demonstrate PhyGenEval’s effectiveness, suggesting that a two-stage evaluation strategy may align more closely with human judgments for both InternVideo2 and GPT-4-o. Line 965 also notes that alternative open-source models achieve a high correlation coefficient with human evaluations. However, it appears that the main results rely on a specific version of GPT-4-o, which is not explicitly mentioned. As a benchmark, would future users need to evaluate all baselines and methods on updated versions of GPT-4-o to ensure fair comparisons? While the paper suggests that evaluation costs are minimal, I am concerned that this reliance on a specific model version may affect consistency. Have the authors considered using other LVLMs in place of GPT-4-o?

  • Certain T2I models may perform poorly on specific prompts. I am not fully convinced that the proposed evaluation method can robustly handle these lower-quality videos.

  • The issue of hallucination in large language models (LLMs) does not appear to be addressed in the evaluation protocol, potentially impacting the reliability of the benchmark. It would be beneficial if the authors considered this factor in their assessment framework.

  • The author promised more human evaluation results in Appendix C.2 but this result seems under Appendix C.1. The writing seems to be confusing. Also between line 899 and 905, I believe the annotation should be done more rigorously. I am expecting carefully validate results from human annotators or I think the results can be noisy. I think showing the instructions to the human annotators can be particularly helpful.

问题

  • What is "the final score is calculated as 0 according to 4.2" (line 292)? Is the example in Figure receive 0 after this physical commonsense evaluation?

  • It seems like the entire evaluation rely on closed sourced LLM: GPT-4o. If in the future, GPT-4o becomes unavailable, how should people compare results?

  • some typos such as we pue more detailed (line 410), Appendix C.2 (line 418)

评论

Q1: The paper includes extensive comparisons to demonstrate PhyGenEval’s effectiveness...

A1: Thank you for your thoughtful feedback. We provide our detailed responses below:

  1. Clarification on alternative open-source models.
    The alternative open-source models refer to replacing the GPT-4o used in the Physics Order Verification and Overall Naturalness Evaluation stages with open-source VLMs. For question generation, we consistently use GPT-4o. We have further clarified this point in the paper.

  2. Specification of the GPT-4o version.
    We specify in our anonymous repository that the version of GPT-4o used is gpt4o-0806, and we have explicitly stated this in the paper as well.

  3. Validation of GPT-4o-generated questions.
    we conduct a detailed human evaluation, which demonstrates that the questions generated by GPT-4o are highly reliable. For detailed information, please refer Reponse to Reviewer wXRw (Q1)

  4. Replacing GPT-4o with open-source VLMs

    We provide a detailed discussion on replacing GPT-4o with open-source VLMs in General Response-Part2. Notably, even without using GPT-4o, the method achieves reasonably good performance (the Pearson correlation is over 0.7), demonstrating its robustness and flexibility.

Thank you again for your feedback. If you have further questions or need additional clarification, please feel free to contact us.

评论

Thank you to the authors for their timely feedback! I acknowledge that the authors specified the version of GPT-4o used in their work. However, I still have some concerns: while the paper suggests that evaluation costs are minimal, relying on a specific model version could impact consistency over time. Given the possibility that the GPT-4o version may eventually be retired, would it not be more prudent to use InternVL-Pro (78B) consistently throughout the main text of the paper?

评论

We thank the reviewer for your valuable questions and provide the following clarifications:

  1. Regarding VLM evaluation, as shown in Table 1 below, we have already evaluated the framework using only open-source VLM models. The details is in General Response-Part2. While the performance is slightly weaker compared to closed-source models, it still achieves a Pearson correlation coefficient of 0.72, demonstrating the high reliability of PhyGenEval. Moreover, we observe that as stronger and larger open-source models are used, the alignment between PhyGenEval and human evaluation improves. This gives us confidence that, with the continued development of open-source models, our framework will not require GPT-4o. We also provide additional evaluation results using only open-source VLMs in Table 2 below. Like Table 2 in the main text, it can reflect the gap between different models (e.g. Gen3 performs best, while Lavie performs worst). In the future, we will release a leaderboard and inference code based entirely on open-source VLMs to further strengthen the stability and reproducibility of our approach.

  2. Regarding the generation of questions using LLMs, we have validated the reliability of GPT-4o-generated questions through human evaluation. These questions, like the prompts in PhyGenBench, are part of the benchmark. We will further refine and clean these questions and release them as open-source to enable standardized and accessible testing for the community.

  3. Regarding the use of GPT-4o in the main text, the primary reasons are: i) Accessing the API is more convenient and has a lower entry barrier compared to deploying models. ii) As shown in Table 1 in Response to Reviewer wXRw (Q1), our approach is computationally efficient, making the cost of using the API lower than the GPU resources required for deployment.

MetricMechanics (ρ ↑)Optics (ρ ↑)Thermal (ρ ↑)Material (ρ ↑)Overall (ρ ↑)
PhyGenEval (Open)0.620.650.640.730.72
PhyGenEval (Close)0.750.770.750.840.81

Table 1: Comparison of PCA correlation results (pearson) using close source or open source models in PhyGenEval.

ModelSizeMechanics(↑)Optics(↑)Thermal(↑)Material(↑)Average(↑)
CogVideoX2B0.420.480.480.450.46
CogVideoX5B0.430.600.550.480.51
Open-Sora1.1B0.520.570.510.460.51
Lavie860M0.380.490.430.400.43
Vchitect 2.02B0.480.620.530.450.52
Pika-0.400.600.500.490.50
Gen-3-0.460.630.560.520.55
Kling-0.500.640.540.440.54

Table 2: Evaluation results of PCA with the proposed PhyGenEval with open VLMs in videos generated by several models .

评论

I appreciate that the authors plan to provide a public leaderboard in the future using open-sourced VLMs, which is a commendable effort. However, I find that there are many moving parts in the evaluation protocols, and several key details are either unclear or not sufficiently stated in the main text of the paper. While I recognize this as an important and promising attempt to evaluate physics for video diffusion models, I believe that my initial score remains appropriate because this is a benchmark paper.

评论

First: I find that there are many moving parts in the evaluation protocols Answer: Thanks to the reviewer's comments, we just replaced the LLaVA-Interleave used with the larger size InternVL2-Pro, and did not make any other changes

Second: Several key details are either unclear ... Answer: We thank you for your suggestions. In fact, PhyGenEval provides a three-stage evaluation strategy. We have marked the VLM used in each stage. In addition, we have open-sourced our code and test files in https://github.com/PhyGenBench/PhyGenBench, which we believe is user-friendly. If you have specific questions, please let us know and we will explain it in time!

评论

Q2: Certain T2I models may perform poorly on specific prompts...

A2: Thank you for your thoughtful feedback. We provide detailed responses below:

  1. Filtering difficult prompts in PhyGenBench.
    During the construction of PhyGenBench, we deliberately filter out overly challenging prompts that lead to extreme distortions in generated videos. These prompts make it meaningless to evaluate physical realism. Instead, we focus on retaining prompts that enable the generation of basic scenes while exposing physical issues. This high-quality prompt design allows PhyGenEval to perform efficient and meaningful evaluations.

  2. Incorporation of semantic alignment scores.
    We design a semantic alignment (SA) score to evaluate whether the generated videos align with the prompts. As detailed in Appendix D.2 (Quantitative result about semantic alignment), the results show that due to the high quality of our prompts, all tested models achieve high SA scores in both machine and human evaluations.

  3. Evaluation of low-quality videos.
    We manually filter videos with low SA scores (≤ 1) and identify a total of 50 such low-quality videos. For these videos, we calculate their PCA scores. Since these videos are inherently highly distorted, it is nearly impossible to assess their physical realism. As shown in the Table 5 below (after norm), these low-quality videos also achieve extremely low PCA scores, further demonstrating that PhyGenEval is capable of identifying such low-quality cases effectively.

Thank you again for your valuable feedback. If you have further questions or need additional clarification, please feel free to contact us.

Avg. SAAvg. PCA
0.420.12

Table 5: SA and PCA scores on the selected video, the scores are normalized to 0-1

评论

Q3: The issue of hallucination in large language models...

A3: Thank you for your insightful feedback. As noted in our Response to Reviewer wXRw (Q1), thanks to the extensive world knowledge of GPT-4o, we are able to generate highly reliable questions. We verify this through human evaluation to ensure their accuracy. Human evaluation shows that the questions generated by GPT-4o are physically consistent and highly reliable. Furthermore, we plan to refine this component further and release it as an open-source resource to enable standardized testing for others.

Thank you again for your suggestions. If you have further questions or require additional clarification, please feel free to contact us.

评论

Q4: The author promised more human evaluation results...

A4: Thank you for your suggestion. We have made the necessary revisions. For the instructions used in human annotations, we provide a detailed explanation in Appendix D.1 (Human evaluation details) and include examples of these instructions in Figure 10. Thank you again for your valuable feedback. If you have further questions or need additional clarification, please feel free to contact us.

Q5: What is "the final score is calculated as 0 according to 4.2..

A5: Yes, the three-stage scores are 0, 1 (only Question1Question_1 is correct), and 0. The final score is calculated as 0 because the average of these scores is 0. In this example, since the egg bounces off the rock like a rubber ball, the human annotation score is also 0.

Q6: It seems like the entire evaluation rely...

A6: Thank you for your feedback. We provide detailed responses below:

  • For the questions generated by GPT-4o, we conduct a thorough human review, as explained in our Response to Reviewer wXRw (Q1). This demonstrates that the generated questions are highly reliable. We plan to further refine and release these questions as an open-source resource to facilitate standardized testing.
  • Regarding GPT-4o's use as a VLM for evaluation, we provide a detailed explanation in our General Response-Part2, where we discuss replacing GPT-4o with open models. The results show that open models can also achieve a Pearson correlation coefficient above 0.7 with human ratings.
  • Finally, we plan to further explore the use of other open-source LLMs and work towards implementing a more end-to-end evaluation approach. Thank you again for your valuable suggestions! If you have further questions or need additional clarification, please feel free to contact us.
评论

We would like to extend our heartfelt gratitude for the invaluable suggestions provided for this paper. In light of your feedback, we have diligently provided comprehensive and elaborate explanations. If you have any further inquiries, we are more than delighted to offer additional clarifications and conduct further experiments. We eagerly anticipate your response!

评论

As today is the final day of the discussion period, we would like to express our sincere thanks for your valuable feedback on our paper. In response, we have carefully considered your suggestions and provided detailed explanations. Should you have any further questions or require additional clarifications, we are more than willing to offer further insights and conduct additional experiments as needed. We kindly ask if you could reconsider the score and we look forward to your response.

审稿意见
6

The paper discusses the limitations of current text-to-video (T2V) models like Sora in accurately representing intuitive physics, which is essential for creating a universal world simulator. Understanding intuitive physics is foundational for such a simulator, yet T2V models' performance in this area remains under-explored. To address this gap, the authors introduce PhyGenBench, a Physics Generation Benchmark designed to test T2V models' grasp of physical commonsense and provide a comprehensive assessment of models' understanding of physics.

Also, the paper presents PhyGenEval, a new evaluation framework that uses advanced vision-language and large language models in a hierarchical structure to assess physical commonsense accurately. This dual framework allows for large-scale, automated evaluations aligned with human feedback.

Overall, the paper is well-written and propose a great benchmark for video generation evaluation.

优点

The paper is well-written and the contributions are very clear. The proposed benchmark for video generation models focuses on evaluate the physical understanding of video generation models, which is crucial and not well-studied. The evaluation strategy is clear and comprehensive.

缺点

I mainly have two questions about the paper.

  1. As PhyGenEval uses VLMs for scoring, I would like to know the effect of different VLMs. For example, GPT-4o for closed-source models and InternVL-2, LLaVA-Video, Oryx, these open-source models that can understand videos. I'm wondering if these models can consistently evaluate the generated videos, which may be an interesting question and I think it should be discussed in the paper to show a more comprehensive understanding of the proposed evaluation pipeline.

  2. As the evaluation pipeline needs a lot of retrieval, I'd like to know the success rate of retrieval with GPT-4o. It is crucial for the overall evaluation and I hope the author can provide more details about how to ensure the retrieval is correct.

问题

As stated in the weakness section.

评论

Q1: As PhyGenEval uses VLMs for scoring ... more comprehensive understanding of the proposed evaluation pipeline.

A1: Thank you for your insightful feedback. We provide our detailed responses below:

  1. Directly using Video VLMs is insufficient for evaluating physical realism in videos.
    We test evaluations using InternVideo or GPT-4o independently, as shown in Table 8 in the Appendix. Results indicate that even GPT-4o achieves a Pearson correlation of only 0.21 with human ratings when directly evaluating video physical realism. This demonstrates that current VLMs are not capable of effectively evaluating physical realism in videos on their own.

  2. The design of each stage in PhyGenEval is necessary.
    Based on the characteristics of fundamental physical laws, we design a three-stage evaluation framework. As shown in Table 9 in the Appendix, using a single stage (or a single VLM) or combining two stages results in poorer performance compared to the full three-stage PhyGenEval. This analysis underscores that the proposed multi-stage evaluation pipeline is essential and well-justified.

  3. The pipeline performs well with both open-source and closed-source models.
    We discuss the effectiveness of incorporating open-source and closed-source models in the pipeline in the General Response-Part2. Notably, even without using GPT-4o, the method achieves reasonably good performance, demonstrating its robustness and flexibility.

Thank you again for your valuable suggestions. If you have any further questions or need additional clarification, please feel free to contact us.

评论

Q2: As the evaluation pipeline needs a lot of retrieval...

A2: Thank you for your thoughtful question. We provide detailed explanations below:

  1. Retrieval accuracy is integrated into the method design.
    In the Key Physical Phenomena Detection and Physics Order Verification stages, we include retrieval operations by designing VLM(Ij,Pr)VLM(I_j, P_r) , which checks whether image after the retrieval Ij I_j matches the retrieval prompt Pe P_e. This ensures that key phenomena occur at the correct frame. The retrieval accuracy is also factored into the score calculation for these two stages, functioning similarly to a regularization term to account for retrieval correctness.

  2. PhyGenBench emphasizes quality control, leading to high retrieval success rates.
    The careful construction of PhyGenBench ensures that models generate semantically aligned images while exposing issues with physical realism. As a result, retrieval operations achieve a relatively high success rate. We report the average VLM(Ij,Pr)VLM(I_j, P_r) scores for different models in Table 4 below, showing that even Lavie achieves a retrieval score above 0.75, indicating that retrieval accuracy is generally high.

Thank you again for your insightful feedback. If you have further questions or need additional clarification, please feel free to contact us.

ModelForce (↑)Light (↑)Heat (↑)Material (↑)Overall (↑)
Cogvideo5b0.76180.90130.80460.78150.8193
Gen30.83530.90770.86270.81140.8577
Pika0.78290.87770.78250.77360.8107
Lavie0.70640.83280.75370.72190.7596
Vchitect20.80780.90340.83170.76680.8327
Keling0.83750.90180.83190.79780.8470
Opensora0.81660.87550.85280.77070.8310
Cogvideo2b0.79240.82550.78860.77190.7971

Table 4: Retrieval accuracy scores of different models

评论

Thank you very much for your insightful suggestions on this paper. We have responded to each of these in great detail. If you have any other questions, we are more than happy to provide additional clarification as well as experiments, and look forward to your reply !

评论

As today marks the final day of the discussion period, we would like to express our sincere gratitude for your insightful feedback on our paper. We have taken the time to thoroughly address your concerns and provide detailed explanations in response. If there are any remaining questions or if further clarifications are needed, we would be more than happy to assist and conduct additional experiments if necessary. We kindly request that you reconsider the score, and we eagerly await your reply.

审稿意见
5

In this paper, the authors focus on the evaluation of text-to-video models. To this end, they propose a new benchmark as well as a new evaluation method. Named PhyGenBench, the dataset of prompts evaluate intuitive physics in subcontexts such as mechanics, optics, thermal dynamics, and material properties. Alongside this benchmark is PhyGenEval, an automated eval pipeline where a VLM is combined with GPT-4o to generate evaluative questions and answers. The authors compare PhyGenEval against human evaluations. They also perform some initial experiments with contemporary video models on their proposed benchmark.

优点

  1. Current generative models of video have serious issues with intuitive physics, and the research community needs a good benchmark to evaluate this capability. The proposed benchmark dataset can serve as a very important dataset for the community.

  2. Evaluating intuitive physics can be difficult, and PhyGenEval might be a promising method to automate evaluation without the need for human raters.

  3. The quantitative evaluation of current video models on the benchmark is a strong contribution and shows the need to improve these models in the realm of intuitive physics.

缺点

While PhyGenEval is an interesting approach to automating evaluation on the benchmark, I assert that the approach has some issues and should not be adopted by the community at this moment as a standard evaluation; human raters should be used:

  1. The pipeline is not reliable enough, the PCA correlation results are only around .7 - .8

  2. The pipeline relies on proprietary models such as GPT-4o and may be difficult to reproduce with open models.

问题

I believe PhyGenBench is an excellent contribution for the research community, but the PhyGenEval as described is problematic for the reasons listed above. Is it possible that PhyGenEval be described as a potential approach for automated evaluation to be iterated on in subsequent research, with PhyGenBench + human evaluation as the main contribution?

评论

Q2: The pipeline relies on proprietary models such as GPT-4o...

A2: Thank you for your valuable feedback. We have carefully considered your comments and provide our detailed responses below:

  1. First, we emphasize that although our method uses GPT-4o, it is both cost-effective and efficient. As shown in Table 1 of the Response to Reviewer wXRw (Q1), the resource costs for our method are low in terms of both time and computational resources.

  2. Second, we highlight that our method is reproducible. Specifically:

    • For LLaVA-Interleave, we disable the do_sample operation to ensure determinism.
    • For VQAScore, CLIP, and InternVideo, these methods are inherently non-random.
    • For GPT-4o, we use the default parameter configuration and gpt4o-0806 to ensure consistent results.

To demonstrate this, we perform five repeated experiments on Kling, originally scoring 0.49. As shown in Table 3 below, the results indicate that our method is stable and reproducible. Additionally, to facilitate testing by others, we provide the question files generated by GPT-4o for different stages of our evaluation.

In addition, we explore the use of various open-source models to enhance the performance of PhyGenEval. Experiments show that even with fully open-source models, the framework achieves a high correlation with human evaluations (approximately 0.7). We provide a detailed explanation of these results in All A2.

Thank you again for your suggestions. If you have any further questions or need clarification, please feel free to contact us.

Experiment No.Result
Experiment 10.49
Experiment 20.48
Experiment 30.48
Experiment 40.49
Experiment 50.50
Avg0.49
Std0.01

Table 3: Results of five replicate experiments with Kling of PhyGenEval

评论

Q3: I believe PhyGenBench is an excellent contribution for the research community...

A3: Thank you for your insightful suggestions. We would like to explain why we present PhyGenBench and PhyGenEval together in our paper:

  1. We acknowledge that human evaluation is the most effective method, but it is difficult to scale up. Our goal is to provide PhyGenEval as a means for efficiently testing different video generation models on PhyGenBench.

  2. Thanks to the automated testing provided by PhyGenEval, machine scores can be computed quickly and serve as a valuable reference for human evaluations.

  3. While we recognize that PhyGenEval is not perfect, it achieves a relatively good alignment with human evaluations. We provide a detailed explanation of this in Response to Reviewer PPCZ (Q1).

Thank you again for your valuable feedback. If you have further questions or require additional clarification, please feel free to contact us.

评论

Q1: The pipeline is not reliable enough...

A1: Thank you for your insightful comments! We have carefully considered your suggestions and provide our responses below:

  1. The tasks in PhyGenBench are inherently challenging.
    The benchmark involves nuanced physical reasoning across diverse domains, which makes evaluation complex even for human reviewers. For instance, scenarios like "A timelapse captures the transformation of arsenic trioxide as it is exposed to gradually increasing temperature at room temperature" require an understanding of arsenic trioxide’s physical and chemical properties at different temperatures and the ability to interpret its gradual changes. This level of complexity poses significant challenges even for highly educated university students.

  2. PhyGenEval already achieves strong human alignment compared to related work.
    While we acknowledge room for improvement, PhyGenEval demonstrates competitive or superior results compared to benchmarks like VideoScore (0.77) and T2V-CompBench (0.5-0.6). With a Pearson correlation of approximately 0.8, PhyGenEval validates its effectiveness in assessing complex physical realism in this domain.

  3. Future Work.
    We acknowledge the potential to refine PhyGenEval further. In future work, we will focus on improving task-specific modules and exploring novel alignment techniques to enhance both accuracy and robustness, making the pipeline more refined and effective.

Thank you again for your valuable suggestions! If you have further questions or require additional clarification, please feel free to contact us.

评论

We sincerely appreciate your questions and would like to provide further clarifications:

  • First, while our method leverages GPT-4o, it remains reproducible and resource-efficient, as evidenced in Response to Reviewer PPCZ (Q2) and Response to Reviewer wXRw (Q1)

  • Second, our method can fully eliminate the need for GPT-4o by only using Open models:

    • Regarding VLM evaluation, as shown in Table 1 below, we have already evaluated the framework using only open-source VLM models. Details can be found in General Response-Part2. Although the performance is slightly weaker compared to closed-source models, it still achieves a Pearson correlation coefficient of 0.72, demonstrating the high reliability of PhyGenEval. Furthermore, as stronger and larger open-source models are used, the alignment between PhyGenEval and human evaluation improves. This reinforces our confidence that, with the continued development of open-source models, our framework will no longer require GPT-4o. Additionally, we provide further evaluation results using only open-source VLMs in Table 2 below.

      Similar to Table 2 in the main text, these results highlight the performance gap across models (e.g., Gen3 achieves the best performance, while Lavie performs the worst). Moving forward, we will release a leaderboard and inference code fully based on open-source VLMs to further enhance the stability and reproducibility of our framework.

    • Regarding the generation of questions using LLMs, we have validated the reliability of GPT-4o-generated questions through human evaluation as shown in Response to Reviewer wXRw (Q1) . These questions, similar to the prompts in PhyGenBench, are an integral part of the benchmark. We will continue to refine and clean these questions and release them as open-source to enable standardized and convenient testing for the research community.

  • Regarding the use of GPT-4o in the main text, the primary reasons are:
    i) Accessing the API is more convenient and has a lower entry barrier compared to deploying models.
    ii) As shown in Response to Reviewer PPCZ (Q2), our approach is computationally efficient, making the cost of using the API lower than the GPU resources required for deployment.

We sincerely thank you again for your insightful questions. We believe our response sufficiently addresses your concerns regarding GPT-4o. However, if you still have further questions or require additional clarification, we would be happy to provide further explanations at any time.

MetricMechanics (ρ ↑)Optics (ρ ↑)Thermal (ρ ↑)Material (ρ ↑)Overall (ρ ↑)
PhyGenEval (Open)0.620.650.640.730.72
PhyGenEval (Close)0.750.770.750.840.81

Table 1: Comparison of PCA correlation results using close source or open source models in PhyGenEval.

ModelSizeMechanics(↑)Optics(↑)Thermal(↑)Material(↑)Average(↑)
CogVideoX2B0.420.480.480.450.46
CogVideoX5B0.430.600.550.480.51
Open-Sora1.1B0.520.570.510.460.51
Lavie860M0.380.490.430.400.43
Vchitect 2.02B0.480.620.530.450.52
Pika-0.400.600.500.490.50
Gen-3-0.460.630.560.520.55
Kling-0.500.640.540.440.54

Table 2: Evaluation results of PCA with the proposed PhyGenEval with open VLMs in videos generated by several models .

评论

As today is the final day of the discussion, we would like to sincerely thank you for your valuable feedback on our paper. In response, we have carefully addressed your comments with detailed explanations. Should you have any further questions or require additional clarifications, we would be happy to provide more information and conduct further experiments as needed. We kindly ask if you could reconsider the score, and we look forward to your response.

评论

I would like to thank the authors for their detailed responses to my questions. While I believe the benchmark dataset is a very useful contribution to the community, I share concerns with the other reviewers on the automated approach to evaluation with GPT-4o. I stand by my rating.

评论

We express our sincere appreciation for the valuable suggestions regarding this paper. In response, we have provided thorough and detailed explanations. If there are any further questions, we are delighted to offer additional clarifications and conduct further experiments. We eagerly look forward to your reply!

审稿意见
5

Although T2V models have shown great progress in generating good media-level content, this paper challenges their capability to become the real world simulator. This paper first proposes a PhyGenBench, 160 T2V prompts composed of several physics categories, then proposes a hierarchical framework to evaluate semantic alignment and physics commonsense alignment. It shows that current models, even ones with large scales, struggle with physical commonsense.

优点

(1) This paper handcrafts T2V prompts in a fine-grained way. (2) This paper provides a carefully designed pipeline to conduct the evaluation of physics commonsense.

缺点

(1) The paper lacks decent novelty in terms of benchmark and evaluator itself. The way it constructs the evaluator heavily relies on several generative models. For example, using GPT4o to do information extraction and create questions sometimes brings about hallucination. Also, using VLMs in different stages can also lead to hallucination. Since it is a complex pipeline composed of different stages, error propagation might happen.

(2) The comparison with other baselines is unfair. The comparisons with other baselines are biased. Although they acknowledge that alternative auto-evaluators lack robustness, they do not demonstrate whether their own auto-evaluator performs effectively on prompts from different benchmarks as part of a generalization analysis. Typically, like concurrent work, these kinds of auto-evaluators are tailored to specific prompt distributions. Basically, the generalization of the reward modeling for world simulators should be enough for another paper.

(3) The number of prompts engaged in this paper are limited, which might be a weak signal for evaluating the video generation models as a world simulator.

问题

(1) What is the efficiency of using your auto-evaluator? Could you provide an estimation?

(2) Could you provide some error analysis on the bad cases where PhyGenEval is opposite to the human eval? Maybe this can provide some insight on how to further improve the reward modeling of world simulators.

评论

Q1: The paper lacks decent novelty in terms of benchmark and evaluator itself...

A1: Thank you for your valuable feedback. We have carefully considered your comments and provide detailed responses below:

1. Novelty and Practical Significance
We are among the first to evaluate the physical realism of AI-generated videos. This topic is both novel and highly practical, as understanding physical realism is fundamental for treating video generation models as world simulators. Notably, Reviewer eSnR and Reviewer QELL highlighted the great novelty of our work, and Reviewer ZEcL and Reviewer PPCZ emphasized the importance of PhyGenBench to the community.

2. Contribution of PhyGenEval Framework
We acknowledge that training an evaluator is a significant contribution, but it is not the only approach. While we do not train a new evaluator, we propose the PhyGenEval framework, which integrates multiple Vision-Language Models (VLMs) to achieve better results. We believe this approach fundamentally aligns with the goal of reducing VLM hallucination in recognizing physical realism. Many excellent works in the community, such as VBench and DEVIL, also build upon existing VLMs. Thus, we argue that our approach represents a novel contribution by demonstrating the utility of this framework.

3. Simplicity of Our Method
Although our method consists of multiple stages, it is not overly complex. We provide resource consumption statistics in Table 1 below, which demonstrate that our evaluation process is both efficient and cost-effective. This efficiency underscores the practicality of our approach. We also add this content in Appendix F

4. Reliability of GPT-4o in Question Generation
When generating questions for different stages using GPT-4o, we do not simply rely on direct outputs. Instead, we incorporate few-shot examples and carefully designed prompts. Thanks to GPT-4o's extensive world knowledge, the generated questions are highly reliable.

To validate this, we conducted human annotations. Specifically, we recruited five senior undergraduate students, assigning each question to all five annotators for evaluation. Their task was to assess the physical correctness with 0 or 1 of each GPT-generated question. For the Overall Naturalness Evaluation stage, our criteria required that each level of description demonstrate progressive improvements in correctness and distinguishability. As shown in Table 2 below, the results indicate that questions generated by GPT-4o align strongly with physical realism.

To further enhance quality, we plan to refine the question set and release the cleaned dataset publicly for testing.

5. Effectiveness of the Multi-Stage Design
Each stage of our method is carefully designed and necessary. The multi-stage design aims to mitigate VLM hallucinations by analyzing fundamental physical laws and creating a three-stage evaluation framework. As shown in Appendix Table 9, using only one stage (or one VLM) or combining just two stages performs worse than the full three-stage PhyGenEval. This design effectively reduces hallucinations without causing error propagation. Therefore, the multi-stage structure is both essential and effective.

Thank you for your suggestions. If you have further questions or require additional clarification, please feel free to reach out.

StageModelbsResourcesTimesMemory
Key Physical Phenomena DetectionVQAScore31 x A100-80GB10min72726MiB
Physics Order VerificationLLaVA-Interleave-7B11 x A100-80GB2min20408MiB
GPT-4o81.4USD5min-
Overall Naturalness EvaluationInternVideo11 x A100-80GB1min7766MiB
GPT-4o83.1USD5min

Table1: Resource consumption of models used in PhyGenEval.

Key Physical Phenomena Detection (↑)Physics Order Verification (↑)Overall Naturalness Evaluation (↑)
0.960.950.92

Table2: Human evaluation for GPT-4o generated questions

评论

Q3: The number of prompts engaged in this paper are limited...

A3: Please refer to the detailed answer in General Response-Part1

Q4: What is the efficiency of using your auto-evaluator...

A4: We provide resource consumption statistics in Table 1 in the Response to Reviewer wXRw (Q1), which demonstrate that our evaluation process is both efficient and cost-effective. This efficiency underscores the practicality of our approach. We also add this content in Appendix F

Q5: Could you provide some error analysis on the bad cases...

A5: Thank you for your valuable feedback. We have carefully addressed your comments and provide our responses below:

We visualize some error cases in Error Case Analysis in Appendix E, where both PhyGenEval and comparison methods like DEVIL fail to correctly identify the physical realism of videos. These error cases are often caused by confusing but iconic physical phenomena in the videos that do not align with the correct progression of physical processes (e.g., in the erroneous case of the "burnt bread" experiment, black coloration appears but does not align with the expected phenomenon), leading to misjudgments. However, even in these cases, PhyGenEval remains closer to human ratings compared to other methods.

We plan to focus on addressing this issue in our future work, including but not limited to the following directions:

  • Training our own evaluator
  • Designing a more refined evaluation framework that leverages deeper video features, such as optical flow, to better assess physical realism and avoid being misled by visually smooth videos.

Thank you again for your constructive suggestions. If you have further questions or require additional clarification, please feel free to contact us.

评论

Q2: The comparison with other baselines is unfair...

A2: Thank you for your valuable feedback and constructive comments. We carefully address your concerns and provide detailed responses below:

1. Scope and Motivation: Our primary objective is to design a benchmark (PhyGenBench) capable of clearly reflecting physical commonsense through simple, explicit prompts. While constructing this benchmark, we observe that existing metrics are inadequate for measuring physical realism, especially when applied to PhyGenBench. This leads us to design PhyGenEval, an evaluation metric tailored to address this shortcoming.

2. PhyGenBench and PhyGenEval appear in pairs: As we emphasize throughout the paper, PhyGenBench and PhyGenEval are designed to be used together, forming a cohesive framework for assessing physical commonsense in video generation models. The focus of this work is not on general-purpose evaluators but on addressing the specific gap in evaluating physical commonsense using a paired benchmark and metric.

3. PhyGenEval outperforms existing metrics on PhyGenBench: The design of PhyGenEval explicitly considers the key physical laws and phenomena incorporated into PhyGenBench. Therefore, it achieves higher consistency with human evaluations on this benchmark compared to other metrics. Specifically, PhyGenEval attains an overall Pearson’s correlation coefficient of 0.81 and Kendall’s Tau of 0.78 on PhyGenBench, significantly outperforming other metrics such as VideoScore (Pearson’s 0.19, Kendall’s 0.17) and DEVIL (Pearson’s 0.18, Kendall’s 0.17).

Thank you again for your thoughtful comments. If you have any further questions or require additional clarifications, please feel free to contact us.

评论

We sincerely thank you for your insightful suggestions on this paper. In response, we have carefully addressed your points with detailed explanations. Should you have any additional questions, we would be happy to provide further clarifications or conduct additional experiments. We look forward to your feedback!

评论

Thanks the authors for addressing most of my concerns. Therefore, I raise my score to 5.

评论

We sincerely thank each reviewer for their valuable suggestions and questions regarding our paper. We have carefully considered these comments and added necessary experiments and clarifications. Under the reviewers' insightful guidance, we believe the quality of our paper has significantly improved. Below, we summarize the main changes, with new modifications in the updated paper highlighted in red:

  • Revised the caption for Figure 3 to clarify the PhyGenEval process.
  • Updated the caption for Table 2 to avoid confusion.
  • Added more related work in Appendix A.
  • Supplemented additional computational details in Appendix C.2.
  • Included an experimental analysis of large-scale open models in Appendix D.3.
  • Added error case analysis and visualizations for PhyGenEval in Appendix E.
  • Provided resource consumption details for the method in Appendix F.

Thank you again for your constructive feedback. If you have further questions or require additional clarification, please feel free to contact us.

评论

Q2: Effects of using Open Models in PhyGenEval (e.g. without GPT-4o)

A2: Thank you for your valuable feedback. We provide detailed responses below:

  • Effectiveness without GPT-4o as the VLM As shown in Table 2 below. We revise Table 10 in the Appendix to rename the original PhyGenEval (Open) as PhyGenEval (Open-S), indicating that it uses small-scale open-source models. The results show that even when using only small-scale open-source models, the method achieves a Pearson correlation coefficient of 0.66, demonstrating the robustness of the PhyGenEval framework.

  • The questions generated by GPT-4o are part of our benchmark. We need to emphasize that the question generation step using GPT-4o is a part of PhyGenBench. We have demonstrated the high quality of the generated questions in Response to Reviewer wXRw (Q1) and both the questions and PhyGenBench have been open-sourced through an anonymous link. Therefore, hallucination issues caused by LLMs are not a concern in PhyGenEval.

  • Exploring larger open-source models We experiment with larger open-source models by replacing the LLaVA-Interleave-7B in PhyGenEval (Open-S) with InternVL-Pro (78B), denoting this configuration as PhyGenEval (Open-L). Additionally, we ensemble PhyGenEval (Open-L) with PhyGenEval (Open-S) , denoting this as PhyGenEval (Open-Ensemble). We supplement the description of ensemble operations in Appendix C.2.

    Results indicate that compared to small-scale open-source models, the overall alignment coefficient improves from 0.66 to 0.72, demonstrating that the framework maintains reproducibility even with fully open-source models. We believe that as open-source models continue to advance, their performance in PhyGenEval will improve further. We add this in Appendix D.3

  • Alignment with human evaluation improves with VLM advancements
    As the capabilities of VLMs used in PhyGenEval improve, we observe an increasing alignment between PhyGenEval and human evaluations. We believe that as open-source models continue to evolve, it will become feasible to use open-source VLMs for the entire workflow, further highlighting the robustness of our method's design.

Thank you again for your valuable suggestions. If you have further questions or require additional clarification, please feel free to contact us.

MetricMechanics (ρ ↑)Optics (ρ ↑)Thermal (ρ ↑)Material (ρ ↑)Overall (ρ ↑)
PhyGenEval (Open-S)0.570.620.580.690.66
PhyGenEval (Open-L)0.590.630.610.710.69
PhyGenEval (Open-Ensemble)0.620.650.640.730.72
PhyGenEval (GPT4o)0.630.570.680.770.71
PhyGenEval (Ensemble)0.750.770.750.840.81

Table 2: Comparison of PCA correlation results using different models such as GPT-4o or open-sourced models in PhyGenEval. PhyGenEval (Ensemble) is the result of ensemble of PhyGenEval (Open-S) and PhyGenEval (GPT4o)

[1] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

评论

We sincerely thank all reviewers for their valuable feedback and the time they have dedicated to reviewing our work. Your comments have been extremely helpful in guiding our revisions and improving the paper. Notably, We notice that several reviewers have raised overlapping questions. Therefore, we have consolidated these into two key questions that address multiple reviewers' concerns. Below, we provide detailed answers to these two questions first.

Q1: The number of prompts engaged in this paper are limited...

A1: Thank you for your valuable feedback. We have carefully considered your comments and provide detailed responses below:

1. PhyGenBench covers a broad range of prompts.
We design PhyGenBench starting from the most fundamental physical laws. It comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains of physics. These prompts ensure comprehensive coverage of key physical commonsense principles.

2. The benchmark focuses on essential and fundamental physical laws.
Even with 160 basic prompts, PhyGenBench effectively exposes significant issues in current models. We focus on the simplest and most common physical scenarios (e.g., involving at most two objects). These basic setups already reveal serious limitations in existing models, demonstrating that our benchmark is sufficient for current testing. As models improve, we plan to expand the benchmark to include more complex scenarios.

3. PhyGenBench undergoes rigorous selection and quality control.
We carefully refine the benchmark by removing overly complex prompts that current models cannot reasonably depict. We assess whether the T2V-generated videos are semantically reasonable to ensure effective evaluation. This process reduces the benchmark to its current size, focusing on scenarios that models can meaningfully generate.

4. PhyGenBench demonstrates high quality.
We support this with both quantitative and qualitative analyses:

  • Quantitative Analysis: In Appendix D.2, we analyze the semantic alignment (SA) scores of videos generated based on PhyGenBench prompts. SA measures whether video generation models can depict the scenarios described in the prompts. Results show that all tested video generation models achieve high SA scores. For example, Kling achieves 0.85 (machine score) and 0.89 (human score), demonstrating the reliability of PhyGenBench.
  • Comparison with Other Benchmarks: In Table 1 below (Also in Appendix B.2), we compare PhyGenBench with benchmarks like VideoPhy in both quantitative and qualitative analyses. Results show that PhyGenBench prompts achieve an average SA score of 0.80, significantly outperforming VideoPhy’s score of 0.63 in human evaluations. This highlights the superior quality of PhyGenBench.

Thank you again for your thoughtful suggestions. If you have any further questions or need additional clarification, please feel free to contact us.

ModelSizeVideophy (↑)PhyGenBench (↑)
CogVideoX5B0.480.78
Vchitect 2.02B0.630.84
Kling-0.770.89
Average-0.630.80

Table 1: Comparison of Semantic alignment scores between PhyGenBench and VideoPhy

AC 元评审

The submission introduces a new benchmark for assessing models' understanding of physical commonsense. Reviewers were lukewarm about the submission, and all shared concerns about the heavy use of generative AI during benchmark development. Other concerns include insufficient evaluation and dataset scale. The AC agreed on these issues and encouraged the authors to revise the submission for the next venue.

审稿人讨论附加意见

The discussion has been on the use of GenAI in dataset development.

最终决定

Reject