Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation
Structured Taxonomy and Evaluation for Professional Video Generation
摘要
评审与讨论
This article introduces SCINE (Stable Cinemetrics) - an evaluation suite specifically designed to assess the quality of video generation from the perspective of professional video production. The evaluation is divided into four major categories: Setup, Event, Lighting, and Camera, each of which contains numerous subcategories, totaling 76 subcategories. Each subcategory was discussed and determined by professional video creators, making it more professional compared to other video evaluation benchmarks.
优缺点分析
Strengths:
- The evaluation system was established through a relatively professional process, which is a task with a high barrier to entry.
Weaknesses:
- Although the article claims that the construction process of SCINE is highly professional, the data provided in the supplementary materials suggests otherwise. First, the
ranked_videosfolder contains two subfolders,good_examplesandbad_examples, but the videos in these folders are not paired, and their prompts are not the same, which contradicts the claims made in the paper. Additionally, the evaluation prompts constructed are not professional. For instance, in thecamera trajectorycategory of prompts in the supplementary materials, most prompts describe a static camera, which is overly simplistic for assessing camera movement. - The supplementary materials do not provide the scores for the human annotation videos, making it impossible to determine the accuracy of the scoring.
- The training process of the evaluation model is simple and entirely based on existing work.
- The main text does not include any visualized presentations of videos, nor does it use visualized examples to explain the necessity of the 76 categories.
问题
See Weaknesses.
局限性
Although a relatively detailed classification is constructed, the prompts for each category do not reflect professionalism and uniqueness. Comparing the professionalism claimed by the author in the main text with the examples in the supplementary materials, there is a suspicion of overclaim.
最终评判理由
Weak Accept.
格式问题
None.
We thank the reviewer for their constructive feedback of our work. Please find below our responses to your questions:
Professional Nature of Prompts
The videos and prompts shown in the Camera Trajectory section of the supplementary material were selected as illustrative positive/negative examples, not as a representative sample of the full prompt set. To clarify, we report below the actual distribution of values from all prompts labeled cinematographer in our SCINE_VISUALS.json.
| Taxonomy Node | Distribution (%) |
|---|---|
| Camera Movement | Push In (33.38), Tracking (20.79), Static (11.90), Zoom In (6.70), Dolly Zoom (7.52), Crane (4.79), Camera Roll (4.79), Arc (3.01), Tilt Down (1.92), Pan Right (1.50), Trucking (1.50), Zoom Out (1.23), Tilt Up (0.96) |
| Camera Angle | Low (30.67), Dutch (21.24), Eye-level (16.96), Ground (11.25), Shoulder (6.02), High (4.12), Hip (3.01), Overhead (2.46), Knee (2.30), Aerial (1.98) |
| Camera Shot Size | Medium-close-up (28.69), Wide (20.47), Close-up (17.45), Medium-Full (13.34), Medium (12.93), Establishing (3.61), Extreme Close-up (2.51), Full (0.50), Medium-wide (0.27), Master (0.23) |
As this shows, our prompt set spans a wide range of values, across different nodes. Importantly, these values do not occur in isolation; controls often co-occur, e.g., "push in + low angle" or "push in + dutch angle", yielding a rich and combinatorially diverse testbed for video generative models.
Additionally, as demonstrated in Figure 4, our prompts closely align with real-world screenplays. Figures 25–27 further show that our prompt set offers broad and diverse coverage across all four taxonomies. These findingss collectively support that our prompts faithfully reflect the professional intent of our taxonomy.
Additional visual examples
We apologize for this. We were initially restricted by the 100 MB NeurIPS supplementary size limit; the size of each video is quite large, for example a video generated by Wan 14B, is on average ~7 MB.
Thus, as requested by the reviewer, we have now created a fully anonymous web-page with 200+ videos, presenting side-by-side comparisons across models on SCINE Scripts and Visuals, and detailed pairwise comparisons with categories, questions, and scores.
However, we are unable to share this link with the reviewer, due to the NeurIPS rebuttal policy. Therefore, we have shared this fully anonymized link with the AC; we hope you are able to consider the web-page during your internal discussions. Thank you!
Organization of ranked_videos in Supplementary
Thank you for pointing this out. The ranked_videos directory in the supplementary material was not intended to contain paired examples generated from the same prompts. Rather, it provides representative samples of high- and low-scoring videos across different categories. For example, videos located at
good_examples/all_models/Setup__Text Generation are videos that received high scores in the Text Generation node of our taxonomy.
To eliminate any ambiguity, we have now added both : (i) sets of videos produced from the same prompt and (ii) explicit pairwise comparisons to the anonymous link.
Scores of Human annotation videos
For each video in the Supplementary, we present the individual annotations and the reason of it being a positive or negative case.
| No. | Folder | Taxonomy Node | Video | Annotator 1 | Annotator 2 | Annotator 3 | Avg | Reason |
|---|---|---|---|---|---|---|---|---|
| 1 | bad_examples | Camera__Trajectory__Camera Movement | wan1b_00748 | 1 | 2 | 1 | 1.33 | Zoom out movement not depicted |
| 2 | bad_examples | Camera__Trajectory__Camera Movement | wan14b_00025 | 1 | 2 | 1 | 1.33 | Generated video does not feature a push-in |
| 3 | bad_examples | Camera__Trajectory__Camera Movement | wan14b_00748 | 1 | 2 | 2 | 1.67 | Zoom out movement not depicted |
| 4 | bad_examples | Lighting__Sources__Artificial_Practicals Light | hunyuan_00501 | 1 | 3 | 1 | 1.67 | Video does not capture the single practical light |
| 5 | bad_examples | Lighting__Sources__Artificial_Practicals Light | stepvideo_00791 | 1 | 2 | 2 | 1.67 | Video does not clearly incorporate fluorescent lights as an artificial source of illumination |
| 6 | bad_examples | Setup__Scene__Geometry | easyanimate5.1_00410 | 1 | 2 | 2 | 1.67 | Video does not incorporate strong diagonal lines in its composition. |
| 7 | bad_examples | Setup__Scene__Geometry | wan1b_00570 | 2 | 1 | 3 | 2 | Diagonal elements like the light trails are not effectively portrayed in the video |
| 8 | bad_examples | Events__Types__Actions | pyramidflow_00834 | 1 | 1 | 1 | 1 | Video does not show the vines interacting with and gripping the tombstone |
| 9 | bad_examples | Events__Types__Actions | pyramidflow_01100 | 1 | 1 | 1 | 1 | The warriors do not enter the holographic arena |
| — | — | — | — | — | — | — | — | — |
| 10 | good_examples | Camera__Trajectory__Camera Movement | ltx_00569 | 5 | 5 | 4 | 4.67 | Camera maintains a static position |
| 11 | good_examples | Camera__Trajectory__Camera Movement | ltx_01076 | 5 | 4 | 5 | 4.67 | Camera maintains a static position |
| 12 | good_examples | Camera__Trajectory__Camera Movement | stepvideo_00452 | 5 | 5 | 3 | 4.33 | Slow zoom in movement is depicted in the video |
| 13 | good_examples | Camera__Trajectory__Camera Movement | wan1b_00569 | 5 | 5 | 4 | 4.67 | Camera maintains a static position. |
| 14 | good_examples | Events__Types__Emotions | easyanimate5.1_00722 | 5 | 4 | 5 | 4.67 | Video visually captures the elderly woman's subtle, soft smile. |
| 15 | good_examples | Events__Types__Emotions | wan14b_00704 | 4 | 5 | 5 | 4.67 | Explicit sense of urgency through the children’s expressions is depicted |
| 16 | good_examples | Events__Types__Emotions | wan14b_01011 | 5 | 3 | 5 | 4.33 | Gardener's startled expression visibly conveyed in the video. |
| 17 | good_examples | Events__Types__Emotions | wan14b_01131 | 5 | 4 | 5 | 4.67 | Subtle expression of nervousness and wariness is portrayed in the video |
| 18 | good_examples | Lighting__Lighting Effects__Reflection | Ray2_00813 | 3 | 4 | 5 | 4 | Long reflections on surfaces is well captured by the video |
| 19 | good_examples | Lighting__Lighting Effects__Reflection | wan14b_00055 | 4 | 5 | 5 | 4.67 | Video showcase hard geometric reflections |
| 20 | good_examples | Setup__Scene__Geometry | wan1b_00473 | 3 | 5 | 5 | 4.33 | Video showcases a structured arrangement with clear horizontal lines. |
| 21 | good_examples | Setup__Scene__Geometry | wan1b_00623 | 5 | 5 | 4 | 4.67 | Vertical and diagonal lines clearly rendered in the composition. |
| 22 | good_examples | Setup__Scene__Geometry | wan1b_01076 | 5 | 3 | 5 | 4.33 | Subjects are symmetrically framed |
| 23 | good_examples | Setup__Text Generation | Minimax_00115 | 5 | 3 | 5 | 4.33 | On-screen text 'The Grand Finale' is rendered well |
| 24 | good_examples | Setup__Text Generation | Minimax_00127 | 3 | 5 | 4 | 4 | On-screen text 'Guiding Light' is rendered well |
| 25 | good_examples | Setup__Text Generation | Minimax_00481 | 5 | 5 | 5 | 5 | Neon signage with text 'EAT' is clearly integrated into the scene. |
Additional annotation scores for more videos across multiple categories are also available at the anonymous webpage link.
Visual examples justifying Taxonomy Categories
We apologize for this. The webpage link now contains visual examples justifying the categories introduced in our taxonomy.
Additionally, the taxonomy is grounded in well-established references from professional filmmaking and cinematography literature:
- A Shot in the Dark by Jay Holben offers a deep dive into lighting practice in film. It covers concepts such as light quality (
hardvs.soft),light position, the use ofpractical lightson set, and the role of theexposuretriangle in audience perception. - Film Directing: Shot by Shot by Steven Katz outlines key visual storytelling principles, including
props,makeup,shot sizes, andcamera movements, which directly map to ourSetupandCameracategories. - Cinematography: Theory and Practice by Blain Brown discusses visual language elements such as
lines,space,contrast,blur, andnoise. - The Visual Story by Bruce Block explores how
depth,shapesstory structure, andcompositioncontribute to narrative meaning in film-making. - The Tools of Screenwriting by Howard and Mabley analyzes how
actions,dialogues, andemotionsare intentionally crafted during the writing process, informing ourEventstaxonomy and its sub-nodes.
In addition to these references, we also conducted multiple iterative feedback rounds with professionals to validate the taxonomy’s relevance and completeness. We hope these revisions clarify both the motivation behind the taxonomy and the real-world grounding of each of its 76 categories.
Training process of VLM
While our training setup is simple, the design goal was not novelty, but practicality and scalability. Human evaluation is costly and does not scale to the thousands of generations required for rigorous benchmarking. Our lightweight VLM enables scalable, high quality evaluation with a single forward pass and outperforms much larger baselines.
Specifically, we fine-tune Qwen2.5-VL-7B as a Bradley-Terry–style evaluator, replacing the LM head with a regression head. Unlike traditional autoregressive models that require seq_len forward passes per evaluation, ours requires only 1, making it roughly 100× more efficient than a 72B model, while also being more accurate.
| Model | Avg FLOPs (in petaFLOPs) | Accuracy |
|---|---|---|
| Qwen-7B | 17.738 | 59.86 % |
| Qwen2.5‑VL‑32B | 81.087 | 59.93 % |
| Qwen2.5‑VL‑72B | 182.44 | 62.50 % |
| Ours | 1.82 | 72.36 % |
Our results show that even state of the art models perform poorly when used directly for evaluation, underscoring the need for task specific tuning. Additionally, this design enables downstream applications. For example, our evaluator can be plugged in as a learned critic during video model training, offering a path towards fine-grained, control optimization.
We thank the reviewer again for the feedback on our paper! As the discussion deadline (Aug. 6th) is approaching, we were wondering whether the reviewer had the chance to look at our response and whether there is anything else the reviewer would like us to clarify. We sincerely hope that our response has addressed the concerns, and if so, we would be grateful if the reviewer could consider increasing the score accordingly.
Best, Authors
Thanks to the author for the detailed answer, which basically solved my doubts. I look forward to seeing your work in the open source community one day. I decided to raise my score to 4 points.
Thank you for acknowledging our rebuttal and increasing your score. We will incorporate your suggestions and additional visual examples in our final version. We appreciate your efforts in helping to improve the quality of our work.
The paper introduces Stable Cinemetrics (SCINE), a structured evaluation framework based on professional filmmaking principles and standards. SCINE includes 4 hierarchical taxonomies (Setup, Event, Lighting, and Camera) to evaluate text-to-video (T2V) models. This paper evaluates 10+ models across 20K generated 60 videos with feedback from 80+ professionals.
优缺点分析
Strengths:
- Quality and Clarity: This paper, espically its taxonomies, is clearly written and technically sound.
- This paper formally introduces professional filmmaking standards into the T2V evaluation process, a aspect that has been previously explored (e.g., in Wan) but not systematically assessed. The proposed taxonomies and evaluation framework, based on professional feedback, are original contributions.
Weaknesses: The scope of this paper is relatively limited to the following aspects:
- Evaluation is limited to controllability (L106): The evaluation taxonomies only evaluate controllability but neglect the overall video quality. There remain fundamental flaws in the generated outputs, such as distortions in human bodies or hands, object movements that violate physical laws, and poor imagery quality. These aspects are critical for professional video generation but are largely overlooked in this paper.
- Evaluation is limited to an independent shot: filmmaking involves more than generating a single independent shot; it requires character consistency across different scenes. Character identity is nearly impossible to control through text alone, meaning that aspects such as character consistency are also overlooked in the evaluation. In application, this issue may be even more important than the aspects evaluated in the paper.
- Evaluation is limited to text-to-video: this paper only evaluates the text-to-video (T2V) aspect because T2V is more suitable to be evaluated by the proposed taxonomies (L73). It remains unclear whether T2V models are the primary workflow used in professional settings. This questions the significance of this work.
- I2V models are also widely adopted by the community due to their greater flexibility in control through reference images. Can the evaluation framework proposed be adapted to evaluate I2V model and pipelines?
问题
see weaknesses
局限性
The authors adequately addressed the limitations and potential negative societal impact.
最终评判理由
I sincerely appreciate the rebuttal and the effort during the rebuttal period. The new clarification and experimental results has addressed most of my concern, and I choose to maintain my current recommendation of 4: Borderline accept.
格式问题
Space below figure 4 seems too small.
We thank the reviewer for their constructive feedback of our work. Please find our responses to your questions:
Clarification on Controllability and Visual Quality.
Thank you for raising this point. We considered this extensively as well. It's important to clarify that cinematic controls represent the "input" dimensions specific knobs or dials that professionals prioritize, while video quality represents the "output" fidelity of the model. Correctness in elements like human anatomy and physical laws is inherently assumed in real-world professional videos. Our initial evaluations indicated a strong correlation between control metrics and video quality. Specifically, we observed that human annotators implicitly factor in visual quality while rating for controls. For example, if a video shows an "interactive" action, but the "interaction" is not grounded in physical laws, humans inherently tend to rate the video low. To validate this empirically, we performed the following experiment, at scale :
- Augmented each SCINE Scripts prompt with extra questions evaluating visual quality (VQ) and temporal smoothness (TS).
- Example prompt:
"In an airport during the day, passengers form a line, checking tickets as they prepare to board. Outside the large window, ground crew load suitcases onto a baggage cart, while another airplane taxis slowly across the tarmac in the distance."
- VQ question: "How smooth and artifact‑free is the rendering of the airplane taxiing in the distance?"
- TS question: "Is the motion of the ground crew loading suitcases and the taxiing airplane fluid and natural, with consistent transitions between frames?"
- Example prompt:
- Collected independent ratings from annotators on all control, plus VQ and TS questions.
- Computed per‑model averages and compared Control vs. VQ and TS, with results shown below :
| Model | Ctrl | VQ | TS | Δ(VQ−Ctrl) | Δ(TS−Ctrl) |
|---|---|---|---|---|---|
| Wan 13B | 3.158 | 3.319 | 3.113 | +0.160 | –0.045 |
| Minimax | 3.110 | 3.437 | 3.181 | +0.327 | +0.071 |
| Hunyuan | 2.973 | 3.138 | 2.900 | +0.165 | –0.074 |
| Luma Ray 2 | 2.897 | 3.282 | 2.980 | +0.385 | +0.083 |
| Wan 1B | 2.877 | 2.974 | 2.659 | +0.097 | –0.218 |
| Pika 2.2 | 2.759 | 2.976 | 2.690 | +0.217 | –0.069 |
| Mochi | 2.704 | 2.708 | 2.396 | +0.004 | –0.307 |
| StepVideo | 2.610 | 3.078 | 2.788 | +0.467 | +0.178 |
| EasyAnimate 5.1 | 2.573 | 2.748 | 2.474 | +0.175 | –0.099 |
| Cogvideo 5B | 2.566 | 2.653 | 2.380 | +0.087 | –0.186 |
| Pyramid Flow | 2.339 | 2.528 | 2.234 | +0.189 | –0.105 |
| Vchitect 2.0 | 2.144 | 2.089 | 1.930 | –0.055 | –0.215 |
| LTX Video | 2.056 | 2.271 | 2.045 | +0.215 | –0.011 |
This denotes a Spearman correlation of 0.93 (p < 1e-5), confirming that VQ and TS are implicitly embedded within control evaluations.
Although video quality and controllability are fundamentally different problems, our findings demonstrate that our primary metric for controllability is an effective proxy for overall model performance, as it implicitly factors in the visual quality of a generated video. However, as models evolve and video quality approaches a plateau, we anticipate that fine-grained controllability will increasingly become the decisive differentiator for professional applications. Our focus on controllability is thus a forward looking evaluation of this critical frontier.
Limitation to Single Shot Evaluation
We agree that filmmaking involves more than generating a single independent shot. That said, our current work specifically focuses on a single shot control because even a single shot involves coordinating multiple elements to translate a creative vision to screen.
We identify a critical gap in current literature: despite the growing capabilities of generative video models, a principled understanding of what constitutes a well-controlled single shot in a professional setting, remains underexplored. Our contribution lies in filling this gap by proposing taxonomies that define and evaluate these elements in a structured way. Further, we believe that if we cannot reliably control a single shot using video generative models, building coherent sequences or shots becomes an even greater challenge. As our evaluations show, current models struggle in setting up a single shot.
We agree that extending to multi-shot setups is an important future direction. Multi-shot setups will add new dimensionalities to our taxonomy, where two axes become relevant:
- Intra-shot correctness
- Inter-shot consistency
Our taxonomy already provides the foundation and structure for evaluating both. In a multi-shot setup, each node remains relevant while additional factors such as consistency will link identical nodes across shots.
Consistency of nodes such as character, style, story, and color become important in a multi-shot professional filmmaking setup; all of which are already defined in our taxonomies :
- Attributes such as Costume and Hair, associated with Character identity, are under the
Setup → Subjectsaxis. - Style and Color are captured under
Set DesignandScene Texture, respectively.
Multiple baseline approaches can then be explored to measure inter-shot consistency, across different controls:
- Facial consistency: via facial‐embedding similarity (RetinaFace, FaceNet) [1–3]
- Costume consistency: via DINO feature similarity [4]
- Hair consistency: via perceptual (SSIM, PSNR) + semantic (CLIP-I) metrics [5]
- Color consistency: via sliced Wasserstein/histogram in CIELAB space + palette recoloring and transfer methods [6–8]
- Style consistency: via Contextual Style Descriptor (CSD) [9]
In conclusion, our current work concentrates on defining a structure of controls in a single shot setting; an understanding of which is still missing from the generative video literature. By formalising this space with our structured taxonomy, we lay a solid foundation for future research on multi‑shot filmmaking; our framework can be extended with additional dimensions such as "consistency", that are crucial when moving beyond single shots.
[1] Wang et al. "CharaConsist: Fine-Grained Consistent Character Generation."
[2] Serengil et al. "A Benchmark of Facial Recognition Pipelines and Co-Usability Performances of Modules"
[3] Schroff et al. "Facenet: A unified embedding for face recognition and clustering."
[4] He et al. "VTON 360: High-fidelity virtual try-on from any viewing direction."
[5] Zhang et al. "Stable-hair: Real-world hair transfer via diffusion model."
[6] He et al. "Multiscale sliced wasserstein distances as perceptual color difference measures."
[7] Liu et al. "Temporally consistent video colorization with deep feature propagation and self-regularization learning."
[8] Chang et al. "Palette-based photo recoloring."
[9] Somepalli et al. "Measuring style similarity in diffusion models."
T2V and its scope in professional applications
We use text as our input modality as it is the most common source of control across video models today. Further, a single text prompt can combine coarse ideas ("dolly zoom at dusk") with exact parameters (35 mm, 4000K). T2V models enables us to cover the breadth of the entire taxonomy in a single, unified workflow. Furthermore, our control taxonomy is input‑agnostic; it defines what a creator needs to steer, regardless of how those controls are fed in.
While T2V models are not the only tools in a professional workflow, they have largely been integrated into creative pipelines. A recent report titled, "Darren Aronofsky Launches AI-Driven Storytelling Venture, As Google Unveils New Gen AI Video Tool" - states that T2V models have been widely used by highly creative professionals. Furthemore, the "AI in Content Creation 2025" report by Wondercraft, states that the rise of T2V workflows is opening new doors; platforms that allow users to start with a written script and quickly generate narrated videos or animated explainers are in high demand. A recent film, titled "We Tested Google Veo and Runway to Create This AI Film" by the Wall Street Journal describes using text prompts to create specific shots. Short films titled, "THE CLEANER" and "SWAT", have been created entirely by using text prompts as input.
In conclusion, T2V models allow complete traversal of our control taxonomy and align with professional pipelines. Evaluating through this lens therefore covers the taxonomy’s full scope without constraining future extensions to other input modalities.
Extensions to I2V
Yes, our evaluation framework can be extended to I2V pipelines. Below, we propose and perform a simple experiment to validate the same.
- Reference Image: Since reference images have widely been used to control
characterappearance, we base our current experiment on them. - Build Prompts for respective models :
- T2I prompt – contains only the character description, conditioned on (
Setup -> Subjects) from our taxonomy. e.g. "Male punk with a neon-green mohawk (Subjects -> Hair), nose ring, tattoos (Subjects -> Accessories), black tank-top (Subjects -> Costume), arms crossed." - I2V prompt – Adds more controls from our taxonomies. "A punk man leans forward (
Events) while filmed from a static low-angle shot (Camera) under warm light (Lighting)."
- Evaluate :
- Control: "Is the clip low-angle while the punk leans forward under warm light?"
- Character Consistency: "Do the mohawk, nose ring and tattoos remain identical to the reference image?"
We create 20 prompts and generate reference images with Stable Diffusion 3.5 Large and generate videos with two I2V models - Hunyuan and Wan-13B - and rate the models on consistency and control. We present their win rates :
| Metric | Hunyuan | Wan | Tie |
|---|---|---|---|
| Consistency | 3 | 13 | 4 |
| Control | 4 | 11 | 5 |
As shown, our taxonomy and evaluation framework are inherently flexible and can be naturally adapted to additional modalities.
Space below Figure 4.
Thank you for the suggestion; we will improve our presentation in the camera-ready version.
I sincerely appreciate the rebuttal and the effort during the rebuttal period. The new clarification and experimental results has addressed most of my concern, and I choose to maintain my current recommendation of 4: Borderline accept.
Thank you for your thoughtful review and for acknowledging the clarifications and additional results in our rebuttal. We are glad that most of your concerns have been addressed. If there are any remaining concerns that are preventing a higher score, we would be very happy to address them to help support a stronger recommendation.
We thank the reviewer again for the feedback on our paper! As the discussion deadline (Aug. 6th) is approaching, we were wondering whether the reviewer had the chance to look at our response and whether there is anything else the reviewer would like us to clarify. We sincerely hope that our response has addressed the concerns, and if so, we would be grateful if the reviewer could consider increasing the score accordingly.
Best, Authors
This paper introduces Stable Cinemetrics (SCINE), a structured evaluation framework for professional video generation that organizes filmmaking principles into four hierarchical taxonomies (Setup, Events, Lighting, and Camera) with 76 fine-grained control nodes. The authors conduct extensive human evaluation with 80+ film professionals assessing 20,000 videos generated by 10+ text-to-video models, revealing that current models struggle most with Events and Camera controls while performing better on Setup and Lighting elements. To enable scalable evaluation, they fine-tune a vision-language model that achieves 72.36% alignment with human judgments, though its testing on same-distribution data raises questions about generalization capabilities. Despite limitations including the absence of visual examples and limited cross-domain testing, SCINE provides a valuable foundation for aligning generative video models with professional production standards.
优缺点分析
Strengths:
- Pioneering Professional Video Evaluation Benchmark: Introduces SCINE, an innovative evaluation framework that formalizes filmmaking principles into four disentangled taxonomies (Setup, Event, Lighting, Camera), addressing a critical gap in current T2V model evaluation for professional use.
- Large-Scale Expert Human Annotation: Conducts a comprehensive human study with over 80 film professionals annotating 20K videos, providing high-quality, industry-aligned ground truth for model performance.
Weaknesses:
- Questionable Generalization of Automatic Evaluator: The trained VLM evaluator's generalization is unproven, as it's not validated on real-world videos or new, unseen generative models, which is crucial for a benchmark's universality.
- Quality Issues in Supplementary Examples: "Good case" examples in the supplementary material (e.g., ltx_00569.mp4) are unrealistic, undermining the claimed alignment with professional standards and raising further doubts about the evaluation method's generalization.
- Lack of Open-Source Commitment: The absence of a clear commitment to open-source the proposed data and models significantly hinders reproducibility and wider community adoption of the proposed dataset, thereby diminishing the paper's overall practical impact.
问题
- Could additional experiments be conducted to verify the generalization ability of the automatic evaluator on videos generated by other models or on real-world videos?
- The paper does not explicitly state whether the data and models will be open-sourced. Given that this is an evaluation benchmark, open-sourcing is crucial for community reproduction and further research. Do the authors have plans for open-sourcing?
- Some videos labeled as 'good cases' in the supplementary materials (e.g., ltx_00569.mp4) clearly do not conform to real-world principles, which raises questions about the rigor of the evaluation criteria. Could the authors provide a more detailed explanation of the judgment criteria for these cases, or consider re-evaluating them?
局限性
See Weakness above
最终评判理由
Most of my concerns are solved.
格式问题
No concern.
We thank the reviewer for their constructive feedback of our work. Please find below our responses to your questions:
Generalization to unseen videos
Thank you for your suggestion. We have now evaluated our VLM‑based evaluator on previously unseen data from Veo 3 Fast, a model released after our submission.
We sample 27 Veo 3 Fast Videos (restricted due to high rate limits), evaluating a total of 100 questions, ensuring equal distribution across categories. We follow a similar setting to our previous experiments and use human annotation as the ground truth. Below we present VLM accuracy, on unseen Veo 3 Fast videos :
| Model | Events | Setup | Camera | Lighting | Overall |
|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 53.6 | 42.5 | 51.0 | 60.0 | 51.8 |
| Qwen2.5-VL-32B | 55.4 | 55.4 | 72.0 | 65.0 | 62.0 |
| Qwen2.5-VL-72B | 64.3 | 60.0 | 82.0 | 74.6 | 70.2 |
| Ours | 79.5 | 78.9 | 78.9 | 74.9 | 78.1 |
As shown above, our model still outperforms existing baselines and generalizes well to out-of-distribution data not seen during training.
Open-Source commitment
We are fully committed to open-sourcing the data and models associated with our work. Post legal approvals and additional safety checks, we will release all the artifacts associated with Stable Cinemetrics.
Quality issues in Supplementary
Thank you for this question. Our evaluation framework assesses each cinematic dimension independently using our 4 taxonomies, Camera, Lighting, Setup, and Events. And a generated video can be good in one aspect, while being poor in another.
The presented example, ltx_00569.mp4 is considered a good example only for Camera Movement, which is expected to be static (as mentioned in the prompt : A sun-dappled camping ground framed by towering pine trees....The camera static on a tripod, shallow depth of field isolating....mirroring the vertical trunks rising behind them | full prompt in metadata.csv), in this case.
| Taxonomy Node | Evaluation Question | Annotator 1 | Annotator 2 | Annotator 3 | Avg. |
|---|---|---|---|---|---|
| Camera Movement (Static) | Is the camera maintained in a fixed, static position throughout the scene as specified in the prompt? | 5 | 5 | 4 | 4.67 |
As shown above by the annotator ratings, and as indeed seen in the generated video (present in Camera__Trajectory__Camera Movement folder in the good_examples folder of the Supplementary), the video correctly depicts a video generated with a static camera movement. This is the only reason the video is listed as a good example, but only for Camera Movement.
Like you correctly pointed out, the generated video has issues related to the movement of the axe and its corresponding effects. These are evaluated and captured under the Events taxonomy, in this case. Below, are the evaluation scores :
| Taxonomy Node | Evaluation Question | Annotator 1 | Annotator 2 | Annotator 3 | Avg. |
|---|---|---|---|---|---|
| Events (Actions) | Does the video clearly display the physical action of gripping the axe and the dynamic motion of the swing as described? | 1 | 2 | 3 | 2.0 |
| Events (Causal) | Is the causal sequence from the axe swing leading to the wood splitting, splinters erupting, and the halves falling depicted clearly and coherently in the video? | 1 | 3 | 3 | 2.33 |
These low scores align with your observation that the physical realism of the axe swing is inadequate.
In conclusion,
ltx_00569.mp4is only a positive example for Camera‑Movement (Static), in this case.- Under Events, the same video would be a negative example, and is treated as such during evaluation.
Concerns about visual examples
We apologize for this. We were initially restricted by the 100 MB NeurIPS supplementary size limit; the size of each video is quite large, for example a video generated by Wan 14B, is on average ~7 MB.
Thus, as requested by the reviewer, we have now created a fully anonymous web-page with 200+ videos, presenting side-by-side comparisons across models on SCINE Scripts and Visuals, and detailed pairwise comparisons with categories, questions, and scores.
However, we are unable to share this link with the reviewer, due to the NeurIPS rebuttal policy. Therefore, we have shared this fully anonymized link with the AC; we hope you are able to consider the web-page during your internal discussions. Thank you!
We thank the reviewer again for the feedback on our paper! As the discussion deadline (Aug. 6th) is approaching, we were wondering whether the reviewer had the chance to look at our response and whether there is anything else the reviewer would like us to clarify. We sincerely hope that our response has addressed the concerns, and if so, we would be grateful if the reviewer could consider increasing the score accordingly.
Best, Authors
I have read the author's rebuttal, most of my concerns are solved, and I will raise the score.
Thank you for acknowledging our rebuttal and increasing your score. If there are any remaining concerns, we would be very happy to address them.
The paper introduces Stable Cinemetrics (SCINE), an evaluation framework designed to assess the readiness of current video generative models for professional use. SCINE organizes filmmaking principles into four hierarchical taxonomies—Setup, Events, Lighting, and Camera—covering 76 fine-grained control nodes based on industry practices. It creates a benchmark with story-driven and visually rich prompts to mirror professional workflows, enabling automated prompt categorization and targeted question generation for detailed evaluation. A human study involving 10+ models, 20K videos, and 80+ film professionals reveals significant performance gaps, particularly in Events and Camera controls. To support scalable evaluation, the authors developed a vision-language model (VLM) aligned with expert annotations, outperforming existing baselines with a 72.36% accuracy.
优缺点分析
Strengths
-
Existing benchmarks lack the detailed, shot-level structure and cinematic depth required for professional filmmaking, as exemplified by a simple VBench prompt ("A man is walking") that omits crucial details like character appearance or camera movement. In contrast, SCINE's taxonomy-guided approach enables scalable and future-proof evaluation, adapting as model capabilities evolve beyond the limitations of static, fixed-prompt benchmarks.
-
The study involves over 80 film professionals, including independent cinematographers, screenwriters, and an Academy Award-winning Visual Effects Artist. The taxonomy development is driven by the central question: "What controls do professionals require when setting up a shot?" ensuring a practical, industry-relevant framework.
-
The taxonomy provides both coarse insights, revealing that models struggle most with Events and Camera, and fine-grained comparisons, such as better performance on shot size over camera framing and natural lighting over artificial lighting.
-
Detailed information is provided, including Taxonomy Details, the User Interface used by annotators for evaluations, and instructions given to the LLM for prompt generation, enhancing the transparency and reproducibility of the methodology.
Weaknesses
-
Not enough evaluated generated videos with evaluation results are presented. The supplemental material contains some ranked videos, but it is not organized in a way that is easy for readers to interpret. Side-by-side comparisons of different T2V models using the same prompts would significantly improve clarity.
-
Certain metrics, such as camera parameters, movement, depth, and contrast, are not subjected to rigorous, objective quantitative assessments. Incorporating precise numerical analysis for camera movement, depth, and contrast, etc. alongside human ratings would create a more balanced and reliable evaluation.
问题
-
Given that SCINE's current evaluation operates at the shot level, how might the framework be extended to evaluate multi-shot sequences?
-
What are the most common failure modes of the VLM evaluator? Would its computational cost be a burden?
局限性
yes
最终评判理由
In the author's rebuttal, the pilot study on supplementing our human and VLM‑based assessments with quantitative, objective-driven metrics should improve this manuscript. Other additional information and discussion provided are also helpful. After reading the rebuttal and other reviewers' opinions, my concerns are addressed. Therefore, I am increasing my rating accordingly. I am also looking forward to seeing the additional visual results, which should be available to ACs when they are making their decisions.
格式问题
n/a
We thank the reviewer for their constructive feedback of our work. Please find below our responses to your questions:
Quantitative Metrics
We appreciate the reviewer’s suggestion to supplement our human and VLM‑based assessments with quantitative, objective driven metrics.
Below we first explain why such metrics remain challenging to be directly applied to state‑of‑the‑art video‑generation systems :
- Out‑of‑distribution content — Monocular depth and camera‑pose estimators are trained and evaluated almost exclusively on real‑world data (e.g. KITTI, RealEstate10K). Their behaviour on synthetic videos has not been systematically benchmarked.
- Poor alignment with human annotators — Widely‑used automated video metrics (e.g. FVD) have been shown to poorly correlate with human judgments [1], prompting a shift towards human and VLM‑assisted evaluations [2,3].
To unify evaluations of all of our 76 control nodes, and because professionals are the ultimate judge of the video's quality, we treat professional human raters as the gold standard and train a VLM proxy to approximate them.
In response to the reviewer’s suggestion, we carried out baseline experiments that demonstrate how geometry‑based computer‑vision metrics can serve as an additional, objective signal for future evaluations :
Camera Pose and Depth Estimation
We generate two 81‑frame videos (Wan 1 B vs. Wan 13 B) for the below prompt :
"A figure walks along the beach shore at sunset as the sun hangs low, ... The camera pushes in smoothly on a dolly from an eye-level perspective, ... drama in sharp relief."
Using CUT3R [4] we recover per‑frame camera parameters and depth maps, then derive the baseline metrics below (ideal dolly path = best‑fit straight line through the 81 camera centres) :
| Metric | Wan 1 B | Wan 13 B | Interpretation |
|---|---|---|---|
| Path RMS / max (m) | 0.022 / 0.042 | 0.046 / 0.126 | ~2 × straighter path for 1 B. |
| Total push (m) | 1.705 | 0.596 | 1 B travels further. |
| Linearity (1 = perfect) | 0.965 | 0.846 | Smoother progression for 1 B. |
| Speed μ ± σ (m / f) | 0.026 ± 0.018 | 0.024 ± 0.014 | Similar mean speed; 13 B slightly steadier. |
| Jerk (max) (m / f²) | 0.103 | 0.056 | Larger instantaneous spike in 1 B. |
| Drift roll / pitch / yaw (°) | 2.3 / 4.3 / 2.6 | 2.0 / 15.0 / 1.1 | 13 B tilts up ≈ 15 °, deviating from a true dolly. |
| Depth median μ ± σ (m) | 3.42 ± 0.32 | 4.02 ± 0.96 | 13 B generates a deeper scene. |
| Depth IQR (m) | 0.84 | 2.74 | Narrower depth spread in 1 B. |
Insight: Wan 1 B adheres more closely to the intended dolly motion and maintains a tighter depth distribution, whereas 13 B introduces significant pitch drift.
Global and Temporal Contrast
Across 9 models, we compute standard contrast metrics to 50 videos per model, each video generated from prompts activating the Setup → Scene → Texture → Contrast branch of our taxonomy.-
| Model | Mean RMS | Patch (32x32) Mean | Patch Variance | Dynamic Mean RMS | Dynamic Max RMS |
|---|---|---|---|---|---|
| CogVideo5B | 55.64 | 21.50 | 357.44 | 0.64 | 4.64 |
| Easyanimate5.1 | 60.83 | 32.65 | 469.87 | 0.76 | 3.86 |
| Hunyuan | 58.81 | 15.43 | 267.16 | 0.21 | 1.92 |
| LTXvideo | 50.39 | 22.02 | 264.17 | 0.06 | 0.30 |
| LumaRay2 | 50.86 | 16.88 | 234.12 | 0.28 | 2.00 |
| Minimax | 55.35 | 18.04 | 297.49 | 0.19 | 1.18 |
| Wan14B | 56.89 | 21.38 | 373.43 | 0.15 | 1.11 |
| Wan1B | 60.10 | 25.30 | 410.62 | 0.24 | 1.63 |
In summary, our current work benchmarks model capabilities through expert human evaluations, which remain the gold standard, and trains a vision–language model (VLM) to approximate them. Additionally, our initial experiments reveal potential directions for grounding geometry-based metrics in the evaluation of video generative models.
[1] Ge Songwei et al., "On the Content Bias in Fréchet Video Distance"
[2] Wu, Jay Zhangjie, et al. "Towards a better metric for text-to-video generation."
[3] Bansal, Hritik, et al. "VideoPhy 2: Challenging Action-Centric Physical Commonsense Evaluation of Video Generation"
[4] Wang, Qianqian et al., "CUT3R: Continuous 3D Perception Model with Persistent State"
Extension to multi-shot sequences
Evaluating multi-shot sequences are an important future direction. Multi-shot setups will add new dimensionalities to our taxonomy, where two axes become relevant:
- Intra-shot correctness : the per‑shot controls defined and evaluated in our current work;
- Inter-shot consistency : coherency of shots across sequences.
Our taxonomy already provides the foundation and structure for evaluating both. In a multi-shot setup, each node remains relevant while additional factors such as consistency will link identical nodes across shots. Consistency of nodes such as character, style, story, and color become important in a multi-shot professional filmmaking setup; all of which are already defined in our taxonomies :
- Attributes such as Costume and Hair, associated with character identity, are under the
Setup → Subjectsaxis. - Style and Color are captured under
Set DesignandScene Texture, respectively.
Multiple baseline approaches can then be adopted to measure inter-shot consistency, across different control elements:
- Facial consistency: calculated using facial embedding similarity [1] computed with models like RetinaFace [2] and FaceNet [3].
- Costume consistency: measured using visual feature similarity methods such as DINO embeddings as shown in [4].
- Hair consistency: evaluated using perceptual and semantic metrics e.g., SSIM, and CLIP-I, as shown in [5].
- Color consistency: evaluated using sliced Wasserstein distances [6] or histogram comparisons [7] in the CIELAB color space. Discrete color palettes can also be compared using palette-based photo recoloring and color transfer methods [8].
- Style consistency: measured using the Contextual Style Descriptor (CSD) [9], which evaluates style similarity between reference and target sets.
In conclusion, our current work concentrates on defining a structure of controls in a single shot setting; an understanding of which is still missing from the generative video literature. By formalising this space with our structured taxonomy, we lay a solid foundation for future research on multi‑shot filmmaking; our framework can be extended with additional dimensions such as "consistency", that are crucial when moving beyond single shots.
[1] Wang, Mengyu, et al. "CharaConsist: Fine-Grained Consistent Character Generation."
[2] Serengil, Sefik et al. "A Benchmark of Facial Recognition Pipelines and Co-Usability Performances of Modules"
[3] Schroff, Florian, et al. "Facenet: A unified embedding for face recognition and clustering."
[4] He, Zijian, et al. "VTON 360: High-fidelity virtual try-on from any viewing direction."
[5] Zhang, Yuxuan, et al. "Stable-hair: Real-world hair transfer via diffusion model."
[6] He, Jiaqi, et al. "Multiscale sliced wasserstein distances as perceptual color difference measures."
[7] Liu, Yihao, et al. "Temporally consistent video colorization with deep feature propagation and self-regularization learning."
[8] Chang, Huiwen, et al. "Palette-based photo recoloring."
[9] Somepalli, Gowthami, et al. "Measuring style similarity in diffusion models."
Failure modes of VLM Evaluator
Please find the nodes on which the VLM has the strongest and weakest performance, respectively. Score indicate the accuracy of the VLM's choice with most common preference among human annotators.
Top 10 Nodes
| Node | Score |
|---|---|
| Setup→Scene→Set Design→Environment→Mood | 0.88 |
| Events→Adv.Controls→Rhythm→Pace | 0.82 |
| Setup→Scene→Set Design→Environment→Style | 0.82 |
| Setup→Scene→Set Design→Props→Utility | 0.80 |
| Events→Types→Emotions→Exp.Types→Explicit | 0.79 |
| Setup→Scene→Geometry→Frame→Shapes→Regular | 0.78 |
| Lighting→Lighting Effects→Shadows→Soft | 0.76 |
| Setup→Scene→Geometry→Space→Spatial Loc.→Rel.Pos. | 0.75 |
| Events→Types→Actions→Int.Types→Standalone | 0.75 |
| Events→Types→Emotions→Exp.Types→Explicit | 0.75 |
Bottom 10 Nodes
| Node | Score |
|---|---|
| Setup→Subjects→Makeup | 0.33 |
| Setup→Scene→Set Design→Environment→Background | 0.33 |
| Events→Types→Actions→Portrayed as→Contextual→Background | 0.46 |
| Setup→Subjects→Accessories | 0.53 |
| Lighting→Lighting Effects→Reflection | 0.55 |
| Setup→Scene→Texture→Color Palette | 0.55 |
| Camera→Intrinsics→Exposure→Shutter Speed | 0.57 |
| Lighting→Adv.Controls→Color Gels | 0.57 |
| Setup→Scene→Texture→Blur | 0.57 |
| Lighting→Color Temperature | 0.58 |
Computational cost of VLM
Our VLM, trained as a Bradley-Terry style evaluator starting from Qwen2.5-VL-7B adds a regression head in place of a LM head. It needs just one forward pass instead of seq_len passes for a traditional autoregressive generative model. This makes our model roughly 100 × than the 72 B baseline, while also offering better performance (72.36 % vs 62.50 %) since it's trained for the objective to work like a preference model.
| Model | Avg FLOPs (in petaFLOPs) | Accuracy |
|---|---|---|
| Qwen-7B | 17.738 | 59.86 % |
| Qwen2.5‑VL‑32B | 81.087 | 59.93 % |
| Qwen2.5‑VL‑72B | 182.44 | 62.50 % |
| Ours | 1.82 | 72.36 % |
Additional Visual Results
We apologize for this. We were initially restricted by the 100 MB NeurIPS supplementary size limit; the size of each video is quite large, for example a video generated by Wan 14B, is on average ~7 MB.
Thus, as requested by the reviewer, we have now created a fully anonymous web-page with 200+ videos, presenting side-by-side comparisons across models on SCINE Scripts and Visuals, and detailed pairwise comparisons with categories, questions, and scores.
However, we are unable to share this link with the reviewer, due to the NeurIPS rebuttal policy. Therefore, we have shared this fully anonymized link with the AC; we hope you are able to consider the web-page during your internal discussions. Thank you!
Thank you for the detailed rebuttal. I appreciate the pilot study on supplementing our human and VLM‑based assessments with quantitative, objective-driven metrics. Other additional information and discussion provided are also helpful. After reading the rebuttal and other reviewers' opinions, my concerns are addressed, and I will increase my rating accordingly. I am also looking forward to seeing the additional visual results.
Thank you for acknowledging our rebuttal and increasing your score. We will incorporate the additional visual results and discussions in the final draft. We appreciate your efforts in helping to improve the quality of our work. If there are any remaining concerns, we would be happy to address them.
This paper introduces Structured Taxonomy and Evaluation for Professional Video Generation. Initially, the reviewers raised concerns about side-by-side comparisons of different T2V models using the same prompts and the generalization of automatic evaluators. However, after the rebuttal, the authors successfully addressed most of these concerns, including clarifications regarding the novelty of the work, and all reviewers provided positive feedback. The AC has carefully reviewed the paper, the reviewer comments, and the rebuttal, and agrees that the paper is well-motivated, clearly written, and supported by thorough experiments. Therefore, the AC recommends acceptance. It is encouraged that the authors incorporate all additional experiments and discussions from the rebuttal into the final version.