WorldJen

An end-to-end multi-dimensional benchmark for generative videos

Karthik Inbasekar·Guy Rom·Omer Shlomovits

research@moonmath.ai

Abstract

Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Fréchet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench 2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results.

WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to exercise up to 16 quality dimensions simultaneously. The framework is built around two interlocking contributions. First, a blind human preference study is conducted, accumulating 2,696 pairwise annotations from 7 annotators with 100% pair coverage over 50 of the curated prompts × 6 state-of-the-art video models. A mean inter-annotator agreement of 66.9% is achieved and the study establishes a human ground-truth Bradley–Terry (BT) rating with a three-tier structure. Second, a VLM-as-judge evaluation engine using prompt-specific, dimension-specific Likert questionnaires (10 questions per dimension, 47,160 scored responses) judges the videos and reproduces the human-established three-tier BT rating structure independently. The VLM achieves a Spearman ρ̂ = 1.000, p = 0.0014 that is interpreted as tier agreement with the human results. Three focused ablation studies validate the robustness of the VLM evaluation framework.

Dimensions

Motion

Motion & stability

How naturally and steadily does the video move?

  • Subject consistency Are subjects stable and consistent across the clip?
  • Scene consistency Do backgrounds and environments stay coherent?
  • Motion smoothness Is motion fluid and free of unnatural jitters?
  • Temporal flickering Are there unwanted flickers or brightness jumps?
  • Inertial consistency Do movements follow believable inertia and momentum?
Logic

Logic & physics

Does the video obey the rules of our world?

  • Physical mechanics Do objects behave according to believable physics?
  • Object permanence Do objects persist sensibly when occluded or off-frame?
  • Human fidelity Are human actions, anatomy, and interactions believable?
  • Dynamic degree Does the clip convey lively, intentional dynamics?
Intent

Instruction adherence

Does the output follow what the prompt asked for?

  • Semantic adherence Does the video match the meaning and intent of the instruction?
  • Spatial relationship Are positions, sizes, and orientations consistent with the brief?
  • Semantic drift Does the content stay on-topic from start to finish?
Look

Aesthetic quality

How pleasing and well-crafted is the video?

  • Composition & framing Is the shot composed in a clear, engaging way?
  • Lighting & volumetrics Is lighting balanced and does volume read naturally?
  • Color harmony Are colors cohesive and intentional?
  • Structural gestalt Does the whole clip feel coherent and complete?

Video Examples

Playground

Upload one generated MP4 and score it across WorldJen dimensions.

Playground is an experimental system simulating the paper for easy accessibility; stability and correctness issues are possible. If you encounter any issue please report to us at research@moonmath.ai.

Video

Drop your MP4 here

One file · 50 MB max

or

Choose a video to begin.

  1. Prepare
  2. Upload
  3. Score

VLM vs. Human BT Ratings — 95% Bootstrap CIs (ρ̂ = 1.000, p = 0.0014, perfect rank agreement)

Per-dimension Scoring Breakdown

Dimension and model columns Veo 3.1 Fast Kling v2.6 Pro Wan v2.2 A14B LTX-2 Hunyuan v1.5 Wan 2.1 1.3B Insights
Motion & stability Subject Consistency 4.594.594.234.124.523.94
Scene Consistency 4.824.734.344.654.614.53
Motion Smoothness 4.244.474.004.134.133.60
Temporal Flickering 4.744.644.354.544.614.21
Inertial Consistency 3.183.313.273.072.723.01
Physics remains the hardest gap
Lowest averages concentrate in the highlighted physics rows.
Logic & physics Physical Mechanics 3.453.123.093.082.732.97
Object Permanence 4.414.404.074.094.303.70
Human Fidelity 4.153.673.713.813.612.94
Dynamic Degree 4.224.164.134.253.913.79
Instruction adherence Semantic Adherence 4.534.394.324.053.904.02
Spatial Relationship 4.384.434.144.184.093.93
Semantic Drift 4.834.784.734.654.684.61
Aesthetic quality Composition & Framing 4.684.664.504.624.584.31
Aesthetic dimensions cluster higher
Group D skews toward higher Likert bins across models.
Lighting & Volumetric 4.004.143.884.083.473.67
Color Harmony 4.864.844.754.734.704.59
Structural Gestalt 3.983.903.593.603.613.24

The WorldJen Framework

Phase A: prompt corpus, LLM judge and enhance, curated prompts.
Phase B: question generation, video generation, human and VLM evaluation, scoring.

Human rank correlation: WorldJen vs VBench

Each line is one model’s human BT, WorldJen BT, and VBench Quality rank. WorldJen matches humans (ρ̂ = 1.00, p = 0.0014); VBench Quality ρ̂ = 0.60 (p = 0.21, NS at n = 6).

Human Eval WorldJen BT VBench Quality Veo 3.1 Kling v2.6 Pro Wan A14B LTX-2 HunyuanVideo Wan 1.3B Wan 1.3B — Human Eval rank 6 Wan 1.3B — WorldJen BT rank 6 Wan 1.3B — VBench Quality rank 6 HunyuanVideo — Human Eval rank 5 HunyuanVideo — WorldJen BT rank 5 HunyuanVideo — VBench Quality rank 2 LTX-2 — Human Eval rank 4 LTX-2 — WorldJen BT rank 4 LTX-2 — VBench Quality rank 5 Wan A14B — Human Eval rank 3 Wan A14B — WorldJen BT rank 3 Wan A14B — VBench Quality rank 3 Kling v2.6 Pro — Human Eval rank 2 Kling v2.6 Pro — WorldJen BT rank 2 Kling v2.6 Pro — VBench Quality rank 4 Veo 3.1 — Human Eval rank 1 Veo 3.1 — WorldJen BT rank 1 Veo 3.1 — VBench Quality rank 1

Correlation with Human Eval: WorldJen (ρ = 1.00) vs VBench (ρ = 0.60)

Case study

We used WorldJen to ensure quality after kernel changes

Before
After
Before
After
Before
After

BibTeX

@misc{inbasekar2026worldjen,
  title        = {{WorldJen}: An End-to-End Multi-Dimensional Benchmark for Generative Video Models},
  author       = {Inbasekar, Karthik and Rom, Guy and Shlomovits, Omer},
  year         = {2026},
  howpublished = {\url{https://worldjen.moonmath.ai}}
}