WorldJen — An end-to-end multi-dimensional benchmark for generative videos

Abstract

Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Fréchet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench 2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results.

WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to exercise up to 16 quality dimensions simultaneously. The framework is built around two interlocking contributions. First, a blind human preference study is conducted, accumulating 2,696 pairwise annotations from 7 annotators with 100% pair coverage over 50 of the curated prompts × 6 state-of-the-art video models. A mean inter-annotator agreement of 66.9% is achieved and the study establishes a human ground-truth Bradley–Terry (BT) rating with a three-tier structure. Second, a VLM-as-judge evaluation engine using prompt-specific, dimension-specific Likert questionnaires (10 questions per dimension, 47,160 scored responses) judges the videos and reproduces the human-established three-tier BT rating structure independently. The VLM achieves a Spearman ρ̂ = 1.000, p = 0.0014 that is interpreted as tier agreement with the human results. Three focused ablation studies validate the robustness of the VLM evaluation framework.

Dimensions

Motion

Motion & stability

How naturally and steadily does the video move?

Subject consistency Are subjects stable and consistent across the clip?
Scene consistency Do backgrounds and environments stay coherent?
Motion smoothness Is motion fluid and free of unnatural jitters?
Temporal flickering Are there unwanted flickers or brightness jumps?
Inertial consistency Do movements follow believable inertia and momentum?

Logic

Logic & physics

Does the video obey the rules of our world?

Physical mechanics Do objects behave according to believable physics?
Object permanence Do objects persist sensibly when occluded or off-frame?
Human fidelity Are human actions, anatomy, and interactions believable?
Dynamic degree Does the clip convey lively, intentional dynamics?

Intent

Instruction adherence

Does the output follow what the prompt asked for?

Semantic adherence Does the video match the meaning and intent of the instruction?
Spatial relationship Are positions, sizes, and orientations consistent with the brief?
Semantic drift Does the content stay on-topic from start to finish?

Look

Aesthetic quality

How pleasing and well-crafted is the video?

Composition & framing Is the shot composed in a clear, engaging way?
Lighting & volumetrics Is lighting balanced and does volume read naturally?
Color harmony Are colors cohesive and intentional?
Structural gestalt Does the whole clip feel coherent and complete?

Video Examples

Anime cinematic scene. A lone man stands at the edge of a massive cliff at sunset, wind blowing through his clothes and hair. The sky is dramatic with deep... Show full prompt

Anime cinematic scene. A lone man stands at the edge of a massive cliff at sunset, wind blowing through his clothes and hair. The sky is dramatic with deep oranges, purples, and glowing clouds. He closes his eyes for a brief moment, then suddenly leaps forward into the abyss. As he falls in slow motion, the camera follows from behind, then rotates around him. Midair, his back violently tears open — dark energy pulses outward — and massive black angel wings burst out, feathers scattering into the wind. The wings are sharp, elegant, slightly glowing at the edges with a subtle purple aura. His expression shifts from fear to calm determination as the wings fully extend. The fall transitions into controlled flight as he stabilizes, gliding forward through the sky. Highly detailed anime style, cinematic lighting, dramatic shadows, smooth motion, dynamic camera angles, depth of field, wind particles, feather particles, glowing energy effects, 4K, 24fps.

Subject Consistency84.17%

Temporal Flickering97.50%

Human Fidelity86.67%

Spatial Relationship97.50%

Scene Consistency88.33%

Physical Mechanics88.33%

Dynamic Degree95.83%

Semantic Drift93.33%

Motion Smoothness90.83%

Object Permanence85.83%

Semantic Adherence94.17%

Subject Consistency62.50%

Temporal Flickering64.17%

Human Fidelity62.50%

Spatial Relationship59.17%

Scene Consistency65.83%

Physical Mechanics50.83%

Dynamic Degree53.33%

Semantic Drift66.67%

Motion Smoothness50.83%

Object Permanence62.50%

Semantic Adherence70.83%

A cinematic close-up shot of a silent ninja assassin in a neon-drenched Tokyo cyberpunk alleyway, standing perfectly still before breaking into a rapid, fluid... Show full prompt

A cinematic close-up shot of a silent ninja assassin in a neon-drenched Tokyo cyberpunk alleyway, standing perfectly still before breaking into a rapid, fluid sprint. As the assassin maneuvers, they pass behind a stack of flickering holographic advertisements and wet metallic crates, remaining naturally hidden from view while behind the obstacles before re-emerging into sight. The assassin grips a glowing katana, with clear knuckle articulation visible, while moving through swirling volumetric fog. The scene features a vibrant color palette of electric blue and magenta, with sharp reflections on the rain-slicked pavement capturing the high-frequency details of the moving figure.

Subject Consistency90.00%

Scene Consistency100.00%

Motion Smoothness67.50%

Temporal Flickering100.00%

Inertial Consistency100.00%

Human Fidelity80.00%

Physical Mechanics60.00%

Object Permanence100.00%

Dynamic Degree72.50%

Semantic Adherence87.50%

Spatial Relationship87.50%

Semantic Drift97.50%

Composition Framing92.50%

Lighting Volumetric72.50%

Color Harmony95.00%

Structural Gestalt77.50%

Subject Consistency20.00%

Scene Consistency2.50%

Motion Smoothness57.50%

Temporal Flickering52.50%

Inertial Consistency47.50%

Human Fidelity7.50%

Physical Mechanics27.50%

Object Permanence5.00%

Dynamic Degree82.50%

Semantic Adherence60.00%

Spatial Relationship67.50%

Semantic Drift87.50%

Composition Framing35.00%

Lighting Volumetric72.50%

Color Harmony85.00%

Structural Gestalt37.50%

A misty, cinematic macro close-up of a Rilakkuma plush hand grasping a translucent blue iMessage chat bubble, with individual strands of plush fur and fine... Show full prompt

A misty, cinematic macro close-up of a Rilakkuma plush hand grasping a translucent blue iMessage chat bubble, with individual strands of plush fur and fine reflective textures on the glass-like bubble to ensure temporal stability. VR hand controls dart rapidly through the frame, shifting velocity with sudden stops and starts while weaving behind and under the chat bubble to demonstrate clear depth and occlusion. A complex 3D node graph expands with high-frequency geometric detail in the background, set against a lush, verdant landscape bathed in volumetric golden morning light that glints off the hovering interface elements. The shot uses a sharp rack focus transition from the intricate stitching of the hand to the glowing, shimmering data nodes.

Subject Consistency100.00%

Scene Consistency97.50%

Motion Smoothness97.50%

Temporal Flickering100.00%

Inertial Consistency70.00%

Physical Mechanics42.50%

Object Permanence100.00%

Dynamic Degree77.50%

Semantic Adherence90.00%

Spatial Relationship80.00%

Semantic Drift100.00%

Composition Framing97.50%

Lighting Volumetric82.50%

Color Harmony97.50%

Structural Gestalt82.50%

Subject Consistency25.00%

Scene Consistency30.00%

Motion Smoothness30.00%

Temporal Flickering10.00%

Inertial Consistency47.50%

Physical Mechanics22.50%

Object Permanence87.50%

Dynamic Degree57.50%

Semantic Adherence92.50%

Spatial Relationship45.00%

Semantic Drift45.00%

Composition Framing70.00%

Lighting Volumetric62.50%

Color Harmony100.00%

Structural Gestalt25.00%

A cinematic wide shot featuring a superhero in a vibrant orange battle suit sprinting across a metallic skyscraper rooftop, with complex motion blur as they... Show full prompt

A cinematic wide shot featuring a superhero in a vibrant orange battle suit sprinting across a metallic skyscraper rooftop, with complex motion blur as they suddenly decelerate and pivot. The suit features intricate, high-frequency mesh textures and reflective golden trim that catch the volumetric sunset light, ensuring the fine details remain stable and steady without any distracting shimmering. As the hero moves, their cape flows with a natural pendulous swing, interacting dynamically with the wind and colliding momentarily with the edge of a structural pillar, remaining correctly hidden from view as they pass behind it. The scene captures precise anatomical detail, showing knuckles gripping the suit fabric and realistic foot-to-ground contact, ensuring all elements feel unified and visually coherent within the high-quality, sprawling cityscape background.

Subject Consistency100.00%

Scene Consistency100.00%

Motion Smoothness100.00%

Temporal Flickering92.50%

Inertial Consistency22.50%

Human Fidelity90.00%

Physical Mechanics90.00%

Object Permanence100.00%

Dynamic Degree90.00%

Semantic Adherence100.00%

Spatial Relationship100.00%

Semantic Drift100.00%

Composition Framing87.50%

Lighting Volumetric100.00%

Color Harmony100.00%

Structural Gestalt95.00%

Subject Consistency57.50%

Scene Consistency100.00%

Motion Smoothness100.00%

Temporal Flickering100.00%

Inertial Consistency40.00%

Human Fidelity5.00%

Physical Mechanics35.00%

Object Permanence60.00%

Dynamic Degree80.00%

Semantic Adherence77.50%

Spatial Relationship90.00%

Semantic Drift95.00%

Composition Framing62.50%

Lighting Volumetric85.00%

Color Harmony77.50%

Structural Gestalt25.00%

A close-up, wide-angle POV shot of a disco-dancing alien with bioluminescent skin, performing a fluid, high-energy dance routine on a translucent, reflective... Show full prompt

A close-up, wide-angle POV shot of a disco-dancing alien with bioluminescent skin, performing a fluid, high-energy dance routine on a translucent, reflective glass platform. The alien executes a sudden, sharp stop followed by a natural, pendulous swing of its limbs, moving with realistic weight and momentum as it moves behind and around glowing crystalline pillars that provide physical occlusion. The scene is illuminated by pulsating neon volumetric fog, with the alien's feet making firm, realistic contact with the vibrating surface as it captures and reflects the vibrant, rhythmic light pulses.

Subject Consistency80.00%

Scene Consistency100.00%

Motion Smoothness90.00%

Temporal Flickering85.00%

Inertial Consistency57.50%

Human Fidelity95.00%

Physical Mechanics95.00%

Object Permanence100.00%

Dynamic Degree77.50%

Semantic Adherence75.00%

Spatial Relationship100.00%

Semantic Drift100.00%

Composition Framing100.00%

Lighting Volumetric35.00%

Color Harmony100.00%

Structural Gestalt90.00%

Subject Consistency7.50%

Scene Consistency55.00%

Motion Smoothness52.50%

Temporal Flickering85.00%

Inertial Consistency65.00%

Human Fidelity2.50%

Physical Mechanics60.00%

Object Permanence40.00%

Dynamic Degree65.00%

Semantic Adherence85.00%

Spatial Relationship72.50%

Semantic Drift95.00%

Composition Framing72.50%

Lighting Volumetric82.50%

Color Harmony67.50%

Structural Gestalt60.00%

An ultra-close-up shot of a woman dancing in a form-fitting latex bodysuit featuring iridescent angel wings, captured in a gritty 90s sci-fi club. As she... Show full prompt

An ultra-close-up shot of a woman dancing in a form-fitting latex bodysuit featuring iridescent angel wings, captured in a gritty 90s sci-fi club. As she moves, her wings physically collide with and slide behind hanging metallic chains, remaining visible and correctly hidden as they weave through the obstacles. She performs a dynamic spin, transitioning from a fluid motion to a sudden deceleration, while her feet maintain realistic, grounded contact with the club floor. The scene is lit by harsh, glossy neon strobes and volumetric fog, highlighting the textures of the latex and the individual strands of her hair, set against a backdrop of distant, brilliant fireworks.

Subject Consistency27.50%

Scene Consistency97.50%

Motion Smoothness97.50%

Temporal Flickering97.50%

Inertial Consistency57.50%

Human Fidelity95.00%

Physical Mechanics35.00%

Object Permanence65.00%

Dynamic Degree67.50%

Semantic Adherence95.00%

Spatial Relationship70.00%

Semantic Drift90.00%

Composition Framing97.50%

Lighting Volumetric85.00%

Color Harmony95.00%

Structural Gestalt55.00%

Subject Consistency35.00%

Scene Consistency25.00%

Motion Smoothness25.00%

Temporal Flickering62.50%

Inertial Consistency47.50%

Human Fidelity75.00%

Physical Mechanics10.00%

Object Permanence15.00%

Dynamic Degree75.00%

Semantic Adherence100.00%

Spatial Relationship52.50%

Semantic Drift95.00%

Composition Framing67.50%

Lighting Volumetric80.00%

Color Harmony85.00%

Structural Gestalt37.50%

Playground

Upload one generated MP4 and score it across WorldJen dimensions.

Playground is an experimental system simulating the paper for easy accessibility; stability and correctness issues are possible. If you encounter any issue please report to us at research@moonmath.ai.

Video

Drop your MP4 here

One file · 50 MB max

Prompt

Choose a video to begin.

Prepare
Upload
Score

VLM vs. Human BT Ratings — 95% Bootstrap CIs (ρ̂ = 1.000, p = 0.0014, perfect rank agreement)

Per-dimension Scoring Breakdown

Dimension and model columns		Veo 3.1 Fast	Kling v2.6 Pro	Wan v2.2 A14B	LTX-2	Hunyuan v1.5	Wan 2.1 1.3B	Insights
Motion & stability	Subject Consistency	4.59	4.59	4.23	4.12	4.52	3.94
	Scene Consistency	4.82	4.73	4.34	4.65	4.61	4.53
	Motion Smoothness	4.24	4.47	4.00	4.13	4.13	3.60
	Temporal Flickering	4.74	4.64	4.35	4.54	4.61	4.21
	Inertial Consistency	3.18	3.31	3.27	3.07	2.72	3.01	Physics remains the hardest gap Lowest averages concentrate in the highlighted physics rows.
Logic & physics	Physical Mechanics	3.45	3.12	3.09	3.08	2.73	2.97
	Object Permanence	4.41	4.40	4.07	4.09	4.30	3.70
	Human Fidelity	4.15	3.67	3.71	3.81	3.61	2.94
	Dynamic Degree	4.22	4.16	4.13	4.25	3.91	3.79
Instruction adherence	Semantic Adherence	4.53	4.39	4.32	4.05	3.90	4.02
	Spatial Relationship	4.38	4.43	4.14	4.18	4.09	3.93
	Semantic Drift	4.83	4.78	4.73	4.65	4.68	4.61
Aesthetic quality	Composition & Framing	4.68	4.66	4.50	4.62	4.58	4.31	Aesthetic dimensions cluster higher Group D skews toward higher Likert bins across models.
	Lighting & Volumetric	4.00	4.14	3.88	4.08	3.47	3.67
	Color Harmony	4.86	4.84	4.75	4.73	4.70	4.59
	Structural Gestalt	3.98	3.90	3.59	3.60	3.61	3.24

The WorldJen Framework

Phase A: prompt corpus, LLM judge and enhance, curated prompts.

Phase B: question generation, video generation, human and VLM evaluation, scoring.

Human rank correlation: WorldJen vs VBench

Each line is one model’s human BT, WorldJen BT, and VBench Quality rank. WorldJen matches humans (ρ̂ = 1.00, p = 0.0014); VBench Quality ρ̂ = 0.60 (p = 0.21, NS at n = 6).

Correlation with Human Eval: WorldJen (ρ = 1.00) vs VBench (ρ = 0.60)

Case study

We used WorldJen to ensure quality after kernel changes

Before

After

Before

After

Before

After

BibTeX

@misc{inbasekar2026worldjen,
  title        = {{WorldJen}: An End-to-End Multi-Dimensional Benchmark for Generative Video Models},
  author       = {Inbasekar, Karthik and Rom, Guy and Shlomovits, Omer},
  year         = {2026},
  howpublished = {\url{https://worldjen.moonmath.ai}}
}