Gradio

**VBVR-Bench** is a comprehensive benchmark for evaluating **video reasoning capabilities**.

To systematically assess model reasoning capabilities, VBVR-Bench employs a dual-split evaluation strategy across 100 diverse tasks:

In-Domain (ID): 50 tasks that overlap with training categories but differ in unseen parameter configurations and sample instances, testing in-domain generalization.
Out-of-Domain (OOD): 50 entirely novel tasks designed to measure out-of-domain generalization, testing whether models acquire transferable reasoning primitives rather than relying on task-specific memorization.

Each task consists of 5 test samples, enabling statistically robust evaluation across diverse reasoning scenarios.

Use the column group selector below to customize which score groups are displayed.


Kling 2.6	⭐ Strong Baseline	0.974	0.724	0.919	0.956	0.782	0.745	0.833	0.988	0.768	0.572	0.547	0.618	0.614

Kling 2.6

⭐ Strong Baseline

0.974

0.724

0.919

0.956

0.782

0.745

0.833

0.988

0.768

0.572

0.547

0.618

0.614


Human	👤 Reference	0.974	0.96	0.919	0.956	1	0.95	1	0.988	1	1	0.99	1	0.97
VBVR-Wan2.2	⭐ Strong Baseline	0.685	0.76	0.724	0.75	0.782	0.745	0.833	0.61	0.768	0.572	0.547	0.618	0.614
VBVR-Wan2.1	⭐ Strong Baseline	0.592	0.724	0.705	0.71	0.727	0.719	0.784	0.461	0.674	0.592	0.387	0.461	0.387
Sora 2	🔵 Proprietary	0.546	0.569	0.602	0.477	0.581	0.572	0.597	0.522	0.546	0.472	0.525	0.462	0.546
Seedance 2.0	🔵 Proprietary	0.544	0.57	0.593	0.498	0.618	0.514	0.602	0.517	0.643	0.398	0.492	0.427	0.556
VBVR-LTX2.3	⭐ Strong Baseline	0.516	0.58	0.608	0.631	0.529	0.454	0.68	0.453	0.608	0.577	0.409	0.414	0.388
Veo 3.1	🔵 Proprietary	0.48	0.531	0.611	0.503	0.52	0.444	0.51	0.429	0.577	0.277	0.42	0.441	0.404
Runway Gen-4 Turbo	🔵 Proprietary	0.403	0.392	0.396	0.409	0.429	0.341	0.363	0.414	0.515	0.429	0.418	0.327	0.373
Wan2.2-I2V-A14B	🟢 Open-source	0.371	0.412	0.43	0.382	0.415	0.404	0.419	0.329	0.405	0.308	0.343	0.236	0.307
Kling 2.6	🔵 Proprietary	0.369	0.408	0.465	0.322	0.375	0.347	0.519	0.33	0.528	0.135	0.272	0.356	0.359
LTX-2	🟢 Open-source	0.313	0.329	0.316	0.362	0.326	0.34	0.306	0.297	0.244	0.337	0.317	0.231	0.311
CogVideoX1.5-5B-I2V	🟢 Open-source	0.273	0.283	0.241	0.328	0.257	0.328	0.305	0.262	0.281	0.235	0.25	0.254	0.282
HunyuanVideo-I2V	🟢 Open-source	0.273	0.28	0.207	0.357	0.293	0.28	0.316	0.265	0.175	0.369	0.29	0.253	0.25

Human

👤 Reference

0.974

0.96

0.919

0.956

0.95

0.988

0.99

0.97

VBVR-Wan2.2

⭐ Strong Baseline

0.685

0.76

0.724

0.75

0.782

0.745

0.833

0.61

0.768

0.572

0.547

0.618

0.614

VBVR-Wan2.1

⭐ Strong Baseline

0.592

0.724

0.705

0.71

0.727

0.719

0.784

0.461

0.674

0.592

0.387

0.461

0.387

Sora 2

🔵 Proprietary

0.546

0.569

0.602

0.477

0.581

0.572

0.597

0.522

0.546

0.472

0.525

0.462

0.546

Seedance 2.0

🔵 Proprietary

0.544

0.57

0.593

0.498

0.618

0.514

0.602

0.517

0.643

0.398

0.492

0.427

0.556

VBVR-LTX2.3

⭐ Strong Baseline

0.516

0.58

0.608

0.631

0.529

0.454

0.68

0.453

0.608

0.577

0.409

0.414

0.388

Veo 3.1

🔵 Proprietary

0.48

0.531

0.611

0.503

0.52

0.444

0.51

0.429

0.577

0.277

0.42

0.441

0.404

Runway Gen-4 Turbo

🔵 Proprietary

0.403

0.392

0.396

0.409

0.429

0.341

0.363

0.414

0.515

0.429

0.418

0.327

0.373

Wan2.2-I2V-A14B

🟢 Open-source

0.371

0.412

0.43

0.382

0.415

0.404

0.419

0.329

0.405

0.308

0.343

0.236

0.307

Kling 2.6

🔵 Proprietary

0.369

0.408

0.465

0.322

0.375

0.347

0.519

0.33

0.528

0.135

0.272

0.356

0.359

LTX-2

🟢 Open-source

0.313

0.329

0.316

0.362

0.326

0.34

0.306

0.297

0.244

0.337

0.317

0.231

0.311

CogVideoX1.5-5B-I2V

🟢 Open-source

0.273

0.283

0.241

0.328

0.257

0.328

0.305

0.262

0.281

0.235

0.25

0.254

0.282

HunyuanVideo-I2V

🟢 Open-source

0.273

0.28

0.207

0.357

0.293

0.28

0.316

0.265

0.175

0.369

0.29

0.253

0.25

VBVR-Bench Leaderboard