Gradio

**VBVR-Bench** is a comprehensive benchmark for evaluating **video reasoning capabilities**.

To systematically assess model reasoning capabilities, VBVR-Bench employs a dual-split evaluation strategy across 100 diverse tasks:

In-Domain (ID): 50 tasks that overlap with training categories but differ in unseen parameter configurations and sample instances, testing in-domain generalization.
Out-of-Domain (OOD): 50 entirely novel tasks designed to measure out-of-domain generalization, testing whether models acquire transferable reasoning primitives rather than relying on task-specific memorization.

Each task consists of 5 test samples, enabling statistically robust evaluation across diverse reasoning scenarios.

Use the column group selector below to customize which score groups are displayed.


Kling 2.6	⭐ Strong Baseline	0.974	0.569	0.919	0.956	0.782	0.745	0.833	0.988	0.768	0.572	0.547	0.618	0.614

Kling 2.6

⭐ Strong Baseline

0.974

0.569

0.919

0.956

0.782

0.745

0.833

0.988

0.768

0.572

0.547

0.618

0.614


Human	👤 Reference	0.974	0.96	0.919	0.956	1	0.95	1	0.988	1	1	0.99	1	0.97
VBVR-Wan2.2	⭐ Strong Baseline	0.685	0.76	0.724	0.75	0.782	0.745	0.833	0.61	0.768	0.572	0.547	0.618	0.614
Sora 2	🔵 Proprietary	0.546	0.569	0.602	0.477	0.581	0.572	0.597	0.522	0.546	0.472	0.525	0.462	0.546
Veo 3.1	🔵 Proprietary	0.48	0.531	0.611	0.503	0.52	0.444	0.51	0.429	0.577	0.277	0.42	0.441	0.404
Runway Gen-4 Turbo	🔵 Proprietary	0.403	0.392	0.396	0.409	0.429	0.341	0.363	0.414	0.515	0.429	0.418	0.327	0.373
Wan2.2-I2V-A14B	🟢 Open-source	0.371	0.412	0.43	0.382	0.415	0.404	0.419	0.329	0.405	0.308	0.343	0.236	0.307
Kling 2.6	🔵 Proprietary	0.369	0.408	0.465	0.322	0.375	0.347	0.519	0.33	0.528	0.135	0.272	0.356	0.359
LTX-2	🟢 Open-source	0.313	0.329	0.316	0.362	0.326	0.34	0.306	0.297	0.244	0.337	0.317	0.231	0.311
CogVideoX1.5-5B-I2V	🟢 Open-source	0.273	0.283	0.241	0.328	0.257	0.328	0.305	0.262	0.281	0.235	0.25	0.254	0.282
HunyuanVideo-I2V	🟢 Open-source	0.273	0.28	0.207	0.357	0.293	0.28	0.316	0.265	0.175	0.369	0.29	0.253	0.25

Human

👤 Reference

0.974

0.96

0.919

0.956

0.95

0.988

0.99

0.97

VBVR-Wan2.2

⭐ Strong Baseline

0.685

0.76

0.724

0.75

0.782

0.745

0.833

0.61

0.768

0.572

0.547

0.618

0.614

Sora 2

🔵 Proprietary

0.546

0.569

0.602

0.477

0.581

0.572

0.597

0.522

0.546

0.472

0.525

0.462

0.546

Veo 3.1

🔵 Proprietary

0.48

0.531

0.611

0.503

0.52

0.444

0.51

0.429

0.577

0.277

0.42

0.441

0.404

Runway Gen-4 Turbo

🔵 Proprietary

0.403

0.392

0.396

0.409

0.429

0.341

0.363

0.414

0.515

0.429

0.418

0.327

0.373

Wan2.2-I2V-A14B

🟢 Open-source

0.371

0.412

0.43

0.382

0.415

0.404

0.419

0.329

0.405

0.308

0.343

0.236

0.307

Kling 2.6

🔵 Proprietary

0.369

0.408

0.465

0.322

0.375

0.347

0.519

0.33

0.528

0.135

0.272

0.356

0.359

LTX-2

🟢 Open-source

0.313

0.329

0.316

0.362

0.326

0.34

0.306

0.297

0.244

0.337

0.317

0.231

0.311

CogVideoX1.5-5B-I2V

🟢 Open-source

0.273

0.283

0.241

0.328

0.257

0.328

0.305

0.262

0.281

0.235

0.25

0.254

0.282

HunyuanVideo-I2V

🟢 Open-source

0.273

0.28

0.207

0.357

0.293

0.28

0.316

0.265

0.175

0.369

0.29

0.253

0.25

VBVR-Bench Leaderboard