VBVR-Bench Leaderboard

**VBVR-Bench** is a comprehensive benchmark for evaluating **video reasoning capabilities**.

To systematically assess model reasoning capabilities, VBVR-Bench employs a dual-split evaluation strategy across 100 diverse tasks:

  • In-Domain (ID): 50 tasks that overlap with training categories but differ in unseen parameter configurations and sample instances, testing in-domain generalization.
  • Out-of-Domain (OOD): 50 entirely novel tasks designed to measure out-of-domain generalization, testing whether models acquire transferable reasoning primitives rather than relying on task-specific memorization.

Each task consists of 5 test samples, enabling statistically robust evaluation across diverse reasoning scenarios.

Use the column group selector below to customize which score groups are displayed.

Select Column Groups to Display: