VBVR-Bench Leaderboard

VBVR-Bench is a comprehensive benchmark for evaluating video reasoning capabilities.

To systematically assess model reasoning capabilities, VBVR-Bench employs a dual-split evaluation strategy across 100 diverse tasks:

  • In-Domain (ID): 50 tasks that overlap with training categories but differ in unseen parameter configurations and sample instances, testing in-domain generalization.
  • Out-of-Domain (OOD): 50 entirely novel tasks designed to measure out-of-domain generalization, testing whether models acquire transferable reasoning primitives rather than relying on task-specific memorization.

Each task consists of 5 test samples, enabling statistically robust evaluation across diverse reasoning scenarios.

Use the column group selector below to customize which score groups are displayed.

Select Column Groups to Display: