VBVR-Bench Leaderboard
VBVR-Bench is a comprehensive benchmark for evaluating video reasoning capabilities.
To systematically assess model reasoning capabilities, VBVR-Bench employs a dual-split evaluation strategy across 100 diverse tasks:
- In-Domain (ID): 50 tasks that overlap with training categories but differ in unseen parameter configurations and sample instances, testing in-domain generalization.
- Out-of-Domain (OOD): 50 entirely novel tasks designed to measure out-of-domain generalization, testing whether models acquire transferable reasoning primitives rather than relying on task-specific memorization.
Each task consists of 5 test samples, enabling statistically robust evaluation across diverse reasoning scenarios.
Use the column group selector below to customize which score groups are displayed.
⭐ Strong Baseline | 0.974 | 0.569 | 0.919 | 0.956 | 0.782 | 0.745 | 0.833 | 0.988 | 0.768 | 0.572 | 0.547 | 0.618 | 0.614 |