BASKET 🏀 : A Large-Scale Video Dataset for Fine-Grained Skill Estimation
Yulu Pan, Ce Zhang, Gedas Bertasius
UNC Chapel Hill
Accepted to CVPR 2025
We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation. BASKET contains more than 4,400 hours of video capturing 32,232 basketball players from all over the world. Compared to prior skill estimation datasets, our dataset includes a massive number of skilled participants with unprecedented diversity in terms of gender, age, skill level, geographical location, etc. BASKET includes 20 fine-grained basketball skills, challenging modern video recognition models to capture the intricate nuances of player skill throgh in-depth video analysis. Given a long highlight video (8-10 minutes) of a particular player, the model needs to predict the skill level (e.g., excellent, good, average, fair, poor) for each of the 20 basketball skills. Our empirical analysis reveals that the current state-of-the-art video models struggle with this task, significantly lagging behind the human baseline. We believe that BASKET could be a useful resource for developing new video models with advanced long-range, fine-grained recognition capabilities. In addition, we hope that our dataset will be useful for domain-specific applications such as fair basketball scouting, personalized player development, and many others.
Why Basketball?
A popular global sport with 400M+ fans!
Huge participant diversity with lots of video data
Involves many different fine-grained skills (e.g. 3-PT shooting, Assisting), making the skill estimation task more challenging and interesting
Players have similar visual appearances, necessitating the models to recognize fine-grained cues rather than scene or background biases
Snapshot of BASKET
Dataset Curation
We first curate full-game videos with detailed time-stamped player-event transcripts (e.g., At 3:03 in the first quarter, Steph Curry makes a three-point shot) for 21 basketball leagues over a span of 6 seasons.
For each player, we randomly select 50 action events within the same league and same season
Lastly, we extract the action clips (~10s) and combine into a single highlight video. Each highlighe video is 8-10 minutes long.
Our dataset offers unprecedented player diversity in terms of player nationality, age, gender, race, experience, and skill.
Massive Scale: BASKET features 4,477 hours of video showcasing 32,232 basketball players from across the globe!
Extensive Diversity: Spanning 21 basketball leagues, both professional and amateur, featuring over 7,000 female players and detailed skill level annotations across 20 abilities!
Versatile Applications: BASKET supports advanced video model development and enables domain-specific applications like fair scouting and personalized player development.
The existing skill estimation datasets are typically very small and lack significant participant diversity.
Can we curate a comprehensive video dataset to enhance the understanding and modeling of skill estimation?
Our proposed BASKET dataset significantly surpasses existing skill estimation video datasets in scale and diversity.
An illustration of fine-grained skill estimation task. Given a long highlight vide that captures many plays of a particular player, the model needs to predict the skill level for 20 fine-grained basketball skills. Each skill is rated on a 5-level scale, from “Poor” to “Excellent.”
Task Description:
Input: Highlight video of a player (8-10 mins)
Output: Skill ratings across 20 fine-grained skills
Evaluation Metric:
Top-1 accuracy averaged across 20 skills
Technical Challenge
Modeling long video inputs with a focus on nuanced skill cues rather than coarse background visuals
Jointly identify the recurring player (i.e., the player of interest) and estimate the player’s skills
BASKET covers five coarse basketball skill categories and twenty fine-grained skills, focusing on the evaluation of multi-faceted skill understanding of basketball players.
Comparison of various video recognition models on our fine-grained skill benchmark, BASKET. All experiments were conducted with a uniform video frame sampling strategy with a 224x224 spatial video resolution by fine-tuning each model with its best configuration. These results show that none of the methods achieves over 30% accuracy, indicating a large room for future improvement.
Visualization of human study results on BASKET, with subjects grouped by their expertise level.
We conducted human study and compare the results on BASKET. We group the subjects by their expertise levels (i.e., novice, average, expert), and for each group we visualize the mean accuracy and the min/max ranges. The blue dashed line indicates the performance of our best model, VideoMamba. To ensure that the time needed to complete the study is reasonable, every subject is asked to watch videos of 5 uniformly selected players and classify 5 selected skills into 3 skill levels (i.e., “Poor”, “Average,” and “Excellent”). Our VideoMamba baseline, which was not trained on these players, is also tested in this exact setting. Our results highlight the gap between model and human performance, especially for the human subjects with high expertise.
In our human study, human subjects achieve more than 31% higher accuracy than the best performing SOTA video recognition model!
@InProceedings{BASKET_CVPR25,
author = {Pan, Yulu and Zhang, Ce and Bertasius, Gedas},
title = {BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025}
}