TL;DR
Figure 1. Our self-supervised GAF learning augmented by two pretext tasks: (1) person flow estimation for local dynamics embedding into GAFs and (2) group-relevant object localization for global context embedding into GAFs. Compared with previous self-supervised methods that utilize only local appearance features, our pretext tasks enhance GAF learning.
This paper proposes Group Activity Feature (GAF) learning without group activity annotations. Unlike prior work, which uses low-level static local features to learn GAFs, we propose leveraging dynamics-aware and group-aware pretext tasks, along with local and global features provided by DINO, for group-dynamics-aware GAF learning. To adapt DINO and GAF learning to local dynamics and global group features, our pretext tasks use person flow estimation and group-relevant object location estimation, respectively. Person flow estimation is used to represent the local motion of each person, which is an important cue for understanding group activities. In contrast, group-relevant object location estimation encourages GAFs to learn scene context (e.g., spatial relations of people and objects) as global features. Comprehensive experiments on public datasets demonstrate the state-of-the-art performance of our method in group activity retrieval and recognition. Our ablation studies verify the effectiveness of each component in our method.
Figure 2. Overview of our network. (a) Image feature extractor: group-relevant objects are inpainted to enhance global feature learning. (b) GAF learning network: image features are fed into the transformer encoder, MLP, and temporal pooling to obtain a GAF G. (c) Pretext tasks: the flow of each person and the locations of group-relevant objects are estimated from G.
Group-relevant objects (e.g., a ball) are inpainted from each video frame using LaMa. DINOv3 then extracts image features I from the inpainted frames, preventing the network from relying on local object appearance and instead forcing global spatial reasoning. Only the last two ViT blocks of DINOv3 are fine-tuned.
Image features I are fed into a transformer encoder E and MLP layers M to produce per-frame video features V. Temporal pooling over V yields the compact D-dimensional Group Activity Feature (GAF) G representing the entire video clip.
Two self-supervised pretext tasks train the network without group activity annotations: (1) Person flow estimation — per-person optical flow values are estimated from G and from frame features It (auxiliary branch), embedding local dynamics into GAFs. (2) Group-relevant object localization — object coordinates are estimated from G and It, embedding global context.
Figure 3. Visual comparison of group activity retrieval on the Volleyball dataset (VBD). For an R-set query (top), our method correctly retrieves an R-set video by capturing player motion direction, whereas GAFL retrieves an R-spike video. For an R-spike query (bottom), our method identifies the jumping spiker and blockers via local dynamics features, while GAFL relies on static appearance similarity.
Figure 4. Visual comparison of group activity retrieval on the NBA dataset. For a 3p-succ query, our method correctly retrieves a 3p-succ video by capturing global player interactions after the shot, whereas the method without LO retrieves a 3p-fail-def video.
| Method | VBD | NBA | ||
|---|---|---|---|---|
| Hit@1 | Hit@3 | Hit@1 | Hit@3 | |
| B1-Compact [ECCV 2018] | 30.3 | 59.9 | 14.9 | 39.5 |
| B2-VGG19 [ECCV 2018] | 35.4 | 65.0 | 16.8 | 39.8 |
| HRN [ECCV 2018] | 31.2 | 57.6 | 15.5 | 37.1 |
| GAFL [CVPR 2024] | 61.1 | 82.4 | 24.7 | 50.4 |
| ★ Ours | 82.7 | 93.0 | 43.9 | 72.0 |
Table 1. Comparison with state-of-the-art self-supervised GAF learning methods on VBD and NBA. Our method achieves the best performance on all metrics. Results in red indicate the best score per column.
| Method | Extractor | VBD | NBA |
|---|---|---|---|
| Whole-image input | |||
| DFWSGAR [CVPR 2022] | ResNet-18 | 90.5 | 75.8 |
| SOGAR [IEEE Access 2025] | ViT-Base | 93.1 | 83.3 |
| Flaming-Net [ECCV 2024] | Inception-v3 | 93.3 | 79.1 |
| LiGAR [WACV 2025] | ResNet-18 | 74.8 | 62.7 |
| Whole-image + person bounding boxes | |||
| SAM [ECCV 2020] | ResNet-18 | 86.3 | 54.3 |
| Dual-AI [CVPR 2022] | Inception-v3 | — | 58.1 |
| KRGFormer [TCSVT 2023] | Inception-v3 | 92.4 | 72.4 |
| MP-GCN [ECCV 2024] | YOLOV8x | 92.8 | 78.7 |
| FAGAR [Pattern Recognit. 2025] | YOLOV8x | 85.2 | — |
| ★ Ours | DINOv3 | 93.9 | 73.0 |
Table 2. Comparison with supervised group activity recognition methods on VBD and NBA (MCA %). Our method achieves the best performance on VBD and competitive performance on NBA. For a fair comparison, only group activity class labels are used as manual annotations, and only images are used in inference across all methods.
@inproceedings{tezuka2026groupdinomics,
title = {Group-DINOmics: Incorporating People Dynamics into DINO
for Self-supervised Group Activity Feature Learning},
author = {Tezuka, Ryuki and Nakatani, Chihiro and Ukita, Norimichi},
booktitle = {Findings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR)},
year = {2026},
}