Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning

(1) Previous Methods

(2) Ours

Figure 1. Our self-supervised GAF learning augmented by two pretext tasks: (1) person flow estimation for local dynamics embedding into GAFs and (2) group-relevant object localization for global context embedding into GAFs. Compared with previous self-supervised methods that utilize only local appearance features, our pretext tasks enhance GAF learning.

Abstract

This paper proposes Group Activity Feature (GAF) learning without group activity annotations. Unlike prior work, which uses low-level static local features to learn GAFs, we propose leveraging dynamics-aware and group-aware pretext tasks, along with local and global features provided by DINO, for group-dynamics-aware GAF learning. To adapt DINO and GAF learning to local dynamics and global group features, our pretext tasks use person flow estimation and group-relevant object location estimation, respectively. Person flow estimation is used to represent the local motion of each person, which is an important cue for understanding group activities. In contrast, group-relevant object location estimation encourages GAFs to learn scene context (e.g., spatial relations of people and objects) as global features. Comprehensive experiments on public datasets demonstrate the state-of-the-art performance of our method in group activity retrieval and recognition. Our ablation studies verify the effectiveness of each component in our method.

Method

Figure 2. Overview of our network. (a) Image feature extractor: group-relevant objects are inpainted to enhance global feature learning. (b) GAF learning network: image features are fed into the transformer encoder, MLP, and temporal pooling to obtain a GAF G. (c) Pretext tasks: the flow of each person and the locations of group-relevant objects are estimated from G.

Step 1

Image Feature Extraction with DINOv3 + Inpainting

Group-relevant objects (e.g., a ball) are inpainted from each video frame using LaMa. DINOv3 then extracts image features I from the inpainted frames, preventing the network from relying on local object appearance and instead forcing global spatial reasoning. Only the last two ViT blocks of DINOv3 are fine-tuned.

Step 2

GAF Learning Network

Image features I are fed into a transformer encoder E and MLP layers M to produce per-frame video features V. Temporal pooling over V yields the compact D-dimensional Group Activity Feature (GAF) G representing the entire video clip.

Step 3

Pretext Tasks for GAF Learning

Two self-supervised pretext tasks train the network without group activity annotations: (1) Person flow estimation — per-person optical flow values are estimated from G and from frame features I^t (auxiliary branch), embedding local dynamics into GAFs. (2) Group-relevant object localization — object coordinates are estimated from G and I^t, embedding global context.

Results

Figure 3. Visual comparison of group activity retrieval on the Volleyball dataset (VBD). For an R-set query (top), our method correctly retrieves an R-set video by capturing player motion direction, whereas GAFL retrieves an R-spike video. For an R-spike query (bottom), our method identifies the jumping spiker and blockers via local dynamics features, while GAFL relies on static appearance similarity.

Figure 4. Visual comparison of group activity retrieval on the NBA dataset. For a 3p-succ query, our method correctly retrieves a 3p-succ video by capturing global player interactions after the shot, whereas the method without L_O retrieves a 3p-fail-def video.

Quantitative Results — Group Activity Retrieval (Hit@K, %)

Method	VBD		NBA
Method	Hit@1	Hit@3	Hit@1	Hit@3
B1-Compact [ECCV 2018]	30.3	59.9	14.9	39.5
B2-VGG19 [ECCV 2018]	35.4	65.0	16.8	39.8
HRN [ECCV 2018]	31.2	57.6	15.5	37.1
GAFL [CVPR 2024]	61.1	82.4	24.7	50.4
★ Ours	82.7	93.0	43.9	72.0

Table 1. Comparison with state-of-the-art self-supervised GAF learning methods on VBD and NBA. Our method achieves the best performance on all metrics. Results in red indicate the best score per column.

Quantitative Results — Group Activity Recognition (MCA, %)

Method	Extractor	VBD	NBA
Whole-image input
DFWSGAR [CVPR 2022]	ResNet-18	90.5	75.8
SOGAR [IEEE Access 2025]	ViT-Base	93.1	83.3
Flaming-Net [ECCV 2024]	Inception-v3	93.3	79.1
LiGAR [WACV 2025]	ResNet-18	74.8	62.7
Whole-image + person bounding boxes
SAM [ECCV 2020]	ResNet-18	86.3	54.3
Dual-AI [CVPR 2022]	Inception-v3	—	58.1
KRGFormer [TCSVT 2023]	Inception-v3	92.4	72.4
MP-GCN [ECCV 2024]	YOLOV8x	92.8	78.7
FAGAR [Pattern Recognit. 2025]	YOLOV8x	85.2	—
★ Ours	DINOv3	93.9	73.0

Table 2. Comparison with supervised group activity recognition methods on VBD and NBA (MCA %). Our method achieves the best performance on VBD and competitive performance on NBA. For a fair comparison, only group activity class labels are used as manual annotations, and only images are used in inference across all methods.

BibTeX

@inproceedings{tezuka2026groupdinomics,
  title     = {Group-DINOmics: Incorporating People Dynamics into DINO
               for Self-supervised Group Activity Feature Learning},
  author    = {Tezuka, Ryuki and Nakatani, Chihiro and Ukita, Norimichi},
  booktitle = {Findings of the IEEE/CVF Conference on Computer Vision
               and Pattern Recognition (CVPR)},
  year      = {2026},
}