CVPR 2026 Workshop — GAZE 2026

End-to-End Shared Attention Estimation via Group Detection with Feedback Refinement

Chihiro Nakatani1Norimichi Ukita1Jean-Marc Odobez2,3

1Toyota Technological Institute, Japan  2Idiap Research Institute, Switzerland  3École polytechnique fédérale de Lausanne (EPFL), Switzerland

In collaboration with Prof. Jean-Marc Odobez (Idiap Research Institute / EPFL)

TL;DR

Teaser: comparison of shared attention estimation methods

Figure 1. Difference between previous and our shared attention estimation methods. (a) Shared attention estimation using simple post-processing over individual attention maps. (b) Direct shared attention estimation without group detection. (c) Our method: end-to-end shared attention estimation via group detection, where shared attention is estimated by integrating individual attention based on detected groups.

Abstract

This paper proposes an end-to-end shared attention estimation method via group detection. Most previous methods estimate shared attention (SA) without detecting the actual group of people focusing on it, or assume that there is a single SA point in a given image. These issues limit the applicability of SA detection in practice and impact performance. To address them, we propose to simultaneously achieve group detection and shared attention estimation using a two-step process: (i) the generation of SA heatmaps relying on individual gaze attention heatmaps and group membership scalars estimated in a group inference; (ii) a refinement of the initial group memberships allowing to account for the initial SA heatmaps, and the final prediction of the SA heatmap. Experiments demonstrate that our method outperforms other methods in group detection and shared attention estimation. Additional analyses validate the effectiveness of the proposed components.

Method

Method overview: network architecture

Figure 2. Overview of our network. Individual attention heatmaps A are first estimated for each person. They are exploited to derive group memberships per group token M and integrated to infer the shared attention heatmap S. Both group memberships and shared attention heatmaps are further refined in a second step.

Step 1

Individual Attention Estimation

For each person, head bounding box coordinates and the cropped head image are encoded as person gaze tokens, processed by transformers along with image tokens to produce per-person tokens P. These are decoded into individual attention heatmaps An.

Step 2

Initial Shared Attention via Group Detection

Learnable group tokens G interact with person tokens through cross-attention to infer group membership coefficients Me,n. The shared attention heatmap for each group is computed as a membership-weighted sum of individual attention heatmaps.

Step 3

Feedback Refinement

Spatial argmax of the initial SA heatmaps provides peak coordinates that are fed back to refine group memberships M′, enabling the final shared attention heatmap S′ to benefit from both group context and initial SA estimates.

Results

Qualitative results on VideoCoAtt dataset
Ground-truth
MTGS-PP
Ours

Figure 3. Visual comparison on the VideoCoAtt dataset. People joining the same group are enclosed within a rectangle of the same color. The estimated shared attention point for each group is visualized as a colored dot. Our method correctly detects groups and localizes shared attention targets where competing methods fail.

Quantitative Results — VideoCoAtt Dataset

Method θIoU = 0.5 θIoU = 1.0
θDist=0.05 θDist=0.1 θDist=∞ θDist=0.05 θDist=0.1 θDist=∞
MTGS-PP [NeurIPS 2024] 16.420.737.5 7.18.611.3
MTGS-Soc. [NeurIPS 2024] 5.77.828.6 2.93.69.8
Gaze-LLE-PP [CVPR 2025] 15.619.526.3 5.76.98.2
★ Ours 32.4 41.0 61.7 12.5 15.9 17.7

Table 1. GroupAP (%) comparison on the VideoCoAtt dataset. Our method outperforms all baselines across every setting. Results in red indicate the best score per column.

Quantitative Results — VideoAttentionTarget Dataset

Method θIoU = 0.5 θIoU = 1.0
θDist=0.05 θDist=0.1 θDist=∞ θDist=0.05 θDist=0.1 θDist=∞
MTGS-PP 30.335.666.0 5.16.39.0
MTGS-Soc. 7.69.937.2 1.51.94.7
Gaze-LLE-PP 7.58.512.4 2.93.03.3
★ Ours 27.233.552.0 7.5 10.1 12.0

Table 2. GroupAP (%) on the VideoAttentionTarget dataset. Our method achieves the best performance under strict group detection criteria (θIoU = 1.0).

Quantitative Results — ChildPlay Dataset

Method θIoU = 0.5 θIoU = 1.0
θDist=0.05 θDist=0.1 θDist=∞ θDist=0.05 θDist=0.1 θDist=∞
MTGS-PP 7.813.925.4 4.65.25.6
MTGS-Soc. 0.81.89.1 0.70.71.3
Gaze-LLE-PP 5.88.114.1 2.42.84.4
★ Ours 9.0 15.6 36.3 2.12.12.4

Table 3. GroupAP (%) on the ChildPlay dataset. Our method achieves best performance under lenient criteria (θIoU = 0.5), especially at θDist = ∞ (36.3% vs. 25.4% for the best baseline).

BibTeX

@inproceedings{nakatani2026sagd,
  title     = {End-to-End Shared Attention Estimation via Group Detection
               with Feedback Refinement},
  author    = {Nakatani, Chihiro and Ukita, Norimichi and Odobez, Jean-Marc},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
               and Pattern Recognition Workshops (CVPRW)},
  year      = {2026},
}

Acknowledgements

This work was conducted in collaboration with Prof. Jean-Marc Odobez at the Idiap Research Institute and École polytechnique fédérale de Lausanne (EPFL), Switzerland.