CVPR 2026 Workshop — GAZE 2026

End-to-End Shared Attention Estimation via Group Detection with Feedback Refinement

Chihiro Nakatani¹ Norimichi Ukita¹ Jean-Marc Odobez^2,3

¹Toyota Technological Institute, Japan ²Idiap Research Institute, Switzerland ³École polytechnique fédérale de Lausanne (EPFL), Switzerland

Paper arXiv Code

TL;DR

● End-to-end shared attention estimation via group detection — unlike previous methods that separately address shared attention estimation and group detection, we propose to estimate shared attention via group detection, enabling full capture of group context.
● Group membership approach for shared attention heatmap estimation — each group's shared attention heatmap is generated by weighting and aggregating individual attention maps based on estimated shared attention group memberships.
● Shared attention refinement — a feedback refinement step re-estimates group memberships by conditioning on the initial shared attention heatmaps, improving both group detection and final shared attention prediction.

Teaser: comparison of shared attention estimation methods

Figure 1. Difference between previous and our shared attention estimation methods. (a) Shared attention estimation using simple post-processing over individual attention maps. (b) Direct shared attention estimation without group detection. (c) Our method: end-to-end shared attention estimation via group detection, where shared attention is estimated by integrating individual attention based on detected groups.

Abstract

This paper proposes an end-to-end shared attention estimation method via group detection. Most previous methods estimate shared attention (SA) without detecting the actual group of people focusing on it, or assume that there is a single SA point in a given image. These issues limit the applicability of SA detection in practice and impact performance. To address them, we propose to simultaneously achieve group detection and shared attention estimation using a two-step process: (i) the generation of SA heatmaps relying on individual gaze attention heatmaps and group membership scalars estimated in a group inference; (ii) a refinement of the initial group memberships allowing to account for the initial SA heatmaps, and the final prediction of the SA heatmap. Experiments demonstrate that our method outperforms other methods in group detection and shared attention estimation. Additional analyses validate the effectiveness of the proposed components.

Method

Figure 2. Overview of our network. Individual attention heatmaps A are first estimated for each person. They are exploited to derive group memberships per group token M and integrated to infer the shared attention heatmap S. Both group memberships and shared attention heatmaps are further refined in a second step.

Step 1

Individual Attention Estimation

For each person, head bounding box coordinates and the cropped head image are encoded as person gaze tokens, processed by transformers along with image tokens to produce per-person tokens P. These are decoded into individual attention heatmaps A_n.

Step 2

Initial Shared Attention via Group Detection

Learnable group tokens G interact with person tokens through cross-attention to infer group membership coefficients M_e,n. The shared attention heatmap for each group is computed as a membership-weighted sum of individual attention heatmaps.

Step 3

Feedback Refinement

Spatial argmax of the initial SA heatmaps provides peak coordinates that are fed back to refine group memberships M′, enabling the final shared attention heatmap S′ to benefit from both group context and initial SA estimates.

Results

Ground-truth

MTGS-PP

Ours

Figure 3. Visual comparison on the VideoCoAtt dataset. People joining the same group are enclosed within a rectangle of the same color. The estimated shared attention point for each group is visualized as a colored dot. Our method correctly detects groups and localizes shared attention targets where competing methods fail.

Quantitative Results — VideoCoAtt Dataset

Method	θ_IoU = 0.5			θ_IoU = 1.0
Method	θ_Dist=0.05	θ_Dist=0.1	θ_Dist=∞	θ_Dist=0.05	θ_Dist=0.1	θ_Dist=∞
MTGS-PP [NeurIPS 2024]	16.4	20.7	37.5	7.1	8.6	11.3
MTGS-Soc. [NeurIPS 2024]	5.7	7.8	28.6	2.9	3.6	9.8
Gaze-LLE-PP [CVPR 2025]	15.6	19.5	26.3	5.7	6.9	8.2
★ Ours	32.4	41.0	61.7	12.5	15.9	17.7

Table 1. GroupAP (%) comparison on the VideoCoAtt dataset. Our method outperforms all baselines across every setting. Results in red indicate the best score per column.

Quantitative Results — VideoAttentionTarget Dataset

Method	θ_IoU = 0.5			θ_IoU = 1.0
Method	θ_Dist=0.05	θ_Dist=0.1	θ_Dist=∞	θ_Dist=0.05	θ_Dist=0.1	θ_Dist=∞
MTGS-PP	30.3	35.6	66.0	5.1	6.3	9.0
MTGS-Soc.	7.6	9.9	37.2	1.5	1.9	4.7
Gaze-LLE-PP	7.5	8.5	12.4	2.9	3.0	3.3
★ Ours	27.2	33.5	52.0	7.5	10.1	12.0

Table 2. GroupAP (%) on the VideoAttentionTarget dataset. Our method achieves the best performance under strict group detection criteria (θ_IoU = 1.0).

Quantitative Results — ChildPlay Dataset

Method	θ_IoU = 0.5			θ_IoU = 1.0
Method	θ_Dist=0.05	θ_Dist=0.1	θ_Dist=∞	θ_Dist=0.05	θ_Dist=0.1	θ_Dist=∞
MTGS-PP	7.8	13.9	25.4	4.6	5.2	5.6
MTGS-Soc.	0.8	1.8	9.1	0.7	0.7	1.3
Gaze-LLE-PP	5.8	8.1	14.1	2.4	2.8	4.4
★ Ours	9.0	15.6	36.3	2.1	2.1	2.4

Table 3. GroupAP (%) on the ChildPlay dataset. Our method achieves best performance under lenient criteria (θ_IoU = 0.5), especially at θ_Dist = ∞ (36.3% vs. 25.4% for the best baseline).

BibTeX

@inproceedings{nakatani2026sagd,
  title     = {End-to-End Shared Attention Estimation via Group Detection
               with Feedback Refinement},
  author    = {Nakatani, Chihiro and Ukita, Norimichi and Odobez, Jean-Marc},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
               and Pattern Recognition Workshops (CVPRW)},
  year      = {2026},
}