For the first time globally, we realize microsecond-level synchronization of bilateral wristband EMG, egocentric RGB, external RGB-D and optical motion capture data, solving the core problem of lack of cross-modal alignment in existing datasets.
Adopting a Graph Transformer based markers2mano reconstruction pipeline, the invalid frame rate is reduced from 12.7% to 3.6%, with an average marker alignment error of only 4.3mm, including complete wrist joint angle annotations.
We build a unified benchmark covering three major tasks: EMG-to-pose, Vision-to-pose, and EMG+Vision fusion, providing three generalization evaluation settings: cross-gesture, cross-user, and combined split.
Overall System Architecture (See Figure 1 in the Paper)
Other description text
41 subjects (23 males, 18 females, average age 24), 60 gesture categories (30 single-hand + 30 bimanual), covering diverse hand movement patterns.
Other description text
16-channel EMG (2kHz), 120Hz IMU, 60fps egocentric RGB, 30fps external RGB-D, 120Hz raw optical motion capture data.
Other description text
Including 20 finger joint angles and 2 wrist joint angles, average marker alignment error 4.3mm, with MANO mesh parameters provided.
Centered on high-precision time synchronization, it integrates EMG, inertial, vision and motion capture modalities to perfectly align hand muscle activity and visual posture, providing high-quality data foundation for multimodal hand perception research.
8-channel surface EMG acquisition per wrist at 2kHz high sampling rate, covering major forearm muscle groups; ≤50g lightweight design, comfortable to wear without restriction, supporting long-term continuous acquisition.
Head-mounted wide-angle RGB camera provides egocentric hand view; external ZED 2i RGB-D camera provides global 3D scene information; dual-camera collaboration achieves complete perception of hands and environment.
Soft synchronization based on host timestamp combined with linear interpolation achieves precise time alignment of all modal data; synchronization error less than 1ms, ensuring consistency across modalities.









jfeng@tsinghua.edu.cn

jzhou@tsinghua.edu.cn
Core members are from the Department of Automation, Tsinghua University, with over 20 years of research accumulation in computer vision and pattern recognition, undertaking multiple national scientific research projects.
Integrating knowledge of biomedical engineering, human-computer interaction and embedded systems, covering the full chain from hardware acquisition to algorithm research.
The team has built multiple hand pose related datasets, with mature technical systems in data collection, annotation and standardization.
Dozens of top papers published in hand pose estimation and EMG decoding, with related technologies widely applied in AR/VR and prosthetic control.
111111@qinghuadaxue.com
Tsinghua University
