Action Recognition

Introduction of depth sensors made a big impact on research in visual recognition.

By providing 3D information, these cameras help us to have a view-invariant and robust representation of the observed scenes and human bodies (see figure 1).

Detection and 3D localization of human body parts are done more accurately and more efficiently in depth maps in comparison with RGB counterparts.


Action Recognition from Skeletons and Depth-maps

Having the 3D structure of the body parts, the articulated and complex nature of human actions makes the task of action recognition difficult. 

One approach to handle this complexity is dividing it to the kinetics of body parts and analyzing the actions based on the partial descriptors. 

We propose a joint sparse regression based learning method which utilizes the structured sparsity to model each action as a combination of multimodal features from a sparse set of body parts.

To represent dynamics and appearance of parts, we employ a heterogeneous set of depth-based and skeleton-based features. 

The proper structure of multimodal multipart features are formulated into the learning framework via the proposed hierarchical mixed norm, to regularize the structured features of each part and to apply sparsity between them, in favor of a group feature selection.

Our experimental results expose the effectiveness of the proposed learning method in which it outperforms other methods in all three tested datasets while saturating one of them by achieving perfect (i.e. 100%) accuracy.​

NTU RGB+D Action Dataset

Recent approaches in depth-based human activity analysis achieved outstanding performance and proved the effectiveness of 3D representation for classification of action classes.

Currently available depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of training samples, distinct class labels, camera views and variety of subjects.

In ROSE-Lab we collected a large-scale dataset for RGB+D human action recognition with more than 56 thousand video samples and 4 million frames, collected from 40 distinct subjects.

Our dataset contains 60 different action classes including daily actions, mutual actions, and medical conditions.

In addition, we propose a new recurrent neural network structure to model the long-term temporal correlation of the features of each body part, and utilize them for better action classification.

Experimental results show the advantages of applying deep learning methods over state-of-the-art hand-crafted features on the suggested cross-subject and cross-view evaluation criteria for our dataset.

The introduction of this large scale dataset will enable the community to apply, develop and adapt various data-hungry learning techniques for the task of depth-based and RGB+D human activity analysis.​

To download our dataset, please visit


[1].Jun Liu, Amir Shahroudy, Gang Wang, Ling-Yu Duan, Alex C. Kot, SSNet: Scale selection network for online 3D action prediction, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[2].Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, Alex C. Kot, “Global Context-Aware Attention LSTM Networks For 3D Action Recognition”, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

[3].Jun Liu, Amir Shahroudy, Dong Xu, Alex C. Kot, Gang Wang, “Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates”, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017

[4].Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Abdiyeva, Alex C. Kot, Skeleton based Human Action Recognition with Global Context-aware Attention LSTM Networks, IEEE Transactions on Image Processing (TIP), 2018 Scale Dataset for 3D Human Activity Analysis", accepted in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.​​​​

[5].Jun Liu, Amir Shahroudy, Dong Xu, Gang Wang, “Spatio-Temporal LSTM With Trust Gates For 3D Human Action Recognition”, in European Conference on Computer Vision (ECCV), 2016

[6].Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang, "NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis", accepted in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.​​​​

[7].Amir Shahroudy, Gang Wang, and Tian-Tsong Ng, "Multi-Modal Feature Fusion for Action Recognition in RGB-D Sequences", in 6th International Symposium on Communications, Control and Signal Processing (ISCCSP14), May 2014, Athens, Greece.

[8].Amir Shahroudy, Tian-Tsong Ng, Qingxiong Yang, and Gang Wang, "Multimodal Multipart Learning for Action Recognition in Depth Videos", to appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).