Human action recognition in videos is one of the most challenging tasks in computer vision. One important issue is how to design discriminative features for representing spatial context and temporal dynamics. Here, we introduce a path signature feature to encode information from intra-frame and inter-frame contexts. A key step towards leveraging this feature is to construct the proper trajectories (paths) for the data steam. In each frame, the correlated constraints of human joints are treated as small paths, then the spatial path signature features are extracted from them. In video data, the evolution of these spatial features over time can also be regarded as paths from which the temporal path signature features are extracted. Eventually, all these features are concatenated to constitute the input vector of a fully connected neural network for action classification. Experimental results on four standard benchmark action datasets, J-HMDB, SBU Dataset, Berkeley MHAD, and NTURGB+D demonstrate that the proposed approach achieves state-of-the-art accuracy even in comparison with recent deep learning based models.
Yang, W., Lyons, T., Ni, H., Schmid, C., Jin, L. and Chang, J. (2017) arXiv:1707.03993