The state of the art has outstanding results for 2D multi-person pose estimation using multi-stage Deep Neural Networks in images with high accuracy. However, the use of these models on real-time applications may be impractical not just because they are computationally intensive but also because they suffer from flicking, from the inability to capture temporal correlations among video frames, as well as from image degradation. To tackle these problems, we expand the use of pose estimation to motion capture in interactive applications. To do so, we propose a novel deep neural network with streamlined architecture and tensor decomposition for pose estimation with improved processing time, named TensorPose. We introduce an architecture for markerless motion capture using Convolutional Neural Networks combined with sparse optical flow and Kalman Filters. We also apply this architecture in a multi-user environment, based on the Holojam framework, where it is possible to create simultaneous collaborative experiences.

The Deep Learning Model

In our model, we replace conventional convolution operations by successive pointwise and regular convolutions in a reduced space.  The proposed modifications are analogous to applying a tensor decomposition, more specifically a high order singular value decomposition (HOSVD). Also, we add temporal coherence, which is not present in previous works since they do not consider the relationship between the processed frames. To track the detected persons in videos and 3D captures, we use sparse optical flow and a Kalman filter to smooth the movement.

In this work, we also propose the creation of an architecture that uses markless motion-capture for real-time applications. Such an architecture brings us the possibility of creating environments for the development of shared and distributed applications. With the collected information, our architecture easily composes an environment for the development of applications that can involve motion tracking using WebGL or a game engine. This software infrastructure can be used to implement multi-user interaction, positional tracking of people and objects, to create a complete sense of immersion in a tangible space, where it is possible to bridge the gap between the physical and virtual spaces. In other words, users in different physical environments and using various devices can share and explore a unique and integrated virtual environment.