The advances in computing power and artificial intelligence have made applications such as augmented reality/virtual reality (AR/VR) and smart factories possible. In smart factories, robots interact with workers and, AR/VR devices are used for skill transfer. In order to enable these types of applications, a computer needs to recognize the user’s hand and body movement with objects and their interactions. In this regard, machine perception of hands and objects is the first step for human and computer integration. This is because personal activity is represented by the interaction of objects and hands. For machine perception of objects and hands, vision sensors are widely used in a wide range of industrial applications since visual information provides non-contact input signals. For these reasons, computer vision-oriented machine perception has been researched extensively. However, due to the complexity of object space and hand movement, machine perception of hands and objects remains a challenging problem.
Recently, deep learning has been introduced with groundbreaking results in the computer vision domain, which address many challenging problems and significantly improves the performance of AI in many tasks. The success of deep learning algorithms depends on the learning strategy and the quality and quantity of the training data. Therefore, in this thesis, we tackle machine perception of hands and objects with four aspects: learning underlying structure of 2D data, fusing surface and volume content of a 3D object, developing an annotation tool for mechanical components, and using thermal information of bare hands. More broadly, we improve the machine perception of interacting hand and object by developing a learning strategy and framework for large-scale dataset creation.
For the learning strategy, we use a conditional generative model, which learns conditional distribution of the dataset by minimizing the gap between data distribution and the model distribution for hands and objects. First, we propose an efficient conditional generative model for 2D images that can traverse the latent space given a conditional vector. Subsequently, we develop a conditional generative model for 3D space that fuses volume and surface representations and learns the association of functional parts. These methods improve machine perception of objects and hands for not only 2D images but also in 3D space. However, the performance of deep learning algorithms has positive correlation with the quality and quantity of datasets, which motivates us to develop the a large-scale dataset creation framework.
In order to leverage the learning strategies of deep learning algorithms, we develop annotation tools that can establish a large-scale dataset for objects and hands and evaluate existing deep learning methods with extensive performance analysis. For the object dataset creation, we establish a taxonomy of mechanical components and a web-based annotation tool. With this framework, we create a large-scale mechanical components dataset. With the dataset, we benchmark seven different machine perception algorithms for 3D objects. For hand annotation, we propose a novel data curation method for pixel-wise hand segmentation dataset creation, which uses thermal information and hand geometry to identify and segment the hands from objects and backgrounds. Also, we introduce a data fusion method that fuses thermal information and RGB-D data for the machine perception of hands while interacting with objects.
Funding
NRI-1637961
IIP-1632154
FW-HTF 1839971
OIA-1937036
Donald W. Feddersen Chaired Professorship from Purdue School of Mechanical Engineering