Font Size: a A A

Vision-based Hand Gesture Recognition Using Deep Learning Approaches

Posted on:2022-08-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Adam Ahmed Qaid MohammedFull Text:PDF
GTID:1488306734471854Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Hand gestures are a significant component in daily life for many individuals.Gestures represent an intuitive and natural form of non-verbal communication medium with humans as well as machines.The automatic recognition and interpretation of human hand gestures from visual input are among the most prevalent computer vision fields of research,especially with the exciting scientific challenges and the growing interest in interactive applications that instigated the need for accurate hand gesture recognition methods.However,recognizing vision-based hand gestures is extremely challenging due to many issues such as small size,complexity,self-occlusions,background clutter,velocity,etc.The recent advances in consumer depth cameras,which provide complementary depth information and skeletal representations,have alleviated some of the difficult tasks which were considered harder earlier and opened the floodgates of novel methods development.However,it is still challenging to effectively extract discriminative features from the gestures data that appropriately represent the different dependencies of hand gestures.Besides,most of the previous methods on hand gesture recognition heavily rely on hand-engineered low-level features,or most recently on complicated and deeper neural networks methods.Therefore,our contribution in this dissertation aims to boost and improve the ongoing research on vision-based hand gesture recognition using deep learning methods.We develop various deep learning models to handle different challenges in recognizing hand gestures from visual modalities to achieve this goal.We mainly focus on developing powerful deep learning methods with superior performance compared to the state-of-the-art methods while maintaining lightweight model sizes and fast processing time.First,we concentrate on addressing the problem of recognizing hand gestures from a cluttered environment by designing a composite system that can localize the hand in the image and recognize the gesture performed.The proposed system passes the input image through a trained hand detector to extract hand regions and then processes them using a convolutional neural network(CNN)model to classify the hand gesture performed.Keys to our system are the deep-learning-based one-stage hand detector,which significantly reduces the computational time for extracting hand regions,and the lightweight CNN subnetwork that accurately recognize hand gestures.In addition,we collect a large dataset to train the hand detector,which compensates for the lack of sufficient annotated hand detection data.A stage-wise training strategy is proposed to train the hand detector to localize and extract hand regions accurately and then apply the trained hand detector to infer unlabeled hand gestures data to train the CNN model.The effectiveness of the system is demonstrated using experiments on various hand gesture recognition datasets taken under different acquisition conditions.Moving forward,we present a static hand gesture recognition method based on deep multimodal learning from RGB-D images.The key to our method is to investigate the information complementariness among modalities and construct lightweight CNN models that maintain high accuracy while focusing on reducing the network size and,therefore,the resources required for training and inference processes.We propose two lightweight CNN models and then explore different multimodal fusion strategies,including the features fusion on the intermediate and late layers.We demonstrate that multimodal features fusion significantly improves the recognition accuracy,which allows the model to learn various multimodal features from each modality.Furthermore,the constructed lightweight networks preserve the multimodal models small and efficient.The aforementioned two works focus on hand gesture recognition from still images,in which the class of gesture is inferred from one image.In the third work of this dissertation,we focus on dynamic hand gesture recognition from skeleton data,in which a sequence of skeleton inputs contains one class.The challenging task of recognizing dynamic hand gestures lies in their spatial and temporal variability when the same gesture can differ in speed,shape,duration,and integrality.Therefore,we address this problem by proposing a deep convolutional long short-term memory(DConvLSTM)architecture to implicitly learn discriminative spatiotemporal features from skeleton data.The model employs multilayer ConvLSTM to accurately capture the hierarchical multiscale spatial and sequential information of the dynamic hand gesture and preserves a fast inference and lightweight size.In addition,we investigate the impact of constructing deep recurrent architecture with multiple layers to attain higher abstraction on the input skeleton sequences.The DconvLSTM model uses only raw skeleton sequence as an input and is applicable for various datasets,regardless of the dataset size,hand model,or sensing technology choice.Finally,the last contribution of this dissertation aims to further enhance the previous method to effectively extract spatiotemporal features from skeleton sequences by developing a deep multi-model ensemble network for high-accuracy skeleton-based hand gesture recognition.Specifically,to establish effective feature extraction and accurate gesture recognition,we propose an architecture consisting of four subnetworks,three Spatio-temporal features classifiers to leverage their various capabilities of extracting and classifying skeleton sequences.Through late feature fusion,the extracted features from each subnetwork are fused into a new fusion classifier.Each subnetwork is trained to perform the task of gesture recognition using only skeleton joints.To efficiently ensemble the different subnetworks,we propose an optimized weighted ensemble method,which intends to find the best way to combine predictions of different models by finding the optimal ensemble weights and aggregate them in a way that minimizes the total expected prediction error.The experiments demonstrate that ensembling multiple models significantly increases recognition rates and yields better results than single model approaches.In summary,this dissertation contributes to the main research field of human-machine interaction using hand gestures by addressing many issues faced in existing methods.The proposed methods develop innovative deep-learning models to extract discriminative features from the different input data for accurate hand gesture recognition.The resulting trained models achieve not only improved performance over the existing methods but also maintain high efficiency.Experimental results on various real-world hand gesture recognition datasets demonstrate the feasibility and improved performance of the proposed methods compared with previous methods.Our work in this dissertation is anticipated to promote further the research on human-machine interaction and its applications in various real-life scenarios.
Keywords/Search Tags:Human-Computer Interaction, Deep Learning, Hand Gesture Recognition, Spatiotemporal Features Extraction, Ensemble Learning
PDF Full Text Request
Related items