| The visual interaction that is one of the famous research topics in the domain of human computer interaction allows a user to communicate with electronic devices in a futuristic and a unique way.However,many existing approaches are able to interact with different electronic devices,but the task still remains challenging due to the following reasons: different invariants,environment complexity,processing time,proper framework,accuracy,and security.It is a very imperative to create a proper method of interaction for real-world applications using hand action including,visual interaction with wearable devices.In addition,hand action classification is an important field to gain smart functionality in the modern electronic devices,because it offers interactive and innovative methods of communication.Most importantly,it is required for a visual interaction application to recognize a human hand action accurately for proper interaction with wearable devices.In this thesis,a novel architecture is developed based on you only look at coefficients(YOLACT)a real time instance segmentation approach to segment the user from the background,a developed face recognition-based security network(FRB-SN)for the user identification and temporal relation network(TRN)for hand action understanding.There are six main hand actions: swap left,swap right,swap up,swap down,zoom in and zoom out.The training of each model is taken place in which,the YOLACT is trained using the segmented version of the 20 BN jester dataset,the FRB-SN is trained using the VGGFace2 dataset,and the TRN is trained using the segmented actions’ videos created using the YOLACT model.On the other hand,the testing of the framework is performed in which,firstly,the YOLACT is used to segment the user from the given image sequence,and then the sequence is passed to the FRB-SN model to detect and recognize the user.Finally,the TRN trained model is used to predict the corresponding user action to interact with graphical user interface.We have conducted several experiments to measure the performance of the proposed model,which has proven the framework efficiency and reliability for visual interaction in real-time.The architecture training and testing details are described in a great extent.Moreover,we also explored various other techniques of visual interaction,image segmentation and action recognition.The 20 BN Jester dataset is adapted to train the YOLACT and the TRN models.The dataset has various hand actions videos that have different backgrounds,and orientations.The main purpose of the YOLACT is that it removes the un-necessary information from the input image,while only the extracted user from the foreground of the image has the permission to manipulate the system.Then the extracted frames are given to the FRB-SN network to recognize the user face before passing the sequence into the action recognition model to recognize the corresponding hand action.The proposed framework is tested with various experiments that showed efficiency,superiority and frame rate of the proposed visual interaction system from different aspects.The visual interaction system is able to run between 7 and 10 FPS,while the action recognition model has achieved the accuracy of 97.65%. |