Font Size: a A A

Research On Cross-model Speech Recognition System

Posted on:2022-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:S H WangFull Text:PDF
GTID:2518306779996199Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
In the process of human-robot interaction,it is an extremely critical link that the robot accepts human speech and performs corresponding actions according to the speech command.It brings great convenience by interacting with humans in the way humans are accustomed to "communication".In this thesis,we propose a spoken term triplet detection model with visual grounding,aiming to recognize the content triplets consisting of subject object,patient object,and action,object,which can be delivered to the robot to take actions according to speeches.In previous research,in order to train a speech model with an ideal accuracy rate,a large amount of manual textual labels are often required,but it is very difficult to obtain.In response to this problem,we devise a framework consisting of two modules,i.e.,video network,and speech network.In terms of the video network,we firstly exploit I3 D and Mask R-CNN to extract action and object features,respectively.Furthermore,two XGBoost classifiers are adopted to identify subject and patient objects according to manipulator's actions.With the obtained subject objects,actions,and patient objects,we construct content triplets as soft labels to train the speech network.In terms of the speech network,we integrate multiple sequence networks into a multi-head attention module to model the interconnections among spoken terms.In particular,the residuals of the sequence networks as well as the output of the attention module have been combined by Bi GRU to make predictions.The experiments conducted on MPII Cooking 2 dataset which has been extended with speech pairs by us,manifest that MASN is able to use the visual grounding to represent speech labels and achieve outperformance comparing with other methods.Finally,we extend the spoken term triplet detection model with visual grounding and propose a robot instruction manipulation framework based on spoken term triplet detection.The framework mainly includes six modules,namely sentence segmentation module,spoken term triplet detection module,instruction conversion module,object retrieval module,and road planning module.It is mainly used to recognize human speech and let the robot follow the speech command to operate.In order to verify the rationality of the framework,we deploy it on the UR10 e robot for further verification.The experimental results show that the robot can quickly make corresponding operations after hearing human speech.
Keywords/Search Tags:spoken term, detecting key words, triplet detection, multi-head attention, visual grounding, robot instruction
PDF Full Text Request
Related items