Research On Cross-model Speech Recognition System

Posted on:2022-11-20

Degree:Master

Type:Thesis

Country:China

Candidate:S H Wang

Full Text:PDF

GTID:2518306779996199

Subject:Automation Technology

Abstract/Summary:

PDF Full Text Request

In the process of human-robot interaction,it is an extremely critical link that the robot accepts human speech and performs corresponding actions according to the speech command.It brings great convenience by interacting with humans in the way humans are accustomed to "communication".In this thesis,we propose a spoken term triplet detection model with visual grounding,aiming to recognize the content triplets consisting of subject object,patient object,and action,object,which can be delivered to the robot to take actions according to speeches.In previous research,in order to train a speech model with an ideal accuracy rate,a large amount of manual textual labels are often required,but it is very difficult to obtain.In response to this problem,we devise a framework consisting of two modules,i.e.,video network,and speech network.In terms of the video network,we firstly exploit I3 D and Mask R-CNN to extract action and object features,respectively.Furthermore,two XGBoost classifiers are adopted to identify subject and patient objects according to manipulator's actions.With the obtained subject objects,actions,and patient objects,we construct content triplets as soft labels to train the speech network.In terms of the speech network,we integrate multiple sequence networks into a multi-head attention module to model the interconnections among spoken terms.In particular,the residuals of the sequence networks as well as the output of the attention module have been combined by Bi GRU to make predictions.The experiments conducted on MPII Cooking 2 dataset which has been extended with speech pairs by us,manifest that MASN is able to use the visual grounding to represent speech labels and achieve outperformance comparing with other methods.Finally,we extend the spoken term triplet detection model with visual grounding and propose a robot instruction manipulation framework based on spoken term triplet detection.The framework mainly includes six modules,namely sentence segmentation module,spoken term triplet detection module,instruction conversion module,object retrieval module,and road planning module.It is mainly used to recognize human speech and let the robot follow the speech command to operate.In order to verify the rationality of the framework,we deploy it on the UR10 e robot for further verification.The experimental results show that the robot can quickly make corresponding operations after hearing human speech.

Keywords/Search Tags:

spoken term, detecting key words, triplet detection, multi-head attention, visual grounding, robot instruction

PDF Full Text Request

Related items

1	Deep Learning For Spoken Term Detection
2	Study And Implementation Of Content-based Mandarin Spoken Term Detection System
3	Research On WFST Based Spoken Term Detection
4	Visual Grounding Via Accumulated Attention
5	Research On Confidence Measure For Chinese Spoken Term Detection
6	A Study Of Key Problems In Spoken Term Detection
7	Research On Out-of-vocabulary Spoken Term Detection
8	Research On Spoken Term Detection Based On ASR Under Limited-resource Conditions
9	Research On Chinese Spoken Term Detection Based On Deep Learning
10	Research On System Combination For Spoken Term Detection