Font Size: a A A

Research On Speech Keyword Spotting Technology Based On Deep Learning

Posted on:2022-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:H B HuFull Text:PDF
GTID:2518306326493064Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of deep learning,compared to traditional keyword retrieval methods based on large vocabulary continuous speech recognition,keyword spotting methods based on deep learning have become popular.In traditional keyword retrieval methods,each component is trained independently,so the overall performance cannot be fully utilized,and the speed of continuous speech recognition is slow and requires a larger storage space.The End-to-End keyword spotting system can overcome these shortcomings very well,so this thesis mainly studies the End-to-End ASR-free keyword spotting system framework,and improves and implements it.The main works are as follows:1.An End-to-End ASR-free keyword spotting system is studied and implemented in engineering.In the End-to-End keyword spotting system,input a text format keyword and the audio object to be detected,and directly output 1/0 to indicate whether the keyword appears in the audio.The system is composed of three parts: the text encoder is used to obtain the vector representation of keyword,which is implemented by a character-level language model;the acoustic encoder is used to obtain the audio vector representation,which is implemented by a recurrent neural network;the keyword spotting model takes the keyword vector and the audio vector as input,calculates the final result,and implements it with a forward network.Experiments based on AISHELL and RASC863 corpus show that its overall accuracy and ATWV are 62.5% and-82.3406,respectively.Although there is a performance gap with traditional keyword retrieval methods,the amount of keyword label data required by the system and the training time are reduced by 90% and 80% respectively compared with traditional keyword retrieval systems.2.On the basis of an End-to-End ASR-free keyword spotting system,aiming at the problems of limited memory ability and insufficient coding ability for long speech sequences of acoustic encoder,attention-based End-to-End keyword spotting model is used to improve the performance of End-to-End keyword spotting system.Firstly,the acoustic module is used to replace the acoustic encoder,which is implemented by a Bidirectional Long Short-Term Memory network.Then,in order to get the keyword vector,the convolutional neural network in the text query encoder is removed.Finally,the attention mechanism is used to extract the keyword information in the input speech signal.Experiments show that the performance of the keyword spotting model using attention mechanism is greatly improved,the accuracy and ATWV are improved by21.6% and 49.7% respectively compared with the baseline system.3.On the basis of attention-based End-to-End keyword spotting model,aiming at the problem of insufficient Long Short-Term Memory network feature extraction ability and slow calculation in the acoustic module,temporal convolution neural network and self-attention mechanism are used to improve it and improve the performance of the model.In this thesis,the two kinds of network structure and Bidirectional Long Short-Term Memory network are combined to get different acoustic modules.The experimental results show that the model with temporal convolutional neural network has achieved the best recognition effect.Compared with the keyword spotting model that uses the attention mechanism,its accuracy rate and ATWV are relatively improved by 11.7% and 67.1% respectively.4.On the basis of attention-based End-to-End keyword spotting model,aiming at the problem of low utilization of labeling information,it is proposed to construct auxiliary tasks using Connectionist Temporal Classification criteria,and perform multitask training on the End-to-End keyword spotting model to improve the semantic information extraction ability of the acoustic encoder and further improve the model performance.This thesis proposes two methods to construct auxiliary tasks using Connectionist Temporal Classification criteria: One is based on the output of the acoustic encoder to construct the auxiliary task of keyword recognition with time and location information,and uses the Connectionist Temporal Classification loss function for multi task training;The other is to construct a continuous speech recognition auxiliary task based on the output of the acoustic encoder,and uses the Connectionist Temporal Classification loss function for multi task training.Experimental results show that the model which constructs a continuous speech recognition auxiliary task achieves the best performance,with accuracy and ATWV of 98% and-6.4892 respectively.
Keywords/Search Tags:keyword spotting, End-to-End, attention mechanism, Connectionist Temporal Classification
PDF Full Text Request
Related items