Font Size: a A A

Deep Learning For Spoken Term Detection

Posted on:2018-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y M ZhuangFull Text:PDF
GTID:2518305897476634Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Since depth learning was proposed by Hinton in 2006,deep learning has become a hot topic for researchers,and deep learning methods have been widely used in industry.Deep learning methods can make reasonable predictions for unknown results based on known external information,and deep learning methods have strong nonlinear mapping ability,self-learning ability and fault-tolerance ability.Speech is one of the most applicable fields in depth learning,and speech term detection is an important topic in the field of speech.The purpose of spoken term detection is to detect whether some specific keywords exist in the continuous speech,this technology is widely used in a variety of scenarios,such as data retrieval,data mining,command control and etc.In the mainstream methods,GMM-HMM based Keyword-Filler model is an effective method to solve the problem of unrestricted keyword detection,and using DNN instead of GMM as a model to calculate the emit probability in HMMs can further reduce the error rate more than 20%.Despite the deep learning method can improve the performance so significant,but the classic deep neural network based spoken term detection methods still have many defects,such as poor long-term dependence modeling ability,ineffectiveness of distinguishing between similar keywords.This paper mainly studies the application of deep learning in spoken term detection,and tries to solve the shortcomings of classical methods.For the problem,this paper mainly solves the problems in two aspects: preprocessing and detection algorithm.In the preprocessing process,voice activity detection is very important,which directly determines the quality of the speech signal that the detection algorithm needs to process.In this paper,a multi-task depth neural network model is adopted,and multi-frame prediction technique is combined to improve the accuracy of voice activity detection.In training step,voice activity detection is used as the main task,speech enhancement is used as an auxiliary task,and the current speech frame is predicted with the pre and post multiple frames,and the final prediction result is calculated by the fusion function.In the English test set,comparing with the classical depth learning baseline,the multi-frame prediction model has a relative accuracy improvement by 17.9%,and the fragmentation problem reduces by 4.1%.In the aspect of detection algorithm,the LSTM-CTC model is introduced into the spoken term detection system.LSTM-CTC is an effective method for modeling long-term dependence.And a phoneme search algorithm is proposed based on the sparsity of the output of LSTM-CTC model.In the proposed method,the posterior probability of each frame is calculated by the LSTM-CTC acoustic model,and the corresponding CTC lattice is generated,and then an edition distance based phoneme search algorithm is performed on the CTC lattice,which calculates the keyword score of the test audio,at last decision is made by comparing the score with a dynamic threshold.This paper designs experiments to test the performance of the proposed system.In English test set,the proposed system reached 29% EER relative reduction and 12% FOM improvement,and the results are also superior in the detection of similar keywords compared with the classical method.
Keywords/Search Tags:Deep Neural Network, Spoken Term Detection, Long Short-Term Memory, Multi-task, Speech Recognition
PDF Full Text Request
Related items