Font Size: a A A

Spoken Keyword Spotting Method And System Design Based On CRNN-CTC

Posted on:2022-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:H K YanFull Text:PDF
GTID:2518306569972659Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of deep learning,the performance of spoken keyword spotting(KWS)has been greatly improved.However,due to the reasons such as the high complexity of the language itself and the lack of annotated corpus,many minority languages such as Hakka dialect have not been fully studied on KWS.There are fewer speech intelligence applications in these languages.This thesis carries out the research on the KWS of the Hakka dialect in Ganzhou area of Jiangxi Province.A KWS method using Convolutional Recurrent Neural Network(CRNN)with Connectionist Temporal Classification(CTC)is proposed in this thesis.First,the effectiveness of this method is verified on Mandarin.And then it is applied to the Hakka dialect.Finally,a KWS system is built.The main work of this thesis is as follows:1.A CRNN-CTC based KWS method is proposed,which combines Convolutional Neural Network(CNN)and RNN-CTC.Experiments on the AISHELL-2 Mandarin public corpus show that the proposed CRNN-CTC method on the tasks of 12 keywords and 20 keywords can achieve a false reject rate of 4.82% and 5.38% at 0.5 false alarm per keyword per hour,respectively,which is a relative decrease of 38.83% and 58.81% compared with the RNN-CTC method.In addition,the training time is also shorter.2.Aiming at the pronunciation characteristics of Mandarin,a systematic comparison of the differences among different modeling units is carried out,which include Chinese characters,tonal syllables,words,initials,and tonal finals.The experimental results show that using all initials and tonal finals as the modeling unit has the best performance on Mandarin KWS.3.A Hakka speech corpus of about 447 hours is collected in this thesis.Then the CRNN-CTC based method is extended to Hakka.Considering the language characteristics of Hakka dialect,whether Hakka and Mandarin are consistent in the selection of the optimal modeling unit is explored.And the reasons for the performance gap between Hakka and Mandarin are analyzed in detail.The experimental results on the Hakka corpus collected in this thesis show that,among the above four modeling units,the Hakka dialect has the best performance when using tonal syllables,which is different from Mandarin.The proposed method on the Hakka KWS tasks of 12 keywords and 50 keywords can achieve a false reject rate of 12.43% and 11.88% at 0.5 false alarm per keyword per hour,respectively.4.In order to improve the purity of the Hakka speech corpus,based on the CRNN-CTC KWS method proposed in this thesis,a speech sample reliability evaluation indictator based on weighted edit distance is designed.The indicator is weighted according to the number of false alarms and false rejects of keywords,and combines the model decoding outputs during different training epochs.Finally,it is used to screen out the samples most likely to have noisy label.5.A KWS system that can process multiple speech concurrently is built.The system opens an API request interface.There are two working modes,one is offline non-real-time keyword spotting,and the other is online real-time keyword spotting.After testing,the system can still process about 60 seconds of speech data per second on a general computer without GPU.
Keywords/Search Tags:spoken keyword spotting, connectionist temporal classification, Hakka, modeling units, speech samples screening
PDF Full Text Request
Related items