Font Size: a A A

Research On Adaptation Methods In Deep Learning Based Speech Recognition Systems

Posted on:2021-01-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:J PanFull Text:PDF
GTID:1368330602994196Subject:Electronics and information
Abstract/Summary:PDF Full Text Request
Speech is the fastest and most convenient way of human-computer interaction,and speech recognition technology is an important part of artificial intelligence.With the progress of deep learning,speech recognition has achieved an accuracy close to that of humans in most scenes,but in special scenarios such as speakers with dialects and accents,complex environmental noise,professional fields,etc.,the accuracy of speech recognition will suffer a significant decline,affecting the user experience.Adaptation technology is one of the effective means to improve the accuracy of speech recogni-tion in special scenes,so it has always been a research hotspot in the field of speech recognition.Compared with the adaptation technology in the traditional speech recog-nition system,the adaptation technology in deep learning speech recognition systems faces the problems of huge model parameters and relatively small amount of data,which makes adaptation in deep learning speech recognition systems a challenging job.Aim-ing at these problems,the dissertation carries out research work on online adaptation of acoustic models,offline adaptation of acoustic models under low resources,unsu-pervised offline adaptation of acoustic models,and adaptation of language models,and applies the research results to actual speech recognition systems.The research work of the dissertation is based on the key special topics of the National Key R&D Program of the Ministry of Science and Technology undertaken by Iflytek Co.,Ltd."Natural Interaction Intention Understanding and Intelligent Input Based on Big Data"(Subject Number:2016YFB1001303),"Speech Recognition and Intent Understanding on Un-known Scenes"(Subject Number:2018AAA0102204)".The research contents of the dissertation include:The online adaptation technology of acoustic models is studied.Aiming at the problem that online adaptation of acoustic models requires extremely high real-time performance,and the lack of adaptation data leads to the limited adaptation effect,the dissertation proposes an online adaptation method of acoustic models based on the atten-tion mechanism.A pre-trained speaker recognition model is used to extract the embed-dings of a large number of speakers,and the embeddings are taken as external memory units after clustering.Then,the attention mechanism is used to quickly select the em-beddings closest to the current speech segment and combine them to obtain the speaker embedding for the current speech frame.Furthermore,we introduce the fixed-size or-dinally forgetting encoding mechanism,and at the same time propose a multiple-gated-connections mechanism,a speaker classification auxiliary objective function,and resid-ual vector speaker embeddings,which further improve the performance of online adap-tation of acoustic models.We conducted experiments on two representative datasets of Chinese and English speech recognition,and the experimental results show that our methods can achieve a significant improvement on online adaptation of acoustic models without substantially increasing the computational complexity of speech recognition.The offline adaptation technology of acoustic models under low resources is stud-ied.Aiming at the problem that offline adaptation of acoustic models under low re-sources is easy to overfit and cause poor generalization ability,the dissertation proposes a speaker code method based on multi-task learning and an adaptation method based on singular value decomposition and vector quantization.We first analyze the traditional adaptation method based on the speaker code and point out the shortcomings.Then,we introduce an additional speaker classification learning target to perform multi-task learning on the speaker code vector to improve the generalization ability of the method for new speakers.Then,we extend the speaker code vector into a speaker code matrix to enhance the effect of adaptation,and initialize the speaker code matrix by singular value decomposition.At the same time,in order to fully compress the amount of adaptation parameters,we introduce the vector quantization technology,and combine the process of vector quantization with the process of adaptation to reduce the loss of effects caused by the vector quantization.In the real speech recognition data set,these two methods have achieved better results under the condition of low resources.The unsupervised offline adaptation technology of acoustic models is studied.Aim-ing at the problem that the unsupervised offline adaptation of acoustic models has a se-rious performance degradation compared to the supervised adaptation,We first propose a method which uses user confirmation text in the process of human-computer interac-tion to help improve the accuracy of machine data annotation of adaptation data,and then propose an acoustic confidence measure method based on a confirmation model.By designing a variety of statistical features for training the confirmation model,we can directly determine whether the current word is recognized correctly,and the correlation between the confidence and the accuracy of speech recognition can be significantly enhanced.It makes the selection of adaptation data more accurate by this confidence measure method,and improves the accuracy of machine automatic labeling.Then,we jump out of the constraints of the traditional unsupervised adaptation methods and pro-pose an unsupervised adaptation method based on meta-learning,which directly takes the performance on the test set as the training target,so that the model trained by meta-learning can obtain better performance on the test set.Experimental results show that the methods proposed in the dissertation can greatly improve the effect of unsupervised adaptation of acoustic models.The adaptation technology of language models is studied.Aiming at the prob-lem that the adaptation data of language models is sparse,and an effective adaptation method is lacked,we propose an N-gram language model adaptation method based on user modified words.By mining user keywords from user modification behaviors and dynamically boosting the keywords,we can do effective and rapid adaptation of N-gram language models,with a low false trigger rate and a significant improvement on the accuracy of recognition of user keywords.Becasue it is hard to adapt the neural net-work language models without domain information,we propose an adaptation method of neural network language models based on unsupervised clustering,which divides the training text data by unsupervised clustering and trains category-specific language models.These models share hidden layers to alleviate the sparseness of the training data.In the decoding process,the output probabilities of multiple category-specific language models are dynamically weighted to improve the reliability of the language model output probability.The effectiveness of the proposed methods is verified on the real speech recognition data set.Based on the above research work,the application of adaptation technology in real deep learning speech recognition systems is introduced.For the speech input method scenarios,the dissertation designs the architecture of the acoustic model adaptation in the speech recognition cloud service,including the acoustic model adaptive training service module and the acoustic model adaptive decoding module.At the same time,the dissertation designs a "self-repairing" language model rapid adaptation function,so that the system can quickly learn and improve after the speech recognition error is corrected by the user.
Keywords/Search Tags:speech recognition, deep learning, acoustic model, language model, adaptation, unsupervised, attention mechanism, meta-learning
PDF Full Text Request
Related items