Font Size: a A A

Research On Efficient Chinese Speech Recognition For Data Scarcity Scenarios

Posted on:2023-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:H XuFull Text:PDF
GTID:2568306911985959Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Auto speech recognition(ASR)is one of the key technologies in artificial intelligence to realize human-computer interaction.It enables computers to understand human language,simplifies the process of human-computer interaction,and makes information transfer between humans and machines more efficient.With the development of IoT technology and the popularity of smart devices,ASR has been widely used in various production and life.The rapid development of deep learning has greatly improved the accuracy of ASR,and the ASR systems based on deep neural networks can be divided into hybrid models with Hidden Markov Model and independent end-to-end models.Compared with hybrid models,end-toend models have the advantages of simplified model structure,joint training,direct output,and no forced data alignment.However,building a highly accurate ASR system in lowresource scenarios is still a challenging task due to the difficulty of data collection and the lack of linguistic knowledge.To address the problem of limited performance of end-to-end models under low resource conditions,this thesis starts from the text and speech data,proposes algorithms to improve Chinese end-to-end ASR performance.The main work of this thesis is summarized as follows.(1)End-to-end speech recognition uses a single network to jointly train the acoustic model and the language model,so its language modeling capability is limited by smaller paired"speech-text" data.To address this problem,this thesis proposes a Decoupling Recognition and Transcription(DRT)framework for Chinese end-to-end ASR,which decomposes speech recognition into two subtask,recognition and transcription,by explicitly separating the acoustic model and the language model.In the recognition task,an end-to-end acoustic model was constructed using Wav2vec 2.0 and connectionist temporal classification.In the transcription task,a Piny in-Character Mixture Language Model(PCMLM)based on selfsupervised learning was proposed.The framework uses both Piny in and Chinese characters as modeling units.It first recognizes the speech as a mixed sequence of Piny in and Chinese characters through acoustic model,and then transcribes the Pinyin in the mixed sequence into Chinese characters through PCMLM to obtain the final result.By independently optimizing the language model using large-scale external text data,the proposed method achieves a word error rate of 3.4%on public speech recognition corpus,outperforming other algorithms.(2)In order to address the domain shift problem under low-resource scenarios,this thesis proposes a Gradual Self-Training(GST)algorithm based on semi-supervised learning.This method is an iterative self-training method,which makes full use of large-scale unlabeled speech data through pseudo-labeling to avoid the performance degradation of ASR systems in different domains due to factors such as accents or noise.Moreover,this method also uses PCMLM to improve and evaluate the quality of pseudo labels generated in the iteration.During iteration,the training samples are selected according to the confidence of pseudolabels from high to low,which implements an easy-to-hard training process,and makes the cross-domain adaption of the acoustic model smoother.Finally,GST enables the model to be adapted to a very different domain compared to other self-training methods.Based on the above two points,the proposed GST based on semi-supervised learning can be applied to cross-domain transfer tasks with large domain shift.(3)To address the problems of high response latency,large footprint and deployment difficulties of large networks,this thesis proposes a model compression method based on knowledge distillation to reduce the parameters of acoustic model and language model.This method mainly reduces the number of layers of the Transformer network in the model,and introduces a variety of losses to ensure better knowledge transfer from the large teacher model to the student model.The results of the experiment demonstrate that the proposed method significantly compresses the model size and improves the prediction speed while ensuring a high accuracy.
Keywords/Search Tags:ASR, Semi-Supervised Learning, Self-Supervised Learning, Self Training, Knowledge Distillation
PDF Full Text Request
Related items