Font Size: a A A

Research And Implementation Of End-to-End Long-term Speech Recognition Model Base On RNN-Transducer

Posted on:2022-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z R LiFull Text:PDF
GTID:2518306338470544Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer science and technology,the demand for human-computer free interaction is increasing.As one of the important technologies for realizing human-computer intelligent interaction,automatic speech recognition technology has quickly become a research hotspot.With the influence of deep learning,the end-to-end speech recognition systems are gradually outperforming traditional algorithms while reducing the complexity of the speech recognition process.However,the end-to-end speech recognition technology still faces some problems and difficulties:(1)Insufficient language modeling capabilities of end-to-end speech recognition models;(2)The model has poor generalization ability and robustness for long-term speech recognition;(3)The algorithm model has a large number of parameters and high time and space complexity.To address the above problems,this paper conducts research on long-term speech recognition and speech model compression techniques,and the main work is as follows:1.RNN-Transducer model with fusion language model.Aiming at the problem that the end-to-end speech recognition model cannot effectively integrate the language model for joint optimization,and the language modeling ability is insufficient,this paper proposes the RNN-Transducer model with fusion language model.Firstly,language modeling auxiliary task is added to the RNN-Transducer prediction network,and multi-task learning joint optimization method is used to help model training.Then knowledge distillation algorithm is used to transfer external linguistic knowledge to the prediction network language model,and the language model is integrated into the RNN-Transducer model during the training process to further improve the language modeling ability of the model.Experiments demonstrate that the proposed algorithm can better learn text information and ensure the end-to-end training optimization of the model,which reduces the character error rate by about 1%.2.Long-term speech recognition algorithm.Aiming at the problem of poor model robustness in long-term speech recognition scenarios,this paper proposes a long-term audio speech recognition algorithm.Firstly,the cross-sentence context module is proposed to retain the semantic information of the historical context across sentences,so that the model can better learn the context information at the conversation level and improve the performance of long-term speech recognition.Then,the training method of initializing the hidden layer state is used to simulate long-term speech training during the training process,which improves the recognition accuracy of the model.Experiments show that the proposed algorithm has achieved excellent recognition accuracy in synthesizing long-term speech data,and the difference in CER between short and long utterance test sets does not exceed 1.00%,which effectively improves the generalization ability and robustness of the model for long-time audio recognition scenarios.3.Mutual-learning Sequence-level Knowledge DistillationAiming at the problem of large number of parameters and high computational complexity in the speech recognition model,this paper proposes mutual-learning sequence-level knowledge distillation for model compression.Combined with the knowledge distillation algorithm,this paper adopts the mutual learning among multiple student models with different structures,introduces the diversity among models,and learns their structural differences to achieve complementarity,so as to transfer more rich and correct information from the teacher model to the student model,and further improve the performance of the student model.Experiments show that the algorithm proposed in this paper can effectively reduce the number of model parameters and computational complexity,while ensuring the performance of speech recognition tasks,achieving a good balance between the two.In summary,this paper proposes a feasible,robust,and fast speech recognition method,which effectively alleviates the problems of the end-to-end model:"insufficient language modeling ability","the poor robustness and generalization capability of long-term speech recognition","large number of parameters and high computational complexity".Finally,based on the research content of this paper,a speech recognition demonstration system is designed and implemented.
Keywords/Search Tags:automatic speech recognition, end-to-end model, connectionist temporal classification, rnn-transducer, knowledge distillation
PDF Full Text Request
Related items