Research And Implementation Of End-to-End Long-term Speech Recognition Model Base On RNN-Transducer

Posted on:2022-04-01

Degree:Master

Type:Thesis

Country:China

Candidate:Z R Li

Full Text:PDF

GTID:2518306338470544

Subject:Electronic Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer science and technology,the demand for human-computer free interaction is increasing.As one of the important technologies for realizing human-computer intelligent interaction,automatic speech recognition technology has quickly become a research hotspot.With the influence of deep learning,the end-to-end speech recognition systems are gradually outperforming traditional algorithms while reducing the complexity of the speech recognition process.However,the end-to-end speech recognition technology still faces some problems and difficulties:(1)Insufficient language modeling capabilities of end-to-end speech recognition models;(2)The model has poor generalization ability and robustness for long-term speech recognition;(3)The algorithm model has a large number of parameters and high time and space complexity.To address the above problems,this paper conducts research on long-term speech recognition and speech model compression techniques,and the main work is as follows:1.RNN-Transducer model with fusion language model.Aiming at the problem that the end-to-end speech recognition model cannot effectively integrate the language model for joint optimization,and the language modeling ability is insufficient,this paper proposes the RNN-Transducer model with fusion language model.Firstly,language modeling auxiliary task is added to the RNN-Transducer prediction network,and multi-task learning joint optimization method is used to help model training.Then knowledge distillation algorithm is used to transfer external linguistic knowledge to the prediction network language model,and the language model is integrated into the RNN-Transducer model during the training process to further improve the language modeling ability of the model.Experiments demonstrate that the proposed algorithm can better learn text information and ensure the end-to-end training optimization of the model,which reduces the character error rate by about 1%.2.Long-term speech recognition algorithm.Aiming at the problem of poor model robustness in long-term speech recognition scenarios,this paper proposes a long-term audio speech recognition algorithm.Firstly,the cross-sentence context module is proposed to retain the semantic information of the historical context across sentences,so that the model can better learn the context information at the conversation level and improve the performance of long-term speech recognition.Then,the training method of initializing the hidden layer state is used to simulate long-term speech training during the training process,which improves the recognition accuracy of the model.Experiments show that the proposed algorithm has achieved excellent recognition accuracy in synthesizing long-term speech data,and the difference in CER between short and long utterance test sets does not exceed 1.00%,which effectively improves the generalization ability and robustness of the model for long-time audio recognition scenarios.3.Mutual-learning Sequence-level Knowledge DistillationAiming at the problem of large number of parameters and high computational complexity in the speech recognition model,this paper proposes mutual-learning sequence-level knowledge distillation for model compression.Combined with the knowledge distillation algorithm,this paper adopts the mutual learning among multiple student models with different structures,introduces the diversity among models,and learns their structural differences to achieve complementarity,so as to transfer more rich and correct information from the teacher model to the student model,and further improve the performance of the student model.Experiments show that the algorithm proposed in this paper can effectively reduce the number of model parameters and computational complexity,while ensuring the performance of speech recognition tasks,achieving a good balance between the two.In summary,this paper proposes a feasible,robust,and fast speech recognition method,which effectively alleviates the problems of the end-to-end model:"insufficient language modeling ability","the poor robustness and generalization capability of long-term speech recognition","large number of parameters and high computational complexity".Finally,based on the research content of this paper,a speech recognition demonstration system is designed and implemented.

Keywords/Search Tags:

automatic speech recognition, end-to-end model, connectionist temporal classification, rnn-transducer, knowledge distillation

PDF Full Text Request

Related items

1	Research On Connectionist Temporal Classification In Speech Recognition
2	Chineses Speech Recognition System Based On CLDNN Hybrid Model
3	Design Of End-to-end Ando Tibetan Speech Recognition System Based On Deep Learning
4	Research On End-to-End Speech Recognition Method Based On Self-Attention Mechanism
5	Knowledge Distillation For Speech-assisted Lip Reading
6	Asr Research Based On CTC
7	Research And Application Of Deep Learning Based Continuous Speech Recognition
8	Research On End-to-End Simultaneous Speech Translation Based Transformer Transducer
9	The Design And FPGA Verification Of End-to-end Mandarin Speech Recognition Based On CNN
10	Research On Speech Emotion Recognition Algorithm Based On Deep Learning