Font Size: a A A

Research In End-to-end Automatic Speech Recognition Technology

Posted on:2020-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z F JiangFull Text:PDF
GTID:2428330572473547Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As the most important communication channel for people in interpersonal communication,language will inevitably become an important bridge for human-computer interaction and integration.Automatic speech recognition(ASR)can be expressed as the process of transcribing human speech signals into written text output by computer.Since the 1970s,automatic speech recognition has been an important research topic in the machine learning community.Up to now,although the traditional automatic speech recognition technology is still being used,the end-to-end model based on deep neural network that solves the problems of traditional frame recognition process,which is cumbersome and difficult to optimize,has become a research hotspot and development direction in the field of automatic speech r-ecognition.In this context,this topic is based on the deep neural network structure framework,and the end-to-end speech recognition technology is studied.Firstly,this paper analyzes two current end-to-end automatic speech recognition models—CTC and attention-based mechanism.It points out the problems existing in the status quo:1.CTC makes independent assumptions between output units,but in fact for context closely related speech recognition it is unreasonable;2.Attention mechanism allows irregular input and output alignment,but usually speech recognition has strictly monotonous input and output.Therefore,this paper proposes an end-to-end automatic speech recognition model that combines CTC and attention mechanism,and validates the effective improvement of the model in the open source English speech data set Librispeech recognition task.Secondly,a novel end-to-end speech recognition model of coding-decoding structure is proposed,named Recurrent neural network Adaptive Mapping(RAM).RAM regards the speech recognition task as a sequence-to-sequence mapping problem,training the input sequence and the target sequence pair end-to-end,introducing "blank label" to achieve the target of adaptive alignment of input and output,which is similar to CTC,but we do not Then make the output independence assumption.The probability of the tag sequence is then calculated by marginalizing all possible blank tags.The experimental results on the Librispeech speech recognition task show that the recognition performance of the RAM-based recognition system is competitive compared to the other end-to-end models without additional language models.Finally,aiming at the characteristics of Mandarin speech signals,the proposed new model RAM is improved for speech recognition in Putonghua,and the validity of the RAM model in Putonghua speech recognition tasks is verified on the open source Chinese dataset AISHELL-1.In addition,we propose to introduce migration learning methods,pre-training models on large English data sets as a priori models,and then transplant them into the recognition of Mandarin,which not only makes training more efficient,but also improves performance.
Keywords/Search Tags:ASR, deep learning, end-to-end, recurrent neural network, encoder-decoder network
PDF Full Text Request
Related items