Font Size: a A A

Research On Chinese Speech Recognition Technology Based On BPE And Transformer

Posted on:2020-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LuanFull Text:PDF
GTID:2428330590474474Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Speech recognition is a precondition for realizing human-computer voice interaction,and it is receiving more and more attention from researchers.End-to-end acoustic model modeling based on Connectionist Temporal Classification(CTC)has become one of the mainstream methods,but choosing the basic output unit for CTC prediction is a design challenge.The choice of recognition unit is generally based on the knowledge of phonetics,but it can also be generated in a data-driven manner.The unit determined by the latter may not have a clear meaning in phonetics,but it may also achieve good performance..In addition,speech recognition systems often include language models,and the n-gram language model is often used in traditional modeling methods.With the development of deep learning,it is also of great research value to find an optimized strategy or network structure to improve the language model.In this context,this paper explores the acoustic model modeling technology and language model modeling technology in the automatic speech recognition system.On the one hand,it proposes a new set of modeling units in combination with CTC theory,and explores the neural network structure of the new language model.Improve the overall performance of speech recognition.Firstly,this paper uses the idea of Byte Pair Encoding(BPE)algorithm to improve the acoustic model and improve the speech recognition performance by selecting more suitable recognition units.The CTC acoustic model can select output units larger than the phonemes,such as vowels and syllables,without labeling each frame of the input speech signal.The BPE algorithm automatically learns and discovers the best set of recognition units by iteratively merging the most frequently occurring elements in the text and adding them to the set of sub-word units,and automatically learns the most appropriate way to decompose the target sequence.In addition,this paper uses the Transformer network to realize the decoding process from the syllable sequence output to the text output from the acoustic model.Compared with the n-gram model,the Transformer network is more likely to capture the interdependent features of long distances in sentences,so that it can make full use of context information and play a greater advantage in the conversion of sound words.Through experimental comparison,the performance of the improved language model system has been improved.And compared to the Recurrent Neural Network(RNN),Transformer has a direct effect on increasing the parallelism of computation,suitable for language model modeling tasks.The combination of BPE-based acoustic modeling and Transformer-based language model modeling technology has significantly improved the performance of Chinese recognition accuracy.
Keywords/Search Tags:speech recognition, BPE, CTC, Transformer
PDF Full Text Request
Related items