Font Size: a A A

Research On End-to-End Speech Recognition Methods Based On Language Model

Posted on:2021-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:K R LvFull Text:PDF
GTID:2428330626958948Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of artificial intelligence technology,more and more voice interaction products and services have come into our lives,serving millions of households in a smarter way.As one of the important technologies to realize and improve intelligent human-computer interaction,speech recognition technology has been a research hotspot in the past few decades.Before the rise of deep learning,hybrid Gaussian models and Hidden Markov Models have been widely used as very effective acoustic models.However,traditional speech recognition is composed of multiple modules,which is not convenient for the unified optimization of the entire model and the operation is cumbersome.In the era of big data with data explosion,these traditional speech recognition technologies are no longer sufficient to support the needs of more efficient speech recognition systems and intelligent interaction.With the development of deep learning,end-to-end models based on deep neural networks have gradually become a new research trend.End-to-end speech recognition technology simplifies the entire recognition system into a single network architecture,using audio files as input and text labels as output,greatly simplifying the construction of speech recognition systems and reducing the loss of information transmission between model components,it improves the overall recognition performance of the model,and gradually becomes a research focus in the field of speech recognition.This paper analyzes the two current mainstream end-to-end speech recognition models,and proposes an improved method of fusing language models for their shortcomings.The main work of the paper includes the following points:(1)Aiming at the shortcomings of traditional speech recognition models,an endto-end speech recognition model DCNN-BGRU-CTC based on deep neural networks was implemented.The model design borrowed from the VGGNet network structure,which works well in the field of image recognition.The two-dimensional convolution directly extracts the feature spectrum of the speech to a certain extent,alleviating the situation where the feature information is partially lost due to the excessive reliance on empirical design in the traditional acoustic feature extraction method.Multiple consecutive small convolution kernels are used to replace the larger convolution kernels,reducing the model parameters and increases the CNN's expressiveness,which is helpful for extracting richer and more discerning features.Experiments on the opensource speech dataset validate the effectiveness of the model.(2)Aiming at the problems of slow convergence and lack of language modeling ability during the end-to-end model training of Chinese characters as acoustic modeling units,it is proposed to reduce the acoustic modeling unit to use pinyin with tone for modeling,and at the same time to increase the language model based on the improved Transformer for decoding,and the experimental results show that the above improvements improve the recognition effect of acoustic model and language model unilaterally.(3)In order to improve the difficulty of integrating language models into the training process of acoustic models and the inability to effectively integrate language models for joint optimization,a new end-to-end speech recognition algorithm that incorporates language models is proposed so that language models can participate in the acoustic model.During the training and testing phase,it corrects the errors generated by the CTC-based speech recognition system to a certain extent.The CTC's output is used as input to train the language model after a certain matrix operation,which truly realizes end-to-end.(4)In order to simplify the follow-up work and make it easy for others to use,the end-to-end speech recognition algorithm of the proposed fusion language model was implemented in a streamlined manner,and a Django framework-based Web site was constructed,which can complete offline and online recognition of voice files.The practicability of the algorithm proposed in this paper is also tested.
Keywords/Search Tags:Speech recognition, CTC, Language model, VGGNet, GRU
PDF Full Text Request
Related items