Font Size: a A A

Research On End-to-end Speech Recognition Based On Convolutional Neural Networks

Posted on:2022-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y K ZhangFull Text:PDF
GTID:2518306563962049Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of deep learning theory,many technologies have been successfully applied to the field of speech recognition.As a key technology of deep learning.As a key technology of deep learning,Convolutional Neural Network has achieved good performance in the construction of speech recognition system by virtue of its characteristics of weight sharing and local connection.Furthermore,the emergence of end-to-end speech recognition method solves the problems of cumbersome process and non-consistent optimization of traditional methods,and further improves the application potential of Convolutional Neural Network in speech recognition.However,when the convolutional neural network is combined with the end-to-end mechanism,there are some problems,such as the former input features do not meet the actual needs,and the traditional CNN processing method will lead to the weakening of the independent information of the speech signal in the time domain and frequency domain.Aiming to improve the end-to-end speech recognition performance of CNN,this thesis researches the input features and the front-end processing network of the acoustic model,and mainly completes the following works:(1)Research on the structure of the end-to-end acoustic model based on CNN.This thesis focuses on an end-to-end acoustic model implemented using the Connectionist Temporal Classification(CTC)framework.With CNN as the input network,the traditional acoustic feature FBank is organized into a form suitable for CNN input.According to the characteristics of high compression of FBank features,three CNN model based on shallow pooling,middle pooling and deep pooling are designed.The experimental results show that the deep pooling model has the best effect,and the error rate reaches 28.14%,which is 4.83% lower than that of the shallow pooling method.(2)Research on the input features of end-to-end acoustic models based on CNN.Due to the traditional features of over-reliance on prior knowledge can cause the loss of frequency domain information,which cannot give full play to the feature extraction ability of CNN in the end-to-end framework.This thesis introduces the spectrogram feature,which contains almost all the information in the frequency domain of speech signals,and applies it to the three built network models respectively.The experimental results show that the spectrogram feature has the best effect on the middle pool model,and the error rate is up to 27.52%,which is 2.20% lower than the optimal result of FBank feature.(3)Research on the way of CNN processing speech feature graph.Since the traditional CNN processing method will lead to the weakening of the independent information of the speech signal in the time and frequency domain.This thesis proposes a phased processing scheme in the time frequency domain,which not only retains the one-dimensional characteristics of each speech frame,but also takes into account the context information between the frames.The scheme is implemented by the one-dimensional model of CNN,and is divided into time-domain and frequency-domain processing methods and frequency-domain processing methods according to the processing order.The experimental results show that the frequence-time domain processing method is more suitable,and the error rate reaches 25.92%,which is 5.77%lower than the optimal result using the traditional CNN processing method.
Keywords/Search Tags:Convolutional Neural Network, End-to-end speech recognition, Connectionist Temporal Classification, Spectrogram
PDF Full Text Request
Related items