With the development of research in artificial intelligence and the continuing accumulation of big data corpus,speech recognition has rapidly developed now.Neural network has been extensively applied to speech recognition technology,end-to-end speech recognition has recently become a hot topic in artificial intelligence research.However,due to the complexity of its real application scenarios and speaker pronunciation characteristics,the end-to-end speech recognition model for Chinese gets relatively low accuracy.Aiming at the above problems,we take the Chinese pronunciation characteristics into consideration to optimize and improve the current mainstream end-to-end speech recognition model structure,which is aimed to improve the recognition performance and training efficiency of the end-to-end speech recognition framework for Chinese.Firstly,we design a baseline experiment based on method which combines Hidden Markov Model(HMM)-Gaussian Mixture Model(GMM)acoustic model,lexicon and N-gram language model.In the study of the GMM-HMM model,aiming at the problem of susceptibility of speech signals to context,we consider the front and back phoneme of current phoneme while building tri-phone acoustic model.Considering the influence of speaking style of different speakers,we adopt speaker adaptation technologies in GMM-HMM modeling to increase the recognition accuracy of baseline experiment.Then,aimed at the low accuracy of end-to-end framework applied to Chinese,we use incomplete end-to-end structure and apply this structure to speech recognition research of neural network time series classification method.In our research,because the LSTM-CTC end-to-end model have drawbacks,such as high computational complexity and long training time,we propose an improved model,i.e.Projection Long Short-term Memory(PLSTM)to speed up the model training.Because the long-term dependence of speech is not only in forward direction,in this work we use bidirectional Long Short-term Memory(Bi-LSTM)instead of LSTM or RNN combined with Connectionist Temporal Classification(CTC),which can help improve the accuracy.Finally,We started our experiment on the speech database of AISHELL,we use speed-perturbed training data to avoid overfitting while training Bi-LSTM.In the final experiment results,compared with the baseline experimental results,the accuracy and the speed of the model are all significantly improved. |