| Voice interaction is a relatively simple way of signal transmission between humans and machines.People only need to interact with machines through daily communication.Therefore,continuous breakthroughs in speech recognition technology enable machines to better recognize human voices,which can not only improve the relationship between humans and machines,but also make the original complex operation simpler.But also can improve the efficiency of human operation of the machine.However,at present,most of the intelligent voice products in the domestic market have good support for Putonghua,and the accuracy rate of speech recognition is not ideal when facing users who only speak Xiangyang dialect.In order to improve the accuracy of Xiangyang dialect speech recognition,this paper studies the human-computer interaction system for Xiangyang dialect speech recognition.The main work of this paper is as follows:(1)The Xiangyang dialect corpus is constructed,which mainly contains common traffic navigation sentences and characteristic words commonly used in daily communication of Xiangyang people.In the process of constructing the corpus,voice collection was conducted by looking for volunteers.Finally,37,760 voice data were obtained after unified format,cutting and data enhancement,and the annotation task of voice data to label text was completed.(2)The GMM-HMM(Gaussian Mixture Model-Hidden Markov Model,GMM-HMM)and DNN-HMM(Deep Neural Network-Hidden Markov Model,DNN-HMM)acoustic model.Firstly,the speech data are preprocessed and features extracted,and then the corresponding text is used to generate the language model,and the acoustic model is further trained.In the experiment,with the continuous optimization and training of the acoustic model,the speech recognition accuracy of the GMM-HMM model is constantly improved,and the speech recognition accuracy is up to 86.72%.Finally,the DNN-HMM model is trained and tested based on the GMM-HMM model.The results show that the DNN-HMM model has higher speech recognition accuracy than the GMM-HMM model in Xiangyang dialect speech recognition,and the speech recognition accuracy is 88.93%.(3)In order to reduce the complexity of Xiangyang dialect speech recognition system,an end-to-end Xiangyang dialect speech recognition framework based on connectionist temporal classification(CTC)is designed.Chinese syllables,namely Chinese pinyin,are used as modeling units.The acoustic model is used to convert the original phonetic signal into a phonetic sequence,and then the phonetic sequence is converted to Chinese characters by a language model.In this study,an acoustic model based on CNN-CTC was first designed,and then an improvement was made on the basis of the CNN-CTC model.A Long Short Term Memory(LSTM)model with strong expression ability for context was added to design the CNN-LSTM-CTC acoustic model.The experimental results show that the CNN-LSTM-CTC acoustic model has better performance and higher accuracy in the end-to-end speech recognition of Xiangyang dialect.In addition,since the research on Xiangyang dialect speech recognition is still blank in the existing public literature,this paper is a preliminary study of Xiangyang dialect speech recognition.Therefore,when evaluating the Xiangyang dialect speech recognition model constructed in this paper,the same Xiangyang dialect speech test set is adopted to test in the Putonghua speech recognition system.Finally,by comparing the accuracy of speech recognition,The effectiveness of Xiangyang dialect speech recognition system is verified.(4)On the basis of the above research,a human-computer interaction system for Xiangyang dialect speech recognition is built based on python language.The human-computer interaction system has the function of speech recognition and corpus collection.Finally,after testing,all the functions of the system can be used normally. |