Font Size: a A A

Research On Multi-accent Chinese Speech Recognition Approaches Based On Time Convolution Network

Posted on:2021-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2518306497966839Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the widespread application of computers and the continuous development of artificial intelligence,people hope that computers can understand human languages and interact better with them.Therefore,speech recognition technology has become a very important research topic in the field of speech.However,in the environment in which speakers with different accents(or multi-accents)interact with intelligent speech devices,the different types of accents carried by different speakers bring challenges to the speech recognition system.At present,with the continuous development of deep learning,some mainstream neural networks such as convolutional neural networks and recurrent neural networks have achieved good results in accent recognition systems.However,the performance of convolutional neural networks is not as good as that of recurrent neural networks in sequential tasks,and training of recurrent neural networks is more difficult.Therefore,in this thesis,the temporal convolutional network is used to build the acoustic model for multi-accents.At the same time,accent sentence embedding and multi-task learning methods are used to improve the accuracy and generalization of the model.In the research and experiment of this thesis,the multiaccent mainly includes Beijing,Shanghai,Guangzhou and ChongqingThe thesis focuses on the feature extraction methods of multi-accent Chinese speech recognition,the construction and optimization of multi-accent acoustic models based on temporal convolutional networks.The main research work of the thesis is as follows:(1)In order to optimize the speech input of the multi-accent acoustic model,the thesis uses a multiple kernel learning method to merge the features of the mel cepstrum coefficient and the supervised accent sentence embedding feature.The accent sentence embedding feature is obtained by weighted average method of speech frame embedding.Speech frame embedding is extracted by the idea of continuous bag of words model,and the target speech frame is predicted by context.The experimental results on the aishell dataset show that the for four multi-accent accents in Beijing,Shanghai,Guangzhou and Chongqing are extracted as two types of features and used as the input of the acoustic model,the average accuracy of accent recognition reaches73.72%,and the average accuracy rate of single-input features is improved by 5.17%over the mel cepstrum coefficient.The average accuracy of single-input features with supervised accent sentences is 72.02%,which is 0.98% and 3.02% higher than the average accuracy of semi-supervised and unsupervised single-input features.(2)In order to improve the ability of ordinary convolutional neural networks to process sequence tasks,the thesis uses the temporal convolutional network to build an multi-accent acoustic model for four multi-accent accents in Beijing,Shanghai,Guangzhou and Chongqing.And based on the ordinary convolutional network,this network introduces causal convolution and dilated convolution to solve the sequence problem.The experimental results on the aishell dataset show that the average accuracy rate of the acoustic model based on the time convolutional network for four multiaccent accents in Beijing,Shanghai,Guangzhou and Chongqing reaches 76.45%,which is an average improvement of 4.65% over the deep neural network and hidden markov model.The average accuracy of the acoustic model on the aidatatang dataset reaches 75.11%,which is an average improvement of 4.23% over the deep neural network and hidden markov model.(3)Aiming at the problem of weak generalization ability of single task accent speech recognition,the thesis uses a multi-task learning method,that is,on the basis of accent recognition single task,accent classification task is added to classify four multiaccent in Beijing,Shanghai,Guangzhou and Chongqing and the accuracy of accent recognition is improved by sharing parameters.The thesis uses multi-accent classifiers as auxiliary tasks for a multi-task learning method and different weight parameters are set for the main tasks and auxiliary tasks.The experimental results show that the average accuracy rate of the accent classifier trained using the time convolutional network on the aishell dataset for four multi-accent accents in Beijing,Shanghai,Guangzhou and Chongqing reaches 84.26%,which is 22.06%,12.09% and 3.92%relatively higher than that of the Gaussian mixture model classifier,deep neural network classifier and recurrent neural network classifier.On the aidatatang dataset,the average accuracy rate reached 80.52%,which was a relative improvement of24.76%,14.15%,and 3.87%.
Keywords/Search Tags:accent embedding, accent recognition, temporal convolutional network, multi-task learning
PDF Full Text Request
Related items