Font Size: a A A

Research On Multi-dimensional Speech Recognition Technology Based On Multi-task Neural Network

Posted on:2021-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:T Y FengFull Text:PDF
GTID:2428330614465693Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Since the 21 st century,information technology has been greatly developed.Under the tide of artificial intelligence,the realization of simple,fast and smooth human–machine interaction has become the goal of researchers.Speech communication is an important part of human-machine interaction,and speech recognition is the key technology of human-machine voice interaction.In recent years,researchers have made a lot of efforts in speech recognition and obtained rich achievements.Speech signals in real environments are complex mixed signals which contain rich semantic information,speaker-related information(such as identity,emotion,etc.)and environmental information.It is the premise that human beings can communicate smoothly.However,most current studies in speech recognition focus on a single task such as speech content recognition,and it is difficult to recognize multi-dimensional information contained in speech signals simultaneously like human beings.However,such single-dimensional speech recognition models ignore the ability of the human brain to process multi-dimensional speech information and abandon the correlation among multi-dimensional information in speech signals,which is not good for machines to understand the true meaning of speech and cannot meet the requirements of intelligent human-computer interaction.Therefore,in order to make speech recognition technology more anthropomorphic and intelligent,our team proposes simultaneous recognition of multi-dimensional information in speech signals which realizes multiple speech recognition tasks by making full use of the rich multi-dimensional information in speech signals and the correlation among different recognition tasks.Based on the previous research of our team,this thesis studies the simultaneous recognition of speaker's gender,emotion and identity from the aspects of classification model construction and feature extraction.The main work and contributions of this thesis are summarized as follows:(1)By combining the multi-task learning(MTL)mechanism with the recurrent neural networks(RNN)structure,this thesis makes full use of the rich multi-dimensional information in speech signals to build a multi-dimensional speech recognition model that can simultaneously identify the speaker's gender,emotion and identity.The model adopts Mel-Frequency Cepstral Coefficient(MFCC)features as the recognition parameters and builds a multi-task neural network structure with attribute dependent layers.It learns common features among different recognition tasks and unique features of each recognition task through the RNN sharing layer and the full connection attribute dependency layers,respectively.In addition,it leverages the MTL mechanism to adjust the weight of each recognition task's loss function in the total loss function of the model for optimizing the performance according to the characteristics of the speech database,and finally outputs the recognition results of three recognition tasks simultaneously.The experimental results on two speech databases show that the proposed multi-dimensional speech recognition model based on MTL and RNN leads to 3.01% and 5.09% higher average recognition rates as compared to the single-dimensional speech recognition model,respectively,and obtains significant improvement in all the three recognition tasks.Moreover,it is shown that the proposed model is robust to language and speaker's personality factors,and has certain anti-noise performance.In addition,it not only shows the feasibility of multi-dimensional task recognition,but also proves that different tasks have obvious correlation so that multi-dimensional recognition is an important method to improve the performance of single-dimensional tasks.(2)Since the proposed multi-dimensional speech recognition model based on MTL and RNN adopts MFCC features,it would remove part of speech information in various filtering and transformation operations during feature extraction.However,multi-dimensional speech recognition requires that the usage of multi-dimensional information in speech signal should be as much as possible.Therefore,this thesis further proposes to combine the structure of convolutional neural network(CNN)with the feature fusion method for improving the feature extraction,and construct a multi-dimensional speech recognition model based on CNN and feature fusion.Specifically,the features extracted by CNN and MFCC are fused to make full use of the multi-dimensional information in the speech signal to form complementary features.Then,by using the fusion feature as the input of the multi-task recurrent neural network classifier,the identification of the speaker's identity,gender and emotion is realized.The experimental results on two speech databases show that the average recognition rate of the proposed multi-dimensional speech recognition model based on CNN and feature fusion is 3.59% and 6.01% higher than that of the single-dimensional speech recognition model,respectively,and 0.85% and 0.99% higher than the MTL-RNN model,respectively.Moreover,the results show that the model can improve the recognition rate in all three recognition tasks and has better anti-noise performance,which proves the effectiveness of fusion features in multi-dimensional speech recognition.
Keywords/Search Tags:Multi-dimensional speech recognition, Multi-task Learning, Recurrent Neural Networks, Convolutional Neural Networks, Feature fusion
PDF Full Text Request
Related items