Font Size: a A A

Research On Deep Learning For Speech Emotion Recognition

Posted on:2019-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:F M ZhuFull Text:PDF
GTID:2428330596460566Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
As we all know,voice is one of the most ideal and natural way for human-computer interaction,and the voice communication with machines has been basically achieved.But the machines often ignore the rich emotional information carried in the voice,and are far from being as natural and friendly as human communication.In order to enhance the experience of human-computer interaction,it needs the assistance of speech emotion recognition.In recent years,deep learning has achieved great success in various fields,and this paper mainly studies the speech emotion recognition based on deep learning,and several improved algorithms for speech emotion recognition has been proposed.The main work and innovation of this thesis are as follows:(1)Research background,meanings,significance of speech emotion recognition are discussed.And the relevant research history and status quo are summarized from the four major aspects in the field of speech emotion: speech emotion description models,speech emotion databases,speech emotion features,and speech emotion recognition algorithms.(2)All the processes before classification are described,including the preprocessing of speech signals,the extraction of key features such as short-term energy,short-term zerocrossing rate,formants,Mel Cepstrum coefficients,etc.The extraction of global statistical characteristics of speech emotion feature parameters is introduced.Finally,the commonly used algorithms for feature dimension reduction are introduced.And the principal component analysis used in the experiments is described in detail to do whitening and dimension reduction.And all these provides data support for follow-up experiments.(3)Pattern recognition,machine learning,and the connections between them are introduced.And the machine learning algorithms commonly used in the speech emotion recognition are studied in detail,including K-Nearest Neighbor,softmax regression,support vector machine,sparse representation,and neural network to provide comparative algorithms for the proposed algorithms in this paper.This thesis also studies the advantages of deep learning in representation learning and some popular deep learning structures to provide theoretical support for subsequent chapters.(4)An improved stacked auto-encoders structure is proposed for emotion recognition.The robustness of the denoising auto-encoders and the sparseness of the sparse auto-encoders are exploited.The structure includes two layers,the first layer uses a denosing auto-encoder to learning a large dimension hidden features than the dimension of the input features,the second layer applies a sparse autoencoder to learn sparse features,and a classifer is applied to classification at the end.In the training process,the layer-wise pre-training is used to achieve the purpose of initialization all parameters of the network,and then fine-tune the whole network.The experiments show that the improved stacked autoencoders achieve a better recognition rate than the stacked denoising autoencoders or stacked sparse autoencoders.In addition,based on CASIA sub-database,the structure is much better than the K nearest neighbor algorithm,the recognition rate is increased by 53.7%,and increased by 29.8% compared with the sparse representation,14.28% higher than the traditional support vector machine,1.9% higher than theartificial neural network.The recognition rate of this structure is 1.64% higher than the artificial neural network on the self-recording database.(5)A recurrent neural network(RNNs)which integrates the attention mechanism was proposed.This structure combines the RNNs' advantages of learning sequential features with the attention mechanism's benefits of learning the weights of features,it can learn better deep weighted features by using simple manual features.The structure includes four layers,the first layer uses bidirectional RNNs to learn the time dependence of the input.The second layer uses the unidirectional RNNs to get more deep features.The third layer uses attention layer to learn the weights of the features,and the weighted features are fused to improve the representative capabilities.The fourth layer uses a full-connectly network to learn the weighted features from third layer,and finally the outputs are sent to the classifier for classification.The first part of experiments in the CASIA_A database show that the average recognition rate of this structure is 88.19%,which is 4%~5% higher than the experiments on RNNs structure,and this structure improves the recognition rate of happiness and anger obviously.The second part of experiments in the CASIA_B database show that the average recognition rate of this structure is 5.71% higher than the stacked auto-encoders structure proposed by others,and it also improves the recognition rate of every different emotions compared with others.
Keywords/Search Tags:speech emotion recognition, stacked auto-encoders, recurrent neural network, attention
PDF Full Text Request
Related items