Font Size: a A A

Speech Emotion Recognition Based On Deep Learning

Posted on:2022-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y K SongFull Text:PDF
GTID:2518306557469444Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
As an important branch of emotion computing,speech emotion recognition has been deeply studied by many scholars because of its fast,convenient and real characteristics.In the past,most scholars use traditional machine learning methods to realize speech emotion recognition.However,with the hot of deep learning in recent years,deep learning methods have shown excellent performance in various fields,which has become the leading point.The main work of this paper is to study how to use neural network to extract the emotional information in speech,select the speech features which are conducive to emotion recognition,and improve the effect of speech emotion recognition.The main work is as follows:(1)This paper proposes a speech emotion recognition method based on spatial and channel attention mechanism.Because a segment of speech contains multiple speech segments,different speech segments have different importance to emotion recognition.A segment of speech signal is divided into multiple audio segments.The deep speech emotion information of each speech segment is extracted by using the pre training model vgg16.Then,by means of space and channel attention mechanism,the features favorable to speech emotion recognition are selected and given high weight to improve the effect of speech emotion recognition.The recognition rates of 58.98%,36.12% and71.31% were obtained by using attention mechanism on the e NTERFACE'05,AFEW and IEMOCAP databases respectively.Compared with the traditional speech emotion recognition methods,the recognition rates are improved by 3.1%,9.8% and 13.14%.(2)This paper proposes a speech emotion recognition method based on the kernel selective attention mechanism.The kernel selection attention mechanism uses convolution kernel of various sizes to extract the emotional information in speech.Different convolution cores have different sense fields,and can extract different speech emotion features,and the importance of features is different.Different weights are given to the features extracted from different convolution cores.In the kernel attention mechanism,convolution kernel is used to make linear combination of information between different channels to realize information exchange between channels;It can also increase the nonlinear expression ability of feature graph while keeping the size of feature map unchanged,which is conducive to extract higher-level emotional features in deeper network.The convolution kernel of size is also used in the kernel attention mechanism.With the increase of convolution core size,the sense field increases,the more information can be extracted,and better local features can be obtained.The smaller convolution core can observe more details and realize the accurate recognition of speech emotion.The simulation experiments on three databases using the kernel selective attention mechanism have obtained 58.42%,35.22% and 68.46% recognition rates respectively.Compared with the traditional speech emotion recognition methods,the recognition rates are improved by 2.54%,8.9% and 10.64%.(3)Because of the time characteristics of speech emotion recognition,this paper proposes a speech emotion recognition method based on space-time characteristics.The speech signal is converted into a form of sound spectrum,not only has spatial characteristics,but also has time feature between frames.Aiming at the spatial and temporal characteristics of speech signals,a speech emotion recognition method based on parallel space-time features is proposed.Based on this model,the network structure is improved and speech emotion recognition based on cross space-time features is proposed.In previous studies,convolutional neural network can extract rich spatial features in images.However,for specific speech emotion recognition tasks,the speech bearing emotion has relevance in a period of time.Convolutional neural network can not extract these time features,so it is necessary to extract the time features of speech by using the cyclic neural network.The spatial and temporal features extracted from the two neural networks are spliced to obtain the speech emotion features with space and time to recognize the voice emotion.The speech emotion recognition method of the spatial time feature has obtained 57.68%,33.90% and 65.11% recognition rate in three databases respectively.However,the above feature splicing method can not realize the information exchange between spatial and temporal features.Aiming at the defects of the above methods,a speech emotion recognition method based on cross space-time features is proposed,which strengthens the information exchange between different features.Compared with the speech emotion recognition method based on parallel space-time feature,the recognition rates of the speech emotion recognition method based on cross space-time feature are 58.61%,38.10% and 70.89% respectively in three databases,which are improved by 0.93%,4.2% and 5.78% respectively.
Keywords/Search Tags:Speech Emotion Recognition, Feature Selection, Attention Mechanism, Convolutional Neural Network, Recurrent Neural Network, Information Exchange
PDF Full Text Request
Related items