Font Size: a A A

The Research On Speech Emotion Recognition Based On Contextual Position Enhancement And Weighted Space

Posted on:2021-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:H R HangFull Text:PDF
GTID:2518306107952759Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Emotion recognition is an important application in the field of human-computer interaction.As an important medium of human communication,speech is also an important carrier of information transmission.It is also the most important and natural part of human interaction with computer.As a part of emotion recognition,speech emotion recognition plays an important role in practical application by analyzing subtle emotion changes and speculating human psychology through speech.For example,customer service feedback systems,criminal interrogation,teaching management and medical services.In recent years,people use artificial intelligence and deep learning technology to deal with some problems in speech field.Speech emotion recognition as a part of speech field has achieved great success.However,in more practical situations,we are faced with the real environment speech which is interfered by environmental noise,different speech fragments,and contains a large number of silent sounds,etc.,which interfere with the speech emotion recognition.Thus,it can be seen that speech emotion recognition is still a difficult task.In order to solve the problems such as the noise in the speaker's environment,the mute segment,the uneven distribution of emotion in the corpus and so on,this thesis proposes a speech emotion recognition model based on speech context and headspace keyframes.The main contributions of this thesis are as follows:(1)A location enhancement method based on speech context is proposed.This method aims to make the model pay attention to the effect of speech context information on the whole emotion module and improve the effect of context information on part of speech interference interval through adaptive learning and memory characteristics of cyclic neural network in time domain.Specifically,through the improvement of the encoder part of Transformer,the adaptive learning of context location and the introduction of cyclic neural network features,the network can be guided to actively learn the emotional transmission of the context in the voice,so as to improve the efficiency of emotion recognition.(2)A headspace enhancement method with weight space is proposed to select the position features and original features obtained in the previous step.This model USES the head space part of Transformer encoder to weight the key frames of a section of speech,so as to strengthen the influence of local speech fragment frames on the emotional category of the whole speech,so as to reduce the uneven emotion distribution of the corpus and the influence of noise on the emotional recognition of the whole speech.Will finally the above two methods in the whole part of the network was improved,the Transformer encoder in IEMOCAP dataset WAR reached 70.3%,UAR reached 70.9%,F1 value reached 70.0%,than the baseline model of the Transformer encoder partial application to speech emotion recognition of its WAR and UAR increased by 2.0 % and 6.6% respectively,compared with the similar performance better method method its UAR and F1 value increased by 1.5% and 0.7% respectively.
Keywords/Search Tags:speech emotion recognition, context location enhancement, headspace weighting, Transformer network
PDF Full Text Request
Related items