Font Size: a A A

Speech Emotion Recognition Research Based On Deep Learning

Posted on:2021-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:D Y LiFull Text:PDF
GTID:2518306308472774Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the advancement of science and technology and the deepening of deep learning research,the application of speech emotion recognition in life has gradually become widespread.Most of the current speech emotion recognition tasks are regarded as simple classification tasks.First,a variety of acoustic features are manually extracted and feature engineering is built.Then,a classification network is trained by deep learning to recognize emotion categories.This thesis aims to explore the change and invariant information in speech emotion,based on the source-filter model of the speech generation model,explore the expression of speech emotion information and build a speech emotion space.Then build the mostsuitable network for learning speech emotion information based on the emotion space structure.Finally,the attention mechanism is used to optimize the model,and the key parts of the speech will be extracted and used.The main research contents of this thesis include the following parts.1.Study the emotion features of speech based on source-filter speech utterance modelAmong a large number of acoustic feature parameters,based on their physical meanings,feature parameters that reasonably express emotional invariance are screened and their effectiveness is verified.Through comparative analysis,the interference of the speaker,content and other information in the emotional speech is removed from the input level as much as possible,and the voice and voice emotional related information are retained.Finally,two types of speech emotion spaces are constructed:the global feature emotion space represented by the spectrogram,and the fundamental frequency,MFCC and their statistical values constitute the time series emotion space.2.Study the network structure suitable for speech emotion spaceBased on the speech emotion space composed of better speech parameters,a deep learning network suitable for mining speech emotion information is constructed.From the perspective of network structure modeling,this paper uses different convolutional neural networks according to the emotion space formed by the temporal characteristics of speech,and the emotion space of the spectrogram as input that can characterize both the discourse level and the frame level.Finally,two types of network structures,spectrum network and time series network,are constructed.In contrast,the combined convolutional spectrum network with the spectrogram as input obtains a better recognition rate.3.Study the extraction of key parts of speech emotion informationNot all parts of speech can reflect emotional information,and important types of speech emotional information are also diverse.How to extract and use these critical parts is a key point in research.This paper proposes a series of methods for extracting speech emotion importance in time,frequency domain and high-level space.One is to act on the input of the original spectrum,and the other is to learn and use local importance in the high-level space of the network.Through traditional methods,self-attention mechanism,non-random Dropout,and channel attention and its combination,the network model is optimized to improve the system accuracy on the basic network.Among them,the best effect is obtained by combining the self-attention method that is applied to the input of the original spectrogram with the attention of the high-level network.
Keywords/Search Tags:speech emotion recognition, source-filter model, spectrogram networks, attention mechanism
PDF Full Text Request
Related items