Font Size: a A A

Cross-corpus Speech Emotion Recognition Based On VGFCC Feature And Composite Network

Posted on:2021-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y R LiuFull Text:PDF
GTID:2518306113951399Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Empowering machine emotional computing is essential for the realization of real artificial intelligence.Speech is the easiest and fastest way to communicate.Obtaining the emotional information contained in speech helps to understand the meaning of speech,so speech emotion recognition has become a hot topic.In the past researches,most of them were based on a single speech database,and have tended to be mature.However,in actual life applications,there are often a variety of differences between training speech corpus and testing speech corpus,such as language,speech type,speaker,environment,cultural background and so on.So there is a research on cross-corpus speech emotion recognition.This thesis focuses on two parts: feature extraction and recognition model.The traditional emotion speech feature extraction assumes that the signal is short-term stationary,but in practice the speech signal changes with time.To solve this problem,this thesis uses a variational mode decomposition algorithm which can handle nonlinear non-stationary signals well to decompose emotional speech signals,synthesize different frequencies,then pass a gammatone filter,and then obtain the logarithm.After the discrete cosine transform,the statistical parameters are calculated to obtain new emotional speech spectral characteristics.Considering that a single feature can not fully represent emotional information.The prosodic features can express the basic characteristics of speech.Nonlinear features can describe speech emotional information from the point of chaos.The global features are obtained by combining the above two features and the new spectral feature proposed in this thesis.Experiments are carried out based on the German emotion database recorded by Berlin University of Technology,the Chinese emotion database built by the Digital Audio and Video Research Center of Taiyuan University of Technology,and the Chinese emotion database recorded by the Institute of Automation of the Chinese Academy of Sciences.The classifier selects the kernel extreme learning machine optimized by artificial bee colony.Compared with the recognition results of prosodic feature,nonlinear feature,and two traditional spectral features: Mel frequency cepstral coefficents and Gammatone frequency cepstral coefficents,the results show that the proposed new feature is an effective emotional speech feature and can distinguish different emotions well.Compared with single feature recognition performance,the experimental results show that the global feature recognition rate is improved.Feature-level fusion enables information to be complementary,and there is also a phenomenon of information redundancy,which leads to a lower recognition rate of certain emotions in global features than a single feature,but the overall average recognition rate is improved.The recognition model is very important for the performance of speech emotion recognition.This thesis proposes a composite network: stacked sparse automatic encoders-kernel extreme learning machine.Firstly,the original features are unsupervised pre-trained by the stacked sparse auto encoders network.And then combined the data labels to supervised fine-tuning by using the back-propagation algorithm.Reconstructed to obtain depth features that are more consistent with brain sparsity and contain more different emotional information.Finally,the artificial bee colony optimization kernel extreme learning machine is used to identify and classify emotions.In order to apply the theoretical research to practice,this thesis carries out cross-corpus speech emotion recognition.Experiments based on the three databases of German emotion database recorded by Berlin University of Technology,the Chinese emotion database built by the Digital Audio and Video Research Center of Taiyuan University of Technology,and the Chinese emotion database recorded by the Institute of Automation of the Chinese Academy of Sciences.Then extract the global features of each corpus.Since the common emotions of the three speech databases are only "sad","angry" and "happy",the research only focuses on these three types of emotions.The classifier selects shallow learning machine: support vector machine,extreme learning machine,kernel extreme learning machine,and composite network structure: stacked sparse automatic encoders-support vector machine,stacked sparse automatic encoders-extreme learning machine,stacked sparse automatic encoders-kernel extreme learning machine.This article designed three groups of experiments: single-corpus,mixed-corpus,and cross-corpus.The results show that the composite network stacked sparse automatic encoders-kernel extreme learning machine has good recognition performance,and improves cross-corpus recognition rate effectively.
Keywords/Search Tags:Speech Emotion Recognition, Cross-Corpus, Variational Mode Decomposition, Feature-level Fusion, Composite Network, Stacked Sparse Automatic Encoders, Kernel Extreme Learning Machine
PDF Full Text Request
Related items