Font Size: a A A

Research Of Speech Emotion Recognition Based On Deep Spatio-Temporal Representation

Posted on:2020-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhengFull Text:PDF
GTID:2428330578471052Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Speech is the most natural and direct way of communication between people,and it contains rich emotional information.The purpose of speech emotion recognition is that in the process of human-computer interaction,the computer can judge the emotional state of the person through the voice signal,even supervising,assisting and guiding the work of the person.How to make the computer automatically recognize the voice emotion is a Important and challenging task.The previous method of emotion recognition firstly extracted the emotion-related hand-made features from the speech according to the expert experience,and used the feature selection method to select and reduce the feature of the sentiment feature vector,and then trained the classifier of the speech emotion recognition.However,the shallow machine learning method relies heavily on manual feature extraction,which restricts the performance of speech emotion recognition.With the wide application of deep learning technology in various fields,how to effectively use deep learning to extract high-dimensional speech emotion representation has become a research hotspot.This paper focuses on the feature extraction of speech emotion signals and the modeling of speech emotion recognition.From the perspective of cognitive science,combined with the deep learning method,a framework for automatically extracting time,space high-dimensional features and Spatio-Temporal features is presented.To a large extent,it avoids the problem of relying on manual extraction of features and effectively extracts the emotional information of speech,enriching the emotional expression of speech signals.The main research contents include:Firstly,considering the problem of acoustic features related to emotions in the existing speech emotions,based on cognitive science,a time-space advanced feature fusion method based on deep learning is proposed.Combining the advantages of FCN and BLSTM,time and space based on FCN-Attention BLSTM Feature extraction and classification prediction.The new feature extraction method effectively simulates the characteristics of language emotions.The experimental results in the Chinese Natural Audiovisual Emotion Database(CHEAVD)and the IEMOCAP Corpus show that the proposed model has no weight-to-weight accuracy(UA)and weight correctness rate(WA).With a large degree of improvement,compared with the results of other existing speech emotion recognition algorithms,UA and WA increased by 4.6%and 6.4%in the CHEAVD database,and UA increased by 3.9%?0.5%.WA increased by 4.7%?1.6%in the two sub-databases of IEMOCAP.Secondly,aiming at the problem of slow training speed and recognition speed based on FCN-Attention BLSTM model,an extended causal convolutional neural network is proposed to extract temporal features,which solves the increase of GPU memory consumption caused by BLSTM input speech calculation.The problem of slow computing speed.Compared with FCN and attention-based BLSTM model,the single test speed of speech emotion recognition is increased from 2.8s to 3.5s to 1.9s?2.1s in the case of UA and WA loss of 1%?2%.The test speed has increased by 5 to 7 times.
Keywords/Search Tags:Speech emotion recognition, Spatio—Temporal representation, BLSTM, FCN, Attention, SeriesNet
PDF Full Text Request
Related items