Font Size: a A A

Research On Speech Emotion Recognition Based On Spatiotemporal Feature Fusion

Posted on:2022-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:C K ZhengFull Text:PDF
GTID:2518306779489144Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
Speech emotion recognition is an important direction in artificial intelligence filed.It is widely used in smart medical treatment,vehicle driving,human-machine dialogue,etc.However,Speech emotion still confronts some problems such as insufficient precision and poor robustness,which mainly due to the following three reasons: 1)Human emotions are abstract,which makes it difficult to distinguish;2)Human emotions can only be detected at certain moments during the speech.3)Speech data samples with emotion labels are usually limited.Therefore,this paper designs a speech emotion recognition method based on spatiotemporal feature fusion,which includes:1.In order to solve the problem of insufficient speech data samples and low recognition accuracy,a speech emotion recognition model(3D-DACRNN)based on spatiotemporal feature fusion is proposed.Considering that the parameters in the Alex Net network are extracted from image datasets,which cannot fully represent the spatial information of speech data and do not contain timing information,This paper proposes that those problem can be solved by extracting the spatial information of the speech spectrogram through a dilated convolutional network(Dilated-CNN),adding a bidirectional long The short-term memory neural network(BLSTM)extracts time-series information and performsing spatiotemporal feature fusion;for the problem that speech having a lot of emotion-independent features,the three channels of the logarithmic Mel spectrogram are used as input to reduce the influence of emotion-independent factors,And add attention mechanism to select the time domain signal with great emotional weight.2.Since it is difficult for the 3D-DACRNN model to distinguish emotions with high similarity,a speech emotion recognition model based on Vi T-CRNN multi-feature fusion is proposed,which fuses the features extracted by the Vision Transformer(Vi T)model and the features extracted by CRNN.First,Vi T can extract deep global temporal features,and is more suitable for transfer learning tasks than CNN.Vi T can better solve the problem of insufficient speech emotion data samples;then,use CRNN to learn more comprehensive spatial features from the original speech;finally,this method will splice spatial features and deep temporal features,which makes the two feature vectors to complement each other in order to improve the recognition rate of the model.This paper proposes a speech emotion recognition model based on spatiotemporal feature fusion and a multi-feature fusion speech emotion recognition model based on Vi T-CRNN,which improves the performance of speech emotion recognition system and the ability to capture emotional detail features.On the dataset IEMOCAP,the UAR of 3D-DACRNN and the UAR of Vi T-CRNN improved by 4.1% and 6.3% respectively.
Keywords/Search Tags:Speech Emotion Recognition, Transfer learning, Dilated Convolutional Network, Long and Short-term Memory Neural Network, Vision Transformer, Feature Fusion
PDF Full Text Request
Related items