Font Size: a A A

Research On Key Technologies Of Multimodal Emotion Recognition Based On Speech Signals

Posted on:2024-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:K LiuFull Text:PDF
GTID:2568306923972759Subject:Electronic information
Abstract/Summary:PDF Full Text Request
In recent years,with the enrichment of computer software resources and the continuous improvement of hardware equipment,the development of speech related technologies such as Automatic Speech Recognition(ASR)and Speech Emotion Recognition(SER)has been greatly promoted.As an important module of human-computer interaction,Intelligent Voice Interaction(IVI)has become an indispensable auxiliary tool in people’s daily life.People expect speech interaction machines to have similar observation and understanding abilities to humans,and generate their own corresponding emotional states to respond more accurately to questions raised by people.Therefore,how to use these technologies to mine the deeper emotional value of voice data is particularly important.Although speech can independently understand and express emotions,people communicate with each other through the integration of different modes of information.If a person’s emotional state is directly determined from only one expression method of speech,the results of the discrimination may be one-sided and limited in accuracy.Considering the complementarity between text and speech,the proposed algorithm first utilizes ASR to obtain text information from speech,then fully mining and fusing emotional feature in text and speech,and finally performing multimodal emotional recognition.The main contributions of this thesis are summarized as follows:(1)Aiming at the problem of long-distance dependence of feature sequences in ASR,a network model combining convolutional neural networks(CNN)and multi-head attention is designed.In this model,CNN is used to extract local feature information,while the multi-head attention obtains global information and weights the output based on the contribution of the feature,effectively alleviating the long-distance dependency problem and improving the accuracy of the model.(2)In view of the difficulty of balancing the attention head and subspace dimensions in multi-head attention,a multi branch fusion network with dilated convolutional networks(DCNN)and multi-head attention(DAMBFN)is designed.The model is divided into multiple parallel branches,and the features in each branch first pass through the DCNN,which can·effectively expand the receptive field of the convolutional layer.The dilated rate of the DCNN in each branch is different,which results in different sensory fields obtained by the multi-head attention in each branch.The multi-head attention only focuses on information within its own sensory field without excessively focusing on global information,which greatly reduces the amount of information in each branch.Therefore,the model can use a larger number of attention heads to obtain better model performance.(3)Aiming at the difficult alignment of different modal emotional features and longdistance dependence across modes in emotion modeling of speech and text time series,a dualflow cross modal feature fusion network(DCFFN)based on interactive attention and selfattention is designed.The model introduces an interactive attention that can fuse text emotional features and speech emotional features across modes,which eliminates the need for the model to align the feature information of the two types of modes,and the full complementarity of the two types of emotional features improves the robustness of the model in noisy environments.At the same time,the model utilizes a self-attention to enable each mode to obtain its own global context information.The combination of the two attention mechanisms makes full use of emotional feature information,significantly improving the accuracy of model classification.(4)For multimodal emotion classification,a multimodal emotion classification model based on Reweighted BiGRU(ReBiGRU)is designed.This model multiplies the state vector of the BiGRU forward and reverse hidden layers with the output vector to obtain the weighting factor of the emotional features on the output at each time step,and then reweights the output vector.This model fully utilizes the emotional feature information hidden in the forward and reverse hidden layers,effectively suppressing emotional features that have a small impact on the output,thereby improving the classification effect of the model.
Keywords/Search Tags:Mult-head Attention, Dilation Convolution, Multi-branch Fusion Network, Cross-modal Feature, Interactive Attention, Reweighted BiGRU
PDF Full Text Request
Related items