Font Size: a A A

Feature Fusion Based On Main-auxiliary Network For Speech Emotion Recognition

Posted on:2022-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:D S HuFull Text:PDF
GTID:2518306542980799Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Language is the most convenient and fast way for human communication,and the emotional information contained in language plays an important role in communication.The field of artificial intelligence has been pursuing the goal of making machines speak,think and feel like humans.The research of speech emotion recognition will promote the realization of this goal step by step.Deep learning has also been successfully applied in the field of speech emotion recognition,which is mainly used to extract significant and more general emotion features and build emotion classification models.In addition to the establishment of appropriate database,speech emotion recognition mainly includes feature extraction and classification model.In this paper,the classification model is improved for different types of features,and a feature fusion algorithm is proposed.The main contents are as follows:(1)emotion acoustic features are extracted from speech signals by frames,and segment features are generated by segmenting processing,which takes into account the feature of speech emotion changing with time.First,the Bidirectional Long Short-Term Memory(BLSTM)model is used to model segment feature.However,it is found that the model mainly has the following two problems: the BLSTM can only learn the local feature in the time step,and it is difficult to learn the global information of the speech emotion signal which is contextual.Another,using only the output of BLSTM at the last time to encode speech emotional information will cause a certain amount of information loss.Based on the above questions,in this paper,we proposed a speech emotion recognition model based on SA-BLSTM-ASP(Self Attentional Bidirectional Long Short Term Memory-Attentive Statistics Pooled)network.By adding the self-attention mechanism module before the BLSTM network to calculate the relationship between different positions of the segment feature input sequence,the ability of the network to learn global features is enhanced.In BLSTM network output,an attention statistical pooling method proposed in this paper is used.The attention mechanism can focus on the more significant emotion segments in the input emotional speech signal,and the statistical pooling can focus on the long-term variation characteristics of the emotional speech signal.The combination of the two can enhance the ability of BLSTM network to extract significant deep segment features and improve the performance of speech emotion recognition system.(2)In this paper,a Convolutional Neural Network-Global Average Pooling(CNN-GAP)structure is designed to extract the mel spectrogram of speech signals,in which the horizontal axis represents time and the vertical axis represents frequency.By designing a large convolution kernel on the time axis and frequency axis respectively,the frequency and time characteristics of the mel language spectrogram can be extracted,and then the significant emotional features can be extracted.After the last convolution layer,the global average pooling is used to replace the full connection layer,which can reduce overfitting and improve the performance of speech emotion recognition.(3)The deep segment features extracted by SA-BLSTM-ASP network and the deep mel spectrogram features extracted by CNN-GAP network were fused by the main-auxiliary network.At present,in deep learning,features learned from different networks are mostly fused by direct concatenation,while the method has achieved a certain effect,but different types of features simple concatenation together as network input,without considering different feature of the difference of dimension and dimension,and the feature of each type of practical physical meaning is different,it will have an adverse effect on the recognition results.In order to solve the above problems,this paper proposes a method to integrate the feature of different categories by means of the main-auxiliary network.Firstly,the segment features were input into SA-BLSTM-ASP network as the main network to extract the deep segment features.Then,the mel spectrogram was input into CNN-GAP network as an auxiliary network to extract the features of deep mel spectrogram.Finally,the features of deep mel language spectrogram were used to assist the features of deep segment features,and the features were fused in the form of main-auxiliary network.In this paper,sufficient experiments are carried out on IEMOCAP and e NTERFACE ’05datasets to verify the effectiveness of the proposed speech emotion recognition model based on SA-BLSTM-ASP network and the proposed speech emotion recognition model based on the feature fusion of main-auxiliary network.Compared with the benchmark model,the recognition results are greatly improved.
Keywords/Search Tags:recurrent neural network, self-attention mechanism, attention statistics pooling, convolutional neural network, main-auxiliary network
PDF Full Text Request
Related items