| Speech emotion recognition is a hot research topic in the field of artificial intelligence.Speech emotion recognition model generally includes three modules: emotional speech data,emotional feature and classifier,among which emotional feature is an important part of the whole model.At present,the commonly used emotional features are divided into two categories: acoustic features and spectrogram.However,these features are superficial representation of speech signal,and cannot represent the deep characteristics of speech signal.Deep learning can solve this problem.Combining feature extraction with deep learning can obtain deep features with stronger representation ability.At the same time,with the increase of the variety of speech features,researchers found that multiple feature fusion can make the feature representation ability more comprehensive,and thus improve the recognition performance of the system.Acoustic feature is the one-dimensional representation of speech signal,which contains the time domain or frequency domain information of speech.As a two-dimensional representation of speech signal,spectrogram contains time-frequency domain information of speech.The fusion of acoustic features and spectrogram features can improve the emotional representation ability of the fused features by using their complementary characteristics.To sum up,this paper studies the depth feature extraction model and feature fusion algorithm,and the main contents are as follows:Aiming at the problem that acoustic features can only represent the speech signal at a shallow level,a multi-task deep neural network based acoustic depth feature extraction model is constructed in this paper.Based on the deep neural network,the model constructs the classification task and self-learning task to train the network simultaneously.The essence of the classification task is emotion recognition,which is realized by setting the output label as emotion category.The essence of self-learning task is feature reconstruction,which is realized by setting the label of network as the input feature itself.The two tasks have two different losses.The network transmits back by establishing joint losses,and trains the classification task and self-learning task at the same time.Finally,the hidden layer,namely the proposed acoustic depth feature,can better represent the feature itself and increase the emotional information.Aiming at the problem that the traditional depth feature of spectrogram has weak emotion and low recognition rate,this paper constructed a depth feature extraction model of spectrogram based on dichotomic lost-assisted convolutional neural network.In this paper,by studying the different language library to extract the spectra characteristics of deep confusion matrix,found that every library has two or three weak emotion recognition rate is low,in turn,lower the overall recognition rate,therefore,this paper aimed at these weak emotion,on the basis of the network construction of corresponding binary classification of emotion loss function auxiliary network training,Thus,the recognition performance of the depth feature of the improved spectrogram to weak emotion is improved.In view of the complementarity between acoustic features and spectral features,a feature fusion algorithm based on multi-kernel principal component analysis is proposed in this paper.Firstly,multi-kernel learning is used to construct multi-kernel mapping space.Multi-kernel learning uses kernel function to map features to higher-dimensional space,which can map and fuse acoustic features and spectral features in the kernel space to obtain the advantages of two different features and improve the performance of fusion features.Then,PCA dimension reduction is carried out in this space to obtain the fusion features and solve the problem of too many dimensions brought by feature fusion.This paper carries out experimental verification on EMODB,SAVEE and CASIA emotional speech database.The results show that the performance of IS09MT feature extracted by multi-task deep neural network is superior to that of original acoustic feature--IS09 feature and IS09DNN feature extracted by deep neural network.Compared with the MSP feature extracted by the basic convolutional neural network,the depth feature of speech spectrum extracted by dichotomous loss-assisted convolutional neural network--EMSP feature,the recognition performance of weak emotion is improved.Finally,the feature performance of IS09MT-EMSP-MKPCA obtained by multi-kernel principal component analysis algorithm is significantly better than that of IS09MT-EMSP and single feature. |