Emotion is a rich psychological behavior of human beings and has been a research hotspot in many scientific research fields.Speech signal is the most natural way of communication between people.It not only contains the content to be delivered,but also contains rich emotional factors,and has been applied to emotional research.In the speech emotion recognition,speech is the carrier of emotion.From the speech signal,the specific emotional state of the speaker can be analyzed,which makes the human-computer interaction more humanized.In the field of speech emotion recognition,the extraction of emotional feature parameters and the training of classification models are important research directions at present,and their quality will directly affect the recognition rate of the whole system.Based on the current popular deep learning,a speech emotion recognition method based on deep and shallow feature fusion of convolutional neural network(CNN)and a speech emotion recognition method based on bottleneck feature fusion of deep neural network(DNN)are proposed in this thesis.The specific research work is as follows.(1)A large number of related literatures in the field of speech emotion recognition are summarized in the thesis,and some theories and commonly used speech emotion recognition methods in the literature are simulated.The related technologies of speech emotion recognition and the commonly used classification models are also introduced in detail,which will make adequate preparations for further research.(2)Acoustic features commonly used for emotion recognition include spectral correlation features,prosodic features,sound quality features,and fusion features of the above features.These features tend to focus only on the time domain or the frequency domain.However,there are correlations between the frequency domain and the time domain in the speech signal,which plays a key role in speech emotion recognition.As a visual representation of the speech signal,spectrogram not only expresses the time-frequency characteristics of speech,but also reflects the language characteristics of the speaker.A new convolutional neural network is proposed,which can fuse deep features and shallow features from spectrograms to obtain more distinctive emotional features.The current popular migration learning method is used to train and test the network.The experimental results show that speech emotion recognition rate of the proposed convolutional neural network with deep and shallow feature fusion is improved,compared with the traditional convolutional neural network.(3)In the process of speech emotion recognition using convolutional neural networks and spectrograms,many parameter settings in each layer of the convolutional neural network have a large impact on the final recognition rate.In the experiment,it is difficult to find the optimal values of these parameters,which leads to a significant improvement in the recognition rate.In recent years,DNN has been applied more and more in the field of speech recognition.A DNN with bottleneck layer is designed to extract bottleneck features of speech signals.The DNN can concentrate the emotional information of speech in the bottleneck layer.The bottleneck feature is used to obtain the emotional information contained in the speech.Then,by setting the position of the bottleneck layer,the bottleneck features of different layers are extracted and fused,and the support vector machine model is used to achieve all kinds of emotional classification.The experimental results show that the speech recognition rate of the proposed recognition method is improved to some extent. |