Font Size: a A A

Research On Speech Emotion Recognition Based On The Two-layer CNN-LSTM

Posted on:2022-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:N N DaFull Text:PDF
GTID:2518306515463984Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of social science,computer science and artificial intelligence,speech emotion recognition has developed rapidly.Emotion is an important support for human-computer interaction.In order to realize the natural interaction between humans and machines,the intelligent system needs to recognize the individual's emotional state,and the voice information contains richer emotional information.With the demands of social development,people have put forward higher requirements on the existing speech emotion recognition technology.In order to achieve accurate and efficient speech emotion recognition,and improve the generalization ability of the recognition algorithm model,this article studies the combined recognition model based on the combination of two-layer convolutional neural network and long-term memory neural network,which is suitable for speech emotion recognition.The main research contents are as follows:1.This article mainly studies speech emotion recognition based on the combination of convolutional neural network and long-short-term memory neural network model.Based on the single-modal emotion recognition model used in the past,a two-layer model is proposed.By exploring the influence of features on emotion recognition,the algorithm principles and development status of neural networks are introduced,and different features of speech emotion,such as energy spectrum,zero-crossing rate,basic frequency and spectrogram,are compared.Finally,it is found that the characteristic information of the spectrogram is richer.This article uses spectrograms for experiments,and then improves the existing single-modal speech emotion recognition framework,and proposes a two-layer speech emotion recognition model.Due to the network deepening,there is a problem that the reflux of gradient information is blocked and the training is difficult,this article uses Highway network to optimize the dual-sequence LSTM network.Through experiments,it is found that the combined model of an independent two-layer 2D CNN network,dual-sequence LSTM and Highway network can significantly improve the accuracy of emotion recognition.This article uses a novel mechanism for data preprocessing,which uses nearest neighbor interpolation to solve the problem of variable length between different audio signals.Experiments show that interpolation is better than more typical methods such as truncating and filling data,which will lose information and increase computational cost.The improved model is evaluated on the IEMOCAP corpus,and compared with existing emotion recognition algorithms,it can provide more accurate predictions.2.In order to realize the automatic speech emotion recognition of the model and improve the accuracy and generalization ability of the model,this article uses the attention mechanism to optimize the two-layer 2D CNN-LSTM combined model.This article studies the use of deep learning to automatically discover emotion-related features in speech.The results show that the use of deep recurrent neural networks can not only learn short-term frame-level acoustic features related to emotions,but also aggregate these features at an appropriate time to form a compact speech-level representation.In addition,this article proposes a novel strategy for feature pooling over time,which uses local attention to focus on specific areas where the speech signal is more emotionally prominent.The improved model is evaluated on the IEMOCAP corpus,and compared with existing emotion recognition algorithms,it can provide more accurate predictions.
Keywords/Search Tags:speech emotion recognition, neural network, spectrogram, attention mechanism
PDF Full Text Request
Related items