| In addition to the basic semantic information,speech signals contain rich emotional information.As an important branch of the research field of "Affective Computing",the research of speech emotion recognition is of great significance for achieving more natural human-computer interaction.As an important part of the speech emotion recognition process,extracting feature sets containing rich emotion recognition information from speech signals plays a crucial role in improving performance.Speech emotional features are usually extracted from two perspectives.Firstly,traditional manual acoustic features are obtained by stacking or statistically calculating speech frame level features.There is a large amount of redundant information unrelated to emotion classification,which affects the effectiveness of emotion recognition.Secondly,the spectrogram reflects the changes in the speech spectrum over time.It is an important carrier of voice information.However,the recognition performance of emotion recognition based on spectrograms still needs to be improved,as the information in spectrograms has not been fully mined.In order to enhance the richness of features in speech emotion recognition research,this article explores corresponding solutions for the two extraction perspectives of speech emotion features.The main research content is as follows:(1)Based on the Biogeography-Based Optimization algorithm,feature selection and optimization are carried out for manual acoustic features with more redundant information.The purpose is to compress feature dimensions and improve feature recognition performance.First,the original Biogeography-Based Optimization algorithm is improved for the feature selection task.A feature selection model for speech emotion classification is established,and the classification results of the support vector machine are used as the basis for iterative optimization.Secondly,during the optimization process,the original manual acoustic feature set is randomly divided into multiple subsets as the initial solution set for the feature optimization problem.Then,the Biogeography-Based Optimization algorithm is used to simulate the feature selection process as natural selection in the process of species migration,and the division method of the feature set is continuously optimized in iteration to obtain the optimal acoustic feature subset of speech.(2)Multi-dimensional speech emotion feature extraction based on neural networks for spectrograms.In order to obtain more comprehensive emotional features of speech,a multi-dimensional emotional feature extraction method is proposed from broadband and narrowband spectrograms.Firstly,a broadband spectrogram with high temporal resolution and a narrowband spectrogram with high frequency resolution is used as model inputs.Then,the focused emotion classification information is extracted through a convolutional module that integrates attention mechanisms.Secondly,the time series information in the convolutional feature graph is further mined through the bidirectional Long short-term memory network.Finally,better recognition performance of emotional features in spectrograms was obtained.(3)The effectiveness of the proposed method was verified through two mainstream sentiment corpora,Emo DB and IEMOCAP.First,the proposed manual acoustic Feature selection algorithm is verified.The results demonstrate that the algorithm proposed in this article significantly improves sentiment recognition performance while reducing feature dimensions.The emotion recognition performance on the two corpora improved by 9.09% and 6.69% respectively,while the feature dimension was compressed to 4.4% and 9.5% of the original dimension,respectively.Secondly,the performance of spectrogram features in emotion recognition was verified through experiments.Through a series of comparative experiments,it has been proven that the extracted spectrogram features in this paper have better recognition performance,with the recognition performance of the two corpora reaching 91.24% and 71.88%,respectively.Finally,the emotional recognition performance of fusing two types of features was further explored,and experiments showed that the fused features achieved better recognition performance than a single type. |