Font Size: a A A

Research On Speech Emotion Recognition Based On Discriminative Feature Mining And Multi-level Knowledge Distillation

Posted on:2024-06-25Degree:MasterType:Thesis
Country:ChinaCandidate:H Q SunFull Text:PDF
GTID:2568307142951819Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Speech emotion recognition(SER)is an important research direction in the field of computer speech processing,aiming to recognize the emotional state of speakers by analyzing speech features.With the continuous development of computer speech processing technology,speech emotion recognition has become a popular research topic in the field of computer speech processing.The application areas of speech emotion recognition are very wide,including speech navigation system,speech diagnosis system,speech intelligent customer service system,etc.The research results of speech emotion recognition can provide important support for other research directions in the field of speech processing.In order to improve the recognition accuracy of speech emotion recognition technology,this paper mainly proposes discriminative feature mining and a multi-level knowledge distillation for SER from two perspectives: extracting discriminative features and constructing a lightweight and robust pre-trained model.The main work is as follows:1)In a speech signal,not all emotional features have the same contribution;at the same time,the clusters of features from different emotional classes may overlap each other’s situation,i.e.,feature confusion.To address this problem,a discriminative feature representation method for SER is proposed.The method integrates integrating cascaded attention network and adversarial joint loss strategy for speech emotion recognition,aiming at discriminating the confusions by emphasizing more on the emotions which are difficult to be correctly classified.First,we extract log-Mels,deltas and delta-deltas of log-Mels as 3D features to effectively reduce the interference of external factors.Next,we introduce a cascaded attention network to extract effective emotional features,where spatiotemporal attention selectively locates the targeted emotional regions from the input features.In these targeted regions,the self attention with head fusion captures the longdistance dependence of temporal features.Finally,an adversarial joint loss strategy is proposed to distinguish the emotional embeddings with high similarity by the generated hard triplets in an adversarial fashion.To evaluate our proposed method,experiments are performed with the IEMOCAP,CASIA,and EMODB corpora.The experimental results showed that our proposed method obtained 82.68%(IEMOCAP),67.08%(CASIA),91.58%(EMODB)of WA,and 82.67%(IEMOCAP),67.08%(CASIA),88.76%(EMODB)of UA,respectively,which are superior over the state-of-the-art approaches.2)In recent years,the performance of SER has achieved significant improvement.However,most of the algorithms are performed in clean speech conditions for both training and test.How to achieve a good performance even in noisy conditions still remains as a challenge task.Meanwhile,for the problem of long training time and high time overhead of pre-trained models,a discriminative feature representation method in the knowledge distillation framework for SER is proposed.The method transfers the learned emotion features of the clean corpus from the teacher network to a simpler structured student network trained on a noise-containing corpus as input.Specifically,we take the features of clean speech learned by the teacher model as the learning goal,and use a student model to approximate the clean features extracted by the teacher model in the case of noisy speech input.In our knowledge distillation framework,we first propose a distillation version of the teacher model distil wav2vec-2.0,which greatly reduces the number of parameters and inference time of the model while ensuring the model performance.We reduced the number of transformer blocks by half and initialized the transformer block in student model with the even layer weights of the teacher model.Second,we propose to select the teacher model’s knowledge of multi-layer networks to guide the output of the student model’s single-layer networks.Finally,in order to evaluate the effectiveness of the method proposed in this paper,experiments are conducted in the IEMOCAP corpus and the Noisex-92 noise bank.Experimental results show that the proposed method in this paper achieves an average absolute gain of 18.23%on UA compared to the baseline system for all types of noise,showing competitive results.
Keywords/Search Tags:Speech emotion recognition, Discriminative feature, Knowledge Distillation, Pre-trained Model, Wav2vec-2.0
PDF Full Text Request
Related items