Font Size: a A A

Speech Emotion Recognition With Deep Learning Techniques And Data Augmentation

Posted on:2024-11-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Y ZhangFull Text:PDF
GTID:1528307361986989Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speech emotion recognition(SER)recognizes the emotional state of speakers by analyzing speech signals and has a wide range of applications in human-computer interaction,disease diagnosis,monitoring,fatigue detection,public security,and other fields.With the development of psychology,neuroscience,pattern recognition,and other disciplines,SER research has also achieved remarkable results but still faces the problems of data scarcity,uneven distribution of various emotion samples,weak feature representation,and poor robustness and generalization of acoustic models.For these problems,this study proposes corresponding solutions,mainly including:1.A specialized deep learning model AHPCL suitable for single domain datasets and a beam feature set with excellent representation ability are proposed to address the issues of small scale and poor feature representation on the single domain datasets.The core design principle behind AHPCL is to facilitate the mapping of raw data into two distinct transformation spaces,thus enabling a comprehensive representation.Additionally,through the extraction of rhythmic and spectral features,a high-quality beam feature set comprising 233 dimensions was derived.Notably,AHPCL exhibits remarkable accuracies of 86.73%,87.92%,and 62.71% on the datasets EMODB,CASIA,and SAVEE,respectively.When juxtaposed against peer models,AHPCL distinguishes itself through its outstanding performance,underscoring its robustness.2.A novel deep learning model,HMN,tailored for multi-domain datasets,along with the MDI technique for data construction,is proposed to address the scarcity of speech emotion data resources.The HMN model employs Bi LSTM and Bi GRU to extract contextual emotion information comprehensively.It introduces convolutional operations to extract emotional spatial information and utilizes multi-operation processing to emphasize raw emotional data.When building multi-domain datasets adopting MDI technique,samples from different domains are merged based on emotion categories,with the union of emotion categories defining the new dataset.Achieving an accuracy of 82.48% on multi-domain datasets demonstrates the robustness of model against interference.3.A deep learning model,MA-Caps Net,suitable for noisy and imbalanced datasets,is proposed,along with the NDA and MSC methods for constructing noisy datasets based on the issues of limited diversity and imbalanced emotional categories on the speech emotion datasets.The key of MA-Caps Net lies in its dynamic routing algorithm,capable of extracting the most crucial pose information,enhancing the robustness of model to target position and angle.Single signal-to-noise ratio,multiple signal-to-noise ratio,and balanced datasets are constructed using NDA,MSC,and resampling techniques.The model achieves a highest accuracy of over 94.30%,demonstrating excellent robustness and generalization.4.A method for defining the reproducibility of deep learning model was proposed,along with a reproducible deep learning model named Speech Net.Reproducibility refers to the ability of a deep learning model to produce consistent results across multiple rounds of testing on a defined training and testing set.A model is considered repeatable when the number of consistent results is equal to or greater than half of the total experimental test times.The modules in the Speech Net model cooperate to effectively capture temporal and spatial information,establishing hierarchical and associative relationships among features,thus making the model more robust and reliable.On the ESD dataset,the highest accuracy,perfect reproducibility,and perfect correct repeatability were 96.83%,85.51%,and 85.17%respectively.This indicates good performance and reproducibility.Compared to peer models,the proposed method demonstrates greater competitiveness.5.A multi-task learning model,DTCN,is proposed to address the inability of single task learning to accurately represent speech emotional information.The DTCN model utilizes techniques such as hard parameter sharing,residual module,causal convolution,and dilated convolution to effectively enhance its parallel processing capability and its ability to handle time series data.Moreover,DTCN exhibits greater flexibility in handling historical information,thereby avoiding the gradient vanishing or exploding issues commonly encountered.The highest accuracy achieved by the model in multi-task learning,including emotion,speaker,and gender,are 92.36%,90.84%,and 97.34% respectively,with relatively low computational cost.This indicates the robust recognition capability of the proposed method for multi-task learning,where multiple tasks can assist each other in recognition.In short,addressing the challenges faced by speech emotion recognition,such as limited data resources,imbalanced emotion categories,weak feature representation,and issues related to the robustness,generalization,and reproducibility of acoustic models,several solutions were proposed.These solutions included data augmentation,resampling techniques,feature fusion,and the constructing the effective acoustic models.Experimental results demonstrate that the proposed techniques have achieved competitive performance compared to similar approaches.
Keywords/Search Tags:Speech emotion recognition, data augmentation, resampling, feature fusion, acoustic model, deep learning
PDF Full Text Request
Related items