Font Size: a A A

Research On The Speech Emotion Recognition Fusing Articulatory And Acoustic Features

Posted on:2020-02-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:G F RenFull Text:PDF
GTID:1368330596985592Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Along with rapid development of artificial intelligence technology,people put forward higher requirements for human-computer interaction technology,it is hoped that intelligent products with the ability to recognize human emotions can provide smooth human-computer interface for human-computer interaction users.Speech emotion recognition became a hotspot in the field of artifial intelligence.In order to make computers better perceive and understand human emotional state and make the communication between machines and humans more smooth,it is necessary to make full use of multimodal signal(audio signal,facial expression signal and kinematic data of articulators)to analyze and study emotional speech.The research results of kinematic information of articulators in emotional speech can be applied in Speech Rehabilitation Training and Computer-Aided Learning Language.At the same time,the research on the articulatory-to-acoustic conversion is conducive to the research on emotional speech signal processing technologies,such as the production,recognition and synthesis of emotional speech.Therefore,it is of great significance and application value to study speech emotion recognition fusing articulatory and acoustic features,which can apply in the articulation mechanism of emotional speech and human-computer interaction technology.This paper mainly focuses on the speech emotion recognition fusing articulatory and acoustic features,including the design of emotional speech data set,feature extraction and analysis,articulatory-to-acoustic conversion,feature fusion and speech emotion recognition.Firstly,a performance-based bimodal emotional speech data set with articulatory and acoustic data has been established.Secondly,extraction and analysis of articulatory-acoustic features in emotional speech have been carried out.Thirdly,articulatory-to-acoustic conversion algorithm based on PSO-LSSVM has been proposed to achieve the conversion from articulatory features to F2 and 12-dimensional MFCC features.Finally,a hybrid fusion method based on Deep Boltzmann Machine has been proposed,and the fusion features has been applied to speech emotion recognition.Main contents and innovations of the research are as follows:(1)The bimodal emotional speech data set containing acoustic and articulatory data has been established.Based on the research about the establishment methods and contents of traditional emotional database containing articulatory data,the Mandarin emotional data set(including anger,happy,sad and neutral emotions)has been recorded by performing method.And then,using the combination of subjective and objective comprehensive fuzzy evaluation model has been used to validly evaluate acoustic data and select data,meanwhile the Root Mean Square Error has been used to select articulatory data.At last,being in line with people's daily communication habits,effective Mandarin emotional speech database has been built,which contains vowels,words and sentences,used in follow-up studies.(2)Breaking the limitation of traditional single syllable,disyllable-level and sentence-level emotional speech has been studied based on articulatory-acoustic features.Combing tone language characteristics,it took disyllable words and sentences as the research object respectively,analyzed the influence of emotion changes on articulatory and acoustic features of emotional speech,and studied the correlation between them.Before feature extraction of articulatory data,speaker normalization based on Prussian transformation has been carried out for articulatory data,and the normalized data can eliminate the physiological differences between different speakers.It is found that the more syllables,the more significant the influence of emotion speech on articulatory features,which is more significant than the influence of emotion on acoustic features.Meanwhile,with the increase of number of syllables,it was found that velocities of tongue root and left & right corners of the mouth were more significantly affected by emotions.The analysis of articulatory and acoustic features on sentence-level and word-level emotional speech proved that the polysyllabic has more emotional information than the monosyllabic or vowel.The more of syllables,the more significant influence of emotions on the articulatory features.Meanwhile,there is a strong correlation between velocities of tongue and lips and acoustic features such as formants and fundamental frequency,and the stronger the expression of emotion,the stronger the correlation.(3)Based on PSO-LSSVM algorithm,articulatory-to-acoustic feature conversion model for emotional speech has been proposed.Combing with the analysis results of articulatory-acoustic features,GMM model and PSO-LSSVM algorithm have been used to achieve articulatory-to-F2 and articulatory-toMFCC conversion respectively,and the conversion model has been analyzed theoretically.The converted features have been compared with the true acoustic features,and the comparison showed that the converted features are effective.(4)A multimodal hybrid fusion method based on Deep Boltzmann Machine has been proposed and applied to speech emotion recognition system.In this paper,a hybrid fusion network model for speech emotion recognition has been established,and its theoretical analysis and formula derivation have been carried out,combing with Random Forest and Support Vector Machine.The experimental results showed that the recognition results of mixed fusion are better than that of single mode emotion recognition,and better than that of feature level fusion of acoustic and articulatory features.At the same time,the comparative recognition results of KNN,SVM and RF classifiers showed that the recognition effect of using Random Forest as the recognizer was better than that of SVM and KNN.
Keywords/Search Tags:Speech emotion recognition, Articulatory features, Articulatory-acoustic analysis, Feature conversion, Hybrid fusion
PDF Full Text Request
Related items