Font Size: a A A

Research On Human Voice Emotion Analysis

Posted on:2023-04-22Degree:MasterType:Thesis
Institution:UniversityCandidate:Dejoli Tientcheu Touko LandryFull Text:PDF
GTID:2568307103991859Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Emotion analysis technology has been applied in intelligent electronic equipment,medical care systems and several other fields.Voice emotion analysis refers to the use of various technologies to find and analyze the potential emotions in voice data,focusing on the speech and nonspeech parts of voice signals.One of the two main challenges of voice emotion recognition are:First,data scarcity,and the limited number of available and reliable dataset.The second problem is that mixed emotions might appear in human voice signals;For example,people may perceive mixed information,from voice speech expressing different emotion than the nonspeech sound occurring in that same voice speech signal.These differences motivate our investigation on using robust recognition framework systems.Recently,deep learning algorithms have been successfully applied to speech emotion recognition(SER).Various model structures have shown good ability to extract high-level emotion features,and have achieved good accuracy on existing data sets.However,when these systems are used in real-time human-computer interaction,accuracy is drastically reduced.One reason is that most studies on human emotion recognition systems are based on acted and scripted datasets recorded in controlled environments,where background noise is almost absent and subjects try to highlight individual emotions.Another reason for performance degradation is that the existing models do not take into account the emotional nonspeech sounds or vocal bursts that contain valuable emotional information about the speaker’s emotional state.The main work of this paper is to process speech emotion recognition and non-speech emotion recognition into the same recognition framework system,considering the high correlation between non-speech emotion recognition and traditional speech emotion recognition,it provides a rationality analysis for the dual-process emotion analysis scheme considering both speech and nonspeech emotional sounds particularity.The research of emotion analysis in this paper involves sound related acoustic(without considering words)information to optimize emotion recognition by analyzing acoustic and spectral features of audio sample data.The steps taken are as follows:1.Built a multi-language dataset ASVP-ESD(Audio,Speech,and Vision Processing lab-Emotional Sound Database)which contains 12635 audio samples from both genders,that considering both speech(5090)and nonspeech(7545)emotional sounds of 13 different emotion categories.2.Designed a dual-process for emotion recognition based on speech/non speech classification.In the first stage,the linear regression algorithm is used to classify speech and nonspeech sound types.In the second stage,the method is a combination of dense convolution network structure with recursive network used to extract local and global features and time-dependent features respectively.Then,different feature patterns between different emotions are detected through a simple logistic regression attention mechanism to realize emotion recognition.and considers that human beings can use nonspeech sounds to express emotions in addition to speech modality containing acoustic and lexical(word)information of emotional states.The two stage recognition can effectively establish the correlation between speech and nonspeech emotional sound expressed in human voice(e.g.,laughter represents happiness)3.Analyzed the performance of the proposed model on existing and available public speech based datasets.The result obtained after implementing speaker independent experiments indicated that the proposed model architecture defined above achieved the same result as the latest published results on conversational type datasets(such as IEMOCAP with 4 different emotions)and achieved a reasonable result on single speaker datasets(such as BERLIN EMODB and RAVDESS with 7 and 8 different emotions respectively).Using the 2 stage recognition strategy,the model achieved a promising accuracy when implemented on the ASVP-ESD dataset using 13 different emotions,exceeding the traditional single-stage approach by 2%weighted accuracy(WA)on the testing set.
Keywords/Search Tags:Voice emotion recognition, speech&nonspeech emotional sound, Densely Connected Convolutional Neural Network, Recurrent Neural Network, Attention mechanism
PDF Full Text Request
Related items