Font Size: a A A

Research On The Invariance Of Speech Emotion Based On Deep Learning

Posted on:2022-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:C WangFull Text:PDF
GTID:2558306914962019Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the increasing development of science and technology,research on speech emotion has gradually received attention.Speech emotion research helps to promote the understanding of speech emotion,and it is also helpful for speech recognition and speech synthesis.In addition,voice emotion recognition has a wide range of practical application scenarios,such as voice assistants,call centers,and online classrooms.However,the current performance of speech emotion recognition is not ideal.This is partly due to the fact that most current researches treat speech emotion as a general pattern recognition task,and have not studied the essence of speech emotion.Voice emotion is affected by speaker and content information.Therefore,in order to improve the performance of voice emotion recognition,it is necessary to study the invariance of voice emotion:that is,the emotional information that is unchanged in the voice relative to the changing speaker and content.The invariance of speech emotion can be divided into three parts according to the level of the speech emotion recognition system.The main work of this article in each part is described below in turn:1)Research on the expression of voice emotion invariance in the spectrogram,excitation source and vocal tract information.The expression of speech emotion invariance is closely related to the input space,so it is necessary to study the emotion invariance in each input space.In this thesis,the homomorphic signal processing method is used to decompose speech into two parts:excitation source and vocal tract.Through parameter search of the convolution kernel,the structured information representation of speech emotion in the excitation source and vocal tract is explored,and combined based on the method of model fusion The best model of excitation source and sound channel and achieved performance gains.In addition,based on the relativity of speech emotion,this thesis uses the long-term difference method to study the expression ofemotion invariance in the spectrogram,which proves that speech emotion is affected by content information.2)The speech emotion information extraction based on a mixture of multiple methods is studied.Due to the complexity of speecb,emotion,this article uses a variety of methods to extract emotion information.This article first starts with the convolutional neural network,based on the different meanings of the time domain and the frequency domain in the spectrogram,conducts a wide range of parameter searches on the convolution kernel;then extracts multi-scale information through model fusion,multi-scale block,etc.;finally,the transformer is added to the model and better performance is achieved.3)Designed a speech emotion understanding network based on emotion attributes,and studied the influence of sequential information processing methods and learning strategies on speech emotion information understanding.The latter part of the model and learning strategies will affect the understanding of speech emotion information.This thesis first explores the impact of multiple timing processing networks and timing information utilization methods on the understanding of emotional information;secondly,using continuous emotional tags and simulating the human brain subsystem to introduce emotional attributes as auxiliary tasks for multi-task learning,the performance is significantly improved;and finally,by adjusting the distribution of training data batchwise,a better understanding of speech emotion invariance information is achieved.
Keywords/Search Tags:speech, emotion, deep learning, invariance
PDF Full Text Request
Related items