Font Size: a A A

Emotion Recognition Using User Speech

Posted on:2022-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:J N GengFull Text:PDF
GTID:2518306323467044Subject:Data Science (Computer Science and Technology)
Abstract/Summary:PDF Full Text Request
As one of the most common natural interaction modes,speech is also important in the field of human-computer interaction,and it is also a research field that attracts more attention in the development of the Internet of Things.Most researchers focus on speech recognition,speech to text conversion and other directions.These directions are the recognition and processing of semantics in speech,but speech is a complex high-level behavior,which should not only contain the semantics part,but also contains complex composition information such as emotion.Speech emotion recognition does not pay attention to the specific semantic information of speech,but through the change of speech,to identify the emotion contained in the speech.The challenge of speech emotion recognition mainly lies in the differences of speech expression and speech content among individuals.For the same emotion,dif-ferent people will have different expressions and habits,and the classification method is difficult to adapt to the characteristics of all people.The types of speech emotion are limited,but the combination of speech content is infinite.This asymmetry makes the classification model face great challenges in extracting emotional features.Therefore,the selection and extraction of speech features and the design of classifier are important components that affect the result of speech emotion classification.In the aspect of speech feature selection part,the main speech features are intro-duced and their calculation methods are described,and finally selects the Mel cepstrum coefficient and Mel-frequency cepstrum coefficient.These two coefficients can inte-grate the characteristics of speech data in the time domain and frequency domain,and are commonly used in the field of speech.In terms of classifier design,thanks to the excellent performance of deep convolutional network in image recognition,this thesis first designed a model composed of convolutional neural network—CNNSpeech.Sec-ondly,considering the remote situational effect of speech and the uncertainty of emo-tional label expression,the RawSeeSpeech model is able to extract remote emotional features of speech by using Transformer’s encoder.Finally,in order to further reduce the distance between the same emotions,the central loss function is further introduced,which is called SeeSpeech.Seespeech model can not only obtain high classification accuracy,but also reduce the intra-class distance and increase the inter-class gap by using the joint decision mak-ing of the central loss function and Softmax cross entropy loss function,so that the model is independent of the speaker.In real environments,first of all,the noise robust-ness of the model is increased by adding noise to the speech,and then the data is filtered by using bandpass filtering and wavelet filtering to increase the classification accuracy of noise data.In the experiment,it is found that the optimal classification results can be obtained by using Mel spectrum coefficient and SeeSpeech model,and the classification accuracy reaches 94%.The results of cross validation also indicate that the model results are independent of the speaker.Finally,the thesis shows that in the actual Internet of Things scenario,the Seespeech model achieves the accuracy of 82%in the accuracy test.In the running performance experiment,it proves that our model can be used in the thin devices of the Internet of Things,and has a wide range of application scenarios.
Keywords/Search Tags:Internet of Things, Speech Emotion Recognition, Mel cepstrum Coefficient, Convolutional Neural Network, Multi-Attention Head Mechanism
PDF Full Text Request
Related items