Speech contains and transmits a lot of information in interpersonal communication.With the constant update and iteration of computer hardware and deep learning algorithm,intelligent speech products keep emerging.However,most of them lack the cognitive ability of human emotion at present.As an important medium for human-computer interaction,speech enables the machine to recognize the emotional information contained by the speaker and give positive feedback,which will make intelligent speech products more user-friendly and enhance users’ sense of experience.Although the research on speech emotion recognition based on Mandarin and English has lasted for more than 20 years,the relevant research results are rarely applied to real life,mainly because of the low recognition rate and poor model generalization performance.How to improve the performance of the model is always a very challenging task.Tibetan is one of the essential national languages in China,which is used by more than 7 million people in China.So far,only a handful of emotion recognition studies have been carried out based on Tibetan speech.With the continuous application of human-computer interaction technology,it is very necessary to carry out emotion recognition research based on Tibetan speech.This paper first introduces the significance and value of studying emotion recognition of Tibetan speech,then introduces the current situation of relevant research at home and abroad,analyzes the research situation of emotion recognition related to Tibetan speech in recent years,and introduces the main work and chapter arrangement of this paper.Then it introduces the basic knowledge of speech emotion recognition from the aspects of speech emotion description model,emotion database,preprocessing,feature extraction,recognition model,etc.,and gives the general framework of the speech emotion recognition model used in this paper.Then it introduces the primary research content of this paper are as follows:(1)The Tibetan speech emotion data set TSED is constructed.Firstly,by analyzing and comparing some similarities and differences between Tibetan,Mandarin and English from the aspects of pronunciation,phonics and intonation,the paper proves the necessity and feasibility of constructing Tibetan phonetic emotion data set.Then,the specific process of constructing Tibetan speech emotion data set TSED is introduced.TSED is composed of 6000 Tibetan speech with 5 emotions recorded by 12 people.Finally,based on the analysis of TSED,the extraction principle and waveform of various features required by the later experiment are introduced.All experiments in the following paper were carried out based on TSED.(2)A Tibetan speech emotion recognition model based on 1-2DCNNBi GRU-MHAT is built.Firstly,the whole structure of 1-2DCNN-Bi GRUMHAT is introduced.Secondly,the structure and function of each submodule are introduced.Then the paper introduces the relevant Settings in the experiment;Finally,three sets of experiments were conducted in the laboratory.Experiment 1 realized emotion recognition of Tibetan speech based on convolutional neural networks such as VGGNet and Res Net,and achieved the highest recognition rate of 79.19% on VGG13,less than 80%,which proved the necessity of designing emotion recognition network of Tibetan speech.Experiment 2 verifies that each submodule in 1-2DCNNBi GRU-MHAT network can improve the model performance.Experiment3 verifies the validity of 1-2DCNN-Bi GRU-MHAT network: After adding1 DCNN,2DCNN and MHAT modules to Bi GRU network,the recognition rate of 84.50% is increased by 21.67 percentage points.(3)A Tibetan speech emotion recognition network 1-2DCNNTransformer Encoder based on multi-feature fusion is built.Firstly,the main structure of Transformer encoder module is introduced.Secondly,the1-2DCNN-Transformer Encoder network composed of 1-D convolution layer and Transformer encoder module modified by 2-D convolution layer is introduced.Then some methods of feature fusion are introduced.Finally,four sets of experiments are conducted to verify the effectiveness of the proposed 1-2DCNN-Transformer Encoder network.Experiment 2 selects features with good classification performance.Experiment 3 shows that the best recognition rate is 87.5% on the fusion feature MFCC260 through multiple experiments.Compared with the single feature,the efficiency of MFCC260 was improved by 2.33 percentage points.Experiment 4demonstrated the effectiveness of MFCC260 by two sets of subexperiments:(1)Compared the performance of MFCC260 with classical fusion features Inter Sp09 and e Ge MAPs,It is found that MFCC260 has better performance on 1-2DCNN-Transformer Encoder network.(2)Applying MFCC260 feature to 1-2DCNN-Bi GRU-MHAT network,the recognition rate is improved by 1.67 percentage points,which proves that the fusion feature MFCC260 can be applied to other networks and can be improved.The effectiveness of the fusion feature MFCC260 is proved by the sub-experiments(1)and(2).The results show that 1-2DCNNTransformer Encoder network can achieve 87.50% recognition rate on TSED based on MFCC260 fusion feature. |