Font Size: a A A

Research On Several Key Technologies In Cross-corpus Speech Emotion Recognition

Posted on:2017-04-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:X R ZhangFull Text:PDF
GTID:1318330515958318Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Speech emotion recognition(SER)research is a very hot research topic in the fields of affective computing,pattern recognition,signal processing and human-machine interaction,in which a major purpose is to classify speech signals into some emotional states,such as anger,fear,disgust,and happiness.During the past several years,many efficient methods have been proposed to cope with the SER problems.Among the various SER methods,most of them focus on one single speech corpora.In many practical applications of SER,however,the training speech corpus might be very different from the testing corpus,e.g.,the training and the testing speech corpus are from two different languages,speakers,cultures,distributions and the amounts of data.These are typical cross-corpus SER problems.The field of speech emotion recognition should catch up with the development of relevant techniques which will provide effective algorithms for feature extraction,feature selection,feature fusion,classifier improvement and so on.The research in the dissertation according to the technical characteristics of SER are carried out on some key technologies in cross-corpus speech emotion recognition.The main research contents are as follows:1.A cross-corpus speech emotion feature classification method is proposed.The method is based on Student's t-mixture model with infinite component number(iSMM),can directly conduct effective recognition for various kinds of speech emotion samples.Compared with the traditional GMM(Gaussian mixture model),speech emotion model based on Student's t-mixture can effectively handle speech sample outliers that exist in the emotion feature space.Moreover,t-mixture model could keep robust to atypical emotion test data.In allusion to the high data complexity caused by high-dimensional space and the problem of insufficient training samples,a global latent space is joined to emotion model.Such an approach makes the number of components divided infinite and forms an iSMM emotion model,which can automatically determine the best number of components with lower complexity to complete various kinds of emotion characteristics data classification.Conducted over two acting(DES,EMO-DB)and a spontaneous(FAU Aibo Emotion Corpus)universal speech emotion databases which have high-dimensional feature samples and diversiform data distributions,the iSMM maintains better recognition performance than the comparisons.Thus,the effectiveness and generalization to the high-dimensional data and the outliers are verified.Hereby,the iSMM Emotion model is verified as a robust method with the validity and generalization to outliers and high-dimensional emotion faetures from various corpus.2.Based on KNN,nuclear method and gravity method of characteristic line and LDA algorithm,this dissertation put forward the LDA+Kernel-KNNFLC method for speech emotion recognition.In view of the prior sample characteristics caused by the large amount of calculation,the dissertation improves the K neighbor of nuclear learning method by means of the criterion for center distance learning samples.In the case of avoiding dimension redundancy,the stability of emotional information recognition is ensured by adding the LDA to the emotional feature vector for optimization.For the research in the field of cross-corpus,the dissertation focuses on the problem which is caused by the high fitting degree of the borders between different emotion categories in independent database.Through the relearning on the feature space,the classifer proposed which optimizes the emotional feature vector between the degree of differentiation,is suitable for speech emotion recognition in combination with LDA method.The emotion corpus for simulation experiment,that dissertation employs,contains 120D Global statistical features.The multiple comparison analysis are carried out though the dimension reduction schemes,emotion classifiers and dimension parameters.The results show that the improvement effect of the inter-class performance for SER by using LDA+Kernel-KNNFLC is remarkable under the same condition.3.To improve and expand feature category under the cross-databases condition,an auditory model of selective attention is proposed for simulating the human ear hearing characteristics.The method can effectively detect the changes of emotional features in spectrogram.Meanwhile,the Chirplet is adopted to obtain the advantages of frequency features matching signals and extract emotional information from the time domain.In the cross-corpus SER research there may exist mismatches between the trained acoustic models and the test utterances.According to the phonetics,this is due to noise conditions,speaking styles and speaker traits,may appear in cross-corpus.These conditions cause the drastic degression in the performance of speech emotion recognition.Hence,the auditory attention model is found to be very effective for variational emotion features detection in our work.Selective attention mechanism model can extract the salient gist features which show their relation to the expected performance in cross-corpus testing.The experimental results reveal that,with the prototypical classifier,the proposed feature extraction approach could deliver a promotion by 9%of the accuracy in cross-corpus speech recognition,which is observed insensitive to different databases.4.Based on the Deep Belief Nets(DBN)in the field of Deep Learning,a method based on feature level fusion for the cross-corpus SER is proposed.According to the foregoing feature abstraciion research,the emotional traits hiding in speech spectrum diagram(spectrogram)are obtained as image features,which are implemented feature fusion with the traditional emotion features.In cross-corpus speech emotion recognition,the feature fusion on multi-scale is the current technical difficulties.First based on the spectrogram analysis by STB/Itti model,the new spectrogram features are extracted from the color,the brightness and the direction,respectively;Then use DBN21 and DBN22 fuse the traditional and the spectrogram features,which increase the scale of the feature subset and the characterization ability of emotion.Through the experiment on ABC database and Chinese corpus,the new feature subset is compared with traditional speech emotion features,while the recognition result on cross-corpus gains a obvious advances.5.The model adaptive problems are studied which are caused by the speake and cross-language for SER.According to the contents introduced in the previous sections for cross-databases,the adaptation speech emotion recognition methods are researched systematically.Meanwhile,the concrete experimental performance analysis and compares are carried out.First,the existing approaches in adaptive emotion recognition from speech signals are discussed.Then,the feature adaptive approach is further studied in view of additive speaker feature distortion.Moreover,the dissertation models the influences brought by the changes in speakers using two popular statistical approaches:GMM and the Student's t-distribution.Then,the adaptive schemes are utilized to obtain feature functions including spectrogram features.Meanwhile,some online data are also used for the rapid optimization of feature functions.Finally the proposed approaches are verified using different databases in four languages,including German,English,Chinese and Vietnamese.The experimental results demonstrate an improved speaker adaptive ability especially when a large number of unknown speakers are presented.Using a few online data we can optimize the features quickly.The influence of different languages on the emotional features is also discussed.
Keywords/Search Tags:speech emotion, cross-corpus, Student's t-distribution, spectrogram feature, selective attention mechanism, deep belief nets, feature adaptation
PDF Full Text Request
Related items