Font Size: a A A

Research On Feature Extraction And Classification Of Speech Emotion Recognition

Posted on:2016-02-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y X SunFull Text:PDF
GTID:1108330479493536Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of electronic technology, human are not satisfied with the existing human-computer interaction(HCI) by using keyboard, mouse, switch. Although the touch screen technology has been rapid development, which makes the HCI more convenient and more diverse. However, we hope that the HCI processes more humane, more intelligent, more friendly and more vivid, which necessarily requires the computer with similar thinking and perception of human. Obviously, Enabling the computer to understand the human emotion is a very important step in the completion of above task. Speech is an important medium of human communication, which is the most basic way to pass information; furthermore, speech sensors have been developed very mature, and do not require parties to cooperate in obtaining speech signals. As a result, speech emotion recognition is very important for HCI.Enabling a computer to understand the emotion from the speech signal is the goal of speech emotion recognition, and then the computer can understand the emotional thinking of human, as a result, the computer could own more humane and more complex function. Speech emotion recognition is a typical pattern recognition problem, which usually consists of three key steps, such as: feature extraction, dimensional reduction, classification. In this paper, we do some researches on all of above three steps. The main contributions include:(1) Proposing a novel weighted spectral features based on Local Hu Moments(Hu WSF) for Speech Emotion Recognition. Features greatly influence the results of speech emotion recognition, among which Mel-frequency Cepstral Coefficients(MFCC) is the most commonly used in speech emotion. However, MFCC does not consider both the relationship among neighbor coefficients of Mel filters of a frame and the relationship among coefficients of Mel filters of neighbor frames, which possibly leads to lose many useful features from spectrogram. This paper presents novel weighted spectral features based on Local Hu moments. The idea is motivated by that the energy on spectrogram would drastically vary with some emotion types such as angry and happy, while it would slightly change with another emotion types such as sadness and fear. This phenomenon would affect the local energy distribution of spectrogram in both time axis and frequency axis of spectrogram. To describe local energy distribution of spectrogram, Hu moments computed from local regions of spectrogram are used, as Hu moments can evaluate the degree how the energy is concentrated to the center of energy gravity of local region of spectrogram and can significantly vary with the speech emotion types. The conducted experiments on Emo DB, SAVEE, CASIA validate the proposed features in terms of the effectiveness of the speech emotion recognition.(2) Presenting a speech emotion recognition method by using semi-supervised feature selection with speaker normalization. Feature selection methods are the mostly used dimensional reduction methods in speech emotion recognition. However, most methods cannot preserve the manifold of data and cannot use the information provided by unlabeled data, so that they cannot select a good sub feature set for speech emotion recognition. This paper presents a semi-supervised feature selection method that can preserve the manifold structure of data, preserve the category structure, and use the information provided by the unlabeled data. To further deal with the manifold of speech data influenced by factors such as emotion, speaker and sentence, a new speaker normalization method is also proposed, which can achieve a good speaker normalization result in the case of a small number of samples of a speaker available. This speaker normalization method can be used in most real application of speech emotion recognition. The conducted experiments on Emo DB, SAVEE, CASIA validate the proposed semi-supervised feature selection method with the speaker normalization in terms of the effectiveness of the speech emotion recognition.(3) Presenting a ensemble Softmax regression model for speech emotion Recognition(ESSER). Many speech emotion recognition methods have been proposed, among which ensemble learning is an effective way to recognize speech emotion. However, they are still confronted with problems, such as the curse of dimensionality and the diversity of the base classifiers hardly ensured. To overcome the problems, this paper proposes an ensemble Softmax regression model for speech emotion recognition. It applies the feature extraction methods with much different principles to generate the subspaces for the base classifier, so that the diversity of the base classifiers could be ensured, and the curse of dimensionality could be partly avoided. As in the case of the diversity of the base classifiers ensured, the performance of ensemble classifier highly depends on the ability of the base classifier, it is reasonable for ESSER to select Softmax as the base classifier as Softmax has shown its superiority in speech emotion recognition, and Softmax provides the probabilities for a testing sample belong to each class such that the ensemble classifier can make full use of the uncertainty classification information. The conducted experiments validate the proposed approach in term of the performance of speech emotion recognition.(4) Proposing a new speech emotion recognition based on manifold and sparse representation classification. Many speech emotion recognition methods based on sparse representation classification have been proposed. However, these methods either directly use the raw data as the dictionary, or trains the dictionaries for each class independently, and then the class label of training data cannot be fully used. To overcome the above defect, a new dictionary learning methods by using dimensional reduction method is proposed. Because the results of dimensional reduction contain the classification information and the dimension of the results of dimensional reduction is much lower than that of the raw data, so that the dictionary could contain much classification information and the sparse representation classification can run more quickly. However, the features extracted from speech signals are changed with many cases, such as speakers, speaking styles and speaking content, and then the distances among some samples within a class could be very large, as a result, minimizing the distances among these samples could largely influence for the optimization objective of existing dimensional reduction, so that the existing dimensional reduction cannot be directly used for dictionary learning. To overcome the above defect, we propose a new supervised dimensional reduction method, which considers the locality of data in computing within class matrix, across class matrix and the manifold of data. Furthermore, to highlight the manifold of speech emotion, we use the self-tuning of point to point distances to represent the relationships among samples. In the last, to fully use the information extracted by our new supervised dimensional reduction method, we also present a new weighted sparse representation classification, whose coefficients are weighted by the self-tuning of point to point distances. The conducted experiments on Emo DB, SAVEE, CASIA validate the proposed speech emotion recognition based on manifold and sparse representation classification.
Keywords/Search Tags:Speech emotion recognition, Spectral feature, Feature Selection, Softmax, Ensemble classifier, Sparse representation classification
PDF Full Text Request
Related items