Font Size: a A A

Research On Varying And Clustering Based Emotion Robust Speaker Recognition

Posted on:2009-04-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:D D LiFull Text:PDF
GTID:1118360242483027Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speaker recognition (SR) is the process of automatically recognizing the identification of the speaker by individual information included in speech signals. As a kind of biometrics, it is regarded as the most natural one and has extensive perspective of applications.One of the largest challenges to the traditional speaker recognition applications is dealing with transient state changes to speakers. Different emotional states affect the speech production mechanism of a speaker in different ways, and lead to acoustical changes in his/her voice. These changes are a major cause of errors in speaker recognition applications. Based on the study of current advancements of emotional speech processing and recent methods of improving the recognition of affectively stressed speakers, this thesis addresses itself to the task of speaker recognition across expression, presents an efficient solution and a framework of speaker recognition system with expressive speech. The main contributions of the work are as follows:1. Provided a thorough survey of the effect of expressive speech to speaker recognitionWe research the effect of the text pertinences, text affectivity, size of expressive data, modeling algorithms, scoring fashion, universal background model, and so on. And, we summarize the current challenges and propose the suggestions to construct an affective- insensitive speaker recognition system and the framework of a varying and clustering based emotion robust speaker recognition model.2. Proposed a rules based feature modification approachBased on the analysis of expressive speech, the proposed method learns the rules of prosodic features transformation from a small amount of the content matched source-target pairs. Features with emotion information are adapted from the prevalent neutral features by applying the modification rules. The converted features are trained together with the neutral features to build the speaker models. We also study the effect of each parameter on transforming the emotional information in speech utterances for affective speaker authentication. Our results show that the combination of prosody feature modifications successfully adds new emotional coloring to the neutral speech.3. Proposed a model parameters shifts based feature transformation methodAll the models are derived with a common background, which lead to a correspondence between Gaussian components in the models. The system examines how model parameters change between the background model and the emotional dependent models and applies this transformation to map features from a common emotional independent feature space into specific emotional domain. Then different emotional types of generated features are obtained for speaker model building. The advantage in performance speaks for the promise of the proposed work.4. Proposed an expressive speech cluster-modeling algorithmDifferent kinds of affective speech show different characteristics in various vocal features, which leads to the space distribution shift of the speaker models. Based on the hypothesis that the match of the affective state of the speech may be characterized by different intonation, stress, or rhythm patterns produced by the changes in FO and in intensity features, the proposed approach clusters the speech that has the same trend of prosodic transformation together and applies multi-modeling of clustered emotional speech for a given speaker to achieve the match between the training and testing utterances.5. Proposed a frame level score reweighed normalizationWhen a speaker is misclassified, it is not due to a non-target speaker doing well, but rather to the true speaker's model doing poorly. We develop and experiment a frame likelihood transformation method, which reweighed the frame likelihood with the probability density functions of the target model ranks. The proposed method strengthens the confidence of the frame likelihood that gives higher score to the target speaker than the imposters, optimizes the final accumulated frame likelihoods over the whole test utterance. This method improves the traditional utterance level likelihood normalization on the aspect that allows to apply successfully likelihood normalization technique for the speaker identification task.
Keywords/Search Tags:Speaker Recognition, Expressive Speech, Emotional Cluster, Prosodic Features
PDF Full Text Request
Related items