Font Size: a A A

Research On Feature Extraction And Model Algorithm For Speaker Recognition

Posted on:2018-05-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:1318330542455001Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the advent of cloud computing and big data era,the requirement of multi-modal natural human-computer interaction in the mobile environment,including voice interaction oriented to service robots,dialogue system and security authentication system,put forward higher performance to the current speaker recognition technology.Future data analysis will pay more attention to the user behavior and the particle size of analysis will be more precise.Due to a rich variety of terminal storage and processing,speech is the most effective and natural kind of biometrics for communication,the collection and preservation are becoming more and more convenient.Intelligent speech interaction and security verification attract more extensive attention.Speaker recognition is defined as the use of a machine to recognize persons from their voice.In this dissertation,first of all,the research significance and the development of speaker recognition were summarized,and the current research hotspot and difficult problems were presented.On the basis of predecessors' research results,this dissertation mainly focused on voice activity detection,discriminative feature extraction and recognition model.The main works of this dissertation were as follows:1.A hierarchical framework approach for robust VAD and speech enhancement was proposed,composed of three blocks,namely,the speech enhancement block,the feature extraction and voting-block,and the training/classification block.Modified Wiener filter approach was utilized for noise reduction and performed better than ordinary Wiener filter in all tested noisy conditions.And then,several discriminative features and a well-trained SVM were employed in a voting paradigm to identify the speech or non-speech segments.Finally,the proposed method was objectively evaluated in four kinds of noises at various signal-to-noise levels.The proposed approach was also compared with some other VAD techniques for adequately confirmation of its achievements.Sufficient theoretical analysis and experimental results showed that the proposed algorithm perform well under a variety of noisy conditions.Under different types of noise,when SNR was 20dB,the speech hit rate can reach 97.8%.2.Two discriminative feature extraction method were proposed.First,a new discriminative feature based on non-uniform filter was proposed.We analyzed the relationship between frequency components and individual characteristics according to the F-ratio,and quantified this dependency according to the F-ratio.This new feature was extracted by non-uniform sub-band filters designed in terms of the adaptive frequency warping in different frequencybands.The experimental results demonstrated that the proposed feature lead to noticeable improvement on speaker recognition rate and contained more discriminative information.Second,the extraction method based on single syllable word included effectively finding multi-representative frames of single word in utterance and combining these feature vectors to constitute the word feature vector,which can consider the syllable structure and pronuniation characteristics of Chinese.Then,this word feature was adopted to calculate the codebook and model matching in training and recognition process.Simulation experiment indicated that this method can describe the continuity between adjacent speech frames and achieve a good performance,which can reflect typical characteristics and avoid the disturbing of transition information.3.A novel method combining ensemble learning with k nearest neighbor was proposed for speaker recognition.It had many advantages over other conventional methods in simplicity and good generalization ability.According to the experiment analysis,proposed BagWithProb scheme can work more effectively than other schemes,and annular-regions stratified sampling algorithm can get large speedup with little loss in accuracy.From the analysis of experimental results,we can obtain a speedup factor from 5.1 to 5.5,meanwhile the impact on the recognition rate was limited.The proposed BagwithProb scheme had the optimizing property,which can reach the best average performance.When the length of training data was 15s,the average frame recognition based on BagwithProb scheme can reach 94.1%.The final recognition results depended on the judgement of all frames,then the error of individual frames will make little effect on the general classification.4.We consider two approaches to deep belief networks-based speaker recognition algorithm.One approach was that DBN was used for feature modeling,in the training stage,the conventional spectrum feature would be put in DBN,in the testing stage,the trained DBN would be used as a classifier model for recognition task.Experiment results showed that under different types of test corpus,when the length of test corpus was 5s,number of hidden layer was 4,the average recognition rate can reach 97.13%,which demonstrated that deep belief networks based speaker recognition was realizable and effective.Another approach was that DBN was used in the stage of feature extraction.Bottleneck features were extracted directly from DBN architecture with the input of conventional spectrum feature,which would be used as subsequent input for the traditional recognition model.When the length of test corpus was 6s,Bottleneck + GMM algorithm can get an average recognition rate of 99.04%,which suggested that the Bottleneck algorithm for speaker recognition can achieve very good performance under the condition of the clean speech,and in the case of that the length of training corpus was limited,the average recognition rate of Bottleneck + GMM algorithm achieved 5.35%improvement than that of MFCC + GMM.Sufficient theoretical analysis and experimantal results verified that the performance of the combining of Bottleneck feature and traditional model had been steadily better than MFCC feature,which indicated that Bottleneck feature can effectively acquire speaker characteristics in spite of the deficiency of clear physical meaning explanation compared with auditory MFCC feature.The theory analysis and experimental results suggested deep belief networks would be a promising and efficient way to make significant improvements on speaker recognition system.
Keywords/Search Tags:Speaker Recognition, Speaker Identification, Voice Activity Detection, Gaussian Mixture Model, Discriminative Feature Extraction, Robust, Ensemble Strategy, Deep Belief Network, Bottleneck Feature
PDF Full Text Request
Related items