Font Size: a A A

Research On Speaker Recognition Based On Sparse Representation Of Frame Level And Segment Level

Posted on:2017-01-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:W WangFull Text:PDF
GTID:1318330536480976Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Great progress has been made in the field of speaker recognition,yet gap remains between theories and practical application.Thus,improving the performance of speaker recognition remains an important research topic.The current recognition algorithms employed by computer auditory system can not equate to the level of human auditory system,for which speaker recognition is a simple task.So,it is of great significance that we can understand the working mechanism underpinning the human ear and construct automatic auditory algorithms based on that knowledge.In signal processing,sparse representation simulates the auditory property of ear and it is robust to the noise.Therefore,according to the theory of sparse representation model,which has attracted much attention in signal processing and neural science,from the dictionary construction of sparse representation,this thesis respectively studied short-term and long-term signal feature to construct frame level and segment level dictionary,and the method of speaker recognition based on two class dictionaries.The main works of this thesis are as follows:First,after the speech signal was decomposed in frame level dictionary,the sparse representation coefficients vector of each frame were quantified to reflect the response of auditory neurons to the input speech signal,and the quantified sparse representation coefficients corresponding to the atomic activity were used to reflect the response of auditory neurons to the input information.Next,the atomic activity of many speech frame sequence along with time variation were used as the whole atomic activity to express the speaker's information.We used kurtosis higher-order statistics to calculate the distribution of atomic activity and used it as the model of speaker recognition,in order to avoid the problem that Gaussian Mixture Model could not effectively describe the distribution of speaker atomic activity when data gathered form speaker were inadequate.Second,in order to extract robust features of the speaker under the noisy conditions,discriminative dictionary learning was used to combine the learned frame level speech dictionary with noise dictionary to form a joint dictionary.After the features of each frame signal of a noisy speech were sparsely decomposed in the joint dictionary,the coefficients decomposed in the noise dictionary were neglected,while the coefficients decomposed in the speech dictionary were saved as the frame's feature.Then,max pooling and average pooling operator were performed on the sparse representation coefficients vectors of many speech frame sequence to get the global features of the speaker,so that the speaker could be recognized.Third,the criterion of discriminative learning was introduced to the construction of dictionary of segment level,to increase the discrepancies among dictionaries of different classes in multi-class speaker recognition and to improve the performance of speaker recognition.When the dictionary was learned,for one thing,information of the dictionary in each class was considered,so the reconstructing errors,namely using the same dictionary to represent the training samples,could be minimized;for the other,by introducing the Fisher discrimination,the deviation of the sparse representation coefficients could be made smallest within the class and largest among the classes,so that the dictionary with more discriminating capacity could be obtained.At the same time,the mean value of the sparse representation coefficients of each training sample was reserved as prior information.In the judgment mechanism,the means and the reconstruction error were used as the basis for recognition.Fourth,since the sparse representation coefficients had continuous group structure and the coefficients of the same group were correlated when a speech was sparsely decomposed in the dictionary,the Block Sparse Bayesian algorithm was used to sparsely decompose the speech signal to make full use of the information of multi-class speakers and the prior information among those sparse representation coefficients,so that the performance of speaker recognition could be improved.
Keywords/Search Tags:speaker recognition, sparse representation, kurtosis statistics, Fisher discrimination, Block Sparse Bayesian
PDF Full Text Request
Related items