Font Size: a A A

Research Of Speaker Recognition On Prosodic Feature On GMM-UBM

Posted on:2015-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:Q TongFull Text:PDF
GTID:2268330431950008Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Text-independent speaker recognition is an important research direction of speech signals processing. As it widely used in public security, army, finance and other fields of authentication and information retrieval, the speaker recognition technology develops and innovates day by day under the research institutions around the world hard working. In order to investigate the latest research and development of speech technology level, the National Institute of Standards and Technology (NIST) began to organize the evaluation of speaker recognition in1996. NIST also represents the highest level in the field of speaker recognition around the world. It has set up a multiple assessment tasks, and provided unified multi-channel and kinds of environment of telephone and radio voice, unified test rules and standards for each of participating units. They used to study speech technology research methods in different environment and conditions. In NIST SRE, speaker recognition task with long speech is one of them, which aim to study how to use the high-level information in the speech signal to recognize.In addition to the Mel-Frequency Cepstrum Coefficient (MFCC), the voice of the high-level characteristic parameters is also a kind of characteristic parameters used in the speaker recognition. It is often associated with the text content, so how to extract text-independent high-level characteristic parameters from the speech signal used in speaker recognition is the focus of current research. This thesis makes an in-depth study on how to extract the high-level parameters of the prosodic features from the speech signal and combine the Gaussian Mixture Model (GMM).According to the above problems, this main work of the paper consists the three points:(1) In first, this thesis introduces the common methods to extract pitch from the speech signal:the Autocorrelation Function (ACF), the Circular Average Magnitude Difference Function (CAMDF), the Cepstrum Function. The accuracy of pitch extraction has a great influence on performance of the system, and puts forward the improved method based on the CAMDF and the Cepstrum Function, and the compare these four methods in the experiments. By experiments, the improved method based on the CAMDF and the Cepstrum Function the indicators is better than any one of the other threes in the Root Mean Square Error(RMSE), the Accuracy of Pitch Extraction and the Gross Error Rate(GER) (2) This thesis illustrates the different prosodic features from different speakers through the experiments, and according to the difference, A test-independent speaker verification method was proposed based on super-segment prosodic feature and GMM-UBM-MAP. Experiments show that the equal error rate (EER) of the system based on super-segment prosodic feature can reach17.77%.(3) Short-term feature parameters(MFCC) reflects the channel characteristics of the speaker, and super-segment prosodic feature is based on the pitch, reflects the speaker audio source characteristics. Both of them can reflects the character of the speaker information from different angles, so we can improve the performance of speaker recognition system with complementary by fusion. In this thesis, the fusion based on suspicion distance is put forward. The experimental results show that, compared with the common same weight addition method and the common experience weight linear fusion method, there is a certain improvement from DET curve and the EER. Studied the different range of plus or minus fusion, we find that selecting certain suspect range, especially a certain plus range, can certainly improves the performance of the system. Experiments show that the EER was increased from5.92%o4.95%, and was improved by16.39%ompared by the main system.
Keywords/Search Tags:super-segment prosodic feature, Text-independent speaker recognition, pitch, system fusion, suspicion distance
PDF Full Text Request
Related items