Font Size: a A A

Research On Key Techniques Of Speaker Recognition In Network

Posted on:2012-05-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:T JiangFull Text:PDF
GTID:1118330362450150Subject:Artificial Intelligence and information processing
Abstract/Summary:PDF Full Text Request
The objective of speaker recognition is to make a computer identify a specific person via his/her voice. As one of important biometric recognition technologies, speaker recognition is widely applied in identity authentication, human and computer interaction, public security, information security, financial service and so on. Especially with the rapid development of network for the past few years, huge amounts of multimedia files with useful speech information are presented in the internet, where speaker recognition is urgently desired.In tranditional application environment, current speaker recognition systems can obtain good performance, but suffer severe performance degradation in the internet. The main cause is the complex internet environment, including: (1) The speech spoken by more than one speaker is recorded in a single channel, and non-speech is also recorded. The speech signal is usually compressed and stored in multimedia files. (2) There are a large amount of out-of-set data, which leads to many false alarms. (3) After being compressed and decompressed several times, a speech file may have different codec versions. Miss acceptance may increase when training and testing with different audio codecs. (4) There are no enough training data for target speakers, causing the inadequate training of speaker models and degrade performance. This thesis focuses on the key techniques of speaker recognition in the internet environment, and provides support for the network application. The main research of the thesis includes a speech normalization method, similarity measures in speaker clustering, a speaker verification method with very low false acceptance, a codec compensation method for speaker models, and speaker model for sparse training data. The contributions of the thesis are:1. To propose a network speech normalization method.To converse multimedia data streams into speaker specified feature sequences, is the first task for speaker recognition in the internet. The main points of this method include: (1) Decoding the compressed audio data in real-time, and storing the data of each channel independently; (2) Extracting robust features from the uncompressed data ; (3) Calculating the similarity between the segments of multichannel audio, and removing redundant audio information; (4) Dividing audio stream into segments and each one contains a single audio scene; (5) removing the non-speech from all segments. Experimental results show that this method can convert the stream of network multimedia into single-one speaker feature sequence in real-time with high performance.2. Proposed a general Kullback-Leibler distance metric for speaker clustering. A speaker recognition system has a tendency to make correct classification given a long segment, so speaker clustering is used to increase the length of testing segments. Similarity measures play an important part in speaker clustering, unfortunately, the performance of conventional measure, such as Kullback-Leibler divergence and generalized likelihood ratio, show bad performance when the input utterances with different lengths. A novel similarity measure named general Kullback-Leibler distance metric (GKLDM), is proposed to solve this problem. When the utterance is modeled in Gaussian distribution, the relationships between GKLDM and the conventional similarity measures are analyzed; when modeled by Gaussian mixture distribution, GKLDM does not have a closed-form solution and a method is proposed to get the upper-bound for GKLDM with a low computational complexity.3. Propose a speaker verification method with very low false acceptance. There is a large amount of out-of-set testing data in network, resulting in many false alarms. A speaker verification method is proposed to decrease the false alarms. A verification step is added in the GMM-UBM frameworks, which filtered the results of the GMM-UBM to refuse wrong ones. Three verification methods, respectively based on coarse-grained analysis window, non-target competing model and statistic vector of change status, were investigated. Experiment with a great quantity of testing data from network multimedia is conducted, and the results show that the proposed method could decrease false alarms substantially.4. Proposed a codec model compensation method. There are so many audio codecs in network multimedia, andthe mismatch between codecs of the training data and network testing data may cause the increase of the miss acceptance. A method to compensate coding mismatch was proposed to solve this problem. It firstly learns the deviation between the distribution of the training features and that of the test features, then compensates the model with the deviation. Experimental results show that the proposed method could effectively reduce the miss acceptance caused by the codec mismatch.5. Proposed a model compensation method for spare training data. The training data in network is sparse, which leads to the inadequate training of speaker models and the decline of the performance of speaker recognition systems. A model compensation method is proposed to address the problem. considering there are a shift between each target GMM-based model and the UBM, a low-dimensional affine space named shift space was found, , and each of its basis representeds a law of this kind shift. Firstly, the shift for each model was transformed to a point in shift space and this point is named shift factor, next, the coordinate of the shift factor was learned from the GMM mixtures of insensitive to the amount of training data, and then it was adopted to compensate other GMM mixtures and improve the ability of description of speaker voice characters. The training method of parameters in the shift factor was proposed and some characters of shift factor were analyzed. Using the proposed method, an obvious reduction in equal error rate is obtained when the training data are sparse.We investigated some key techniques of speaker recognition in network and proposed the corresponding solutions. The work in this thesis will promote the application of speaker recognition in complicated network.
Keywords/Search Tags:speaker reccognition in network, speech normalization method, general Kullback-Leibler distance metric, very low false acceptance, shift factor
PDF Full Text Request
Related items