Font Size: a A A

Robust Speaker Recognition Based On Sparse Coding

Posted on:2017-05-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y N XieFull Text:PDF
GTID:1108330485980242Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Speaker recognition, also known voiceprint identification is a technique to identify a speaker via his/her voice. Since speech is easy to record and the cost is low, this technique is widely used for authentication, security, military and forensic and other fields and it has a broad application prospect. For decades, many research institutions and companies in the world have invested a lot of manpower and resources to expand research, which gives a strong impetus to the development of speaker recognition. Currently, many meothods for speaker recognition have gone to applications from Laboratory. However, the complex of real environment requires more in robustness, real-time, accuracy and stability. This needs the breakthrough of speaker recognition in the key connections. This needs the breakthrough in some key issues of speaker recognition especillay in voice activity detection, feature extraction as well as speaker model.Although current speaker recognition has achieved good performance in clean environments, its performance is drastically degraded in noisy environments. This prevents the speaker recognition technology from real-world applications. Focused on the environmental noise robustness of speaker technology, sparse coding techniques are applied to all aspects of the speaker recognition, including voice activity detection, speech feature extraction and speaker modeling, and systematized solutions are provided to solve the speaker recognition in noise robustness problems and improve the speaker recognition rate. The main contributions of this dissertation lie in the following:First, we analyze theoretically the capabilities of the two sparse coding methods noise in modeling environment noise, laying the foundation for applications of sparse coding. Sparse coding models noise in two ways. The first one models noise with residual noise without training noises dictionary, where the theoretical model of the noise is Gaussian white noise. Its inherent assumption is that the speech is sparse over the speech dictionary, and the noise is not sparse over the speech dictionary. White noise is dense over any dictionary and meets this requirement well. The second trains a dictionary to model noise, and its assumption is that the speech and noise are sparse over their own dictionaries, but not sparse over the dictionary of each other. We theoretically analyze the reconstruction error of the two sparse coding methods. The results show that the two methods have the same lower bound but different upper bounds. When the noise is also sparse, the second method considers the prior knowledge by using a noise dictionary which make the reconstruction error lower than that of the first one.Secondly, focusing on the problem that the speech activity detection is easy to be affected by noise, we propose a noise-robust voice activity detection method. Voice activity detection is the first step in speaker recognition. It can reduce the amount of data processed in algorithms and further improve the recognition efficiency. Although current method has considered the influence of noises, these methods can only deal with the conditions where the noise is known. When the noise varies or is not stable, the performance degrades drastically. The proposed method uses the Gaussian mixture model to identify the type of noise of first, and then selects the appropriate noise dictionaries to concatenate a big dictionary with speech dictionary for sparse decomposition. Noisy speech is represented over the big dictionary, and then the sparse representation over speech dictionary is used to judge speech and nonspeech. Experimental results show that the proposed method can achieve excellent peroformance under complex noisy environments because it has the ability of noise perceptivity and can adapt itself to noise.Then, based on sparse coding, we proposed two feature extraction methods, which are robust to noise. Feature extraction is the important step in speaker recognition. On the one hand, we need the feature to be discriminative; on the other hand, we hope the feature is immune to noise. The first feature extraction method uses minimum variance distortion perception response technology, and this feature is based on the shift difference cepstrum algorithm and successfully integrated the speaker’s voice long-time information. The extracted feature not only achieve good performance in clean environments, but also perfom better than current mainstream features in noise and channel mismatch. Experimental results on the Y database and ROSSI database show that the new features can effectively improve the robustness of the recognition system in the case of noise and channel distortion. The second method represents nosiy speech over speech dictionary, then reconstructs speech with sparse representation, and finally computes Mel cepstrum features for model training and recognition. Since sparse coding can model noise with residual error or noise dictionary, the reconstructed signal contains no noise and the speech features which are immune to noise can be obtained.Finally, the speaker recognition framework with two-stage sparse decomposition is proposed. At present, the common method is to form a large dictionary by using all dictioanries. Although it has discrinminant in a certain degree, but there are two problems. On one hand, the scale of the large dictionary atoms is too large resulting in low recognition efficiency; on the other hand, the category competition is too much, and the competitive power of the real speaker dictionary is diluted. The proposed method in the first stage to be speech recognition is decomposed to each speaker in the dictionary, and then through the reconstruction calculate the residuals and sorting to select a real speaker lexicographic subset contains; the second stage will be new lexicographic subset splicing into a large dictionary, again to be speech recognition to the dictionary, use dictionaries for sparse decomposition to calculate the score and speaker recognition. The proposed method decomposes test speech on the dictionary of each speaker to to remove a large number of unrelated speaker dictionaries in the frist step, reducing the time complexity of the algorithm; it ensure high accarcy by using discriminative mehtod. The experimental results show that the proposed method not only increases the recognition speed, but also improves the accuracy.
Keywords/Search Tags:speaker recognition, sparse coding, noise robustness, sparse decomposition
PDF Full Text Request
Related items