Font Size: a A A

Research On Speaker Recognition Technology Based On Voiceprint Information Space

Posted on:2013-05-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:E Y WangFull Text:PDF
GTID:1228330377951699Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of speaker recognition technologies, researchers start to focus on practical problems in real application of these technologies. To take good advantage of speaker recognition technologies, more and more effective solutions are introduced to meet different actual requirements, and to improve recognition performance. How to extract representative voiceprint feature and how to model accurate speaker model are still the key problems in nowadays research. Voiceprint feature is a kind of suprasegmental information which is located in the whole speech data, but not uniformly contained in each detailed cue of the corresponding speech data. The carriers of the voiceprint feature originate from different information space due to different interpretation of speaker dependent information. In this dissertation, we define voiceprint information space as all speaker dependent information which a kind of carrier can obtain. We will explore phonetic voiceprint space, temporal voiceprint space, frequency voiceprint space and deep structured feature space. In these spaces, we will focus on acquiring effective representation of voiceprint and setting up modeling method.Firstly, we build up a multilingual coding based speaker recognition system in phonetic voiceprint space. Phonetic segments contain not only textual information but also speaker dependent information. It is an effective carrier for voiceprint feature. In this part of work, we try to extract and apply voiceprint feature resided in this space. To obtain voiceprint feature, a set of phonetic patterns is used to reveal the speaker dependent information. Extracting speaker information with phonetic patterns works like coding process in this unique phonetic space. Furthermore, multi-sets of phonetic patterns are introduced to make this phonetic voiceprint space more completed. Like traditional MLLR-SVM system, we also use MLLR transforms to represent voiceprint feature from phonetic patterns for each speech segment. Because these sets of phonetic patterns are used paralleled in acquiring speaker information, we call this method as multilingual coding based MLLR-SVM speaker recognition system. Also, several combination strategies are applied to gather speaker information from different phonetic voiceprint space in order to improve the performance.Secondly, speaker dependent information is contained in variable speech realizations which include speech segments from different communication channels, and ones from different personal feelings. Since different speech realizations are formed in different time, voiceprint feature in these speech segments are called temporal voiceprint space. In this situation, speaker recognition system could suffer great performance attenuation. Traditionally, researchers use joint factor analysis (JFA) and nuisance attribute projection (NAP) to solve the problems. In this dissertation, we try to work out this problem by using unsupervised adaptation method. It could update parameters all the time when there is a new available training data. It is effective due to capturing voiceprint feature in the temporal space. Comparing model-based method, we introduce score-based unsupervised method with hard and soft decision strategy. By defining prior score distribution and score confidence, we finally get an unsupervised score normalization method. This method can bring nice performance and reduce the computational cost.Thirdly, there are inside correlation among different frequency bands. This kind of relationship not only reflects textual information but also contains speaker dependent information. We define these information come from frequency voiceprint space, and also in this space, this dissertation will try out performance of speaker recognition system. Covariance matrices are introduced to describe the voiceprint in frequency bands. Due to difficulty in estimation of the covariance, we provide two kinds of stable estimation methods. Like traditional mean supper-vector, we construct a covariance supper-matrix to represent the voiceprint. To measure the similarity of these information carriers, two distance metrics are given. Finally, with support vector machine and linear inner classifiers, we set up a speaker recognition system in frequency voiceprint space which performs equally well as traditional mean supper-vector systems.Finally, we explore voiceprint feature in deep structured space. In nowadays research, feature and model method can both be explained by shallow structure. Deep structure could reveal information which is constructed by more than two or three layers of nonlinear nodes. In this dissertation, we will try to find out voiceprint in deep structured space with deep neural networks. There are two steps in training deep neural networks. One is pretraining step which is a supervised feature expanding method with deep structures. The expanded feature which comes from deep structured feature space contains more general and abstract information. In this step, the feature couldn’t tell speaker dependent information from speaker independent information, because information from voiceprint space is equal to one from other space. So we introduce the other step which is called finetuning to separate voiceprint feature from others. In this dissertation, we provide two constraint conditions to achieve this aim. They are sparse coding method and speaker distance based method. To verify the effectiveness of the voiceprint in deep structured space, and also to avoid the interference from other information, we use TIMIT as our database in experiments. Preliminary results have shown that voiceprint in deep structured space could give much better performance than traditional method. Also our provided system can be combined with baseline system to receive significant improvement.
Keywords/Search Tags:speaker verification, support vector machine, maximum likelihoodlinear regression, covariance supper matrix, deep neural network
PDF Full Text Request
Related items