With the focus on speaker recognition, speech segments with high speaker discriminative power, characteristic features with high discriminative capability and other speaker discrimination issues have been explored in the thesis. Major efforts and contributions are:1. The framework on speaker recognition considering the discrimination power of speaker is proposed. The discrimination power is studied in both the time domain and the feature domain: in the time domain, filter out the useless segments, reserve segments with high speaker discriminative power for impact analysis on performance, and then summarize their effect in speaker recognition under various circumstances; while in the feature domain, enhancement, restoration, and projection are performed on acoustic features, which improve the speaker discrimination of features, thus further improve the system performance.2. For recording environment with noise, the capability of speaker discrimination has been described by the phone posterior probabilities, according to which, speech segments with high discrimination power are selected. The paper adopts a deep learning approach to study different patterns between phonemes and noise by means of posterior probabilities. The posterior probability is used as the prior of importance of each speech segment, resulting in the emphasis or ignorance on the acoustic features extracted. The results show that the proposed method has gained relatively 21.0% improvement in the environment with an SNR of 18, and more importantly, it also works for clean environment.3. For loss of discrimination power due to transmission channels such as clipping, a non-linear feature restoration method has been proposed to recover the damaged feature patterns. The impact of clipping on speaker recognition is researched systematically for the first time. The paper proposes a distribution-based detection method for clipping. With this method, clipping segments are selected out of normal speech to research on the discrimination respectively. A non-linear restoration method based on DNN models is proposed, which decreases the EER by 28.4% when the enrollment and test speech are clipped by 90%.4. The joint optimization of features and models is proposed to weaken the emotion discriminative information, which will improve the speaker discrimination power as a result. The joint optimization method is proposed. By transforming the emotional features to the neutral space to weaken the impacts from emotions, in which the transformation adopted will be optimized instantly; with the optimization, the transformed features from enrolled features are used to update the speaker models, which then will be used in the transformation optimization. This iterative method instantly weakens the discriminative power for emotion, while increasing the speaker discrimination. The experiments show that the proposed method has decreased the average EER for about 17.5% in four different emotions.
|