Font Size: a A A

Phoneme Category Based Short Utterance Speaker Recognition

Posted on:2013-04-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y X FangFull Text:PDF
GTID:1268330422460321Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speaker Recognition is determining the identity of a person by means of his/hervoice. Conventionally, large amount of audio data is required to perform speakerrecognition. In real life, various difficulties in acquiring speech data and variations inspeech quality can affect the speech. In such situations, it becomes crucial to use theavailable data, long or short, effectively to perform Speaker Recognition. Of late,research in speaker recognition has taken a turn towards Short Utterance SpeakerRecognition (SUSR) to devise new methodologies in order to improve SpeakerRecognition performance when utterance lengths are short. However, most of themethods define short utterances to be around10seconds long. Only recently has shortutterance been defined as utterance length of around3seconds. The shortest utterancein literature has been seen to be2seconds reaching a minimum Equal Error Rate (EER)of21.98%.We strive to find an effective way to recognize a speaker on test lengths of less thanor equal to3seconds. We keep Chinese language as our reference. In our quest for asolution, we present the following innovative research ideas:1) We propose text-independent speaker recognition for short utterances. For shortutterances of speech, there are variations in speech which can deteriorate performanceof speaker recognition. Although text-dependent speaker recognition can help to solvethis problem, at segments as short as a few seconds, speech recognition is not feasible.Therefore, we suggest the usage of rudimentary phoneme recognizer to make use ofspeech unit knowledge, making SUSR text-independent, while still using theunderlying speech information.2) We propose to use phoneme sequences rather than continuous speech forspeaker recognition using short utterances. Since phonemes are the smallestmeaningful unit of sound, the use of phoneme sequences would add useful knowledgeto the recognition process, at the same time preserving the idiosyncrasies of a speaker.3) In order to achieve the above goals, we suggest the use of phoneme categories.Phoneme Categories will make use of the knowledge of speech by grouping similar sounds under one category. This would not only solve the problem of having sparsedata in less-frequent categories but also make the distribution of phonemes acrosscategories fairly even. In doing so we propose Phoneme Category Based SUSR(PCBSUSR) method.4) In order to design the phoneme categories, we propose to study the phoneticand phonological properties of phonemes. For the purpose of confirmation of our ideaabout using Phoneme Categories for Short Utterance Speaker Recognition, we developVowel Categories (VC) based on their articulation properties.5) To measure the performance of combination of phonemes (vowels andconsonants), we propose designing Syllable Categories (SC), which are the mostnatural combination of vowels and consonants. We design Consonant categories (CC)and combine VCs and CCs to study and devise SCs by considering the syllablestructure of Standard Chinese.We test our method by training Universal Background VC, CC and SC Models andperforming recognition on3seconds,2seconds and1second long sequences of VCs,CCs and SCs obtained from test utterances. The results prove that there is importantspeaker information present in speech units as small as phonemes and syllables. Weconclude from our results that Syllable Categories are the best choice for speakerrecognition. Vowel categories have also performed very well in our proposed SUSR.According to our results, Consonants, however, are not a feasible choice to performSUSR. Comparing the minimum EER with the existing SUSR systems for2secondsof test utterance, our experimental results (based on Gaussian Mixture Model–Universal Background Model (GMM-UBM)) give relative EER reduction of54.50%and absolute EER reduction of11.8%in performance using one database, and relativeEER reduction of6.73%and absolute EER reduction of1.48%using another database.
Keywords/Search Tags:Short Utterance Speaker Recognition, Phonemes, Vowel Categories, Consonant Categories, Syllable Categories
PDF Full Text Request
Related items