Font Size: a A A

Latent Semantic Analysis In Language Identification

Posted on:2011-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:T JinFull Text:PDF
GTID:2178360308955285Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Language identification is the process of determining the language to which a given utterance belongs by a computer, which is an important research direction in speech recognition. With the deepening of economic globalization, it has quite a wide application in many fields such as daily life, national defense, military affairs and public security.Generally speaking, each language has its own relatively independent phone set, prosody, vocabulary, syntax and grammar. These differences between languages make it possible to achieve language identification. According to the ideas of modeling on the speech, there are two mainstream categories of language identification: based on acoustic model and based on language model. The method based on language model transforms the given speech into a phone sequence using speech recognition technology first, and then implement language identification by discriminating the different rules of phonotactics in different languages, which has the advantage of robust performance and good scalability.This paper concentrated on a systematic study of language identification under the phonotactic framework based on language model. At first, we built a complete language identification system from the phone recognizer to language modeling; Then, we made progress on reducing the computational complexity and improving the system performance that we successfully mined the latent semantic structures in different statistical language models. The detailed works are as follows:Firstly, we compared the output structures of lattice and 1-best string in phone recognition and proved that lattice could get more detailed results. Meanwhile, we constructed a new kernel function, which would greatly improve the accuracy of language identification.Secondly, in the phone recognition followed by support vector machine system, we carried out two keyword selection methods to select more discriminative characteristics from the feature vectors of each given utterance to the question that feature vectors were high-dimensional and sparse, which also got an effective improvement in the efficiency of language identification.Thirdly, based on the idea of bag of words in information retrieval, we introduced two latent semantic analysis methods to language modeling, and generated new robust and representative latent semantic features for training more discriminative language models, which greatly alleviated the high-dimensional question and showed a promising performance.
Keywords/Search Tags:Language Identification, Lattice, N-Gram, Support Vector Machine, Keyword Selection, Latent Semantic Analysis, Probabilistic Latent Semantic Analysis
PDF Full Text Request
Related items