Latent Semantic Analysis In Language Identification

Posted on:2011-07-10

Degree:Master

Type:Thesis

Country:China

Candidate:T Jin

Full Text:PDF

GTID:2178360308955285

Subject:Signal and Information Processing

Abstract/Summary:

Language identification is the process of determining the language to which a given utterance belongs by a computer, which is an important research direction in speech recognition. With the deepening of economic globalization, it has quite a wide application in many fields such as daily life, national defense, military affairs and public security.Generally speaking, each language has its own relatively independent phone set, prosody, vocabulary, syntax and grammar. These differences between languages make it possible to achieve language identification. According to the ideas of modeling on the speech, there are two mainstream categories of language identification: based on acoustic model and based on language model. The method based on language model transforms the given speech into a phone sequence using speech recognition technology first, and then implement language identification by discriminating the different rules of phonotactics in different languages, which has the advantage of robust performance and good scalability.This paper concentrated on a systematic study of language identification under the phonotactic framework based on language model. At first, we built a complete language identification system from the phone recognizer to language modeling; Then, we made progress on reducing the computational complexity and improving the system performance that we successfully mined the latent semantic structures in different statistical language models. The detailed works are as follows:Firstly, we compared the output structures of lattice and 1-best string in phone recognition and proved that lattice could get more detailed results. Meanwhile, we constructed a new kernel function, which would greatly improve the accuracy of language identification.Secondly, in the phone recognition followed by support vector machine system, we carried out two keyword selection methods to select more discriminative characteristics from the feature vectors of each given utterance to the question that feature vectors were high-dimensional and sparse, which also got an effective improvement in the efficiency of language identification.Thirdly, based on the idea of bag of words in information retrieval, we introduced two latent semantic analysis methods to language modeling, and generated new robust and representative latent semantic features for training more discriminative language models, which greatly alleviated the high-dimensional question and showed a promising performance.

Keywords/Search Tags:

Language Identification, Lattice, N-Gram, Support Vector Machine, Keyword Selection, Latent Semantic Analysis, Probabilistic Latent Semantic Analysis

Related items

1	Audio Scene Recognition Based On Probabilistic Latent Semantic Analysis
2	Latent Semantic Analysis-based Spam Filtering System Design And Realization
3	Research On Text Sentiment Analysis Based On Support Vector Machine
4	Research And Apply On Patient Record Text Mining Based On Latent Semantic Analysis
5	The Implementation And Research Of The Probabilistic Latent Semantic Analysis Model In The Search Engine's Business Text Classification System
6	Research On LYNC Instant Message Filtering Based On Latent Semantic Index
7	The Study Of Latent Semantic-Based Personalized Search Key Technology
8	The Application Of Cross-Language Information Retrieval Based On Latent Semantic Analysis
9	Research On Local Semantic Concept Representation Based Image Scene Classification Technology
10	Probabilistic Latent Semantic Analysis Method Based On Dynamic Threshold Model