Font Size: a A A

Language Model Based On Data Cluster

Posted on:2011-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ChuFull Text:PDF
GTID:2178360308461335Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Statistic language model, which is grown up from nineteen eighties, has been widely used in lots of fields, such as:speech recognition, information retrieval, machine translation, handwriting identification, Chinese automatic segmentation and so on. The traditional statistic language model is N-gram language model, which only contains the relations between the adjacent words without the semantic information. In recent years, with the increasing information, the word corpus is bigger than ever. Data sparseness becomes the key point that statistic language model faces up. The system's performance will be a substantial increase if solved this problem. The main research work focused on how to build more effective classed-based language model to solve the data sparseness.The purpose of this paper is to study the word clustering algorithm of classed-based language model. The main work of this paper focused on the following aspects:First, it introduces the basic knowledge on language model, the traditional n-gram model, the smoothing algorithm and the language model evaluation, such as:back-off algorithm for smoothing parameter, data expansion and class-based language model.Second, it introduces the traditional algorithm to cluster the words. This algorithm uses mutual information as target function and greedy algorithm to maximum the likelihood. It is a suboptimal cluster algorithm because that Greedy algorithm is easy to fall into local optimum solution. Then it states a cluster algorithm based on semantic similarity that takes adjacent contextual information into account. After the comparative experiments and analysis it seems that the result of this method shows more semantics information than the old one.In the end of this paper, a topic-analysis based cluster algorithm is proposed. It uses LDA (Latent Dirichlet Allocation) to get the word's probability over the topics as the word's feature to cluster. This feature can reflect the words'distribution on the global topics. It is a long-span semantics information compared with the semantics similarity method. With this algorithm, the words in the same class show strong relations in topic, and this can make the cluster more meaningful and effective.
Keywords/Search Tags:data cluster, N-Gram language model, LDA(Latent Dirichlet Allocation), latent topic based word cluster language model
PDF Full Text Request
Related items