Language Model Based On Data Cluster

Posted on:2011-08-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Chu

Full Text:PDF

GTID:2178360308461335

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

Statistic language model, which is grown up from nineteen eighties, has been widely used in lots of fields, such as:speech recognition, information retrieval, machine translation, handwriting identification, Chinese automatic segmentation and so on. The traditional statistic language model is N-gram language model, which only contains the relations between the adjacent words without the semantic information. In recent years, with the increasing information, the word corpus is bigger than ever. Data sparseness becomes the key point that statistic language model faces up. The system's performance will be a substantial increase if solved this problem. The main research work focused on how to build more effective classed-based language model to solve the data sparseness.The purpose of this paper is to study the word clustering algorithm of classed-based language model. The main work of this paper focused on the following aspects:First, it introduces the basic knowledge on language model, the traditional n-gram model, the smoothing algorithm and the language model evaluation, such as:back-off algorithm for smoothing parameter, data expansion and class-based language model.Second, it introduces the traditional algorithm to cluster the words. This algorithm uses mutual information as target function and greedy algorithm to maximum the likelihood. It is a suboptimal cluster algorithm because that Greedy algorithm is easy to fall into local optimum solution. Then it states a cluster algorithm based on semantic similarity that takes adjacent contextual information into account. After the comparative experiments and analysis it seems that the result of this method shows more semantics information than the old one.In the end of this paper, a topic-analysis based cluster algorithm is proposed. It uses LDA (Latent Dirichlet Allocation) to get the word's probability over the topics as the word's feature to cluster. This feature can reflect the words'distribution on the global topics. It is a long-span semantics information compared with the semantics similarity method. With this algorithm, the words in the same class show strong relations in topic, and this can make the cluster more meaningful and effective.

Keywords/Search Tags:

data cluster, N-Gram language model, LDA(Latent Dirichlet Allocation), latent topic based word cluster language model

PDF Full Text Request

Related items

1	Research On Text Retrieval Based On Topic Analysis
2	Research On Jointly Learning Word Embeddings And Latent Topics In Text
3	Contextual Topic Modeling
4	Aurora Image Classification Based On Multi-Feature Latent Dirichlet Allocation
5	Study Of Text Evolution Analysis And Prediction Based On Topic Model
6	Topic Model Based On Dirichlet Process
7	Extending For Topic Model Used In Web Data Mining
8	Exploring Entropy-based Term Weighting Schemes In Latent Dirichlet Allocation
9	News Topic Discovery Research Based On The LDA Model
10	Research On Application Of Improved Topic Segmentation Model In Teachers' Discourse Text Analysis