Research On Text Categorization Based On LDA

Posted on:2011-06-27

Degree:Master

Type:Thesis

Country:China

Candidate:Z L Song

Full Text:PDF

GTID:2178360305470878

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Automatic text classification is research focus and core technology of information retrieval and data mining field.It received extensive atttention and rapid development in recent years.It is one of the hot and key technologies of information retrieval,machine learning and natural language processing.In recent years,people began to apply machine learning to the field of automatic text categorization.Text classification system includes text expression,feature dimension reduction,classification method and effect assessment.In the paper, Latent Dirichlet Allocation (LDA) is used to give generative probabilistic model for the text corpus. It avoids the the classification performance damage problem generated by the use of feature extraction method, at the same time overcomes the problem of semantic links between words missing caused by the use of feature filter method.The main works are as follows:1. When the text corpuses are high-dimensional and large-scale, the traditional dimension reduction algorithm will expose their limitations. In this paper, A Text Categorization Algorithm Based on LDA is presented. In the discriminative frame of support vector machine (SVM), Latent Dirichlet Allocation (LDA) is used to give a generative probabilistic model for the text corpus.Parameters are estimated with Gibbs sampling of MCMC and the word probability is represented. In the process of modeling the corpus, a latent topics-document matrix associated with the corpus has been constructed for training SVM.Classification experiments in Chinese and English corpus are conducted to verify the effectiveness and superiority of Text Categorization method Based on LDA.2. The learning process of model parameters is very sensitive for the the number of topics and initial values of the distribution of topics. To solve this problem,in the paper,the selection algorithm of the optimal number of topics based on DBSCAN is presented. The paper refer to the thought of calculating the density of the sample in DBSCAN(density based spatial clustering of applications with noise) to measure the correlation between the topics. On the automatic text classification system, the selection algorithm of the optimal number of topics based on DBSCAN is used to find the optimal number of topics of LDA model for text corpus. Compared with other two methods,experimental results show that the method presented in the paper can automatically find the opitimal structure of the topics without need for manual tuning the number of topics, with raltatively few iterations.

Keywords/Search Tags:

Text categorization, Latent dirichlet allocation (LDA), Gibbs sampling, the optimal number of topics selection

PDF Full Text Request

Related items

1	The Research And Implementation About Parallel Latent Dirichlet Allocation
2	Research On Fast Gibbs Sampling Topic Inference Algorithms For Topic Models
3	Design And Implementation Of A Text Recommender System Of Social Network Based On Latent Dirichlet Allocation
4	Research Of Semantic Community Detection Based On Quantizing Progress
5	Theme Of Model-based Expert Retrieval And Mining
6	Research On Text Retrieval Based On Topic Analysis
7	Classification Algorithm For Social Text Stream
8	Research On Classification Algorithm Of Scientific Papers Based On Topic Model
9	Latent Dirichlet Allocation: Hyperparameter selection and applications to electronic discovery
10	Research On Semi-supervised Topic Model For Text Classification