Font Size: a A A

Research On Text Categorization Based On LDA

Posted on:2011-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:Z L SongFull Text:PDF
GTID:2178360305470878Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Automatic text classification is research focus and core technology of information retrieval and data mining field.It received extensive atttention and rapid development in recent years.It is one of the hot and key technologies of information retrieval,machine learning and natural language processing.In recent years,people began to apply machine learning to the field of automatic text categorization.Text classification system includes text expression,feature dimension reduction,classification method and effect assessment.In the paper, Latent Dirichlet Allocation (LDA) is used to give generative probabilistic model for the text corpus. It avoids the the classification performance damage problem generated by the use of feature extraction method, at the same time overcomes the problem of semantic links between words missing caused by the use of feature filter method.The main works are as follows:1. When the text corpuses are high-dimensional and large-scale, the traditional dimension reduction algorithm will expose their limitations. In this paper, A Text Categorization Algorithm Based on LDA is presented. In the discriminative frame of support vector machine (SVM), Latent Dirichlet Allocation (LDA) is used to give a generative probabilistic model for the text corpus.Parameters are estimated with Gibbs sampling of MCMC and the word probability is represented. In the process of modeling the corpus, a latent topics-document matrix associated with the corpus has been constructed for training SVM.Classification experiments in Chinese and English corpus are conducted to verify the effectiveness and superiority of Text Categorization method Based on LDA.2. The learning process of model parameters is very sensitive for the the number of topics and initial values of the distribution of topics. To solve this problem,in the paper,the selection algorithm of the optimal number of topics based on DBSCAN is presented. The paper refer to the thought of calculating the density of the sample in DBSCAN(density based spatial clustering of applications with noise) to measure the correlation between the topics. On the automatic text classification system, the selection algorithm of the optimal number of topics based on DBSCAN is used to find the optimal number of topics of LDA model for text corpus. Compared with other two methods,experimental results show that the method presented in the paper can automatically find the opitimal structure of the topics without need for manual tuning the number of topics, with raltatively few iterations.
Keywords/Search Tags:Text categorization, Latent dirichlet allocation (LDA), Gibbs sampling, the optimal number of topics selection
PDF Full Text Request
Related items