Font Size: a A A

Document Clustering Method Based On LDA Topic Model

Posted on:2013-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:J L DongFull Text:PDF
GTID:2248330371992594Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology and internet, the network information is expanding rapidly. Therefore, how to quickly select target information from the chaotic vast amounts of text information has become a hot research field of the natural language processing. Text clustering is one of key technologies in Natural Language Processing, and its difficulties mainly depend on two things:Firstly, how to improve the quality of clustering; Secondly, how to describe the clustering result. This paper researches on the above two points comprehensively and proposes a document clustering method based on the LDA topic model.The main work of this paper includes the following three aspects:Firstly, this paper analyses the key technologies of domestic and international text clustering. Such as text modeling, feature extracting, text clustering, statistical topic model and identification method, and summarizes the advantages and disadvantages of these technologies and research progress.Secondly, this paper introduces the LDA topic model to the area of text clustering. The topic model generates a text-potential topic model from a statistical perspective. We combine the generated model with the traditional TFIDF word model; add the potential topic knowledge into the word model; deeply mine the internal semantic knowledge of texts, and improve the clustering quality.Lastly, this paper uses the generated potential topic-word model and feature word collection, combines with the probability distribution, proposes a clustering result topic identification method based on the LDA topic model, improve the visualization and comprehensibility.The experimental results in Chinese and English corpus shows, our method is better than the traditional clustering method on word model. The clustering quality is increased by4%to10%, and the clustering result identification is more accurate. So the text clustering method based on LDA topic model is reasonable and effective.
Keywords/Search Tags:Document clustering, Latent Dirichlet Allocation Model, Topic identification
PDF Full Text Request
Related items