Font Size: a A A

The Text Categorization And Structure Of Theme Words Network Based On Topic Models

Posted on:2016-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:C J ZhangFull Text:PDF
GTID:2348330503988258Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the number of unstructured text presents the exponential growth. It become an urgent problem that how to rapidly and accurately get the information people wanted from these huge amounts of data. Text classification algorithms have been used in text automatic sorting, digital library service and organization of the retrieved results. But the traditional classification algorithms ignores the relevance of text between the middle term and word semantic, therefore this paper aims at study on the corresponding traditional classification algorithm. In addition, inspired by the rapid development of complex network theory, we build a subject network and dig out the relationship between them after text subject classification, to make people achieve a multiplier effect at the time of access to information.This paper proposes a new model based on the LDA theme of unstructured text classification method, focus on solving the deficiencies and limitations when facing large-scale and high dimension text categorization used traditional text categorization algorithm. Firstly we should get the subject distribution hidden in the text and the distribution of different topic words in the subject by the LDA model. Then calculate the similarity of two texts in "text-theme" feature space and "theme-words" feature space using KNN classification algorithm respectively and linear weighted on them to text classification. The experiments shows that the algorithm can obtain better classification effect based on both Chinese and English corpus.Then get the new representation from the multi-label texts in the corpus used the concept of "window" in this paper. Through the analysis of the new multi-label text by labeled LDA subject model, we can obtain the new "theme-words" distribution which is the joint probability of two words in the hidden theme and the semantic distance of two words.According to the "theme-words" distribution, we can build the text subject network. And we can understand the theme more deeply through the analysis of this network.
Keywords/Search Tags:Text Classification, LDA Topic Model, Topic Words, Labeled LDA Topic model, Complex Networks
PDF Full Text Request
Related items