Font Size: a A A

The Research Of Probabilistic Topic Models And Their Application In Relational Text Classification

Posted on:2012-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:P P LiangFull Text:PDF
GTID:2218330338457016Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of the web technology, online information in the form of text grows excessively. How to organize and manage these text information efficiently, and how to help users find the information they need accurately in a short time become challenging problems in the filed of the information science. Text classification which aims at categorizing the huge amount of text information is an effective approach to organize and manage these text information. At the same time, since topic models can uncover the underlying semantic structure of document collections, applying topic models to text classification is an effective way to improve the performance of text classification methods.Currently, the supervised topic model sLDA based on LDA and traditional text classification algorithms assume that each document is independent with others Actually, links may exist among documents. For example, research papers extracted from online database, e.g., DBLP and C-DBLP, can use the citations among papers to form a document network, and web pages can be associated through hyperlinks. When links among documents play an important role in determining the attributes of documents and documents do not have enough text information for prediction or classification, the performance of existing supervised topic models and text classification methods (e.g., SVM, Naive Bayes, etc.) declines.iTopicModel uses multivariate Markov Random Fields to construct the document networks and models the links among documents and text information uniformly. iTopicModel can deal with both directed and weighted text information networks. In this thesis, we propose a novel probabilistic topic model called SRTM. We model the links among documents, text information of documents and labels associated with documents uniformly. We first use classical linear regression model for prediction and provide the joint distribution of SRTM, then a method to estimate parameters of SRTM is given by maximizing the log-likelihood of the joint distribution, and we also give method to predict unlabelled documents which are not contained in the original training dataset. Finally, we extend SRTM by using generalized linear model to draw documents'labels, so that the model can accommodate a variety of label types. The experiments on Cora research paper dataset and movie review dataset show that SRTM outperforms existing supervised topic models in documents networks.We also apply iTopicModel to text classification task. We propose a text classification algorithm called TC-iTM based on iTopicModel. TC-iTM uses the probability that the labeled documents are assigned to each topic to judge the category that each topic represents. TC-iTM classifies unlabelled documents by using the probability that the documents are assigned to each topic and the text information of these documents. The result of experiments on Cora research paper dataset and DBLP dataset shows that TC-iTM excels state-of-the-art text classification approaches when links among documents are critical to the categories of documents.
Keywords/Search Tags:topic model, document network, prediction, linear regression, text classification
PDF Full Text Request
Related items