The Research Of Probabilistic Topic Models And Their Application In Relational Text Classification

Posted on:2012-06-16

Degree:Master

Type:Thesis

Country:China

Candidate:P P Liang

Full Text:PDF

GTID:2218330338457016

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the rapid development of the web technology, online information in the form of text grows excessively. How to organize and manage these text information efficiently, and how to help users find the information they need accurately in a short time become challenging problems in the filed of the information science. Text classification which aims at categorizing the huge amount of text information is an effective approach to organize and manage these text information. At the same time, since topic models can uncover the underlying semantic structure of document collections, applying topic models to text classification is an effective way to improve the performance of text classification methods.Currently, the supervised topic model sLDA based on LDA and traditional text classification algorithms assume that each document is independent with others Actually, links may exist among documents. For example, research papers extracted from online database, e.g., DBLP and C-DBLP, can use the citations among papers to form a document network, and web pages can be associated through hyperlinks. When links among documents play an important role in determining the attributes of documents and documents do not have enough text information for prediction or classification, the performance of existing supervised topic models and text classification methods (e.g., SVM, Naive Bayes, etc.) declines.iTopicModel uses multivariate Markov Random Fields to construct the document networks and models the links among documents and text information uniformly. iTopicModel can deal with both directed and weighted text information networks. In this thesis, we propose a novel probabilistic topic model called SRTM. We model the links among documents, text information of documents and labels associated with documents uniformly. We first use classical linear regression model for prediction and provide the joint distribution of SRTM, then a method to estimate parameters of SRTM is given by maximizing the log-likelihood of the joint distribution, and we also give method to predict unlabelled documents which are not contained in the original training dataset. Finally, we extend SRTM by using generalized linear model to draw documents'labels, so that the model can accommodate a variety of label types. The experiments on Cora research paper dataset and movie review dataset show that SRTM outperforms existing supervised topic models in documents networks.We also apply iTopicModel to text classification task. We propose a text classification algorithm called TC-iTM based on iTopicModel. TC-iTM uses the probability that the labeled documents are assigned to each topic to judge the category that each topic represents. TC-iTM classifies unlabelled documents by using the probability that the documents are assigned to each topic and the text information of these documents. The result of experiments on Cora research paper dataset and DBLP dataset shows that TC-iTM excels state-of-the-art text classification approaches when links among documents are critical to the categories of documents.

Keywords/Search Tags:

topic model, document network, prediction, linear regression, text classification

PDF Full Text Request

Related items

1	Research Of Hot-Topic-Oriented Subjective And Objected Classification Method For Microblog Text
2	The Research On Topic-oriented Multi-document Summarization
3	Research On Food Complaint Document Classification Based-on Topic
4	Research And Application Of Topic Oriented Text Mining
5	The Text Categorization And Structure Of Theme Words Network Based On Topic Models
6	The Application Of Prediction Model Based On Linear Regression Method And Neural Network In National Economy Data
7	Text Classification Algorithm Based On Chinese And English Topic Space
8	A Biterm Pseudo Document Topic Model For Short Text
9	Research And Application Of Text Classification Model Combining Character Features And Topic Features
10	Research On Short Text Classification Algorithms Based On Topic Model And Convolutional Neural Network