Font Size: a A A

Research On Supervised Probabilistic Topic Models

Posted on:2015-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:P LiFull Text:PDF
GTID:2298330434952325Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Topic extraction and latent semantic analysis have become one of the mostimportant task of machine learning and data mining,they are widely used in machinetranslation, text classification and keywords extraction. However, text data containsvery high dimensions and complex semantic structure which make task of analysis facemany difficulties and problem. How to analyze text data and reduce the dimension hasbecome one of most important elements in machine learning.Recent years, probabilistic topic models based on the hierarchical bayesian modelhave play an important role in text analysis because of effectiveness of dimensionreduction and topic extraction, such as Latent Dirichlet Allocation(LDA) model.Standard LDA model is a unsupervised model, however, text data may contains manyextra information such as document labels, authors and so on. Those extra informationmakes the text analysis very difficult, but if we can make full use of those information,our model may get better performance. So how to build supervised model based on theinformation of labels is a main content of this paper and many researchers of machinelearning.Most topic models proposed recently are based on the directed graph, which makemodel inference has many problems and could not build distributed representation ofdocuments. Recent years, models based on the undirected graph model such asrestricted bolzmann machine(RBM) has been widely used in distributed feature learningof images and speech. One of the main contains of this paper is building RBM modelsthat could make use of document labels to improve the distributed feature extractionperformance of text analysis.The main contribution of this paper includes:1. Through the study on the multi-label documents and LDA models, this paperproposes a new Labeled LDA model. In this new model, each label has not only a set oflocal topics, but also has several background (global) topics. Experienmental resultsshow that it can decrease the affect of similarities and dependence between differenttopics and because the label of document is mapped as a combination of local topics andshared topics, so it has a high accuracy when learning from multi-Labeled documents. In addition, this model can be viewed as a semi-supervised clustering model which canutilize the information of labels and outperfom other models.2. Based on the study of LDA and its modifications, this paper proposes a newLDA model, namely author&references topic model (ART), which is a combination ofDSTM and USTM. The ART model can analyze documents with authors and referencesinformation. The experimental results show that this model not only has efficientcapabilities of academic documents topic extraction and clustering, but also could giveaccurate prediction of authors for a new document.3. Based on the study on the RBM, this paper proposes a new RBM model fordistributed topic feature extraction which has a better performance of feature learningthan standard LDA model and we get a better performance in task of multi-labellearning based on the feature learned by the new model.
Keywords/Search Tags:Latent Dirichlet Allocation, supervised learning, distributed features, Restricted Bolzmann Machine
PDF Full Text Request
Related items