Font Size: a A A

Study On Topic Model Based Multi-label Text Classification And Stream Text Data Modeling

Posted on:2016-02-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:X M LiFull Text:PDF
GTID:1228330467495430Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Web2.0, more and more text documents are available online.Analyzing and mining these texts documents are significantly challenging. The topic modelssuch as latent Dirichlet allocation (LDA) are one of the most effective algorithms foranalyzing text documents. In this paper, we investigate two very important problems based ontopic models: supervised topic models for multi-label document classification and onlinetopic modeling inference algorithms for stream documents.The main contributions of this paper are outlined as follows:1. Labeled LDA (L-LDA) and Dependency-LDA are two representative supervisedtopic models for multi-label document classification, however, they suffer some problems,including unscalable, lacking of considerations of label dependency, label frequency and labelfrequency of words. To address these problems, we propose the following modifications:(1) L-LDA constrains each document to sample from its own labels. This leads to worseclassification performance for datasets, whose cardinality (i.e., the average number of labelsper document) is small. To this end, we propose a modification supervised labeled LDA(SL-LDA) by defining a label threshold to describe the weight of labels. The proposed modelis scalable, and it can be applied to both single-label and multi-label datasets. Theexperimental results show that SL-LDA performs better on both text modeling andclassification.(2) The label dependency knowledge is significant for classification, however, L-LDAlacks of considerations of this knowledge. To address this, we propose a novel algorithmlabelset topic model (LSTM) by considering the labelset occurred in the training set as a newlabel. The labelsets are used to capture the label dependency, and they are organized bygroups to reduce the computational complexity. Experimental results show that LSTMoutperforms the traditional algorithms on multi-label classification.(3) The label frequency knowledge is significant for classification, however, L-LDAlacks of considerations of this knowledge. To address this, we develop Frequency-LDA(FLDA) by using label frequency to influence the Dirichlet prior of document-labeldistributions. We further extend FLDA to Dependency-Frequency-LDA (DFLDA) byconsidering label dependency. DFLDA assumes that there is a hidden topic level between thelabel level and word level. It uses the co-occurrence of topics to describe label dependency.The experimental results show that the proposed models perform better than the existingclassification algorithms, especially for skewed datasets. (4) The label frequency of words is usually fully considered in document processing,however, L-LDA neglects this knowledge. To address this, we combine L-LDA withclass-feature-centroid (CFC) algorithm, and then proposed a modification, namely centroidprior topic model (CPTM). Since CFC considers the label frequency of words, we usenormalized CFC vectors (termed centroid prior) to be the prior for label-word distributions inL-LDA. Under variational Bayes, training CPTM is equal to maximizing an evidence lowerbound and minimizing the KL divergence between centroid prior and label-word distributions.CPTM is simple and easy-to-implement. The experimental results show that CPTM performsbetter in most settings.。2. Stochastic variational inference (SVI) and hybrid variational-Gibbs (HVG) are tworepresentative online inference algorithms for topic models. They optimize global parametersof interest by using stochastic optimization. However, they suffer some problems such aslarge noise of stochastic gradients, being sensitive to learning rates and expensivecomputational complexity. In this paper, we attempt to solve the above problems. Furthermore,we investigate the effective expectation propagation (EP) algorithm, and then develop anonline version of EP.(1) The noise of stochastic gradients in SVI is commonly large, resulting in worseperformance. To address this, we propose a modification moving average SVI (MASVI). Theproposed algorithm reuses the pervious noisy terms to smooth the stochastic gradients. Weapply MASVI to LDA. The experimental results show that MASVI reduces the noise in somedegrees and performs better than SVI and other existing online algorithms.(2) SVI is sensitive to the learning rates, which usually requires hand-tuning to eachiteration. To solve this problem, we develop a novel algorithm, which tunes the learning rateof each iteration adaptively. Our algorithm uses the KL divergence to measure the similaritybetween variational distributions with noisy update and batch update, and then optimizes thelearning rates by minimizing the KL divergence. The experimental results show that ouralgorithm performs better than the commonly used learning rates.(3) The topic sampling cost of HVG scales linearly with the number of topics, so HVG istime-consuming with large number of topics. To address this, we propose sparse HVG(SVHG) to reduce the topic sampling cost by considering the sparsity of topic models. SHVGregroups the sampling equation into a sparse part and a dense part. For each word token, itsamples topic twice and uses the Alias algorithm to process the dense part. The experimentalresults show that SHVG is about5~8times faster than HVG, and also achieves bettermodeling performance(4) EP is an effective batch topic modeling inference algorithm. We develop an onlineversion of EP for stream documents, namely online EP (OEP). OEP suggests two options toupdate the global parameters of interest. The first option uses the stochastic natural gradientalgorithm and the second option uses a window-based EP algorithm. The experimental results show that OEP performs better than the state-of-art online algorithms and converges faster inmost settings.
Keywords/Search Tags:Topic model, Multi-label, Text classification, Streaming documents, Online learning, Stochastic optimization
PDF Full Text Request
Related items