Font Size: a A A

Supervised Latent Dirichlet Allocation Combined With Feature Selection Of Sparsity

Posted on:2015-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:C H LiangFull Text:PDF
GTID:2298330422977173Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Supervised Latent Dirichlet Allocation (SLDA) can model text corpora and extract latent topic features from words of a document as well as other topic models. Meanwhile, based on the extraction of topic features of documents, SLDA can do the prediction for label variables of documents. Therefore, SLDA has been used for regression analysis and classification of text mining. However, in the model construction of topic extraction, SLDA actually does not take it into account that whether all topic features extracted from documents make a contribution to the prediction of document responses. So it is reasonable to think that many noise topics or redundant topics likely exist among those topics extracted by SLDA and this probably reduces its precision of prediction.In order to make up the shortage of topic selection in SLDA, we introduce the regularization technique based on LASSO into SLDA and propose a new extension of SLDA which we call SLDA-FS. SLDA-FS introduces the constraints ofî–ºregularization from LASSO during its optimization about the weights of topic features. Then via the LASSO characteristics of variables selection and parameters shrinkage, SLDA-FS can shrink the weights of topic features during its maximization about posterior expectation, and then enforces the weights finding their best sparse expression, which can make some elements of these topic weights tend to get exactly a value of zero. Through this way, SLDA-FS arrive at the goals that do the feature selection and remove irrelevant features during its training and predicting.In nature, SLDA-FS we propose is just a feature selection framework based SLDA model. In order to estimate the performance for SLDA-FS, we introduce two instantiation of SLDA-FS for both textual regression problems and classification problems in text minding, and infer their parameters with variational inference. In the aspect of parameter inference, because of the introduction ofî–ºregularization in optimization, the objective function about the weight of topic features cannot be differentiable and this brings some challenge in optimization. So we apply Alternate Direction Method of Multiplier (ADMM) to split original objective function into two sub optimization problems which are the maximization of a differentiable loss function and the optimization of LASSO problems. We alternately optimize the two problems to gain an approximated solution for original objective function. We test the two instantiation of SLDA-FS on several real world dataset and compare theirpredictive precision with original SLDA-based models and other relevant models. The resultsof experiments show that SLDA-FS can have higher precision because of its optimizationabout feature selection. Finally, we observe those words that are allocated to the topicsselected by SLDA-FS and find that the meanings of those words are indeed relevant to theprediction of task.
Keywords/Search Tags:Topic model, SLDA, Feature Selection, Regularization, Sparsity
PDF Full Text Request
Related items