Supervised Latent Dirichlet Allocation Combined With Feature Selection Of Sparsity

Posted on:2015-06-17

Degree:Master

Type:Thesis

Country:China

Candidate:C H Liang

Full Text:PDF

GTID:2298330422977173

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Supervised Latent Dirichlet Allocation (SLDA) can model text corpora and extract latent topic features from words of a document as well as other topic models. Meanwhile, based on the extraction of topic features of documents, SLDA can do the prediction for label variables of documents. Therefore, SLDA has been used for regression analysis and classification of text mining. However, in the model construction of topic extraction, SLDA actually does not take it into account that whether all topic features extracted from documents make a contribution to the prediction of document responses. So it is reasonable to think that many noise topics or redundant topics likely exist among those topics extracted by SLDA and this probably reduces its precision of prediction.In order to make up the shortage of topic selection in SLDA, we introduce the regularization technique based on LASSO into SLDA and propose a new extension of SLDA which we call SLDA-FS. SLDA-FS introduces the constraints ofregularization from LASSO during its optimization about the weights of topic features. Then via the LASSO characteristics of variables selection and parameters shrinkage, SLDA-FS can shrink the weights of topic features during its maximization about posterior expectation, and then enforces the weights finding their best sparse expression, which can make some elements of these topic weights tend to get exactly a value of zero. Through this way, SLDA-FS arrive at the goals that do the feature selection and remove irrelevant features during its training and predicting.In nature, SLDA-FS we propose is just a feature selection framework based SLDA model. In order to estimate the performance for SLDA-FS, we introduce two instantiation of SLDA-FS for both textual regression problems and classification problems in text minding, and infer their parameters with variational inference. In the aspect of parameter inference, because of the introduction ofregularization in optimization, the objective function about the weight of topic features cannot be differentiable and this brings some challenge in optimization. So we apply Alternate Direction Method of Multiplier (ADMM) to split original objective function into two sub optimization problems which are the maximization of a differentiable loss function and the optimization of LASSO problems. We alternately optimize the two problems to gain an approximated solution for original objective function. We test the two instantiation of SLDA-FS on several real world dataset and compare theirpredictive precision with original SLDA-based models and other relevant models. The resultsof experiments show that SLDA-FS can have higher precision because of its optimizationabout feature selection. Finally, we observe those words that are allocated to the topicsselected by SLDA-FS and find that the meanings of those words are indeed relevant to theprediction of task.

Keywords/Search Tags:

Topic model, SLDA, Feature Selection, Regularization, Sparsity

PDF Full Text Request

Related items

1	Research On Robust Feature Selection For High-dimensional Data
2	Research Of Feature Selection Algorithm For Weak Label Learning
3	The Application Of Feature Selection Algorithm Based On Randomized Structural Sparsity Optimization On Functional Magnetic Resonance Imaging Data Of Cognitive Activities
4	Regularization Methods For Feature Enhancement Of SAR Images
5	Mixed Sparsity Regularized Multi-view Unsupervised Feature Selection
6	Research On Unsupervised Feature Selection Method Based On Regularized Regression Model
7	Mdl-based Feature Selection For High Dimensional Data
8	Sparsity Signal Inversion By Multi-parameter Regularization And Optimal Parameter Choices
9	Study On Algorithms For Analyzing Multi-View Data
10	Research And Application Of Text Classification Model Based On Topic Model