Application of Latent Dirichlet Allocation in Online Content Generation | Posted on:2017-09-09 | Degree:M.S | Type:Thesis | University:University of California, Los Angeles | Candidate:Yang, Yajia | Full Text:PDF | GTID:2448390005469358 | Subject:Statistics | Abstract/Summary: | | In this paper, I apply latent dirichlet allocation(LDA) to cluster 100,000 health related articles using the livestrong.com data set. I first review the previous research progress in topic modeling. Then I introduce how LDA model is constructed. In stead of using simple word counts as model inputs, Part-of-Speech(POS) tagging and Term-Frequency Inverse Document Frequency(tf-idf) transformation are performed in data preprocessing steps in order to improve training efficiency and model interpretability. I further discuss the choices of model parameters, evaluating of model performance and visualization of model outputs from a real world point of view. Finally, I discuss two variations of conventional LDA including paralleled LDA and Online LDA. In addition to a traditional perplexity measure, I discuss how to use cosine similarity and Symmetric Kullback-Leibler Divergence to evaluate clustering performance. Three examples of using LDA outputs as building blocks for more complicated machine learning system are also demonstrated: 1) Cascaded LDA for taxonomy building. 2) In-cluster similarity computing. 3) Auto categorization. | Keywords/Search Tags: | LDA | | Related items |
| |
|