Font Size: a A A

Analyzing Clustered Latent Dirichlet Allocation

Posted on:2017-08-11Degree:M.SType:Thesis
University:Clemson UniversityCandidate:Gropp, ChristopherFull Text:PDF
GTID:2468390014975227Subject:Computer Science
Abstract/Summary:
Dynamic Topic Models (DTM) are a way to extract time-variant information from a collection of documents. The only available implementation of this is slow, taking days to process a corpus of 533,588 documents. In order to see how topics - both their key words and their proportional size in all documents - change over time, we analyze Clustered Latent Dirichlet Allocation (CLDA) as an alternative to DTM. This algorithm is based on existing parallel components, using Latent Dirichlet Allocation (LDA) to extract topics at local times, and k-means clustering to combine topics from different time periods. This method is two orders of magnitude faster than DTM, and allows for more freedom of experiment design. Results show that most topics generated by this algorithm are similar to those generated by DTM at both the local and global level using the Jaccard index and Sorensen-Dice coefficient, and that this method's perplexity compares favorably to DTM. We also explore tradeoffs in CLDA method parameters.
Keywords/Search Tags:DTM, Latent dirichlet
Related items