Font Size: a A A

Improving the Usability of Topic Models

Posted on:2016-08-22Degree:Ph.DType:Thesis
University:Northwestern UniversityCandidate:Yang, YiFull Text:PDF
GTID:2478390017985871Subject:Computer Science
Abstract/Summary:
In an age of information abundance and exploration, understanding large collections of unstructured textual documents will benefit various applications such as recommendation system, search engines and so on. Many attentions have been given to generative probabilistic topic models of textual collections, designed to identify topical representations of the documents that reduce the dimension and reveal documents statistical structure. Latent Dirichlet Allocation (LDA) is one of the most commonly used topic modeling approaches due to its capability to uncover hidden thematic patterns in textual documents with little supervision.;However, LDA has several limitations which makes it difficult for data analysis practitioner to use in practice. Firstly, the Gibbs sampling inference method for LDA runs too slow for large dataset with many topics. Secondly, the topics learned by LDA sometimes are difficult to interpret by end users. Thirdly, LDA suffers from instability problem, which occurs not only when there is new data arrives and the model needs to be update but also when the same Gibbs sampling method is run multiple times on the same data. All the above limitations undermine the usability of LDA in practice.;This thesis focuses on improving the usability of topic models. We propose a general framework, SC-LDA, for efficiently incorporating different kinds of knowledge into topic models. The knowledge is represented as a set of constraints which shapes the topics learned by LDA topic model. By incorporating the knowledge into topic models, users can guide the model training process so that the topics learned become more interpretable. The framework also takes advantage of topic model's sparsity to significantly reduce the computational cost of training. It is shown in experiments that SC-LDA converges much rapidly than existing baseline methods while also maintains comparable model performance.;SC-LDA alleviates the first and second limitations of topic models by efficient knowledge integration. We also build a topic model update system, none-disruptive Topic Model Update (nTMU), that employs SC-LDA to improve the stability issue of topic models. Evaluation results on both simulation experiments and user studies indicate that our approach significantly outperforms baseline systems in achieving high topic model stability while still maintaining high topic model quality.;Overall, this thesis presents user-centric approaches to address the usability problems of topic models. We hope the work will help topic modeling practitioners who have experienced the usability problem in their practice. Moreover,we also want to interest and inspire machine learning and data mining researchers to pay more attention to the development of user-centric data analytics algorithms to improve their usability.
Keywords/Search Tags:Topic, Usability, LDA, Data, Documents
Related items