Improving the Usability of Topic Models

Posted on:2016-08-22

Degree:Ph.D

Type:Thesis

University:Northwestern University

Candidate:Yang, Yi

Full Text:PDF

GTID:2478390017985871

Subject:Computer Science

Abstract/Summary:

In an age of information abundance and exploration, understanding large collections of unstructured textual documents will benefit various applications such as recommendation system, search engines and so on. Many attentions have been given to generative probabilistic topic models of textual collections, designed to identify topical representations of the documents that reduce the dimension and reveal documents statistical structure. Latent Dirichlet Allocation (LDA) is one of the most commonly used topic modeling approaches due to its capability to uncover hidden thematic patterns in textual documents with little supervision.;However, LDA has several limitations which makes it difficult for data analysis practitioner to use in practice. Firstly, the Gibbs sampling inference method for LDA runs too slow for large dataset with many topics. Secondly, the topics learned by LDA sometimes are difficult to interpret by end users. Thirdly, LDA suffers from instability problem, which occurs not only when there is new data arrives and the model needs to be update but also when the same Gibbs sampling method is run multiple times on the same data. All the above limitations undermine the usability of LDA in practice.;This thesis focuses on improving the usability of topic models. We propose a general framework, SC-LDA, for efficiently incorporating different kinds of knowledge into topic models. The knowledge is represented as a set of constraints which shapes the topics learned by LDA topic model. By incorporating the knowledge into topic models, users can guide the model training process so that the topics learned become more interpretable. The framework also takes advantage of topic model's sparsity to significantly reduce the computational cost of training. It is shown in experiments that SC-LDA converges much rapidly than existing baseline methods while also maintains comparable model performance.;SC-LDA alleviates the first and second limitations of topic models by efficient knowledge integration. We also build a topic model update system, none-disruptive Topic Model Update (nTMU), that employs SC-LDA to improve the stability issue of topic models. Evaluation results on both simulation experiments and user studies indicate that our approach significantly outperforms baseline systems in achieving high topic model stability while still maintaining high topic model quality.;Overall, this thesis presents user-centric approaches to address the usability problems of topic models. We hope the work will help topic modeling practitioners who have experienced the usability problem in their practice. Moreover,we also want to interest and inspire machine learning and data mining researchers to pay more attention to the development of user-centric data analytics algorithms to improve their usability.

Keywords/Search Tags:

Topic, Usability, LDA, Data, Documents

Related items

1	Topic Analysis And Recommendation System Based On Scientific Research Documents
2	A Research On The Related Discipline Of Academic Documents Based On Hierarchical Topic Model
3	A Research To HLDA-based Hierarchical Topic Organization For Internal Books
4	Beyond usability: An alternative usability evaluation method, PUT-Q2
5	Computer Assistant Mechanism For Data Gather And Analysis In Usability Testing
6	Topic Chain-based Topic Information Extraction From Chinese Food Complaint Documents
7	A family of statistical topic models for text and multimedia documents
8	Research On The Topic-oriented Summarization For Web Documents
9	Study On The Text Representation Of Extraction-based Multi-documents Summarization
10	Study On Topic Model Based Multi-label Text Classification And Stream Text Data Modeling