Font Size: a A A

Latent Dirichlet Allocation: Hyperparameter selection and applications to electronic discovery

Posted on:2016-10-06Degree:Ph.DType:Thesis
University:University of FloridaCandidate:Pazhayidam George, ClintFull Text:PDF
GTID:2478390017979094Subject:Computer Engineering
Abstract/Summary:
Keyword-based search is a popular information retrieval scheme to discover relevant documents from a document collection, but it has many shortcomings. Concept or topic search is an alternative to keyword-based search that can address some of these deficiencies, and better categorize documents based on their underlying topics. Latent Dirichlet Allocation (LDA) is a popular topic model that is often used to make inference regarding the properties of a corpus. LDA is a hierarchical Bayesian model that involves a prior distribution on a set of latent topic variables. The prior is indexed by certain hyperparameters which have a considerable impact on inference but are usually chosen either in an ad-hoc manner or by applying an algorithm whose theoretical basis has not been firmly established. We present a method, based on a combination of Markov chain Monte Carlo and importance sampling, for obtaining the maximum likelihood estimate (MLE) of the hyperparameters. We report the results of experiments on both synthetic and real data. These show that when making inference regarding the topics of the documents in a corpus, the LDA model indexed by the MLE of the hyperparameters performs considerably better than LDA models indexed by default choices of the hyperparameters. Topic models such as LDA have many real-world applications such as document clustering, classification, and ranking and summarizing a corpus. In this thesis, we employ various topic models to the electronic discovery (e-discovery) problem, which refers to the process of identifying, collecting, discovering, and managing electronically stored information (ESI) for a lawsuit. We perform an empirical study comparing the performance of LDA to other topic models in representing ESI and building binary classification models to solve the document discovery problem of e-discovery. We report the performance of this study using several real datasets.
Keywords/Search Tags:LDA, Document, Models, Latent
Related items