Latent Dirichlet Allocation: Hyperparameter selection and applications to electronic discovery

Posted on:2016-10-06

Degree:Ph.D

Type:Thesis

University:University of Florida

Candidate:Pazhayidam George, Clint

Full Text:PDF

GTID:2478390017979094

Subject:Computer Engineering

Abstract/Summary:

PDF Full Text Request

Keyword-based search is a popular information retrieval scheme to discover relevant documents from a document collection, but it has many shortcomings. Concept or topic search is an alternative to keyword-based search that can address some of these deficiencies, and better categorize documents based on their underlying topics. Latent Dirichlet Allocation (LDA) is a popular topic model that is often used to make inference regarding the properties of a corpus. LDA is a hierarchical Bayesian model that involves a prior distribution on a set of latent topic variables. The prior is indexed by certain hyperparameters which have a considerable impact on inference but are usually chosen either in an ad-hoc manner or by applying an algorithm whose theoretical basis has not been firmly established. We present a method, based on a combination of Markov chain Monte Carlo and importance sampling, for obtaining the maximum likelihood estimate (MLE) of the hyperparameters. We report the results of experiments on both synthetic and real data. These show that when making inference regarding the topics of the documents in a corpus, the LDA model indexed by the MLE of the hyperparameters performs considerably better than LDA models indexed by default choices of the hyperparameters. Topic models such as LDA have many real-world applications such as document clustering, classification, and ranking and summarizing a corpus. In this thesis, we employ various topic models to the electronic discovery (e-discovery) problem, which refers to the process of identifying, collecting, discovering, and managing electronically stored information (ESI) for a lawsuit. We perform an empirical study comparing the performance of LDA to other topic models in representing ESI and building binary classification models to solve the document discovery problem of e-discovery. We report the performance of this study using several real datasets.

Keywords/Search Tags:

LDA, Document, Models, Latent

PDF Full Text Request

Related items

1	Influence analysis of some complicated latent variable models
2	Network Statistics and Modeling the World Trade Network: Exponential Random Graph Models and Latent Space Models
3	Latent tree models: An application and an extension
4	Research On Document Clustering Technology Based On Latent Semantic Indexing
5	Study And Implementation On Latent Semantic Space Analysis And Web Document Clustering Based On LDA
6	Research Of Latent Semantic Analysis Based On Paragraph
7	Extensions of latent class trajectory models
8	Research On Deep Bayesian Latent Variable Models And Their Inference Methods
9	Contextual document models for searching the clinical literature
10	A Study of Document-Context Models in Information Retrieval