Font Size: a A A

Some Research On Bayesian Statistics In Text Mining

Posted on:2020-08-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:H B ZhangFull Text:PDF
GTID:1367330596467891Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the development of the information age,more and more unstructured text in?formation is constantly appearing.We need new tools to organize,search and understand these text information in order to obtain valuable information.Text mining is an effec-tive tool for solving this series of problems.In text mining,the most commonly used is text classification.Text categorization is a supervised learning process in text mining that aims to assign documents to one or more predefined categories based on the content of the document.Due to the complexity of text information construction,diversity of diversity and high dimensionality,this poses a great challenge to the text classification task to ef-fectively extract text features.Probabilistic topic models are effective tools for extracting text features in text mining.It is mainly through the Bayesian statistical method to find the hidden semantic structure in the text,and then obtain effective features.Therefore.text categorization and probabilistic topic models are very interesting research topics in text mining.This paper not only pays attention to text classification,but also explores the feature representation and feature selection of the probabilistic topic model based on Bayesian non-parameters in text classification.The main work is as follows(1)The Polya urn model is a basic model widely used in statistics and text mining Most algorithms to training the model are very slow and complicated which makes it difficult to run the P'olya urn model on big data sets.We propose a new minorization-maximization(MM)algorithm for the maximum likelihood estimation(MLE)of the Polya urn model where the surrogate function in the new MM is based on a simple convex function.We also discuss the convergence of our MM algorithms and prove asymptotical normality of the MLE under non-identically distributed observations.To illustrate the performance of this MM algorithm,we compare it to Newton's method and other MM algorithms.We applied the Polya urn model to the model of text categorization and compared it to the text classification methods(2)From a natural intrinsic relationship between words and words,a nonparametric Bayesian graph topic model(GTM)based on hierarchical Dirichlet process(HDP)is pro-posed.The HDP makes the number of topics selected flexibly,which breaks the limitation that the number of topics need to be given in advance.Moreover,the GTM releases the assumption of'bag of words'and considers the graph structure of the text.The combi-nation of HDP and GTM takes advantage of both which is named as HDP-GTM.The variational inference algorithm is used for the posterior inference and the convergence of the algorithm is analyzed(3)From the empirical studies of the natural language,it has been shown that the frequencies of word tokens follow power law distributions.Standard statistical models fail to capture this property.Pitman-Yor proceses(PYP)is a Bayesian nonparametric model that generates distributions following a power law and can be used to model data with a potentially infinite number of components.It has been applied widely in probabilistic topic modeling.However,existing probabilistic topic models used PYP rarely consider the relations between topics.Hidden Markov Models(HMMs)are one of the most popular successful models for modeling relations between topics.Our proposed approach develop a probabilistic topic model combined HMM with Pitman-Yor Priors.Posterior inference is performed by using Variational Bayes(VB)methods(4)From the perspective of the construction of the text,a sentence topic model based on a hierarchical Pitman-yor process is proposed.This topic model takes into account the sentence information that the classic topic models often ignores and can overcome the assumption of“bag of words”.Hower,the posterior inference through variational Bayesian(VB)method is not applicable,because the hierarchical Pitman-yor has no stick-breaking representation,for which we explore the Gibbs sampling method to infer the posterior distribution.We apply the topic model based on the hierarchical Pitman-yor process to topic modeling and text categorization,and compare it with the classic topic models.The conclusions and methods of this paper enrich the study of Bayesian nonpara-metric statistics in the topic model,and help to improve the effect of text classification.
Keywords/Search Tags:Text mining, text categorization, probability topic model, Bayesian non-parametric, Polya urn, hierarchical Dirichlet process, Pitman-Yor process, graph topic model, variational inference, hidden Markov model
PDF Full Text Request
Related items