Font Size: a A A

Probabilistic Generative Models-based Topic Mod-eling Of Text And Its Applications

Posted on:2011-02-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Q DingFull Text:PDF
GTID:1118330332978362Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
We are faced with a world of digital information in the 21st centry. Text information is affecting our lives in various ways. Search engines such as Google and Baidu brought us the first wave of revolution in our way of utilizing text information by helping us locate relevant information. Text mining technologies extract genuine knowledge from text, and are bringing us the second wave of revolution by helping us understand text information.Text clustering is one of two major research areas in text mining. Text clustering can group texts into clusters according to text contents, with each cluster representing a group of semantically similar texts. With the help from text clustering, all it takes to un-derstand a large text collection is going through a small number of clusters. Results of text clustering can also act as inputs of other text processing routines for further analysis. However, traditional text clustering analysis focuses on the clustering of texts into groups. The study of the clusters themselves has not attracted much attention. The research of topic models fixes this problem by directly modeling abstract concepts such as topics and topic relationships in a Bayesian approach. The topic modeling approach can effectively deal with "the curse of dimensionality" by dimension reduction, it offers a principled way of modeling complex processes in real world and the models can be extended to account for various kinds of domain knowledge. In this dissertation, the author focuses on several key problems in topic modeling. and also the application of topic modeling methods in other research areas. The specific issues and contributions of our work include:The research on topic model designs. The introduction of hierarchical topic rela-tionship and DAG-structured topic relationships makes topic modeling a more powerful modeling tool. However as pointed out in our work, the correlation among the random variables in complex models can often make probabilistic inference algorithms such as Gibbs sampling coverge slowly or get trapped in local maxima. To deal with this prob-lem, we proposed a new random process:the nested hierarchical Dirichlet process. Based on this random process, we introduced 2 hierarchial topic models. As confirmed by our experiments, by introducting the new concepts "sub-topic" and "level mapping" into the model, we effectively dealt with the difficulty in inference caused by correlations among the random variables.The research on approximate probabilistic methods. Because of the strong coupling among the random variables in topics models, exact inference is intractable. Markov chain Monte Carlo is a approximate inference method widely used in topic modeling research. Fast convergence speed and the capability to escape local maxima play an important role when topic modeling is applied to process large text collections. However coupling among the random variables in topic models can make MCMC samplers converge slowly and get trapped in local maxima. The ASM sampler introduced in our work is an adap-tie MCMCM method. By utilizing all state information of the MCMC Markov chain, the ASM sampler can adjust its transition matrix as it moves in the support of the target distri-bution. Our experiments shows this adaptation can effectively speed up convergence when compard to an existng MCMC method.The application of topic modeling in mobility modeling. Topic modeling is not only a hot research topic in the text analysis community, but has also been applied in other research areas. Mobility modeling builds models for user movement patterns in wireless networks, it can deal with the variaous challenges brought on by the mobile nature of mo-bile user, such as resource reservation, the design of mobile routing protocols. Traditional mobility models focuse on mining sequential patterns. We show that non-sequential pat-terns can be useful in some scenario. We introduced the concept of hierarchical mobility patterns, to the best of our knowledge, we are the first to introduce topic modeling in mo-bility modeling. Our expeirment shows our model based on nested Dirichlet process has better generalization capability than the Hidden Markov model, also the mobility patterns discovered by our model are easier to interpret.
Keywords/Search Tags:text mining, topic modeling, Bayesian model, Approximate probabilistic inference
PDF Full Text Request
Related items