Font Size: a A A

Research On Topic Models Combining Internal Feature And External Information Of Texts

Posted on:2017-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:L X LiuFull Text:PDF
GTID:2348330485992585Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development and widely application of information technology, variety of text information have emerged in the form of digitized texts, including text pages, blog, news, books, microblogging and social networking, which are accumulating at an unprecedented rate. Faced with such a huge and rapid growth of text data, how to effectively and efficiently mine the implied knowledge is a major challenge for computer science currently. Topic model, also known as probabilistic topic model, can extract statistical law containing in large-scale high-dimensional sparse data and give expression to its low-dimensional intuitive semantic level. Extracted topics by topic models can provide basic services for information retrieval, classification, clustering, similarity between texts, judgment of relevance between texts and other applications.Topic models are widely used in text mining, automatic document summary, sentiment analysis, image processing and other fields.Latent Dirichlet allocation (LDA) is the most representative topic model. Currently, variety of important works about topic models are accomplished by means of modifying or extending the original LDA topic model. Considering jointly the internal characteristics of text and external information to enhance the modeling effect or to be utilized in particular tasks is a crucial way of topic modeling. Based on the above-mentioned improvement ideas, the major studies of the thesis are as follows:1. To solve the problem that traditional multinomial-based topic models can not properly capture the phenomenon of word burstiness, a continuous-time topic model for word burstiness Dirichlet compound multinomial continuous-time topic model is proposed, which integrates inherent temporal information in the corpus. Experiments on the NIPS conference proceedings demonstrate that the model has obvious advantages in terms of generalization performance when the given number of topics is small, and it can also effectively reveal the latent evolutions of topics in the corpus.2. To solve the problem of content sparsity existing inherently in short texts, the thesis proposed a latent feature biterm topic model by introducing word vectors into biterm topic model. Gibbs sampling is employed to estimate the parameters in the model. Comparative experiments on short texts of real world demonstrate that the model can utilize abundant word vectors information effectively to further alleviate the content sparsity problem, producing significant improvements on topic coherence, and the overall modeling effect has obvious advantages over baseline models.
Keywords/Search Tags:Topic model, Word burstiness, Short text, Biterm, Gibbs sampling
PDF Full Text Request
Related items