Research On Topic Models Combining Internal Feature And External Information Of Texts

Posted on:2017-06-07

Degree:Master

Type:Thesis

Country:China

Candidate:L X Liu

Full Text:PDF

GTID:2348330485992585

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development and widely application of information technology, variety of text information have emerged in the form of digitized texts, including text pages, blog, news, books, microblogging and social networking, which are accumulating at an unprecedented rate. Faced with such a huge and rapid growth of text data, how to effectively and efficiently mine the implied knowledge is a major challenge for computer science currently. Topic model, also known as probabilistic topic model, can extract statistical law containing in large-scale high-dimensional sparse data and give expression to its low-dimensional intuitive semantic level. Extracted topics by topic models can provide basic services for information retrieval, classification, clustering, similarity between texts, judgment of relevance between texts and other applications.Topic models are widely used in text mining, automatic document summary, sentiment analysis, image processing and other fields.Latent Dirichlet allocation (LDA) is the most representative topic model. Currently, variety of important works about topic models are accomplished by means of modifying or extending the original LDA topic model. Considering jointly the internal characteristics of text and external information to enhance the modeling effect or to be utilized in particular tasks is a crucial way of topic modeling. Based on the above-mentioned improvement ideas, the major studies of the thesis are as follows:1. To solve the problem that traditional multinomial-based topic models can not properly capture the phenomenon of word burstiness, a continuous-time topic model for word burstiness Dirichlet compound multinomial continuous-time topic model is proposed, which integrates inherent temporal information in the corpus. Experiments on the NIPS conference proceedings demonstrate that the model has obvious advantages in terms of generalization performance when the given number of topics is small, and it can also effectively reveal the latent evolutions of topics in the corpus.2. To solve the problem of content sparsity existing inherently in short texts, the thesis proposed a latent feature biterm topic model by introducing word vectors into biterm topic model. Gibbs sampling is employed to estimate the parameters in the model. Comparative experiments on short texts of real world demonstrate that the model can utilize abundant word vectors information effectively to further alleviate the content sparsity problem, producing significant improvements on topic coherence, and the overall modeling effect has obvious advantages over baseline models.

Keywords/Search Tags:

Topic model, Word burstiness, Short text, Biterm, Gibbs sampling

PDF Full Text Request

Related items

1	Research On Fast Gibbs Sampling Topic Inference Algorithms For Topic Models
2	Research On Topic Model Over Short Texts With Incorporation Of Word Embedding
3	Research On Short Text Topic Discovery Based On BTM Topic Model
4	Reasearch On The Topic Clustering Of Network Short Text
5	A Biterm Pseudo Document Topic Model For Short Text
6	User-Aggregated Biterm Topic Model And It’s Application On Short Text Recommender System
7	Analysis Of Network Public Opinion Data Based On Short Text Clustering
8	Topic Model For Short Texts Based On Word Triangles
9	Research On Short Text Topic Model Based On Semantic Information And Word Triangle
10	A Study Of Short Text Topic Models Based On Information Of Word Embeddings