Reasearch On The Topic Clustering Of Network Short Text

Posted on:2016-04-01

Degree:Master

Type:Thesis

Country:China

Candidate:M Wu

Full Text:PDF

GTID:2348330479954681

Subject:Computer technology

Abstract/Summary:

The rapid development in recent years, mobile Internet spawned a large number of new mobile Internet applications, and these Internet applications are filled with vast amounts of short text messages, network short text and traditional text page long text has very big difference, due to the short text content is less. Therefore the semantic features of short text is much less than long text shortage, short text semantic feature matrix is quite sparse, which leads to short text semantic mining difficulty increases exponentially, commonly used methods based on word frequency statistics and vector space model is not applicable in this short, the accuracy rate is low. The focus of this paper is to solve the problem of the short text semantic matrix sparsity and context dependency, a non sparse matrix, and use the matrix passage between the topic clustering, of clustering after the theme of accurately cluster description.In the long text fields, topic clustering techniques have been developed to mature, such as LDA topic model. But in the field of short text, as a result of the mobile Internet has the rapid development in recent years, the number of short text messages in the short term has been rapid growth, and the semantic mining is also difficult, so short text clustering technology development is still in the initial stage, development is relatively slow. Xiaohui Yan et al proposed BTM(Biterm topic model), the model introduced the word similarity in short text data sets on the topic modeling, this paper on BTM model based on CRP reintroduction correlations between pairs of words, and put forward the model of GBTM(gravity Biterm topic model).After the GBTM model generated document-topic probability distribution, and then use the K-means clustering algorithm combined with probabilistic graphical model JS distance to calculate the semantic distance between the text and then polymerize into a cluster close to the text, the cluster is a theme. After the completion of the clustering, semantic similarity of text will be clustered into a cluster, then how to visually describe poly clusters is a very important problem. In this paper, we use the theme word of matrix reasoning topic-word matrix to select poly clusters of adjectives, through appropriate method removing inter cluster recurring themes word description. The clustering results are analyzed, compared LDA model and BTM model of topic clustering results confirmed GBTM feasibility in short text data sets, the accuracy of clustering have improved to some extent, finally, do some processing to gather the cluster description, makes the distinction between clusters is more obvious, without losing accuracy and concision.

Keywords/Search Tags:

short text clustering, topic modeling, GBTM model, Gibbs sampling, cluster description

Related items

1	Research On Short Text Topic Discovery Based On BTM Topic Model
2	Short Text Clustering Method Based On BTM
3	Research On Topic Models Combining Internal Feature And External Information Of Texts
4	Research On Short Text Clustering And Cluster Description Method
5	Research On Fast Gibbs Sampling Topic Inference Algorithms For Topic Models
6	Document Clustering Method Based On LDA Topic Model
7	Research On Topic Modeling For Short Texts Based On Intented Biterms
8	Design And Implementation Of Clustering Software For Expert Research Interests
9	Research On Topic Clustering Methods For Multi-source Texts
10	Research On Topic Model Over Short Texts With Incorporation Of Word Embedding