Font Size: a A A

Reasearch On The Topic Clustering Of Network Short Text

Posted on:2016-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:M WuFull Text:PDF
GTID:2348330479954681Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid development in recent years, mobile Internet spawned a large number of new mobile Internet applications, and these Internet applications are filled with vast amounts of short text messages, network short text and traditional text page long text has very big difference, due to the short text content is less. Therefore the semantic features of short text is much less than long text shortage, short text semantic feature matrix is quite sparse, which leads to short text semantic mining difficulty increases exponentially, commonly used methods based on word frequency statistics and vector space model is not applicable in this short, the accuracy rate is low. The focus of this paper is to solve the problem of the short text semantic matrix sparsity and context dependency, a non sparse matrix, and use the matrix passage between the topic clustering, of clustering after the theme of accurately cluster description.In the long text fields, topic clustering techniques have been developed to mature, such as LDA topic model. But in the field of short text, as a result of the mobile Internet has the rapid development in recent years, the number of short text messages in the short term has been rapid growth, and the semantic mining is also difficult, so short text clustering technology development is still in the initial stage, development is relatively slow. Xiaohui Yan et al proposed BTM(Biterm topic model), the model introduced the word similarity in short text data sets on the topic modeling, this paper on BTM model based on CRP reintroduction correlations between pairs of words, and put forward the model of GBTM(gravity Biterm topic model).After the GBTM model generated document-topic probability distribution, and then use the K-means clustering algorithm combined with probabilistic graphical model JS distance to calculate the semantic distance between the text and then polymerize into a cluster close to the text, the cluster is a theme. After the completion of the clustering, semantic similarity of text will be clustered into a cluster, then how to visually describe poly clusters is a very important problem. In this paper, we use the theme word of matrix reasoning topic-word matrix to select poly clusters of adjectives, through appropriate method removing inter cluster recurring themes word description. The clustering results are analyzed, compared LDA model and BTM model of topic clustering results confirmed GBTM feasibility in short text data sets, the accuracy of clustering have improved to some extent, finally, do some processing to gather the cluster description, makes the distinction between clusters is more obvious, without losing accuracy and concision.
Keywords/Search Tags:short text clustering, topic modeling, GBTM model, Gibbs sampling, cluster description
PDF Full Text Request
Related items