Font Size: a A A

Research On Topic Clustering Algorithm Based On Topic Models

Posted on:2018-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:D ZhangFull Text:PDF
GTID:2348330518995319Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Nowadays, internet technology has been changing with fast paces so that each day can bring us massive information. Weibo, as a major platform for capturing and sharing information, is very sensitive to the hot topics, which will first catch people's eyes and then become the main focus once the topics are released.In the aspect of traffic accident disclosure, the high timeliness of Weibo makes the information dissemination faster than traditional news media, so that people can capture the traffic topic in Weibo, and help to predict and make decisions on traffic events more quickly and accurately.However, due to the characteristics of short, colloquial and fast network iteration, it is not easy to excavate latent semantic association within text in Weibo.This paper chooses Weibo text that is associated with traffic as the research data.After a series of pre-treatment operations on more than 450,000 traffic-related texts,a hot topic detection and text clustering system based on Weibo traffic content has been designed and implemented. The main innovations of this paper are as follows:(1) Taking into account the shortcomings of the traditional text model, this paper chooses BTM model that can be adapted to the short text collection for topic detection. In the process of modeling analysis, the topic number selection of BTM model is improved from two aspects: theme similarity and topic importance. This paper defines the importance of the topic based on the theorem that the low similarity the better model. Meanwhile, this paper selects 10 best topics which represent the most popular topics according to the fading characteristics of the topic importance.(2) Word2vec is used to express the sentence vectors through word vectors,and the similarity between the document and the topic is calculated. In this paper,an algorithm that based on dense characteristics of word2vec is proposed to compute the similarity of documents. By introducing the high-dimensional vocabulary of high-frequency network words, the dimension of characteristic vectors is expanded, so that they can become denser and each dimension has its actual meaning. Therefore, the information expression is more comprehensive.This paper also completes the comparison between the proposed algorithm and others that are popular for now.(3) The traditional K-means clustering algorithm has been improved. The topic distribution of the optimized BTM model is chosen to be the initial centroid of K-means. The number of clusters is set by the optimal number of topics. In the process of clustering iteration, the proposed similarity algorithm is used and the stability and accuracy of the clustering results are verified by the evaluation indicators.
Keywords/Search Tags:topic detection, BTM topic model, word2vec, similarity computing, text clustering
PDF Full Text Request
Related items