Research On Topic Clustering Algorithm Based On Topic Models

Posted on:2018-10-21

Degree:Master

Type:Thesis

Country:China

Candidate:D Zhang

Full Text:PDF

GTID:2348330518995319

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Nowadays, internet technology has been changing with fast paces so that each day can bring us massive information. Weibo, as a major platform for capturing and sharing information, is very sensitive to the hot topics, which will first catch people's eyes and then become the main focus once the topics are released.In the aspect of traffic accident disclosure, the high timeliness of Weibo makes the information dissemination faster than traditional news media, so that people can capture the traffic topic in Weibo, and help to predict and make decisions on traffic events more quickly and accurately.However, due to the characteristics of short, colloquial and fast network iteration, it is not easy to excavate latent semantic association within text in Weibo.This paper chooses Weibo text that is associated with traffic as the research data.After a series of pre-treatment operations on more than 450,000 traffic-related texts,a hot topic detection and text clustering system based on Weibo traffic content has been designed and implemented. The main innovations of this paper are as follows:(1) Taking into account the shortcomings of the traditional text model, this paper chooses BTM model that can be adapted to the short text collection for topic detection. In the process of modeling analysis, the topic number selection of BTM model is improved from two aspects: theme similarity and topic importance. This paper defines the importance of the topic based on the theorem that the low similarity the better model. Meanwhile, this paper selects 10 best topics which represent the most popular topics according to the fading characteristics of the topic importance.(2) Word2vec is used to express the sentence vectors through word vectors,and the similarity between the document and the topic is calculated. In this paper,an algorithm that based on dense characteristics of word2vec is proposed to compute the similarity of documents. By introducing the high-dimensional vocabulary of high-frequency network words, the dimension of characteristic vectors is expanded, so that they can become denser and each dimension has its actual meaning. Therefore, the information expression is more comprehensive.This paper also completes the comparison between the proposed algorithm and others that are popular for now.(3) The traditional K-means clustering algorithm has been improved. The topic distribution of the optimized BTM model is chosen to be the initial centroid of K-means. The number of clusters is set by the optimal number of topics. In the process of clustering iteration, the proposed similarity algorithm is used and the stability and accuracy of the clustering results are verified by the evaluation indicators.

Keywords/Search Tags:

topic detection, BTM topic model, word2vec, similarity computing, text clustering

PDF Full Text Request

Related items

1	The Research And Implementation Of Text Similarity Computing Based On Topic Model
2	Research On Hot Topic Detection Methods For Microblog
3	Event Detection From Microblogs Based On Topic Model
4	Research On The Key Technology Of Hot Spot Topic Discovery Based On Microblogging
5	Research On BBS Topic Detection And Tracking
6	Research On Text Clustering Algorithm And Its Application In Topic Detection
7	The Design And Implementation Of The Hot Education News Topic Detection System
8	Sphere Topic Model Based On Word Embedding In Text Clustering Field
9	Research On Topic Detection Method Of Complex Short Text Based On Topic Model
10	Research And Implementation Of Distributed Topic Clustering Technology For Text Flow