Research On Microblog Topic Detection Based On VSM Model And LDA Model

Posted on:2013-01-16

Degree:Master

Type:Thesis

Country:China

Candidate:B Huang

Full Text:PDF

GTID:2248330371996010

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In recent years, with the rapid development and wide spread of Internet technology, network speed of information dissemination and quantity have reached an unprecedented scale. As a new Internet media, with high penetration of Internet users, Microblog has become one of the main sources of information on the Internet. It is very different from other network text. Firstly, it has relatively simple content (Its main body usually includes less than140words). In addition, it can be posted in real-time by mobile phone, instant messaging software and so on, which results in large amounts of data in a short period of time. This kind of data is often huge, messy and chaotic, which means that dealing with it requires considerable workload. Besides, under these circumstances, it is extremely difficult to find the required information accurately and efficiently.Topic detection technology can involve the merge of the information that distributed under the same topic, which greatly reduces the repetition rate of information. It can help the user conveniently understand the linkages between the different topics, and quickly find the information they most needed. Although the topic detection algorithm based on the traditional VSM (Vector Space Model) achieved good results and facilitated a wide range of applications, when dealing with large-scale microblog short text, there are obvious shortcomings. First of all, traditional VSM have no special considerations for short and sparse microblog data. It will lead to inaccurate calculating the similarity between the texts, thereby affecting the quality of topic detection. Moreover, in the traditional VSM, it is believed that if more same words appear in two different documents, they are more similar with each other. However, in fact, the similarity of the different documents not only depends on the literal words repetition, but also depends on the semantic association of the context.Under these circumstances, according to the characteristics of microblog, the paper utilizes the Latent Dirichlet Allocation (LDA) Model to extract the hidden microblog topics information form the dataset. Then, it is possible to get the topic distribution by Gibbs sampling and combining it with the VSM. At last, the final topics could be detected by using the multi-layer clustering method. In addition, the author has successfully established a Microblog text topic detection system which is consisted of the text collection and preprocesses subsystem and the topic results description subsystem. Experiments on actual dataset results showed that the proposed method decreased the residual error rate and the fault detection rate, as well as reduced the detection cost.

Keywords/Search Tags:

Microblog, topic detection, LDA model, VSM(vector space model), multi-layer clustering

PDF Full Text Request

Related items

1	News Topic Detection Based On LDA Fusion Model And Multi-layer Clustering
2	Approach To Chinese News Topic Detection Based On Multi-Vector Model
3	Research On Multi-Level Topic Clustering Based On Cross Degree
4	Research And Implementation Of Hot Topic Detection On Microblog
5	Micro-blog Hot Topics Detection Method Based On Hybrid Clustering
6	Research And Implementation Of Sentiment Analysis Technology For Different Topics In Microblog Website
7	Research On Hot Topic Detection And Topic Evolution On Microblog
8	Research On Topic Detection And Tracking In Internet Public Opinion
9	Research On Web News Topic Organization And Acquisition System
10	Research On BBS Topic Detection And Tracking