Font Size: a A A

Research On Microblog Topic Detection Based On VSM Model And LDA Model

Posted on:2013-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:B HuangFull Text:PDF
GTID:2248330371996010Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development and wide spread of Internet technology, network speed of information dissemination and quantity have reached an unprecedented scale. As a new Internet media, with high penetration of Internet users, Microblog has become one of the main sources of information on the Internet. It is very different from other network text. Firstly, it has relatively simple content (Its main body usually includes less than140words). In addition, it can be posted in real-time by mobile phone, instant messaging software and so on, which results in large amounts of data in a short period of time. This kind of data is often huge, messy and chaotic, which means that dealing with it requires considerable workload. Besides, under these circumstances, it is extremely difficult to find the required information accurately and efficiently.Topic detection technology can involve the merge of the information that distributed under the same topic, which greatly reduces the repetition rate of information. It can help the user conveniently understand the linkages between the different topics, and quickly find the information they most needed. Although the topic detection algorithm based on the traditional VSM (Vector Space Model) achieved good results and facilitated a wide range of applications, when dealing with large-scale microblog short text, there are obvious shortcomings. First of all, traditional VSM have no special considerations for short and sparse microblog data. It will lead to inaccurate calculating the similarity between the texts, thereby affecting the quality of topic detection. Moreover, in the traditional VSM, it is believed that if more same words appear in two different documents, they are more similar with each other. However, in fact, the similarity of the different documents not only depends on the literal words repetition, but also depends on the semantic association of the context.Under these circumstances, according to the characteristics of microblog, the paper utilizes the Latent Dirichlet Allocation (LDA) Model to extract the hidden microblog topics information form the dataset. Then, it is possible to get the topic distribution by Gibbs sampling and combining it with the VSM. At last, the final topics could be detected by using the multi-layer clustering method. In addition, the author has successfully established a Microblog text topic detection system which is consisted of the text collection and preprocesses subsystem and the topic results description subsystem. Experiments on actual dataset results showed that the proposed method decreased the residual error rate and the fault detection rate, as well as reduced the detection cost.
Keywords/Search Tags:Microblog, topic detection, LDA model, VSM(vector space model), multi-layer clustering
PDF Full Text Request
Related items