Font Size: a A A

Research On Chinese Micro-Blog Hot Topic Detection And Tracking

Posted on:2012-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:S P SunFull Text:PDF
GTID:2178330335951457Subject:Information management
Abstract/Summary:PDF Full Text Request
ABSTRACT:Micro-blog is a new platform to quickly share and disseminate information. It is characterized by huge amount of scattered and diverse information. In order to make people not only obtain the dispersed information in micro-blog, but also keep up with the hot topic and follow the messages of the topic they are interested in, this paper carried out the research on Chinese micro-blog hot topic detection and tracking.This paper analyzes the main characteristics of the information and dissemination in the Chinese micro-blog firstly, then combine the existing technologies of topic detection and tracking for ordinary web, we research the technologies for Chinese micro-blog, such as web crawling, information extraction, hot topic detection and tracking. Finally, we get the following research results:(1) A web crawling method based on time judgment and breadth-first is proposed. This method add a time analyzer in web information acquisition process to judge whether the web information generate earlier than the desired time, to determine whether collect the web information by breadth only. This method avoids to collect the premature useless information, improve the collection efficiency, as well as make the coverage rate of the collection appropriate.(2) The SP & HA clustering algorithm based on vector space model for topic detection is proposed. According to the flexible characteristic of the language in micro-blog, the text and topic are expressed by vector space mode, and calculated by SP & HA clustering algorithm. This topic detection algorithm process is divided into three stages, they are text modeling, topic preliminary detection and topics combination. In the topic preliminary detection, the modified Single-Pass clustering algorithm is used to improve the detection efficiency. In the topic combination, the modified coalescing hierarchical clustering algorithm is used to improve the topic detection quality.(3) This paper modifies the calculation method for weight, similarity and heat. Calculation method for the weight of characteristics and similarity is proposed by the combination of semantic similarity table. These method not only reduces the calculation error bring about by different semantic expression, but also improves the calculation efficiency. In addition, calculation functions for the heat of bowen and comment are proposed to calculate and rank the heat of the detected topic and the tracked text. These can help us to present the results of detection and tracking in a more reasonable way.(4) An adaptive topic tracking algorithm based on query vector is proposed. For the traditional topic tracking algorithm based on query vector can not solve the problem of topic drift, the query vectors are continually adjusted to adapt topic development in the process of topic tracking. Meanwhile, in order to improve the effect of the adjustments of the query vectors, the noise information are reduced by taking advantage of web relations, the core feature items and non-core feature items.
Keywords/Search Tags:Micro-blog, Topic Detection, Topic Tracking, Web Crawling, Information Extraction
PDF Full Text Request
Related items