Font Size: a A A

Research On Topic Based Micro-blog Web Crawler

Posted on:2015-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:X H ZengFull Text:PDF
GTID:2298330452450763Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the twitter in USA, the domestic micro-blogspringing up, and micro-blog become more and more popular between netizens.Various of network hot words which born in micro-blog also quickly became popular,and the micro-blog network effect is gradually formed, micro-blog has became one ofthe main activities for Chinese netizens. It is because of the effect of the formation ofmicro-blog, micro-blog topic quickly spread between netizens. Crawling andanalyzing the message of micro-blog has become an important issue. In order toscrape more micro-blog data, the major micro-blog sites have provided micro-blogAPI, but always with a variety of restrictions, which can’t meet the requirements toobtain a large number of micro-blog data and the data always really messy. So thisthesis proposes a micro-blog web crawler which based on topic.The main work of this thesis includes: analyze web page analysis technology,choose a micro-blog page information acquisition method according to the features ofmicro-blog page; focused on the process about how to probe and design thebreadth-first search strategy which based on the "pruning" in details, and make aspecial effort to solve the problem that how to re-collect the URL and keep thedynamic change of the URL collection and so on; research and analysis of short textsubject extraction techniques and multi-key matching techniques to design thescheme about how to scrape topical relevance micro-blogs; finally designed andimplemented a prototype system of topic based micro-blog web crawler, scrape andstore micro-blog data in real-time. The core issue of this thesis is that, according tothe characteristics of the micro-blog data design a breadth-first search strategy whichbased on "pruning", and apply it to the micro-blog crawler; then the clawer willavoid the limitation of the micro-blog platform API while using the page analysistechnology, makes topical micro-blog data as accurately as possible.Obtain experimental results through the repeated experiments above theprototype system, compare the experimental results with the results of API-basedmicro-blog crawler and the web-based micro-blog crawler come to the conclusion:the strategy can scrape topical relevance micro-blog data in this thesis, although the efficiency decreased, but the micro-blog data with better topical relevance. Thisconclusion verifies the graduate thesis topic is feasible in practical applications.
Keywords/Search Tags:web page analysis, micro-blog crawler, micro-blog crawler strategy, topic correlation analysis
PDF Full Text Request
Related items