Font Size: a A A

Research And Implementation Of Microblog Crawler

Posted on:2014-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y W GuoFull Text:PDF
GTID:2268330425965001Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the mobile communication network and Web2.0technology developing, micro-boghas gradually become a basic tool of daily communication and entertainment. More and morepeople start using it to spread ads, news, topics and other information. Because of theopenness and anonymity of micro-bog, it hides many negative information, such as rumors,violence and reactionary information, which led to great difficulties for the guidance andsupervision of public opinion. Thus, a good research on collecting micro-bog network datawill make a great contribution to not only modeling and optimization of informationdissemination, but also monitoring and analysis of public opinion on micro-blog network,which have a great significance for research and practical.In this paper we take Sina micro-blog as the target site for crawling, and we design andimplement an efficient incremental Sina micro-blog crawler after investigating the currentmainstream technology of crawler framework, our main tasks are as follows:1、 According to the need of information extraction, we analyze the information structureof Sina micro-blog, and Collect the user’s basic information, the user’s label and a topic ofconcern, the user’s social relationships (friends, fans) and microblogs. Then we determine theuseful web information for crawling and design corresponding database. In this paper, thecrawler we designed is to simulate the browser to download the home page, and convert thesource page into the Document Object Model tree. The crawler we designed use the Xpathexpressions extract the web page, and use Hibernate and Spring data persistence technologyfor data storage.2、In the specific design stage, the paper achieves a better technology of automatic fill outthe form. We mainly use the capture software to crack Sina microblogging landing encryptionprotocol, and fill out the form and simulate browser landing Sina microblogging. In order toachieve a high efficient crawler,we design and implement an efficient incrementalmultithreaded Sina micro-blog crawler based on multi-producers and multi-consumer model.To further improve the efficiency of crawling data, we use the Sina micro-blog API forassistant collecting user’s social information.3、We make a deep research into the crawling strategytowards microblogging network.Since each user have different publishing frequency, there will waste a lot of bandwidth andnetwork resources if crawling all the users at the same frequency. Hence we propose themicroblogging crawling strategy based on the user active degree. We use the collected usermicroblogging time data to predict the degree of active users, and use the time series analysis method to predict the user of publishing number of the next time period. If the user publishmore microblogs, the active degree of the user is much greater. The crawler will collect theusers more frequency than others. According to our exprimental result, the coverage andtimeness had a significant improvement compared with the simple depth prioritymicroblogging crawling strategy.
Keywords/Search Tags:Web crawler, Xpath extraction, micro-blog network, data collection
PDF Full Text Request
Related items