Font Size: a A A

A P2P Based Distributed Microblog Crawler System

Posted on:2017-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y LuFull Text:PDF
GTID:2348330488997104Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet technology, Microblog is becoming a important media to spread the public information. Microblog is a nearly real-time way of spreading public information. Government can use Microblog to monitor the public opinion and make quick response to emergencies. So it is very significant to acquire Microblog data accurately and efficiently for data analysis. Traditional web crawler can't acquire the whole Microblog data and the Weibo API has lots of limitation of its function and connection times.In view of the above problems, a distributed Microblog crawler system based on Chord called Chord-Crawler is present in this paper. With some modification, this system can acquire large scale Microblog data continuously and efficiently. The crawler program is based on simulated login technology and traditional web crawler technology. It uses bit map to remove duplicate tasks. The system is based on Chord module. It assigns tasks according to the Microblog user's province information and consistent hash function. Communication consumption can be reduced by update the Province-Node table. This paper also presents a dynamically inserting load balance algorithm. The algorithm is used to blance the nodes' loads and observably increase the system's efficiency. The simulation experiment proves the dynamically inserting load blance algorithm can balance the loads efficiently. By comparing with the other three structures, it is proved that the system's good performance can provide researchers with adequate data.
Keywords/Search Tags:P2P, Chord, Web Crawler, Load Balance, Simulated Login
PDF Full Text Request
Related items