Font Size: a A A

Research And Implementation On The Technology Of Distributed Web Crawler Based On The Cloud Platform Of Storm

Posted on:2016-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z H FuFull Text:PDF
GTID:2308330473452533Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, many new business models, such as O2 O etc, are used to the Internet, which leads to more and more websites be created on the Internet. Therefore, more and more information resources are emerged on the Internet. In the vast sea of the Internet, people want to find the information quickly, which makes the search technology of search engine become very important. It puts forward a new challenge to the web crawler, which is a very important part of search engine.The traditional stand-alone web crawler can’t meet the need of grabbing the high-speed growth data, which results in the technology of distributed web crawler. Distributed web crawler uses multiple machines, which share out the work between machines effectively. It improves the speed of web crawler and enhances the performance of web crawler on the whole.Combined with the popular microblog of Sina, a scalable distributed web crawler system based on Storm is designed and realized in this thesis. The data source of the system is also the microblog of Sina. Specific work completed by this thesis is as follows:1. We give a detailed explanation to the requirement of the distributed web crawler system, including the goal of the system, feasibility study of the system, functional requirement of the system and performance requirement of the system. In the model of functional requirement, we divide the system into six models, including the model of login simulation, the model of URL queue, the model of URL optimization, the model of downloading webpage and the model of storing webpage. Every model is detailed elaborated in aspect of requirement.2. We detailed introduce the design of the distributed web crawler system based on microblog of Sina, including the design of database and the architecture of the system. We focuse on the whole architecture of the system and give a detail decription to the six models.3. We test the distributed web crawler system from the aspect of functionality and performance and analyze the results.4. We give a summary to this thesis and analyze the existing problems and deficiencies. At last, we propose the direction to the further study of this thesis.
Keywords/Search Tags:distributed system, web crawler, storm, microblog
PDF Full Text Request
Related items