Research On Network Reptiles In Distributed Parallel Environment

Posted on:2016-07-26

Degree:Master

Type:Thesis

Country:China

Candidate:R L Zhu

Full Text:PDF

GTID:2208330470966819

Subject:Computer application technology

Abstract/Summary:

With the development of the Internet, the Internet sites and data become more and more huge and complex. We require the Internet information more than last, and often depend on search engine. As search engineâ€™s data source, web crawler plays an important role. Some of web crawlerâ€™s indicators, such as crawling speed, coverage, page rank, index, and real-time, etc. directly affect the search results.Meanwhile, the requirement of deep integration information is widespread. So far, many companies, organizations and individuals continue to research and develop new crawler, especially theme crawler. In the enterprise, the information which crawled by web crawler can be multidimensional show as the data warehouseâ€™s data source, and it also can be used as the source of data mining. For example, the opinion monitoring system needs to collect relevant information from the Internet. Real estate business use crawler to crawl the relevant real estate information in order to make decision and analysis. Especially, some people use crawler to mine information and gather intelligence from the Internet.However, the traditional crawler which runs on a single computer is difficult to cope with the challenges which bring by the rapid growth of information. And it is difficult to grab massive amounts of data quickly and effectively. Distributed technology supports large clusters, massive shared storage space. It can take advantage of each nodeâ€™s CPU, and increase the total computing power. And it has greater total bandwidth. It overcomes the crawlerâ€™s efficiency problem fundamentally, and solves the IT operating costs. Because distributed technology depends on cheap personal computers, instead of expensive server machines.The paper analyzed the crawlerâ€™s principles, workflow, crawling strategy, web analytic methods and other related theories based on the Internet web siteâ€™s structure and web pageâ€™s principles. For the sake of improving crawl efficiency, optimize web crawler by using the distributed cluster feature of Hadoop. Design and implement a configurable, high-performance, load balance, and scalability distributed web crawler prototype system based on Hadoop. Set forth and analyzed the systemâ€™s architecture, implementation solution and the design and implementation of several key modules by combining with distributed cluster technology. And give solutions to several key technical issues. These issues include the design of URL queue, massive duplicated URL removal, multi-threaded parallel crawling, web pagesâ€™ incremental update and dynamic web page analytic. At last, analyze and test the crawling performance.

Keywords/Search Tags:

distributed cluster, URL queue, parallel crawling, dynamic web page analytic, distributed crawler

Related items

1	Distributed Web Crawler System
2	Design And Implementation Of A Distributed Dynamic Web Crawler System
3	Design And Implement Of Distributed Commodity Information Web Crawler System
4	Vertical Search Engine For Crawling The Web Page Design And Implementation
5	Research And Implementation Of Distributed Internet Information Crawling System For Cyber Security
6	Research On Customized Web Information Crawling And Pushing Techniques
7	Research And Optimization Of Dynamic Web Crawler Based On Webmagic
8	The Video Download Method And Distributed Crawling System Design And Implementation
9	Design And Implementation Of Customized Distributed Web Crawler
10	Research And Implementation Of Distributed Crawler Technology