Font Size: a A A

The Research On Web Crawler Technology Based On Distributed Calculation

Posted on:2012-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:S Y MeFull Text:PDF
GTID:2178330335455401Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet technology, website technique has become more and more mature. Large quantity of websites appear in the Internet year by year, with huge amount of information, and the amount of needs for information of Internet on people's work and life has been bigger and bigger, therefore, the importance of search engine is even more obvious. As to now, search engine technique has gone into people's heart, been close to people's life, and made more and more influence on people's life. Web crawler plays an important role in search engine, it affects in many aspects.The web crawler based on stand-alone network has limited ability to crawl information, and cannot meet the need for the update of current link grab. This promotes the appearance of web crawler technology based on distributed system, which construct a distributed system with large-area and wide-distribution. In this system, many machines cooperate with each other, which eliminate the influence on crawling web pages because of dispersion of webs and low speed of access. Thus we can improve calculate speed of the huge amount of data, and improve the performance of web crawler. The other application is distribute storage. The design of storage is an important part of web crawler. The method of storage of the crawled web data will affect the performance of the whole system. Because of the huge data throughput, simple database storage cannot meet the needs. Therefore the best solution is to use the distributed clustered storage.In accordance with the technology above, by using Java language in Linux platform, this paper develops a kind of web crawler based on Hadoop distributed system. This system has the advantages of high crawl speed, wide coverage, good scalability, and strong transplantation. This paper gives detail research and discussion on the overall design framework of distributed web crawler system and specific realization of modules. In this paper, there are detail design schemes of distributed web crawlers, the whole system architecture chart, the realized method of each module, that is MapReduce, and the detail realized process of each module.By the end, in order to authenticate the characters of distributed web crawler, a testing environment for Hadoop distributed system has been set up. This paper design detail system testing program form three aspects:functional test, performance test, and scalability test. In accordance of this program, we test the real data, and get the detail performance parameters of the system according to the test data..
Keywords/Search Tags:Distributed computing, Web crawler, Search engine, Hadoop
PDF Full Text Request
Related items