Font Size: a A A

Distributed Crawler Based On Hadoop

Posted on:2018-10-06Degree:MasterType:Thesis
Country:ChinaCandidate:X J ZhouFull Text:PDF
GTID:2348330515466755Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
While Computer science is updating daily,Information and data expansion poses a great challenge to the accuracy and relevancy of engine's searching results.With the rapid development of Network media as a carrier of Information propagation,Web Spider which is used as a utility of Information collecting plays a more and more important role in big data collecting.The collection of mass data challenges the capacity of computer in computing and storing data.By the research into the distributed crawler,this paper presents the corresponding approach to collect information and parse data for crawler and improve the search accuracy of the crawler based on Hadoop framework.Firstly,Hadoop is good at dealing with the big data offline.This paper solves the URL traveling problem in the distributed circumstance based on this point.The crawler extracts the links from pages and visit the networks by the level of links.And this solution helps to execute the crawler effectively on the distributed framework.To solve the URL-seen test problem in the distributed circumstance,this paper compares the centralized test and the sharing assigning,which is common used in URL-seen test in distributed framework,and proposes the distributed Bloom Filter.The distributed Bloom Filter keeps the individuality of the URL-seen test records of respective partitions and realized the URL-seen test offline which is suitable for Hadoop framework.This solution decreases the interaction between the main node and sub-nodes in URL-seen testing and the ineffective visiting to duplicated pages.On this solid basis,this paper proposes a solution to test the update of the web document and avoids the redundant visiting to the static pages and acquires the updated information in time.Secondly,to realize the focused crawling on the distributed framework,this paper proposed the focused crawler based on keywords expansion.The keywords expansion is based on the TextRank model to extract keywords from the relevant documents and the expanded keywords set would consolidate the expressive power of the topic description.And to find the tunnel in the focused crawler,we reserve some irrelevant links to find the targeted pages through the tunnel and improve the recall of the document.Then this paper allows the focused crawler own more priority to search deeper when relevant.At last,based on the input format and output format of Hadoop,we designed the modules of respective part of the crawler to realize the implementation of the crawler based on Hadoop.The result of experiment manifests the distributed crawler is more capable of storing massive data.And in the process of searching mass data,distributed crawler would better data parsing and dealing.
Keywords/Search Tags:Distributed architecture, web crawler, bloom filter, focused crawling, Hadoop
PDF Full Text Request
Related items