Font Size: a A A

The Research And Implementation Of Distributed Topic Web Crawler Based On Nutch

Posted on:2019-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:X JingFull Text:PDF
GTID:2428330548979587Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the age of information explosion,it's important for us to learn how to use search engine to accurately find useful information.Although users could find information they care about using general search engine,but the search result contains much irrelevant information too.Topic-specified web crawler is an important component of topic search engine.It is of great theoretical value and practical significance to study the topic-specified web crawler while we need to use topic search engine to improve the precision of information retrieve.Many big data processing tools such as Hadoop,Spark are developed to deal with computation tasks for big data.These tools use a computer cluster to fulfill computation tasks which are usually processed by computers with mass memory.After studying topic-specified searching technologies,open source search engine Nutch and learning automaton algorithm,this paper proposed a topic-specified and distributed web crawler based on improved learning automaton algorithm.The crawler make some modification to the Fetch and Parse module of Nutch and use many Seed URL acquisition strategy to improve precision,recall and running efficiency of topic crawling,enabling the crawler to adapt web.Finally,a set of simulation experiment was conducted to show the performance of the proposed crawler.Simulation study showed that proposed crawler performs better precision and efficiency.
Keywords/Search Tags:search engine, topic-specified web crawler, distribution, Nutch, learning automaton
PDF Full Text Request
Related items