Font Size: a A A

Design And Implementation Of IT-oriented Distributed Topic Crawler

Posted on:2017-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:W LuFull Text:PDF
GTID:2348330485455636Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,Web information is showing exponential growth.Traditional stand-alone multi-threading topical crawler is not suitable for the work of crawling mass of information.In this background,it provides a good solution to solve the big data problem with the emergence of cloud computing,in which the Apache project Hadoop's distributed platform received widespread attention from the industry.On the basis of analyzing the framework of topic crawler,this thesis highlights and analyzes the key module of topic crawler,such as topic module,topics related discrimination module,page download module and so on.And then,the feature selection experiment is carried out to construct a vector that could represent the topic.In topic crawling strategy section,a combining algorithm which based on Shark-Search algorithm and PageRank algorithm is proposed after analyzing the crawling strategy both in content and links at their own disadvantage.In this thesis,open source crawler Nutch architecture and workflow are analyzed firstly,then the topic discrimination module is added to the Nutch,and finally the relevant modules are tested,results show that focused crawler combination algorithm is effective,and the design scheme is feasible.
Keywords/Search Tags:Focused Crawler, Feature Selection, Nutch, Search Strategy
PDF Full Text Request
Related items