Font Size: a A A

Crawling Strategy And Parsing Method Of Focused Crawlers Based On Hadoop Platform

Posted on:2016-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:M M LiFull Text:PDF
GTID:2348330476955763Subject:Software engineering
Abstract/Summary:
As the Internet develops sharply and vast amounts of Web resources spring on the Internet in recent years, vertical search engines with focused crawler as the core component become popular among users for its pertinence and accuracy. But there are some problems in existing crawlers:(1) Crawling strategy considers link structure or Web content only, however the relation among link contents is also an important factor;(2) When parsing pages, most current methods are not practical and the result is not satisfying;(3) Most existing parsing algorithm embedded in crawler system are difficult to be expanded and they have no applicability in mass data processing, while Hadoop platform is now widely used in the handling of huge amounts of data for its high reliability and good expansibility. Research about web crawling and analytic method in Hadoop environment has high theoretical value and practical significance.Based on the above problems, this paper mainly works on the following three aspects:(1)As the efficiency of existing crawling method is not high, this paper analyses the URL analysis methods of the existing focused crawlers and proposes a URL analysis algorithm based on the semantic content and link clustering(ALCSC). In this algorithm, the download URLs are clustered with the philosophy of clustering on the basis of VSM to improve the precision of the focused crawler according to the correlation between download URLs and new URLs. Additionally, the algorithm proposed considers web content and link content also to collect web pages related to the given topic accurately and effectively.(2)As most parsing algorithms parse pages unsatisfyingly and have low applicability in big data environment, this paper also analysis the advantages and disadvantages of the existing web page analysis algorithm to further improve the running speed of the crawlers. A new page parsing algorithm based on DOMTree with the method division, merging and degree reduction in Map/Reduce model is proposed as we find the characteristics of tag path owned by noise text and target text in Web pages. This strategy can not only parse pages effectively, it has better computing speed in Hadoop and that means its high applicability in cluster computing.(3)Finally these two algorithms are realized and verified. The first experiment makes secondary development on Heritrix crawler to determine the best values of parameters ? and ? in algorithm. The longitudinal comparison among algorithm proposed in this paper, Best-First Search and Shark-Search shows the accuracy, effectiveness and learning ability of crawling algorithm. The second experiment makes test on two data sets and verifies its advantages on precision rate, recall rate, F value. The satisfactory parsing result and great applicability in big data environment can be demonstrated by contrasting the same algorithm running in Hadoop and normal environment.
Keywords/Search Tags:Hadoop, Crawling Strategy, Page Parsing, DOM Tree
Related items