Font Size: a A A

Research And Implementation Of Web Information Automatically Crawling In Vertical Search

Posted on:2016-10-25Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ZhangFull Text:PDF
GTID:2428330491960150Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the constant innovation and development of Internet technology,the amount of information on the Internet has been explosively improved,vertical search engine provides people with professional,comprehensive,high quality of search results.And the perfect search result which vertical search provide tide up with a large amount of accurate industrial data,and the data acquisition is the credit of vertical web crawler.Vertical web crawler can selectively extract information and URLs from web pages according to certain rules of extraction,then store structured information to database for usage of vertical search engine,which is the main flow of network information crawling.The main goal of the vertical web crawler to crawl the pages on the Internet which are in large amount,easy changed,semi-structured characteristics,and these characteristics led to crawling coverage is low,crawled page information is likely failed.Crawling coverage issue affects the comprehensiveness of search result of vertical search engine,web information failure issue affects the validity of search result of vertical search engine,and incorrect extracted information affects accuracy of search result of vertical search engine.By analyzing and summarizing three issues,this paper proposes pages automatically discovery mechanism,pages automatically re-access mechanism and extraction rules' failure alarm mechanism.These three mechanisms resolves crawling coverage issues,pages information failure and rules of extraction failed issues of vertical web crawler.This paper firstly introduces the page auto-discover mechanism,and select open resource crawler framework Scrapy as the basis for secondary development,which improve comprehensiveness and accuracy of vertical web crawler;Secondly this paper introduces web page automatically re-accessed mechanism,and select Spring and Hibernate framework as the basis for the functional development ensuring the validity and accuracy of the vertical web crawler;Thirdly this paper introduce extraction rules failed automatic alarm mechanism,through monitoring the process of web crawling,which rise alarm automatically when the rules of extraction are failed.Finally,a large number of experiments have been taken to verify the validity and efficiency of the three automated mechanism.
Keywords/Search Tags:Vertical Web Crawler, Crawling Coverage, Crawling Aging, Extract Rules, Scrapy
PDF Full Text Request
Related items