Research And Implementation Of Web Information Automatically Crawling In Vertical Search

Posted on:2016-10-25

Degree:Master

Type:Thesis

Country:China

Candidate:J Y Zhang

Full Text:PDF

GTID:2428330491960150

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the constant innovation and development of Internet technology,the amount of information on the Internet has been explosively improved,vertical search engine provides people with professional,comprehensive,high quality of search results.And the perfect search result which vertical search provide tide up with a large amount of accurate industrial data,and the data acquisition is the credit of vertical web crawler.Vertical web crawler can selectively extract information and URLs from web pages according to certain rules of extraction,then store structured information to database for usage of vertical search engine,which is the main flow of network information crawling.The main goal of the vertical web crawler to crawl the pages on the Internet which are in large amount,easy changed,semi-structured characteristics,and these characteristics led to crawling coverage is low,crawled page information is likely failed.Crawling coverage issue affects the comprehensiveness of search result of vertical search engine,web information failure issue affects the validity of search result of vertical search engine,and incorrect extracted information affects accuracy of search result of vertical search engine.By analyzing and summarizing three issues,this paper proposes pages automatically discovery mechanism,pages automatically re-access mechanism and extraction rules' failure alarm mechanism.These three mechanisms resolves crawling coverage issues,pages information failure and rules of extraction failed issues of vertical web crawler.This paper firstly introduces the page auto-discover mechanism,and select open resource crawler framework Scrapy as the basis for secondary development,which improve comprehensiveness and accuracy of vertical web crawler;Secondly this paper introduces web page automatically re-accessed mechanism,and select Spring and Hibernate framework as the basis for the functional development ensuring the validity and accuracy of the vertical web crawler;Thirdly this paper introduce extraction rules failed automatic alarm mechanism,through monitoring the process of web crawling,which rise alarm automatically when the rules of extraction are failed.Finally,a large number of experiments have been taken to verify the validity and efficiency of the three automated mechanism.

Keywords/Search Tags:

Vertical Web Crawler, Crawling Coverage, Crawling Aging, Extract Rules, Scrapy

PDF Full Text Request

Related items

1	Design And Development Of Distributed Crawler Based On Scrapy Framework
2	Design And Implementation Of Web Crawler System Based On Scrapy Framework
3	Vertical Search Engine For Crawling The Web Page Design And Implementation
4	The Study And Implementation Of Efficient And Stable Methods For Data Crawling In Vertical Search Engines
5	Research And Application Of Web Crawling Algorithm Based On Semantic Analysis
6	QQ Space Data Research And Analysis Based On Scrapy Crawling
7	Crawling Data Of Electronic Business Platform Based On Scrapy And Construction Of Automatic Question-Answering System
8	Research On Efficient Web Information Crawling Strategy
9	The Design And Implementation Of Data Crawling And Processing Moudle Of Trendata Data Analysis Platform
10	Research And Application Of WEB Anti-crawling Mechanism