The Study And Implementation Of Efficient And Stable Methods For Data Crawling In Vertical Search Engines

Posted on:2015-06-05

Degree:Master

Type:Thesis

Country:China

Candidate:J Xiao

Full Text:PDF

GTID:2308330461957920

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The development of vertical search engine has satisfied the needs from Internet users of big data and time-sensitive searching, and a crucial part of it is data crawling. Traditional web crawling programs only targeted at the acquisition of web links and web pages and blocks while ignore the content of the web pages. However, more intelligent and efficient crawling systems have to be made to provide data source for vertical search engines. Such crawling systems can not only extract and analyse web links but also analysis the structure and content of web pages to extract structured information accurately [Chakrabartia et al.,1999], as well as ensuring a certain amount of cover rate and updating rate.While data crawling targeted at a certain realm can help dig deeper into information from that realm, some obstacles can be expected, such as site revision, traffic balance, lagging updates of scheduling, duplicated crawling, bad timeliness, etc [Ricardo et al., 2007]. We have designed and developed a distributed framework for data crawling[Bing Zhou et al.,2010] to satisfy different needs of data crawling from different business. The crawling process includes task scheduling, task dispatching, data crawling, and data storing.Data crawling systems are verified to be efficient, stable, and extensible in real-world scenarios. Based on data crawling and test monitoring of the data scrawling system, this paper represents the results of the following studies and application work:1. In terms of data crawling, multithreading was used for improving system efficiency and multiple crawling modes are used to analyse and handle web pages. Besides, auto-upgradation of clients was supported in order to ensure that the change of requirements will not introduce huge workload.2. Functional and performance tests were made and they played an important role in ensuring the efficiency, accuracy and stability of data crawling, which includes parameter gathering, API testing, data verification, etc.3. Monitoring management is achieved to visualize the process of data crawling and provide mechanism for alerting. This can help with the detection and solving of problems, thus increases the stability of the crawling framework.

Keywords/Search Tags:

Data crawling, C/S, distributed crawling, Redis, visualization

PDF Full Text Request

Related items

1	Research On Efficient Web Information Crawling Strategy
2	Research And Implementation Of Web Information Automatically Crawling In Vertical Search
3	Research On Crawling Model And Stratage Which Is Available Of Crawling Cloud-Computing Products’Data From Rias
4	The Video Download Method And Distributed Crawling System Design And Implementation
5	Design And Implementation Of Configurable Distributed WEB Information Crawling System
6	Research On Key Technologies Of Distributed Web Crawling
7	The Design And Implementation Of IPv6Information Crawling System
8	The Design And Implementation Of Data Crawling And Processing Moudle Of Trendata Data Analysis Platform
9	Desigh And Implementation Of Web Anti-crawling System
10	Model-based Crawling - An Approach to Design Efficient Crawling Strategies for Rich Internet Applications