Font Size: a A A

The Study And Implementation Of Efficient And Stable Methods For Data Crawling In Vertical Search Engines

Posted on:2015-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:J XiaoFull Text:PDF
GTID:2308330461957920Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The development of vertical search engine has satisfied the needs from Internet users of big data and time-sensitive searching, and a crucial part of it is data crawling. Traditional web crawling programs only targeted at the acquisition of web links and web pages and blocks while ignore the content of the web pages. However, more intelligent and efficient crawling systems have to be made to provide data source for vertical search engines. Such crawling systems can not only extract and analyse web links but also analysis the structure and content of web pages to extract structured information accurately [Chakrabartia et al.,1999], as well as ensuring a certain amount of cover rate and updating rate.While data crawling targeted at a certain realm can help dig deeper into information from that realm, some obstacles can be expected, such as site revision, traffic balance, lagging updates of scheduling, duplicated crawling, bad timeliness, etc [Ricardo et al., 2007]. We have designed and developed a distributed framework for data crawling[Bing Zhou et al.,2010] to satisfy different needs of data crawling from different business. The crawling process includes task scheduling, task dispatching, data crawling, and data storing.Data crawling systems are verified to be efficient, stable, and extensible in real-world scenarios. Based on data crawling and test monitoring of the data scrawling system, this paper represents the results of the following studies and application work:1. In terms of data crawling, multithreading was used for improving system efficiency and multiple crawling modes are used to analyse and handle web pages. Besides, auto-upgradation of clients was supported in order to ensure that the change of requirements will not introduce huge workload.2. Functional and performance tests were made and they played an important role in ensuring the efficiency, accuracy and stability of data crawling, which includes parameter gathering, API testing, data verification, etc.3. Monitoring management is achieved to visualize the process of data crawling and provide mechanism for alerting. This can help with the detection and solving of problems, thus increases the stability of the crawling framework.
Keywords/Search Tags:Data crawling, C/S, distributed crawling, Redis, visualization
PDF Full Text Request
Related items