Font Size: a A A

The Design And Implementation Of The Deep Web Acquisition System

Posted on:2013-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y SongFull Text:PDF
GTID:2248330395973979Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of the Internet, and the explosive growth ofcontent on the Internet, people began to concentrate on what behind such vast amount ofinformation, which is often called web data mining. As the basis of data mining,techniques on information collection (often called web crawling) can directly affect thequality of data mining. However, existing web crawling systems are difficult to adapt tothe rapid development of web technology. For example, they support poorly ongathering dynamic pages using AJAX technology, and lack of effective response ofsevers’ anti-crawling methods, which affected the efficiency of the whole systemseriously.In this paper, we present a scheme to develop a deep web crawling system usingJAVA programming language. The scheme includes gather strategy configuration, deepand normal gathering, auto proxy selection, indexing on gathered data, and techniquesto perform the follow-up processing. After that, we realize a prototype system, andfinish the testing by crawling the searching results of Baidu, Netease news site, and theproducts’ information of360buy online shop.This system is based on existing web crawling techniques. Firstly, all the relatedstrategies and actions to be performed are integrated into corresponding websitetemplates. Secondly, the deep crawling of the websites is done by an open sourceHtmlUnit module. Thirdly, important fields will be extracted from the gathered pages.Finally, all the extracted fields will be indexed and written to the database, and thewhole process finishes. At the same time, to help the user to create and manage the jobs,and search on gathered data easily, we provide a user-friendly interface.After introducing the design of the system, this paper states the design andrealization of the main modules of the system, including template management module,normal page downloading module, deep-crawling module, bloom filter module, thauto-proxy module, content indexing module, and the field extraction module. As thewhole system is written in JAVA, it can run on different platforms, either MicrosoftWindows or Linux. The system introduces lightweight HtmlUnit module, which is GUI-free and oftenused in web testing. By using this module, the system will support more websites.Moreover, the system is plug-in oriented, which means it has good extendibility andsupport further development. By adding custom templates, or using third-party modules,the system can be used in larger scale of gathering. Through the above methods, werealized an easy-to-use, high efficient and scalable deep web crawling system.
Keywords/Search Tags:web crawling, HtmlUnit, AJAX, data mining
PDF Full Text Request
Related items