The Design And Implementation Of The Deep Web Acquisition System

Posted on:2013-01-29

Degree:Master

Type:Thesis

Country:China

Candidate:Y Song

Full Text:PDF

GTID:2248330395973979

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development of the Internet, and the explosive growth ofcontent on the Internet, people began to concentrate on what behind such vast amount ofinformation, which is often called web data mining. As the basis of data mining,techniques on information collection (often called web crawling) can directly affect thequality of data mining. However, existing web crawling systems are difficult to adapt tothe rapid development of web technology. For example, they support poorly ongathering dynamic pages using AJAX technology, and lack of effective response ofsevers’ anti-crawling methods, which affected the efficiency of the whole systemseriously.In this paper, we present a scheme to develop a deep web crawling system usingJAVA programming language. The scheme includes gather strategy configuration, deepand normal gathering, auto proxy selection, indexing on gathered data, and techniquesto perform the follow-up processing. After that, we realize a prototype system, andfinish the testing by crawling the searching results of Baidu, Netease news site, and theproducts’ information of360buy online shop.This system is based on existing web crawling techniques. Firstly, all the relatedstrategies and actions to be performed are integrated into corresponding websitetemplates. Secondly, the deep crawling of the websites is done by an open sourceHtmlUnit module. Thirdly, important fields will be extracted from the gathered pages.Finally, all the extracted fields will be indexed and written to the database, and thewhole process finishes. At the same time, to help the user to create and manage the jobs,and search on gathered data easily, we provide a user-friendly interface.After introducing the design of the system, this paper states the design andrealization of the main modules of the system, including template management module,normal page downloading module, deep-crawling module, bloom filter module, thauto-proxy module, content indexing module, and the field extraction module. As thewhole system is written in JAVA, it can run on different platforms, either MicrosoftWindows or Linux. The system introduces lightweight HtmlUnit module, which is GUI-free and oftenused in web testing. By using this module, the system will support more websites.Moreover, the system is plug-in oriented, which means it has good extendibility andsupport further development. By adding custom templates, or using third-party modules,the system can be used in larger scale of gathering. Through the above methods, werealized an easy-to-use, high efficient and scalable deep web crawling system.

Keywords/Search Tags:

web crawling, HtmlUnit, AJAX, data mining

PDF Full Text Request

Related items

1	Research On Algorithm Of Crawling Ajax Dynamic Web Pages Based On User Interface State Changes
2	Design And Implementation Of An Ajax Supported Deep Web Crawler System
3	Research And Implementation Of Web Crawler For URL-Specified Crawling Of Ajax-Based Web Applications
4	Applications Of Data Mining For The Competitive Intelligence System In The Enterprise
5	The Study And Implementation Of Efficient And Stable Methods For Data Crawling In Vertical Search Engines
6	Research And Implementation On Web Page Crawling And Analyzing Techniques For AJAX Script Network
7	Ajax-based Livelihood Platform Development And Design
8	An Approach Based On WSFT Model For Crawling Deep Web
9	Design And Implementation Of Visualization System For Movie Website Data Mining
10	Research On Crawling Model And Stratage Which Is Available Of Crawling Cloud-Computing Products’Data From Rias