Font Size: a A A

Hadoop-based Automatic Deep Web Data Extraction

Posted on:2015-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:H J WangFull Text:PDF
GTID:2298330422472136Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapidly development and extensive application gave a result that the resourcesprovided by the Internet have increased steadily. Especially those huge informationresources can’t be searched by static links with traditional search engines, which calledDeep Web, are dramatically increasing. The study on Deep Web is a hotspot in Webdata management in recent years.The information of Deep Web is displayed in the results page after submittingqueries to a specific interface, only through extracting information from Deep Webquery results and integrating these information resources, storing them in a unifiedmodel, we can provide users a better, unified indexing service. Therefore, the extractionof Deep Web query results is a critical process in a Deep Web data integration system.This paper primarily focuses on the deep research on extraction algorithm whichcombined DOM tree-based structure and template. The main contents of this thesis areas follows:①Compares several major extraction technologies, and makes the focus on theextraction algorithm based on DOM tree structure and template to detailed describe,moreover, makes an exhaustive analysis and comparison of a variety of techniques interms of complexity, scope and degree of automation;②Integrated the advantages of DOM tree-based extraction algorithm andtemplate-based extraction algorithm, a novel approach which combines DOMtree-based extraction algorithm and template-based extraction algorithm called FIME(Filtering, Iterating, Matching, and Extracting) algorithm is proposed. Before thecomparison with The DOM Tree-based structure, FIME makes a preprocess for thepages in order to make the pages in accordance with the standard XHTML. Furthermore,FIME algorithm clears up the tags and part of the property element in the pages whichare useless for extracting information, making the pages more simplify and improvingthe efficiency of matching module;③According to the existing problems that when backtracking the iterations ofpages may cause high complexity with the DOM tree-based extraction algorithm, FIMEmerges iterations of the input pages before matching, reducing the time complexity ofmatching module;④Apply the idea of template-based extraction algorithm, FIME algorithm use the position information of the data waiting for extraction by comparing the DOM treeStructure as the model Wrapper for the same website in the matching module; in theextracting module, it uses the wrapper to automatic extracting information of the pagesthat come from the same Website with the wrapper, instead of repeated treatment, toimprove the extraction efficiency and automation degree;Since the returned data of the Deep Web query result pages may be massive, thesingle node extraction algorithm exists the calculate bottleneck. Currently, the opensource distributed system architecture Hadoop has been very stable, and therefore,FIME algorithm is deployed in Hadoop platform, Experimental results show that theFIME algorithm has higher extraction accuracy and efficiency.
Keywords/Search Tags:Deep Web query results, DOM tree, template, FIME algorithm, Hadoop
PDF Full Text Request
Related items