Hadoop-based Automatic Deep Web Data Extraction

Posted on:2015-03-19

Degree:Master

Type:Thesis

Country:China

Candidate:H J Wang

Full Text:PDF

GTID:2298330422472136

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The rapidly development and extensive application gave a result that the resourcesprovided by the Internet have increased steadily. Especially those huge informationresources can’t be searched by static links with traditional search engines, which calledDeep Web, are dramatically increasing. The study on Deep Web is a hotspot in Webdata management in recent years.The information of Deep Web is displayed in the results page after submittingqueries to a specific interface, only through extracting information from Deep Webquery results and integrating these information resources, storing them in a unifiedmodel, we can provide users a better, unified indexing service. Therefore, the extractionof Deep Web query results is a critical process in a Deep Web data integration system.This paper primarily focuses on the deep research on extraction algorithm whichcombined DOM tree-based structure and template. The main contents of this thesis areas follows:①Compares several major extraction technologies, and makes the focus on theextraction algorithm based on DOM tree structure and template to detailed describe,moreover, makes an exhaustive analysis and comparison of a variety of techniques interms of complexity, scope and degree of automation;②Integrated the advantages of DOM tree-based extraction algorithm andtemplate-based extraction algorithm, a novel approach which combines DOMtree-based extraction algorithm and template-based extraction algorithm called FIME(Filtering, Iterating, Matching, and Extracting) algorithm is proposed. Before thecomparison with The DOM Tree-based structure, FIME makes a preprocess for thepages in order to make the pages in accordance with the standard XHTML. Furthermore,FIME algorithm clears up the tags and part of the property element in the pages whichare useless for extracting information, making the pages more simplify and improvingthe efficiency of matching module;③According to the existing problems that when backtracking the iterations ofpages may cause high complexity with the DOM tree-based extraction algorithm, FIMEmerges iterations of the input pages before matching, reducing the time complexity ofmatching module;④Apply the idea of template-based extraction algorithm, FIME algorithm use the position information of the data waiting for extraction by comparing the DOM treeStructure as the model Wrapper for the same website in the matching module; in theextracting module, it uses the wrapper to automatic extracting information of the pagesthat come from the same Website with the wrapper, instead of repeated treatment, toimprove the extraction efficiency and automation degree;Since the returned data of the Deep Web query result pages may be massive, thesingle node extraction algorithm exists the calculate bottleneck. Currently, the opensource distributed system architecture Hadoop has been very stable, and therefore,FIME algorithm is deployed in Hadoop platform, Experimental results show that theFIME algorithm has higher extraction accuracy and efficiency.

Keywords/Search Tags:

Deep Web query results, DOM tree, template, FIME algorithm, Hadoop

PDF Full Text Request

Related items

1	The Research On Deep Web Interfaces Integration And Query Results Ranking
2	Hadoop-based Geospatial Data Storage And Query Technology
3	The Research And Implementation Of Deep Web Query Results Extraction
4	Research On Source Discovery And Query Results Extraction Of Deep Web
5	Research On Keyword Query Approach Over RDF Data Based On Tree Template
6	Deep Web Query Results Extraction And Annotation
7	Research Of Query Expansion And Search Results Clustering For Web Information Retrieval
8	The Research Of Personalized Categorization Approach For Web Database Query Results
9	Researches On Deep Web Query Interface Determining Technology
10	Study On Deep Web Query Interface Pattern Matching And Query Results Annotation