Automatic wrapper generation for the extraction of search result records from search engines

Posted on:2008-01-06

Degree:Ph.D

Type:Dissertation

University:State University of New York at Binghamton

Candidate:Zhao, Hongkun

Full Text:PDF

GTID:1448390005956929

Subject:Computer Science

Abstract/Summary:

The deep web, which is estimated about 500 times larger than that of the surface web, is extremely under-utilized. Researchers have been working on various issues towards the building of large-scale deep web applications, which aim at unleashing the real power of the deep web. One of the key issues facing large-scale deep applications is the extraction and understanding of the data returned by deep web sites. In order to utilize the data in deep web sites, we need to extract the data (search result records) from the search result pages, which are web pages that contain both the data of interest and other unrelated content, returned by the deep web sites. Data extraction from web pages is generally a very hard problem. The performances of existing researches in the literature are far from satisfactory.; This dissertation studies the problem of extracting search result records from search engine returned pages in both the deep web sites and the surface web sites. A method that combines both the visual content features and the HTML tag structures the result pages is proposed to generate wrappers for the extraction of search result records. This novel technique archives significantly better performance than that of the state-of-the-art researches.; To extract search result records from categorized result pages requires maintaining the section-record relationships. Major issues like section boundaries and optional sections make achieving a good performance difficult. We introduce a novel method based on the content properties of search result records and the dynamic properties of sections.; A search result record usually consists of multiple data units. The semi-structured nature of search result records makes the data units extraction a hard problem. The mismatches between the HTML tag structures and the data structure of search result records as well as the optional and disjunctive data units further limit the performance. We introduce a novel directed acyclic graph representation of search result record templates, which can be used to extract data units from search result records. An effective machine learning and statistics based algorithm that extracts templates from search result records is also presented.

Keywords/Search Tags:

Search result records, Deep web, Extract, HTML tag structures, Surface web, Data units

Related items

1	Post-Processing Of Deep Web Querying Result
2	A Research On Key Technologies Of Deep Web Data Integration Based On Result Pattern
3	Research On Deep Web Search Interface And Search Result Extraction
4	Improving Web retrieval by mining the HTML tags for keywords and exploring the hyperlink structures of Web pages
5	Study On Search Results Clustering Based On Formal Concept Analysis
6	A Comparative Study Of Explicit Search Result Diversification Algorithms
7	Research And Application Of Video Search Result Analysis And Visualization Method
8	Research On Deep Web Dynamic Search
9	Research On Fusion-Based Methods For Search Result Diversification
10	Data Extract Pattern Mining Research Based On HL7 Electronic Medical Record