Structured data extraction from the Web

Posted on:2007-12-29

Degree:Ph.D

Type:Thesis

University:University of Illinois at Chicago

Candidate:Zhai, Yanhong

Full Text:PDF

GTID:2448390005971327

Subject:Computer Science

Abstract/Summary:

This thesis studies the problem of extracting structured data from Web pages (semi-structured documents). Structured data on the Web are usually data records which are retrieved from underlying database and displayed in Web pages following patterns defined in some fixed templates. Extracting such data records is very useful because it enables ones to obtain and to integrate information from multiple sources and be able to provide value-added services. With more and more companies and organizations disseminating information on the Web, the ability to extract data records from Web pages is becoming increasingly challenging and important.; In this thesis, we study the existing techniques in the area of Web data extraction and analyze their limitations. Two novel approaches are proposed to extract data on the Web, from two different types of data-rich pages: (1) automatic extraction based on mining repeated patterns; and (2) wrapper generation based on instance-based learning. In the first approach, given a single page with multiple data records inside, structured data are automatically identified, extracted and put in a database table. To achieve this, the visual information of HTML elements and a tree matching algorithm are utilized to mine similar patterns, which correspond to data records. To extract data items from the identified data records, a novel partial tree alignment algorithm is devised. In the second approach, given a set of pages, by marking/labeling items of interest in a single page, the system can begin extracting the similar data items from the rest of the pages. A novel similarity measure is proposed to measure the similarity between two data items in terms of their markup encoding.

Keywords/Search Tags:

Data, Web, Extract

Related items

1	Data Extract Pattern Mining Research Based On HL7 Electronic Medical Record
2	Design And Implementation Of Data Extract And Transform Sub-System Of Business Analysis System In CRBT Platform
3	Research On An Approach For Identifying Code Refactoring Change Patterns
4	3-D Data Processing to Extract Vehicle Trajectories from Roadside LiDAR Dat
5	Seismic Achievement Data ETL Platform Architecture Design And Software System Implementation
6	The Centralizing And Processing Of Data On The Console Of The Business Income Centralized Management System
7	Identification And Analysis Of Extract Method Refactorings
8	Design And Implementation Of National Tax Risk Assessment Information System
9	Study On Integration Technology Of Application For Data Warehouse In Multi-Database Systems
10	The Application Of ETL In Electric Power Information Analyzing System