Font Size: a A A

Structured data extraction from the Web

Posted on:2007-12-29Degree:Ph.DType:Thesis
University:University of Illinois at ChicagoCandidate:Zhai, YanhongFull Text:PDF
GTID:2448390005971327Subject:Computer Science
Abstract/Summary:
This thesis studies the problem of extracting structured data from Web pages (semi-structured documents). Structured data on the Web are usually data records which are retrieved from underlying database and displayed in Web pages following patterns defined in some fixed templates. Extracting such data records is very useful because it enables ones to obtain and to integrate information from multiple sources and be able to provide value-added services. With more and more companies and organizations disseminating information on the Web, the ability to extract data records from Web pages is becoming increasingly challenging and important.; In this thesis, we study the existing techniques in the area of Web data extraction and analyze their limitations. Two novel approaches are proposed to extract data on the Web, from two different types of data-rich pages: (1) automatic extraction based on mining repeated patterns; and (2) wrapper generation based on instance-based learning. In the first approach, given a single page with multiple data records inside, structured data are automatically identified, extracted and put in a database table. To achieve this, the visual information of HTML elements and a tree matching algorithm are utilized to mine similar patterns, which correspond to data records. To extract data items from the identified data records, a novel partial tree alignment algorithm is devised. In the second approach, given a set of pages, by marking/labeling items of interest in a single page, the system can begin extracting the similar data items from the rest of the pages. A novel similarity measure is proposed to measure the similarity between two data items in terms of their markup encoding.
Keywords/Search Tags:Data, Web, Extract
Related items