Data Extraction And Integration In HTML Tables

Posted on:2005-09-03

Degree:Master

Type:Thesis

Country:China

Candidate:A Z Wu

Full Text:PDF

GTID:2178360182465893

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

A large amount of information available on the Web is formatted in HTML tables, which are mainly presentation-oriented and are not suited for database applications, and there may be name conflicts, structure conflicts, semantic conflicts among different HTML tables belonging to a domain of interest. As a result, how to capture information in HTML tables semantically and integrate relevant information is a challenge, and has many potential applications including information retrieval, data warehouse and web mining, and web content summarization and delivery.In HTML tables, since cells may span multiple rows and columns, headings may include row heading and column heading, and may has nested structure, they are more complex than the tables in traditional relation databases. It is not trivial to capture its attribute-value pairs. By using the attributes of cells, we can normalize the HTML tables by inserting redundant cells into them. Based on the normalized table, we can capture the attribute-value pairs according to the headings and their corresponding data cells. For some HTML tables without marked headings, To convert them, the key is recognizing their headings. Based on the fact thai the authors generally use different formatting information, such as font (size, bold), to mark the headings, we can recognize its headings by using the measure of formatting information.After the HTML table be converted into XML, we need to perform data integrated at the schema level, that is to generate global schema. By defining what data sources to be integrated, we can produce a list of global concepts and their hierarchies, which form the global schema, denoted by XML DTD, of data sources to be integrated. In order to eliminate conflicts, we define the notion of lexical semantic set (LSS) for each global concept as the set that consists of all the attribute name corresponding to the global concept. By using LSS, we can eliminate most of the conflicts. We can solve the nondeterministic problems, by comparing the context of the attribute in the source with the context of each global concept involved in the conflict to determine which global concept it corresponds to. By using the new approach, that is the global concept view expressed in XML DTD, the LLS for each global concept for eliminating conflicts, the conflict set and the context for solving nondeterministic problems, We can perform the data integration of HTML tables belonging to a domain of interest to map the semantic correspondence between each source schema to the global schema.In this paper, we discuss HTML and XML, present the algorithm to normalize HTML tables, research the approach to extract data and convert them into XML and integrate them, and give an example that extracts and integrates HTML tables on the web.

Keywords/Search Tags:

HTML table, XML, information extraction, information integration, semi-structured data

PDF Full Text Request

Related items

1	ClusTex: Using clustering techniques for information extraction from HTML pages containing semi-structured data
2	Research On Key Issues Of Web Information Integration Oriented Web Information Extraction
3	Research On Technology Of Table Information Extraction In Semi-Structured Texts
4	Semi-structured Web Information Extraction Technology And Its Application
5	Web Information Extraction And Integration
6	Research On The Technology Of The Web Employment Information Extraction Based On The HTML
7	Research And Application Of Extraction Method Of Semi-structured Text Information
8	The Implementation And Application Of Extracting Structured Data From Web Pages
9	Research On Semantic Information Extraction For Semi-structured Documents
10	Design And Implementation Of The Core Information Extraction System Of Semi-structured Financial Contract