Font Size: a A A

Data Extraction And Integration In HTML Tables

Posted on:2005-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:A Z WuFull Text:PDF
GTID:2178360182465893Subject:Computer technology
Abstract/Summary:PDF Full Text Request
A large amount of information available on the Web is formatted in HTML tables, which are mainly presentation-oriented and are not suited for database applications, and there may be name conflicts, structure conflicts, semantic conflicts among different HTML tables belonging to a domain of interest. As a result, how to capture information in HTML tables semantically and integrate relevant information is a challenge, and has many potential applications including information retrieval, data warehouse and web mining, and web content summarization and delivery.In HTML tables, since cells may span multiple rows and columns, headings may include row heading and column heading, and may has nested structure, they are more complex than the tables in traditional relation databases. It is not trivial to capture its attribute-value pairs. By using the attributes of cells, we can normalize the HTML tables by inserting redundant cells into them. Based on the normalized table, we can capture the attribute-value pairs according to the headings and their corresponding data cells. For some HTML tables without marked headings, To convert them, the key is recognizing their headings. Based on the fact thai the authors generally use different formatting information, such as font (size, bold), to mark the headings, we can recognize its headings by using the measure of formatting information.After the HTML table be converted into XML, we need to perform data integrated at the schema level, that is to generate global schema. By defining what data sources to be integrated, we can produce a list of global concepts and their hierarchies, which form the global schema, denoted by XML DTD, of data sources to be integrated. In order to eliminate conflicts, we define the notion of lexical semantic set (LSS) for each global concept as the set that consists of all the attribute name corresponding to the global concept. By using LSS, we can eliminate most of the conflicts. We can solve the nondeterministic problems, by comparing the context of the attribute in the source with the context of each global concept involved in the conflict to determine which global concept it corresponds to. By using the new approach, that is the global concept view expressed in XML DTD, the LLS for each global concept for eliminating conflicts, the conflict set and the context for solving nondeterministic problems, We can perform the data integration of HTML tables belonging to a domain of interest to map the semantic correspondence between each source schema to the global schema.In this paper, we discuss HTML and XML, present the algorithm to normalize HTML tables, research the approach to extract data and convert them into XML and integrate them, and give an example that extracts and integrates HTML tables on the web.
Keywords/Search Tags:HTML table, XML, information extraction, information integration, semi-structured data
PDF Full Text Request
Related items