Font Size: a A A

Semantic hierarchies of HTML documents and their applications

Posted on:2002-11-14Degree:Ph.DType:Dissertation
University:Brigham Young UniversityCandidate:Lim, SeungJinFull Text:PDF
GTID:1468390011490960Subject:Computer Science
Abstract/Summary:
Traditional Web query languages for querying hierarchical data in HTML documents rely on the traditional HTML grammar-based parse trees where character data and HTML tags are intermingled. Hence, it is difficult to query hierarchical data using an existing query language without having prior knowledge of the internal tag structures of the documents. In addition, the suggested data hierarchy in a parse tree is quite different from the data hierarchy that the user would perceive from the rendered image of the document on a Web browser.; We propose a new data hierarchy, called semantic hierarchy, of an HTML document (i) to allow the user to query hierarchical data without requiring prior knowledge of the internal tag structures of the document and (ii) to provide data hierarchy of the document from a human perspective. We found that semantic hierarchies are useful in solving the problems of: (i) automated transformation of HTML documents to XML documents, (ii) structural integration of HTML documents, XML documents, and relational tables, and (iii) change detection of any two given HTML documents. We were able to construct semantic hierarchies at above 70% rate for 235 real-world HTML documents located at 174 different Web sites.
Keywords/Search Tags:HTML documents, Semantic hierarchies, Hierarchical data, Data hierarchy, Internal tag structures
Related items