Font Size: a A A

Research And Implementation Of A Tree Structure Based Automatic Web Page Data Extraction Method

Posted on:2006-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z LiFull Text:PDF
GTID:2168360155452956Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The amount of information that is currently available on the net in HTML format grows at a very fast pace,so that we may consider the Web as the largest "knowledge base"ever developed and made available to the public. However HTML sites are in some sense modern legacy systems, since such a large body of data cannot be easily accessed and manipulated. The reason is that Web data sources are intended to be browsed by humans, and not computed over by applications. As a consequence, extracting data from Web pages and making it available to computer applications remains a complex and relevant task.Data extraction from HTML is usually performed by software modules called wrappers. Wrapper is a procedure whose purpose is to convert information implicitly stored in Web pages into information explicitly presented in predefined format for further processing.Wrappers are generated by wrapper generation system. Generating a wrapper is equal to generating a set of extraction rules from sample documents. The input of Wrapper Generation System is a set of sample pages and the output is a set of rules stored in Rule Repository. This paper investigates the wrapper generation problem under a new perspective. Our goal is that of fully automating the wrapper generation process, in such a way that it does not rely on any a priori knowledge about the target pages and their contents.In particular, we aim at automating the wrapper generation process to a larger extent and extract data from pages in data-intensive sites,which are usually automatically generated: data are stored in a back-end DBMS, and HTML pages are produced using scripts. We may formulate the problem studied in this paperas follows:given a set of sample HTML pages belonging to the same class, find the nested type of the source dataset and extract the source dataset from which the pages have been generated. The algorithm in this paper is based on a matching technique called DTAWE ,for DOM TREE based Automatic Web data Extraction, which we describe in the following. Reduce Noise in HTML Pages. To avoid errors and missing tags in the sources, we assume that the HTML code complies to the XHTML pecification, a restrictive variant of HTML in which tags are required to be properly closed and nested . In order to clean HTML sources, fix errors and make the code compliant with XHTML, and also to build DOM TREEs, it uses JTidy,a Java port of HTML Tidy,a library for HTML cleaning. Wrapper Generager System. we formalize such schema finding problem,a key contribution stands in the definition of a new class of regular languages, called the prefix mark-up languages, which nicely abstract the typical structures usually found in HTML pages. They are identifiable in the limit, i.e., that there exist unsupervised algorithms for their inference from positive examples only. We show that prefix mark-up languages, differently from other classes previously know to be identifiable in the limit, require for the inference a new form of characteristic sample, which is a high probability of being found in a bunch of randomly sampled HTML pages. It is the most important part of this technique。It is implement by the DT-Match arithmetic.The input is two DOM Trees of Web Pages ,one of them is called wrapper(WRADT) and the other is called sample(SAMDT) and then match the two trees.We can match the two roots of the trees use Distance Finding Station recursively.And then match the two nodes at the same position,if they are macthed,match their children node.if they are not,these two nodes and WRADT,SAMDT are called mismatch point.Whenere one mismatch is found,the algorithm tries to solve it by generalizing the wrapper.This is done by suitable generalization operators. The goal of Labeller is that of analyzing the wrapper and a set of sample pages in order to locate inside the common template...
Keywords/Search Tags:Implementation
PDF Full Text Request
Related items