Font Size: a A A

Research Of A Suffix Tree Based Automatic Wrapper Generation Method

Posted on:2006-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ZhangFull Text:PDF
GTID:2168360155452995Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The World Wide Web is a vast and rapidly growing source of information. And most of this information is in semi-structured HTML pages. HTML serves the visual presentation of data in Internet browsers, but does not provide semantic information for the data presented. This form of data is, therefore, inappropriate for computer applications to process automatically. Web information extraction tries to extract data items from Web pages, usually semi-structured, and return them in a structured format so as to automate the Web information processing. Extracting data from Web requires a wrapper that "wrap"around a Web site and extracts data from Web pages. Wrappers are specialized program routines that automatically extract data from Web pages and convert the information into a structured format. Currently, most approaches to wrapper construction are semi-automated: they either need human involvement or have mainly focused on extracting plain-structured data objects with a fixed number of attributes and values, and usally cannot handle nested-structured data objects, whose instances may have variable number of values on their attributes. In this paper, we make a research on a suffix tree based automatic wrapper generation method. The wrapper generated by this method can handle not only data objects with plain structures, but also those with nested structures. Targets of this research are the so-called data intensive HTML pages, which are often dynamically generated. That is, the data is stored in a back-end database, after receiving a user request, a server script program fetches data from the back-end database and fills it into a predefined HTML template and generates the final HTML page. The structures of pages generated by the same script are the same or quite similar. The automatic wrapper generation method provided in this paper includes the following approaches: 1. Automatically remove the noise of Web pages. Many Web pages, especially those from commercial Web sites, usually present the main content data along with some "decoration", such as navigation bars, advertisements, etc. These decoration parts are so-called "noise content". The noise content in web pages decreases the accuracy and speed of wrapper generation process which base on the content of web pages. The noise content in different pages that generated by same template is always almost the same, while the main data content is different with each other. Therefore, the corresponding DOM trees of these pages are similar, and have some subtrees that are completely the same. And the noise content locates in these subtrees. A conception of the noise granularity is proposed in this paper in order to distinguish the noise content from the data content. Based on the discovery that the same parts of noise content usually have a higher level of granularity than those of data content, this paper propose a DOM tree based granularity controllable noise removing algorithm.The main idea of this approach is as follows: First, we parse the HTML pages and convert them into DOM trees. Then we use a tree matching algorithm to match the target and sample trees. During the matching process, we can identify the subtrees that are exactly the same in different DOM trees. If the granularity of the subtrees is higher than or equal to the pre-defined noise granularity, then these subtrees should be identified as noise subtrees and removed from the DOM trees. HTML pages with different types may have different noise granularity. By set appropriate noise granularity for different page types, it can improve the accuracy and efficiency of noise removing. We also use other methods to enhance the efficiency of the main algorithm, such as: DOM tree filtering by some heuristic rules. Using XPath to record the path of noise subtree and locate the noise subtrees. Filtering the random advertisements by a black-list. If we use a default setting for noise granularity, the noise removing process will be fully automated. 2. Automatically discover the nested pattern of HTML pages. The discovery of the nested pattern is the key problem of this paper. If thepage contains several instances of a data object, the HTML tag-structure enclosing data objects may appear repeatedly. Therefore, we can discover the nested pattern of the page by discovering the repeated HTML tag-structures. If we use a special symbol, PCDATA, to denote the arbitrary text enclosed by a pair of tags, the HTML page then becomes a symbol sequence composed of HTML tags and the PCDATA symbol. Symbol sequences can be mapped to strings, so we can use corresponding methods of string processes to discover the nested pattern. A suffix tree is a compacted trie that represents the suffixes of a string, and every substring appears in a path start from the root. The suffix tree is a classic data structure that exposes the internal structure of a string, and it allows many problems on strings to be solved efficiently, such as the problems concerning the identification and location of repeated substrings, etc. The iterative nested pattern discovery procedure is as follows: a) Parse the DOM tree into a well-formed symbol sequence. At the same time, construct the Alphabet of the page, and build the mappings from symbols to integers and Unicode characters. Then map the symbol sequence to a string composed of Unicode characters. b) Build the suffix tree for the string, and use the contiguous repeats detection algorithm to discover the contiguous repeated substrings, which are then verified to discover the repetitive patterns. If there are no repetitive patterns, the procedure stops. c) Substitute the repeated units for the repetitive patterns and go to (b). In this paper, we use a linked list structure of repetitive patterns to record the nested pattern discovery process. The linked list can be converted into a union-free regular expression which can reprensent the nested structure of the page. 3. Automatically discover the optional pattern of HTML pages. The nested pattern discovery process uses a single page to discover the nested patterns. Therefore, it can not discover the optional patterns.This paper propose a novel page tree matching algorithm to discover the optional patterns using a set of sample pages that have already been processed by the nested pattern discovery...
Keywords/Search Tags:semi-structured data, automatic wrapper generation, suffix tree, information extraction, Web page purification
PDF Full Text Request
Related items