Design And Implementation Of Warpper Generation System Based On Nested-pattern In Web Pages

Posted on:2011-12-16

Degree:Master

Type:Thesis

Country:China

Candidate:X Shen

Full Text:PDF

GTID:2198330335459953

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As the Web grows, more and more data has become available on the Internet. It is quite convenient for us to get the information in which we are interested. We can send out a query to a Search Engine to obtain the information of interest, but we must face to a huge amount of data. The data on the Internet is displayed in the form of HTML code which is semi-structured. It is easy to read for people, but it is hard for a computer to process automatically. So, if we can extract the useful data from web pages and store it into Database, it will be easy for us to do deep analysis. Thus, it is important and necessary to extract useful information from web pages, which is Web Information Extraction and Integration. Currently, generating Wrapper is widely used to extract information from Web pages automatically.In this paper, we implement the generation of a Wrapper for Web Information Extraction and Integration. It can generate Wrapper automatically for web pages which contain nested-structured data. We construct a wrapper by 4 steps to extract information from Web pages for Deep Web:1. Pre-process Web pages, and eliminate noisy data. We propose a new algorithm called ENDW which is based on "Query Keyword" and DOM trees to ensure the integrality of useful data.2. Construct suffix tree for a given web page based on Ukkonen's algorithm. Suffix trees are used to discover all continuous repeated substrings. We consider HTML code of a web page as a string. After the given web page is processed in step 1, the HTML code containing no noisy data is used as input to construct a suffix tree base on Ukkonen's algorithm.3. Search for all continuous repeated strings based on a suffix tree. For Deep Web, data records displayed in web pages are continuous repeated substring. We can discover nested-structure based on these continuous repeated substrings. Next step, we will abstract the Regular Expression representing the pattern (structure) of the web pages based on these continuous repeated substrings.4. Generate Regular Expression as Wrapper that can represent the structure of web pages.

Keywords/Search Tags:

Web Information Extraction, Deep Web, Noise Elimination, Suffix Tree

PDF Full Text Request

Related items

1	Research Of A Suffix Tree Based Automatic Wrapper Generation Method
2	Finding MUMs With Enhanced Suffix Arrays
3	Research On Construction Of Index Structure For Biological Sequences
4	Research Of Web Information Extraction Technology Based On Semantie
5	Research On Automatic Extraction Algorithm Of Internet Web Technology Data
6	Multi-pattern Matching With Wildcards Based On Suffix Tree And Suffix Array
7	An Algorithm Based On Suffix Tree For Identification Of Repeats In DNA Sequence
8	Research On Data Extraction Of Deep Web Based On Visual Information And Tree Match
9	Research Of Finding Maximal Unique Matches In Genome
10	Research On Vietnamese News Topic Recognition Method Based On Suffix Tree Clustering Algorithm