Font Size: a A A

Design And Implementation Of Warpper Generation System Based On Nested-pattern In Web Pages

Posted on:2011-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:X ShenFull Text:PDF
GTID:2198330335459953Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the Web grows, more and more data has become available on the Internet. It is quite convenient for us to get the information in which we are interested. We can send out a query to a Search Engine to obtain the information of interest, but we must face to a huge amount of data. The data on the Internet is displayed in the form of HTML code which is semi-structured. It is easy to read for people, but it is hard for a computer to process automatically. So, if we can extract the useful data from web pages and store it into Database, it will be easy for us to do deep analysis. Thus, it is important and necessary to extract useful information from web pages, which is Web Information Extraction and Integration. Currently, generating Wrapper is widely used to extract information from Web pages automatically.In this paper, we implement the generation of a Wrapper for Web Information Extraction and Integration. It can generate Wrapper automatically for web pages which contain nested-structured data. We construct a wrapper by 4 steps to extract information from Web pages for Deep Web:1. Pre-process Web pages, and eliminate noisy data. We propose a new algorithm called ENDW which is based on "Query Keyword" and DOM trees to ensure the integrality of useful data.2. Construct suffix tree for a given web page based on Ukkonen's algorithm. Suffix trees are used to discover all continuous repeated substrings. We consider HTML code of a web page as a string. After the given web page is processed in step 1, the HTML code containing no noisy data is used as input to construct a suffix tree base on Ukkonen's algorithm.3. Search for all continuous repeated strings based on a suffix tree. For Deep Web, data records displayed in web pages are continuous repeated substring. We can discover nested-structure based on these continuous repeated substrings. Next step, we will abstract the Regular Expression representing the pattern (structure) of the web pages based on these continuous repeated substrings.4. Generate Regular Expression as Wrapper that can represent the structure of web pages.
Keywords/Search Tags:Web Information Extraction, Deep Web, Noise Elimination, Suffix Tree
PDF Full Text Request
Related items