Font Size: a A A

Structure Information Extraction- Study And Implementation On Semi-auto Wrapper

Posted on:2012-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:P C ShangFull Text:PDF
GTID:2178330332483809Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With development of Web technology in the world, more and more information is carried. Among all forms of web pages, the pages containing structured data are very important. It is the hot question how to extract the interested data from these structured web pages in Web Data Mining field. Now, there are many ways and principles to extract the structured data.The text introduces two forms of structured web pages, and studies separately extraction algorithms of these forms.The first algorithm is put forward in order to extract the interested data of List Web Page for Flat Data. In this way, The DOM tree is produced by analyzing the source code of List page for Flat Data. And then different Data Region is classified after part of tree compared in DOM tree. The Main Data Region is found by using of leaf nodes'similarity in different generalized nodes. And Data Records will be distinguished in the Main Data Region. At last, Data Items are extracted after Partial Tree alignment.The text improves the original algorithm, the improved algorithm can distinguish the Main Data Region from different regions. The Main Region is where target data items are. This will improve the efficiency of extraction.The second method is aiming at extraction in the Detail Pages. This algorithm is a kind of semi-automatic algorithm. The first thing to do is that artificially choosing a sample, and labeling the target data item. The extraction rule will be build in order to be used to extract the rest pages. Until it can not be suitable for some pages, we will choose a new page to be a new sample, and produce a new rule. Like this all pages will be extracted.In this algorithm which is base on Detail Pages, the text contents the method that labeling a single model artificially and getting its extraction rules, and the rule will be used to extract the rest pages. Until the rule can not extract the Data Items in the page, we will give this page to users and label it, then the new rule generates. Go ahead this iteration before all pages are extracted successfully.With the character of real estate information, the text implies the two ways of structure data extraction into the pages contents real estate information. Algorithms mentioned in the text will exactly extract the data items by experiment.
Keywords/Search Tags:Data Ming, Structure Data Extraction, Flat Data, List Page, Detail Page
PDF Full Text Request
Related items