Structure Information Extraction- Study And Implementation On Semi-auto Wrapper

Posted on:2012-09-03

Degree:Master

Type:Thesis

Country:China

Candidate:P C Shang

Full Text:PDF

GTID:2178330332483809

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With development of Web technology in the world, more and more information is carried. Among all forms of web pages, the pages containing structured data are very important. It is the hot question how to extract the interested data from these structured web pages in Web Data Mining field. Now, there are many ways and principles to extract the structured data.The text introduces two forms of structured web pages, and studies separately extraction algorithms of these forms.The first algorithm is put forward in order to extract the interested data of List Web Page for Flat Data. In this way, The DOM tree is produced by analyzing the source code of List page for Flat Data. And then different Data Region is classified after part of tree compared in DOM tree. The Main Data Region is found by using of leaf nodes'similarity in different generalized nodes. And Data Records will be distinguished in the Main Data Region. At last, Data Items are extracted after Partial Tree alignment.The text improves the original algorithm, the improved algorithm can distinguish the Main Data Region from different regions. The Main Region is where target data items are. This will improve the efficiency of extraction.The second method is aiming at extraction in the Detail Pages. This algorithm is a kind of semi-automatic algorithm. The first thing to do is that artificially choosing a sample, and labeling the target data item. The extraction rule will be build in order to be used to extract the rest pages. Until it can not be suitable for some pages, we will choose a new page to be a new sample, and produce a new rule. Like this all pages will be extracted.In this algorithm which is base on Detail Pages, the text contents the method that labeling a single model artificially and getting its extraction rules, and the rule will be used to extract the rest pages. Until the rule can not extract the Data Items in the page, we will give this page to users and label it, then the new rule generates. Go ahead this iteration before all pages are extracted successfully.With the character of real estate information, the text implies the two ways of structure data extraction into the pages contents real estate information. Algorithms mentioned in the text will exactly extract the data items by experiment.

Keywords/Search Tags:

Data Ming, Structure Data Extraction, Flat Data, List Page, Detail Page

PDF Full Text Request

Related items

1	Research On Web Data Extraction Based On Web Page Structure
2	The Research And Implementation Of One Kind Of Web Page Filtering Method Based On Real-Time Network Traffic Data
3	Research Of Data Extraction Technology Based On Tag Tree From List Pages
4	The Research On Data-object Oriented Page Based The Big Data
5	Research On WEB Page Structure And Data Extraction Technology
6	Study On Automatic Extraction Of Web Data Based On DOM
7	Research On Mining Structure Of WEB Page For Information Extraction
8	Based On The Theme And Structure Of The Xml Page Data Extraction
9	Research And Implementation Of Chinese Web-page Classification Based On Web Data-mining
10	The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique