Research On The Technology Of Web Data Extraction

Posted on:2015-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:L J Chang

Full Text:PDF

GTID:2308330464471372

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, Web has become a huge space where information can be shared. These data can be further used in data mining, data integration. Web data extraction aims to study how to extract the data that may attract users from webpage. This thesis mainly studies how to extract data from two kinds of webpage including list pages and detailed pages.List pages refer to the pages that contain single or multiple tables, there have been some researches about the automatic extraction. But because of its varied forms and templates, some problems may exist when extracting data from list pages. The organization of data records shows diversity,which might lead to extraction several read data records as one data record. The existing simple tree matching problem just considers the name of tag, but many tag names of fields in a data record are the same, which will lead to more than one matching between two data records. To solve the above-mentioned problems, after mining data regions, this thesis analysises the generalized node which many present a data record in order to identify the read data record. And based on the existing simple tree matching algorithm, this thesis also considers the content contained in the node which has improved the accuracy when extracting data fields.Structureless content pages focus on specific description of an object, and this thesis implements a block-based body extraction algorithm. The sub-block algorithm mines the blocks in the page based on dom tree and visual information of the page. After that, the classification learning method is used to train the training set. Then the body block can be extracted based on the spatial characteristics of blocks. For structured content pages, the attribute value of an object can be extracted automatically by matching two similar pages. Since pages contain some noise data such as advertisement around the body block, and advertising data may be different in the two similar pages, which will affect the matching algorithm. This thesis makes some improvements about the above-mentioned problems. Before matching two similar pages, the body block extraction algorithm is used to extract body blocks of the two similar pages. And then, this thesis matches the two extracted body blocks, which will improve the accuracy when extracting attribute value from structured pages.

Keywords/Search Tags:

Web data extraction, List pages, Depta, content pages, RoadRunner

PDF Full Text Request

Related items

1	Features Extraction And Duplicate Pattern Detection Of Web Pages
2	Research On Chinese Blog Pages Recognition And Content Extraction
3	Research Of Data Extraction Technology Based On Tag Tree From List Pages
4	Research On The Technology Of Incremental Web Pages Crawler
5	Automatic Ranking List Extraction From Web Pages Based On Visual And Sematic Information
6	Study And Design Of Information Integration Model Based On Web Pages Content
7	Research On Content Extraction In HTML Web Pages Based Multi-Features
8	Research Of Web Information Extraction Method Based On Multi-feature Mining
9	Research Of Automatic Metadata Extraction From Template Web Pages
10	The Designation And Implementation Of Content-aware System Of News Webpages