Font Size: a A A

Research On The Technology Of Web Data Extraction

Posted on:2015-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:L J ChangFull Text:PDF
GTID:2308330464471372Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, Web has become a huge space where information can be shared. These data can be further used in data mining, data integration. Web data extraction aims to study how to extract the data that may attract users from webpage. This thesis mainly studies how to extract data from two kinds of webpage including list pages and detailed pages.List pages refer to the pages that contain single or multiple tables, there have been some researches about the automatic extraction. But because of its varied forms and templates, some problems may exist when extracting data from list pages. The organization of data records shows diversity,which might lead to extraction several read data records as one data record. The existing simple tree matching problem just considers the name of tag, but many tag names of fields in a data record are the same, which will lead to more than one matching between two data records. To solve the above-mentioned problems, after mining data regions, this thesis analysises the generalized node which many present a data record in order to identify the read data record. And based on the existing simple tree matching algorithm, this thesis also considers the content contained in the node which has improved the accuracy when extracting data fields.Structureless content pages focus on specific description of an object, and this thesis implements a block-based body extraction algorithm. The sub-block algorithm mines the blocks in the page based on dom tree and visual information of the page. After that, the classification learning method is used to train the training set. Then the body block can be extracted based on the spatial characteristics of blocks. For structured content pages, the attribute value of an object can be extracted automatically by matching two similar pages. Since pages contain some noise data such as advertisement around the body block, and advertising data may be different in the two similar pages, which will affect the matching algorithm. This thesis makes some improvements about the above-mentioned problems. Before matching two similar pages, the body block extraction algorithm is used to extract body blocks of the two similar pages. And then, this thesis matches the two extracted body blocks, which will improve the accuracy when extracting attribute value from structured pages.
Keywords/Search Tags:Web data extraction, List pages, Depta, content pages, RoadRunner
PDF Full Text Request
Related items