Research On Data Extraction And Schame Labelling On Deep Web

Posted on:2011-08-24

Degree:Master

Type:Thesis

Country:China

Candidate:J Ma

Full Text:PDF

GTID:2248330395958003

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The Deep Web, can not be "seen" by the search engines, contains the amount information that is several orders of magnitude larger than the surface web, and with the increase of the Deep Web data sources, the importance of the data in them appears day by day. Deep web pages are often returned in response to a submitted query or accessed through a form. And the list of objects displayed in a deep web page is typically in structured form. The deep web pages account for a large proportion of Web pages, so extracting data records embedded in Web pages in a automatic way is not only better the service of search engines but also the applications regard to data integration. And we will have a huge collection of meta-data about real-word objects that could be further used for knowledge discovery.This paper aims to provide an solution that not only extract data records from the Deep Web pages in a fully automatic manner but also labels attributes of data items.The solution consists of two parts. One is the data record locater which in charge of extracting data records from the result pages. The data record locater employs an algorithm named Modified Mining Data Record algorithm which locates data records. The MMDR uses the tree structure of an HTML page, by converting the page into DOM model. Then it tries to find a group of similar sub-trees by doing comparision between the sub-trees of a common parent.The other component of the solution is the attribute labeler which labels the attribute of the data items of a data record. This part is mainly based on the Condition Random Field, a probalistic graphical model, one of the most popular motheds for the task of assigning label sequences to a set of observation sequences. There will be a fully discussion about the features used by the model for attributes labeling. Also the Dectect and Combine algorithm is used by the attribute labeler. The algorithm is designed to make the data items have a more explicte meaning.The experiments have shown that the methods we proposed achive good performance on the dataset we manully collected. These approaches can solve the data extraction and schema recognition problem, and give both theoretical and practical support for data integration on Deep Web as well.

Keywords/Search Tags:

attribute labeling, data extractraction, Deep Web, DOM Structure

PDF Full Text Request

Related items

1	Research On Extraction Of Attractions Attribute Relations Based On Encyclopedic And Vertical Website Data
2	Research On Labeling Problems In Graph Theory
3	The Research On Incremental Attribute Reduciton Algorithm Decision System With Weakly Labeling
4	Research On Automatic Labeling Of Multiple Knowledge Points And Cognitive Verbs In Test Questions Based On Machine Learning
5	Design And Implementation Of The Web Table Data Extraction And Analysis System
6	Research On Incentive Based Data Labeling Technologies And Their Applications
7	Research And Realization Of Labeling Techniques Of Internet Website
8	Sentence Similarity Calculation Based On Semantic Role Labeling
9	Semantic Role Labeling Based On Deep Neural Network
10	Automatic Gating Of Attributes In Deep Structure