Font Size: a A A

Research On Data Extraction And Schame Labelling On Deep Web

Posted on:2011-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:J MaFull Text:PDF
GTID:2248330395958003Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The Deep Web, can not be "seen" by the search engines, contains the amount information that is several orders of magnitude larger than the surface web, and with the increase of the Deep Web data sources, the importance of the data in them appears day by day. Deep web pages are often returned in response to a submitted query or accessed through a form. And the list of objects displayed in a deep web page is typically in structured form. The deep web pages account for a large proportion of Web pages, so extracting data records embedded in Web pages in a automatic way is not only better the service of search engines but also the applications regard to data integration. And we will have a huge collection of meta-data about real-word objects that could be further used for knowledge discovery.This paper aims to provide an solution that not only extract data records from the Deep Web pages in a fully automatic manner but also labels attributes of data items.The solution consists of two parts. One is the data record locater which in charge of extracting data records from the result pages. The data record locater employs an algorithm named Modified Mining Data Record algorithm which locates data records. The MMDR uses the tree structure of an HTML page, by converting the page into DOM model. Then it tries to find a group of similar sub-trees by doing comparision between the sub-trees of a common parent.The other component of the solution is the attribute labeler which labels the attribute of the data items of a data record. This part is mainly based on the Condition Random Field, a probalistic graphical model, one of the most popular motheds for the task of assigning label sequences to a set of observation sequences. There will be a fully discussion about the features used by the model for attributes labeling. Also the Dectect and Combine algorithm is used by the attribute labeler. The algorithm is designed to make the data items have a more explicte meaning.The experiments have shown that the methods we proposed achive good performance on the dataset we manully collected. These approaches can solve the data extraction and schema recognition problem, and give both theoretical and practical support for data integration on Deep Web as well.
Keywords/Search Tags:attribute labeling, data extractraction, Deep Web, DOM Structure
PDF Full Text Request
Related items