Research Of Data Extraction And Result Aggregation Technology For Deep Web

Posted on:2013-01-23

Degree:Master

Type:Thesis

Country:China

Candidate:Q W Yin

Full Text:PDF

GTID:2248330377458786

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer network, network resources are getting richerincreasingly day by day, on one hand, which broadens people’s access to information. Onother hand, the disorder of information makes users difficult to get their Information neededfrom vast information; search engines provide network information retrieval and classificationservice for users. However, there is a kind of resource that can’t be indexed by search engine,which we call deep web resource. Deep web resources refer to resources that can’t be indexedby traditional search engines. Deep web resources also refer to online web database that canbe accessed. Deep web gets favored day by day because the resources are very rich andprofessional, its auto-update speed is very fast and range of field is wide. Deep web resourceshave become an important source of access to information. Research of data extraction andresult aggregation technology for deep web is of great significance both in theory andPractice.In this paper, we research data extraction and result aggregation to deep web resource. Inthe process of data extraction we introduce MDR briefly and summarize the low efficiency ofMDR encountered in deep web pages. Get inspiration from MDR and improve MDR so as toreduce the complexity of data extraction. Extraction algorithm uses label tree to express theHTML pages, before extraction, we clean, standardize the HTML pages and structure labeltree. We use structure similarity of label tree to locate data record, this algorithm is moreefficient compared with tree edit distance and more Accurate compared with elements ofcomparative method. The effect on data extraction is quite good. However,similarity between some data records is low, data extraction algorithm based on Similarity oflabel tree sometimes have a bad situation. To solve this problem, we propose a new datarecord identifying algorithm based on sub-tree incomplete match according to improving ofstructure similarity of label tree. Result aggregation is mainly about identifying duplicate datarecords, in this paper, before removing duplicate data records we sort records accordingto attribute weights to reduce the number of comparisons, to achieve removing duplicate datarecords rapidly and effectively.Experiments show that data extraction algorithm based on structural similarity of labeltree is more effective than MDR. Data record identifying algorithm based on sub-treeincomplete match is better than MDR and data extraction algorithm based on structural similarity of label tree. Compared with removing duplicate records directly, the algorithm thatsorts the records according to the attribute weight is more effective.

Keywords/Search Tags:

Deep Web, Data Extraction, DOM, Structure Similarity, Result Aggregation

PDF Full Text Request

Related items

1	The Research On Data Extraction Mechanism In Deep Web Based On Result Pattern
2	A Research On Key Technologies Of Deep Web Data Integration Based On Result Pattern
3	Automatic wrapper generation for the extraction of search result records from search engines
4	Research On Deep Web Search Interface And Search Result Extraction
5	Deep Web Data Annotation Based On Result Schema
6	Research On Source Discovery And Query Results Extraction Of Deep Web
7	Research On The Web Structure Data Extraction Based On The Browser And Its Implementation
8	Research And Application Of Pattern Recognition For Data Aggregation
9	A Study On Mechanisms Of Data Aggregation Based On Fitting Analysis
10	Research On Data Extraction And Schame Labelling On Deep Web