Font Size: a A A

Research Of Data Extraction And Result Aggregation Technology For Deep Web

Posted on:2013-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:Q W YinFull Text:PDF
GTID:2248330377458786Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer network, network resources are getting richerincreasingly day by day, on one hand, which broadens people’s access to information. Onother hand, the disorder of information makes users difficult to get their Information neededfrom vast information; search engines provide network information retrieval and classificationservice for users. However, there is a kind of resource that can’t be indexed by search engine,which we call deep web resource. Deep web resources refer to resources that can’t be indexedby traditional search engines. Deep web resources also refer to online web database that canbe accessed. Deep web gets favored day by day because the resources are very rich andprofessional, its auto-update speed is very fast and range of field is wide. Deep web resourceshave become an important source of access to information. Research of data extraction andresult aggregation technology for deep web is of great significance both in theory andPractice.In this paper, we research data extraction and result aggregation to deep web resource. Inthe process of data extraction we introduce MDR briefly and summarize the low efficiency ofMDR encountered in deep web pages. Get inspiration from MDR and improve MDR so as toreduce the complexity of data extraction. Extraction algorithm uses label tree to express theHTML pages, before extraction, we clean, standardize the HTML pages and structure labeltree. We use structure similarity of label tree to locate data record, this algorithm is moreefficient compared with tree edit distance and more Accurate compared with elements ofcomparative method. The effect on data extraction is quite good. However,similarity between some data records is low, data extraction algorithm based on Similarity oflabel tree sometimes have a bad situation. To solve this problem, we propose a new datarecord identifying algorithm based on sub-tree incomplete match according to improving ofstructure similarity of label tree. Result aggregation is mainly about identifying duplicate datarecords, in this paper, before removing duplicate data records we sort records accordingto attribute weights to reduce the number of comparisons, to achieve removing duplicate datarecords rapidly and effectively.Experiments show that data extraction algorithm based on structural similarity of labeltree is more effective than MDR. Data record identifying algorithm based on sub-treeincomplete match is better than MDR and data extraction algorithm based on structural similarity of label tree. Compared with removing duplicate records directly, the algorithm thatsorts the records according to the attribute weight is more effective.
Keywords/Search Tags:Deep Web, Data Extraction, DOM, Structure Similarity, Result Aggregation
PDF Full Text Request
Related items