Font Size: a A A

Research On Deep Web Oriented Information Extraction And Integration

Posted on:2010-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:G F LiuFull Text:PDF
GTID:2178360275459226Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the World Wide Web and Database technology,Internet is deepening rapidly.Large amount of information are hidden in Web Databases,which are called Deep Web.Users can get them dynamicly by submitting queries to query forms.Because Deep Web resources distribute in many different Deep Web sites,so it is not convenient to get information from Deep Web.Therefor,many researchers and companies had been researching how to integrate Deep Web resources into one system.This thesis researches on Deep Web oriented data extraction and integration technology,proposes corresponding algorithms and solutions,and then designs a Deep Web oriented prototype search engine in the last main section.The main work of this thesis is summarized as followings:(1) Extracting Web Data Objects from result pages of queries is the first step of Deep Web integration.This thesis proposes an automatic method of Web Data Object extraction based on DOM,which identifies the Data Regions and Web Data Objects by following steps:preprocessing the HTML pages,extracting candidate web data object set,and revoming objects which are not web data object from the set,then Web Data Objects can be extracted from the result HTML pages.(2) Proposes a method of integrating heterogeneous Web Data Objects which are extracted from different Deep Web sites.This method is based on vector space model.It was designed to integrate heterogeneous Web Data Objects by clustering,and then identifiy the duplicate Web Data Objects by discriminabiltity and similarity of property in order to eliminate redundant phenomenon.(3) Analyzes the influence on query response speed which are generated by the orgnization of the massive data,and then further proposes an orgnization method of huge amount of Web Data Objects.By incremental clustering,Web Data Objects are divided into different clusters according to their own characters.All the clusters construct a hierarchical structure,which is the basis of quick response to queries submitted by users.(4) Designs a Deep Web oriented prototype search engine based on the above works. Moreover,this thesis also designs and performs several experiments on the methods mentioned in the thesis.The experimental results show that these methods are feasible and effective.
Keywords/Search Tags:Deep Web, Data Integration, Data Extraction, Cluster, Search Engine
PDF Full Text Request
Related items