Research On Database Discovery And Clustering Of Deep Web

Posted on:2011-08-05

Degree:Master

Type:Thesis

Country:China

Candidate:C Gao

Full Text:PDF

GTID:2178360305484872

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Internet resources can be divided into Surface Web and Deep Web. Surface Web refers to the resources that can be retrieved by traditional search engines. Deep Web consists of the resources that cannot be retrieved by traditional search engines, mainly the Web databases.A survey shows that the information Deep Web contained is about 400 to 500 times of Surface Web. However, as Web databases cover all domains and distribute all over on World Wide Web, they must be integrated for effective usage. Because Deep Web integration only deals with Web database of the same domain, it must find Web databases and classify them to different clusters according to the domain they belong to.Web database can be found through query interface, because it is the only entry to access Web database. Query interface exists in the form of web form. However, some non-query interfaces also exist in the form of web forms. It is needed to distinguish the query interface from the non-query interface. Seven heuristic rules are proposed to identify the query interface based on previous research results and observations of a large number of web forms. The experimental result shows the F-measure of query interface identification is higher than 0.98.During the process of integration, it has to map controls of the integrated query interface to those of each local query interface. To accomplish this task, the schema information must be extracted from query interfaces. There are six major difficulties, and this paper gives corresponding solutions to them. Experimental results show that the accuracy of the query interface schema extraction can achive 94% or above.The title and keywords attributes of web pages that contain web databases of the same domain always share certain key words of that domain. With this idea, a cluster algorithm based on frequent itemset is proposed to cluster web databases. Web pages that share a frequent itemset are clustered together, with the corresponding words of that frequent itemset as the cluster label. Experimental result shows that the algorithm's F-measure can achieve 0.91 or above.

Keywords/Search Tags:

Deep Web, query interface identification, schema extraction, Web database, frequent itemset

PDF Full Text Request

Related items

1	Research On Key Technologies Of Deep Web Data Crawling
2	Research On Data Source Clustering And Query Interface Conversion Of Deep Web
3	Research On Schema Extraction From Deep Web Query Interface
4	Research On Method Of Deep Web Schema Matching Based On Query Interface
5	The Research Of Web Query Interface Location And Schema Extraction
6	Deep Web Sources Classification And Query Interface Schema Extraction Based On Ontology
7	Research On The Key Technologies About Preprocessing Of Deep Web Integrated Query System
8	Research Into Query Interface Schema Extraction Of Deep Web
9	Research Of Query Interface Integration Mechanism In DWIIS System
10	Research Of Query Interface Integration Mechanism In Dwiis System