Font Size: a A A

Research On Database Discovery And Clustering Of Deep Web

Posted on:2011-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:C GaoFull Text:PDF
GTID:2178360305484872Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet resources can be divided into Surface Web and Deep Web. Surface Web refers to the resources that can be retrieved by traditional search engines. Deep Web consists of the resources that cannot be retrieved by traditional search engines, mainly the Web databases.A survey shows that the information Deep Web contained is about 400 to 500 times of Surface Web. However, as Web databases cover all domains and distribute all over on World Wide Web, they must be integrated for effective usage. Because Deep Web integration only deals with Web database of the same domain, it must find Web databases and classify them to different clusters according to the domain they belong to.Web database can be found through query interface, because it is the only entry to access Web database. Query interface exists in the form of web form. However, some non-query interfaces also exist in the form of web forms. It is needed to distinguish the query interface from the non-query interface. Seven heuristic rules are proposed to identify the query interface based on previous research results and observations of a large number of web forms. The experimental result shows the F-measure of query interface identification is higher than 0.98.During the process of integration, it has to map controls of the integrated query interface to those of each local query interface. To accomplish this task, the schema information must be extracted from query interfaces. There are six major difficulties, and this paper gives corresponding solutions to them. Experimental results show that the accuracy of the query interface schema extraction can achive 94% or above.The title and keywords attributes of web pages that contain web databases of the same domain always share certain key words of that domain. With this idea, a cluster algorithm based on frequent itemset is proposed to cluster web databases. Web pages that share a frequent itemset are clustered together, with the corresponding words of that frequent itemset as the cluster label. Experimental result shows that the algorithm's F-measure can achieve 0.91 or above.
Keywords/Search Tags:Deep Web, query interface identification, schema extraction, Web database, frequent itemset
PDF Full Text Request
Related items