Font Size: a A A

Clustering Analysis Of Deep Web Resources Based On The Query Interface Features

Posted on:2008-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y HanFull Text:PDF
GTID:2178360242967300Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet brings us a great deal of information, and that information is still in rapid growth.The entire Internet can be divided into two parts: Surface Web and Deep Web.The Surface Web can be found by traditional search engine through URL. Deep Web refers to the onsite database and the resources can only be got through the query interface. Comparing to the Surface Web, the Deep Web contains more professional and higher quality resoureces. However, the data of Deep Web are heterogeneous and dynamic, before the effective use, the data must be integrated, while the classification in accordance with their domains is a prerequisite for data integration.Query interface is the only way to access the Deep web, the query is a form, but not all the forms are query interface, so this paper performs a form classifier to filter the non-query interface forms. Through experiments of query interface,this paper find that the characteristics of query interface can represent domains of Deep Web data resources and query capacity,thus this paper clusters Deep Web resources on the basis of features of query interface.A difference between query interface and ordinary text clustering is that the feature matrix of query interface is sparse,therefore the results of clustering is ineffective with the traditional hierarchical agglomerative clustering algorithm based on similarity-distance.To solve the problem, the paper used the method of nonparametric tests to measure the similarity and then objective function of similarity degree is also improved. When the method is applied into traditional algorithm of hierarchical agglomerative clustering, the clustering of query interface is achieved. Accordingly, the clustering of Deep Web source represented by query interface is realized.The requirement of hypothesis testing is the amount of the incident's observation value. Before the initial clusters are treated, they may not satisfy the requirements of hypothesis testing. In order to solve the problem, this paper put forward the idea of preprocessing.Firstly, all query interfaces are filtered by type. Secondly, data are grouped in terms of the inclusion degree between attributes. Thirdly, different groups are filtered by frequency of attributes. At last only the groups whose observations satisfies hypothesis are clustered. Loner interfaces are query interface which hasn't come through the interface check and the one which not satisfy observation value.Reclassification is used to deal with the loner interfaces. By probabilistic methods, they are classified to clusters from which they are most likely to come. By means of the way that begins with clustering and then goes on with reclassification,the clustering is completed finally.It is demonstrated by experiments that with adopting the idea a better clustering outcome can be gotten.
Keywords/Search Tags:Deep Web, Clustering Analysis, Nonparametric Hypothesis Tests, Hierarchical Agglomerative Clustering
PDF Full Text Request
Related items