Clustering Analysis Of Deep Web Resources Based On The Query Interface Features

Posted on:2008-10-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y Han

Full Text:PDF

GTID:2178360242967300

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The rapid development of the Internet brings us a great deal of information, and that information is still in rapid growth.The entire Internet can be divided into two parts: Surface Web and Deep Web.The Surface Web can be found by traditional search engine through URL. Deep Web refers to the onsite database and the resources can only be got through the query interface. Comparing to the Surface Web, the Deep Web contains more professional and higher quality resoureces. However, the data of Deep Web are heterogeneous and dynamic, before the effective use, the data must be integrated, while the classification in accordance with their domains is a prerequisite for data integration.Query interface is the only way to access the Deep web, the query is a form, but not all the forms are query interface, so this paper performs a form classifier to filter the non-query interface forms. Through experiments of query interface,this paper find that the characteristics of query interface can represent domains of Deep Web data resources and query capacity,thus this paper clusters Deep Web resources on the basis of features of query interface.A difference between query interface and ordinary text clustering is that the feature matrix of query interface is sparse,therefore the results of clustering is ineffective with the traditional hierarchical agglomerative clustering algorithm based on similarity-distance.To solve the problem, the paper used the method of nonparametric tests to measure the similarity and then objective function of similarity degree is also improved. When the method is applied into traditional algorithm of hierarchical agglomerative clustering, the clustering of query interface is achieved. Accordingly, the clustering of Deep Web source represented by query interface is realized.The requirement of hypothesis testing is the amount of the incident's observation value. Before the initial clusters are treated, they may not satisfy the requirements of hypothesis testing. In order to solve the problem, this paper put forward the idea of preprocessing.Firstly, all query interfaces are filtered by type. Secondly, data are grouped in terms of the inclusion degree between attributes. Thirdly, different groups are filtered by frequency of attributes. At last only the groups whose observations satisfies hypothesis are clustered. Loner interfaces are query interface which hasn't come through the interface check and the one which not satisfy observation value.Reclassification is used to deal with the loner interfaces. By probabilistic methods, they are classified to clusters from which they are most likely to come. By means of the way that begins with clustering and then goes on with reclassification,the clustering is completed finally.It is demonstrated by experiments that with adopting the idea a better clustering outcome can be gotten.

Keywords/Search Tags:

Deep Web, Clustering Analysis, Nonparametric Hypothesis Tests, Hierarchical Agglomerative Clustering

PDF Full Text Request

Related items

1	A Document Clustering Method Based On Affinity Propagation And Agglomerative Hierarchical Clustering
2	Efficient Algorithms for Hierarchical Agglomerative Clustering
3	Study On The Clustering-Based Network Intrusion Detection Methods
4	Application and evaluation of Hierarchical Agglomerative Clustering in Wireless Sensor Networks
5	Compare Analysis Of Document Clustering Algorithm For Large Data Set And The Application In Sense Induction
6	Research On Clustering Of Uncertain Data
7	Study On Change-point Detection Method Based On Hierarchical Clustering Analysis
8	Design and evaluation of clustering criterion for optimal hierarchical agglomerative clustering
9	Research Of Agglomerative Hierarchical Clustering Method Based High-Resolution Remote Sensing Image Segmentation Method
10	Research On The Method Of Linkage-Based Graph Clustering