Font Size: a A A

Research On The Deep Web Data Sources Classification

Posted on:2011-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:B S DingFull Text:PDF
GTID:2178360305450263Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, a large amount of information is increasingly generated and accumulated in our daily work and life. At present, the amount of the whole Web information has exceeded 200,000TB, and will continue to increase as people develop business deeply. In order to make use of these resources, especially the Deep Web resources, academic interests introduce the research on Deep Web data integration. Deep Web data sources classification, as the important part in Deep Web data integration, needs further concentration and study.There are two common methods to classify Deep Web data sources, which are pre-query and post-query.Pre-query classifies data sources based on Deep Web query interface features. Post-query classifies data sources based on query results, which are returned after query was submitted. Because post-query causes huge workload and network occupancy and is time-consuming in operating results, this paper takes query interface as the breakthrough and conducts the study by means of the pre-query approach. Current problems are mainly about how to combine the data sources ready to be classified with its domain knowledge and how to select or improve the clustering and classification algorithms. All these fields should be researched and extended in order to make better classification.In this paper, there are two problems to be solved, which are the massive data sources clustering and classification of newly discovered data sources. Combined with thesaurus dictionary and ontology, relevant algorithms are modified in order to classify data sources better.Briefly speaking, the main contributions and innovations of this paper are listed as follows.1. An improved clustering algorithm DWK-means is proposed. Based on the page-form model, it is necessary to extract the features of the content text and hyperlinks, and regulate the feature extraction on the form at the same time. After preprocessing, which includes standardization of features and semantic processing with thesaurus dictionary, the improved K-means clustering algorithm is used to cluster data sources. The reason why K-means is improved is that it will produce loose clusters, or some clusters produced belong to the same domain, which need to be included in the same category sequentially.Postprocessing is introduced to split loose clusters and merge clusters belonging to the same domain according to hyperlinks in DWK-means algorithm. Experiments reveal that preprocessing is useful in improving clustering performance and DWK-means algorithm can overcome foregoing drawbacks and finally lead to better clustering results.2. A classification algorithm based on ontology called DWC4.5 is proposed. After clustering the Deep Web data sources, this paper proposes a new method to classify newly discovered Deep Web data sources. It is necessary to build decision table according to the weight of each attribute subject to ontology. Since C4.5 algorithm is weak in the function of anti-noise, rough set is introduced to improve C4.5 in order to produce better decision tree in Deep Web classification. Experiments show that it is useful in differentiating domain concepts and processing semantic relationships among attributes by building ontology, and the improved classification algorithm DWC4.5 based on ontology leads to better classification results.
Keywords/Search Tags:Deep Web Data Sources Classification, Thesaurus Dictionary, Ontology, K-means, DWK-means, C4.5, DWC4. 5
PDF Full Text Request
Related items