Font Size: a A A

Research On Deep Web Data Sources Classification Based On Semantic

Posted on:2013-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:W P LiuFull Text:PDF
GTID:2248330395455457Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet technology, there are a lot of Webdatabases, which become a huge information resource database and provide vastamount of information to people. According to the―depth‖of information stored inWeb, the entire Web can be divided into two categories: surface Web and Deep Web.The quality of information in Deep Web is better than that in Surface Web and thequantity of the information in Deep Web is more than that in Surface Web, moreover,the information in Deep Web has more significant application value. Since the DeepWeb data is of the dynamic, hidden, distributed and heterogeneous characteristics,which make the integration of Deep Web data interfaces face a great challenge.Therefore, how to classify the Deep Web data source fast and efficiently is a key issueto be addressed and has important practical significance and broad applicationprospects.This thesis focuses on a series of key technologies of data source classification.We propose a novel classification model based on Semantic Tree, an Adaptive KNNalgorithm based on density and a weighted Naive Bayesian algorithm, respectively, allof which can effectively improve the classification accuracy. The main contributions ofthis work are as follows:1. The feature extraction of query interface page is a basis of Deep Web datasource classification. A new effective query interface page feature extraction method isproposed based on the page-form model. Finally an information gain based featureselection method is used to select features.2. Due to the heterogeneous characteristics of the Deep Web data sources, thesame feature of different Deep Web interfaces may be represented by synonymous orpolysemous words, and thus it lacks of unique semantic understanding. To address theabove limitations, a novel classification model based on Semantic Tree is proposed.3. In order to address the limitations of the canonical KNN algorithm and NaiveBayesian algorithms, an adaptive KNN algorithm based on density and a weightedNaive Bayesian algorithm are proposed, respectively.4. Finally, experiments are performed on real UIUC Web repository dataset. The co mparative analysis of t he experi ment al results show t hat Se mantic Tree model andt he i mproved classification algorit hms proposed in this paper are effective.
Keywords/Search Tags:Deep Web Data source classification, Semantic tree, ImprovingK-NN algorithm, Weighted nave bayesian algorithm
PDF Full Text Request
Related items