Font Size: a A A

Research Of Deep Web Classification And Realization

Posted on:2014-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2248330398994453Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The whole web can be divided into deep web and visible web according to the depthof information. The visible web that is those pages which can be indexed by traditionalsearch engines such as Google, BaiDu. The deep web pages refer to those pages whichneed to fill the form and submit it to the backend server to get the pages.According to theBrightPlant’s investigation,the deep web contains the amount of information is hundredstimes of visible web, and the quality of informatiom,the professional level of informationare better than visible web. Based on those features, the deep web is suitable for acquisitionand utilization. However, the deep web’s information are widely distributed, large-scaledand its related business rapidly changing which lead to utilize hardly. There are plenty ofworks needed to be done in order to effective data integration, knowledge mining. Dataclassification accurately from massive deep web is most important in these studies.This article mainly focuses on the classification of deep web, the amount ofinformation of deep web is huge and also the quality of the information is also high withspecific areas theme, the data of deep web is suitable for the information collection andusing according to the data characteristics above. In actual use, there are a lot of works needto be done because of wide distribution of deep web information and rapid changes in thesize and business to have good data integration. First job is to find the data source andclassify the data source in order to achieve effective data integration in massive networkdata. Nowadays there is mainly adopting manual methods to classify the data sources butthe cost of this kind of methods is high and update slow and covers only some limitedcategories, so how to get rid of the drawbacks of manual classification and automaticallyclassify each of data sources and label the categories and achieve a effective deep webresourses interation has been a hot research topic for those researchers. At the present timethe researches of deep web are based on the text features of the form and assuming thatthere is no correlation between those text features, which is inconsistent with the actualsituation, and the classification algorithms do not consider the situation of the distributionof the training samples, a classification algorithm used to classify the deep web sources canachieve better results but it may not achieve good result in other circumstances. For abovedeficiencies, this paper proposes some impoved methods as following,Firstly this article researched the knowledge of web and search engine, which provide a theoretical foundation and reference value of the classification of deep web. To study theexisting algorithm of classification of deep web and propose that we should adopt differentclassification method according to different training sample distribution, that is to say, weshould consider the distribution situation of deep web, there are two cases, rich or neanless.In this paper, we proposed a method to get the features of deep web form in thefeature extraction stage that is extracting the data source’s text information and structureinformation both, to make full use of the relations between data source and categoriesbecause different categories have different data source structures, at the same time there aregreat similarity in structure of data source which belong to same category.In the case of having rich data source interfaces, we introduced data mining method tomine the frequent pattern of data source features and mine the relations among features,itcan overcome the assumption that there is no association among the features and proposedan improved classification based on Bayesian, and to verify the performance of the newalgorithm in TEL-8data set. We compare the performance between new algorithm andtraditional algorithm in recall, precision and f-measure indicators to prove the effectivenessof the proposed algorithm.In the case of having sparse source interfaces, we introduced a semantic dictionary toovercome the drawbacks of sparse features caused by less source interfaces we alsointroduced the concept space model and improved classification algorithm based on KNNand verify the performance of the new algorithm in the same experimental conditions.
Keywords/Search Tags:Deep Web, Classification, Bayes, KNN, Senmantic, Data Mining, WordNet
PDF Full Text Request
Related items