Font Size: a A A

Research On Deep Web’s Data Source Automatically Identify And Classification

Posted on:2014-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:K LinFull Text:PDF
GTID:2248330398482529Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Deep Web depth of network resources, also known as the network is not visible or hidden network (translated Invisible Web or Hidden Web), it is often called network information Google cannot find out this information as we know it does not belong to the standard search engine are able to search. Some people think that the usual search engines cannot find out the information to account for90%of all the information of the network. According to Bright Planet Technical White Paper described in Deep Web resources capacity of about500times the Surface Web, but also contains a more valuable resource. More than half of the Deep Web contents are saved in the professional field in the database. Today, we have used Internet search services equivalent to a huge network information is independent of the form of distribution in the network, the surface of the vast amounts of surface information can of course through a common search engine query, but there is considerable information hidden in the depths cannot be found in the search engines, and Deep Web data sources at the same time is constantly changing, the vast majority of hidden information must be generated by the dynamic request pages of information, standard search engine is no way to find it. Because these dynamic pages requested information must be obtained by the Deep Web query interface, making the Deep Web access to information becomes more difficult, in order to access the Deep Web information, we have to be on the Deep Web data integration.In this paper, two key issues of automatic identification and data source in the Deep Web Data Integration Interface Category-depth study. The main contents include:(1) For ordinary Web forms and the form characteristics of Deep-Web page analysis, through merger, add, screening for get this article adopts the form feature extraction scheme, contains all the control value, control number, include semantic information entry and a series of characteristic value as classification properties. (2) The key issues of Deep Web data integration research. Identification and classification of query interface. Restrictions on Naive-Bayes method, optimized reduction using rough set algorithm. Two random sampling, the method was utilized to establish classifier based on Naive-Bayesian algorithm. Using rough set algorithm to reduction the classifier group. Group and then use the optimized classifier for classification, the classification results obtained by the weighted average, get the final classification result. Experimental results showed that after optimization based on rough sets and Bayesian classifier classification group, makes the mining performance has been significantly optimized; Not only improve the robustness of the Bayesian method, and makes the Bayesian can be applied to a larger scope.(3) Deep Web data sources identification and classification performance. This algorithm with several classification methods, such as C4.5decision tree, ID3algorithm carries on the analysis comparison, in the recall ratio and precision effect shows that this method is feasible.This paper adopted the method is based on the analysis of existing related research, as well as the study and analysis of the Deep Web data sources, and on the existing research results, through the improvement of algorithm, to the experimental data to verify the effectiveness of our algorithm. Judging from the results of experiment in this paper, the method is satisfactory. Through research on Deep Web data sources classification study, according to the actual situation to take first to use the Bayesian algorithm to establish classifier, and then use the rough set reduction algorithm is applied to optimize classifier group optimization. Unavoidably exist deficiencies in the experiments, we will further in future research on related problems and algorithms. Deep Web research today is still a long way to go, there are problems need the researchers to solve one by one...
Keywords/Search Tags:Deep Web, data source automatically identify, classification of the datasource, rough set, Bayesian, Data Mining
PDF Full Text Request
Related items