Font Size: a A A

Researches On Deep Web Query Interface Determining Technology

Posted on:2010-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:Q H LiFull Text:PDF
GTID:2178360275451452Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The Deep Web is opposite to the Surface Web.As the Internet continues to expand and deepen,more and more information are available to people through the way which unifies the dynamic homepage technology and the database technology. However,the traditional search engines can not crawl the online databases to obtain information due to technical restrictions and other reasons,a large number of high-value information inside the Deep Web becomes invisible.Since the query interfaces are the only entrance to the Deep Web databases,and users can obtain information from the Deep Web only by submiting queries to the query interfaces,correctly judge and identify the query interfaces is very important to obtain information of the Deep Web.Surrounding with the judgement and identification of the query interfaces,this thesis mainly has done the following researches:Firstly,study the related knowledge of the Deep Web and the research situation at home and abroad,including the concept,value and the information search methods of the Deep Web,then give the research question and direction of this thesis;Secondly,collect a variety of forms from different domains,parse the forms into DOM trees,extract characteristics of each form,and save the characteristics to a database;Thirdly,make pre-processing of the primitive data sets,including the removal of redundancy and noise,attribute selection,format conversion,discretization processing and so on;Finally,use several kinds of typical classification algorithms to classify and predict the data sets.The classification algorithms include decision tree C4.5 classification algorithm,Support Vector Machine,k-Nearest Neighbor classification algorithm and Naive Bayesian Classifier.In the process of classification and prediction,holdout of random sampling and 10-fold cross-validation are selected. Based on the analysis and comparison of the experimental results,the highest accuracy algorithm is choosed to judge and identify the Deep Web query interfaces.The conclusion of this thesis puts forward several points for the further research of the chosen topic.Although the research of the Deep Web is only at an early stage now,the Deep Web research will surely make a bigger breakthrough and harvest with the unceasingly exploration by people.
Keywords/Search Tags:Deep Web, Query Interface, DOM, Decision Tree C4.5 Classification Algorithm, 10-Fold Cross-Validation
PDF Full Text Request
Related items