Font Size: a A A

Research On Deep Web Data Source Selection Method Based On Sampling

Posted on:2016-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y QinFull Text:PDF
GTID:2208330461984736Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Due to the rapid development of Internet information, Web contains a vast amount of inf ormation for people to use, but the Deep Web database is invisible to the user, the included information can only be obtained through the specific query interface. In order to make full us e of the rich and valuable information in the Deep Web, and to improve the efficiency of the q uery, the establishment of Deep Web data integration system has become a hot spot of the curr ent research. Among them, the Deep Web database selection is a very significant part of the q uery processing module in this integrated system.Aimed at the selection of the Deep Web data sources, this paper is to do the following thr ee aspects as the key research: obtaining the data source characteristics through sampling, sam pling quality assessment, sorting and selecting the data source according to the general score of the data source which is calculated based on the selected evaluation index.First, on the basis of the random walk sampling approach and aiming at the lack of resear ch for keywords attribute, this paper proposes a extension method that is to introduce the key words attributes in the sampling and to classify the attributes. Meanwhile, further taking into a ccount the existing research lacked of the study for the categorial attributes which has the tree feature, this paper proposes a concept of tree categorial attributes and gives a approach of the sampling process.Secondly, based on the original random walk sampling approach and according to savin g sampling paths and comparing the new sampling path to the existing sampling paths throug h the scanning, hereby, this paper proposes an improved random walk sampling algorithm to avoid repeated submission. In this way, the approach can improve the sampling efficiency furt her.Thirdly, in the sampling evaluation system, the paper considers the consistency of the information content between the sample and the data source, and introduces the text similarity calculation method into the sampling quality evaluation system, combining the ratio of sample set and data sources to measure the sample bias, to further improve the sampling quality evaluation.Fourthly, on the basis of the sample set to evaluate the data source quality, this paper gives five evaluation indexes such as authority, domain relevance, accuracy, redundancy and timeliness to assess the quality of the data source, and gives the quantitative methods and formulas of these five indexes. Besides, this paper makes the appropriate improvement for the calculation of the semantic similarity, adding the semantic similarity into the similarity degrees calculation of the Hamming distance. Through the comprehensive evaluation of the five indicators, the overall score of the data sources are obtained, then the paper uses the scores to sort and select data sources.Experiments show that methods proposed in this paper greatly improves the previous methods and improves the quality and efficiency of the sampling further, and makes the assessment of the sample set quality more reliable and effective.
Keywords/Search Tags:Deep Web Database, Data source sampling, Classification, the sampling quality evaluation, Data source quality assessment, Similarity, Sorting and selection
PDF Full Text Request
Related items