Font Size: a A A

Research On Deep Web Data Analysis Based On Stratified Sampling

Posted on:2016-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhouFull Text:PDF
GTID:2308330464950427Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Deep web contains massive and high quality data. Deep web-oriented data analysis become the latest focus research of this field for the moment. As the existence of restrictions over the query interface, most of existing data analysis methods over deep web performed based on sampling the data sets. The samples, in turn, can only be obtained by querying the deep web databases. Compared with computation costs, query cost is the dominant factor while performing data analysis over the deep web. We should not only consider the support of sampled data with respect to the analysis task, but also consider how to reduce the number of queries. Based on this idea, this paper researches two kinds of deep web data analysis task, the main work are as follows:(1) According to the characteristics of deep web form-like query interface, We analyze the existing sampling strategy and its application in the aggregate estimation task. We discuss the influence of unbiased samples with respect to different analysis tasks. This provides a theoretical basis for the follow-up researches.(2) We investigate the task of clustering over the deep web. With the perspective of sampling cost reduction and clustering accuracy improvement, we propose a deep web multi-phase stratification based clustering method. In order to reduce the effect of stratification dependency on the initial samples, we iterate the process of representative sampling and boundary sampling to select the optimal sample sets, and estimate the clustering results with the weight information of stratified samples. The experiments on the synthetic and yahoo data sets show that our method can achieve higher clustering accuracy.(3) We investigate the task of outlier detection over the deep web. Considering the precision and recall metrics, we propose a stratified-based deep web outlier detectionmethod. In order to adapt to the characteristics of outlier detection, we utilize neighborhood sampling to collect outlier samples combined with the layer relationship of samples from the aspect of stratified index. Meanwhile, we introduce uncertainty sampling process in order to the resulting uncertainty problem. The experiments on several data sets show that our method has a good performance with respect to combined metrics.
Keywords/Search Tags:Stratified Sampling, Deep Web, Data Analysis, Hierarchical Clustering, Outlier Detection
PDF Full Text Request
Related items