Research On Deep Web Data Analysis Based On Stratified Sampling

Posted on:2016-01-29

Degree:Master

Type:Thesis

Country:China

Candidate:X Zhou

Full Text:PDF

GTID:2308330464950427

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

Deep web contains massive and high quality data. Deep web-oriented data analysis become the latest focus research of this field for the moment. As the existence of restrictions over the query interface, most of existing data analysis methods over deep web performed based on sampling the data sets. The samples, in turn, can only be obtained by querying the deep web databases. Compared with computation costs, query cost is the dominant factor while performing data analysis over the deep web. We should not only consider the support of sampled data with respect to the analysis task, but also consider how to reduce the number of queries. Based on this idea, this paper researches two kinds of deep web data analysis task, the main work are as follows:(1) According to the characteristics of deep web form-like query interface, We analyze the existing sampling strategy and its application in the aggregate estimation task. We discuss the influence of unbiased samples with respect to different analysis tasks. This provides a theoretical basis for the follow-up researches.(2) We investigate the task of clustering over the deep web. With the perspective of sampling cost reduction and clustering accuracy improvement, we propose a deep web multi-phase stratification based clustering method. In order to reduce the effect of stratification dependency on the initial samples, we iterate the process of representative sampling and boundary sampling to select the optimal sample sets, and estimate the clustering results with the weight information of stratified samples. The experiments on the synthetic and yahoo data sets show that our method can achieve higher clustering accuracy.(3) We investigate the task of outlier detection over the deep web. Considering the precision and recall metrics, we propose a stratified-based deep web outlier detectionmethod. In order to adapt to the characteristics of outlier detection, we utilize neighborhood sampling to collect outlier samples combined with the layer relationship of samples from the aspect of stratified index. Meanwhile, we introduce uncertainty sampling process in order to the resulting uncertainty problem. The experiments on several data sets show that our method has a good performance with respect to combined metrics.

Keywords/Search Tags:

Stratified Sampling, Deep Web, Data Analysis, Hierarchical Clustering, Outlier Detection

PDF Full Text Request

Related items

1	Research And Application Of Outlier Data Mining Algorithm Based On Deep Forest
2	Research On Distributed Stratified Sampling And Its Application On Object Detection
3	Research On Hierarchical Clustering Algorithm And Parallelization In Massive Data Environment
4	Study On The Algorithms Of Clustering And Outlier Detection Based On Neighborhood
5	Study And Implementation Of Clustering And Outlier Detection Algorithms
6	Research On Outlier Detection For Unbalanced Data
7	Research And Implementation Of Clustering And Outlier Detection Algorithms
8	Research And Application Of Outlier Detection Algorithm
9	Research On Smart Grid Big Data Outlier Detection And Analysis Of Electricity Behavior Based On Density Peaks Clustering Algorithm
10	Based On Clustering Analysis Of The Outlier Detection Research And Its Application In The Audit