Font Size: a A A

Data Mining over Hidden Data Sources

Posted on:2013-09-07Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:Liu, TantanFull Text:PDF
GTID:1458390008482952Subject:Computer Science
Abstract/Summary:
In recent years, one mode of data dissemination has become extremely popular, which is the deep web. Like any other data source, data mining on the deep web can produce important insights or summary of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly. Thus, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Furthermore, acquiring data from the deep web is especially time consuming, since deep web queries are executed over a wide area network.;In order to solve these problems, we proposed efficient sampling methods for mining one or more deep web data sources, by targeting at a number of distinct data mining problems, including association rule mining, frequent itemset mining, clustering and differential rule mining.;For differential rule mining and association rule mining on the deep web, we introduced a novel stratified sampling method for verifying interesting rules. Our contributions include a novel greedy stratification approach, which processes the query space of a deep web data source recursively, and considers both the estimation error and the sampling cost. We have also developed an optimized sample allocation method that integrates estimation error and sampling cost.;We also proposed another sampling method for solving data mining on the deep web based on the theory of active learning. For frequent itemset mining on the deep web, a novel active learning based sampling method is proposed to obtain good estimation for 1-itemsets for output attributes. In our method, the Bayesian network is utilized to describe the relationship between the input and the output attributes and a risk function is defined to evaluate the loss of the estimation for the support values of 1-itemsets. Our sampling method iteratively selects data records in the space of query which will maximally reduce the risk function of the deep web.;For clustering on a deep web data source, we developed a stratified hierarchical clustering method. Our work includes three new ideas, which are a method for stratifying a deep web data source, an algorithm for hierarchical clustering based on stratified sampling, and a two phase technique for sampling, which includes a representative sampling in the first phase, and sampling focusing on the boundary points between the clusters in the second phase.;We also proposed another sampling method for solving k-means clustering over a deep web data source. In our approach, three representative sampling methods are developed, with the goal of achieving a good estimation of the statistics, including proportions and centers, within the sub-spaces of the output attributes.
Keywords/Search Tags:Data, Deep web, Sampling, Output attributes, Estimation, Over
Related items