Font Size: a A A

The Research On Sampling For Data Mining

Posted on:2007-10-03Degree:MasterType:Thesis
Country:ChinaCandidate:H T YuFull Text:PDF
GTID:2178360182486602Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapidly growth of database scale, high computational (time and space) costs are required when directly applying known mining algorithms to large scale databases. While sampling is one of the most important methods in mining knowledge from large scale databases and proper sampling can guarantee the accuracy of result and reduce the computational costs of algorithms. The sampling methods on data mining are studied in the dissertation.The major work of the dissertation are as follows:(1) In order to overcome the problem that known sampling mining methods heavily rely on the subjective factors, the statistical optimal sample size is drawn into sampling. The sampling mining algorithms whose sample size is determined according to the statistical optimal sample size can not only show the peculiarity of data distribution but also guarantee the accuracy of result and shrink the sample size at the same time.(2) A stratify sampling algorithm for extracting classification rules is proposed in the dissertation. The goal of the algorithm is to maintain the chief classification rules and to shrink the sample size at the same time. In the algorithm, the statistical optimal sample size is used to determine the sample size, and stratify sampling is used to raise the classification accuracy of classification algorithms on inhomogeneous distribution data.(3) A weighted sampling method for mining frequent itemsets is proposed in the dissertation. The algorithm aims to mine the long frequent itemsets from large scale data and gives attentions to both sample quality and sample size. So it can basically hold the frequent itemsets and reduce the data scale simultaneously.(4) A new grid clustering method based on the random sampling is proposed in the dissertation. It inherits the merit that clustering methods based on grid have a good flexibility to data with large scale and hyperspace, and improves the clustering accuracy using random sampling to determine the partition granularity of grid.The experiments verify the validity of these algorithms.
Keywords/Search Tags:Data Mining, Sampling, Statistical Optimal Sample Size (SOSS), Stratify Sampling, Weighted Sampling, Grid Clustering
PDF Full Text Request
Related items