| Data mining, or knowledge discovery in databases, aims at finding unknown and useful information from large masses of data. Data Mining is a young but flourishing field. Many algorithms and applications exist to mine different types of data and extract different types of knowledge. Both applications and methods have been developed at a fast pace. Several major kinds of data mining methods, including characterization, classification, association rule, clustering, outlier detection, pattern matching, data visualization, and so on.We study the clustering methods in data mining. The results of cluster can offer theory supports for the development plan in the future by clustering. Point out the challenge in the development of clustering in the future. Discuss the following clustering methods: partitioning method, hierarchical method, density-based method, grid-based method and model-based method. In addition, for the efficiency, we improve the exit K-means clustering algorithm. The improvable K-means clustering algorithm is more efficient in dealing with spare data sets.In the field of data mining, large data sets are more and more popular, they are multidimensional and include a lot of data records. Existing data mining methods are mostly adapt to low dimension and little data sets. The results of these methods are not ideal to these multidimensional and large data sets, and the demand for the system is high.Data mining in large data sets often requires a sampling or summarization step to form an in-core representation of the data that can be processed more efficiently. Uniform random sampling is frequently used in practice and also frequently criticized because it will miss the important information, for example, small clusters, in original data sets. Many natural data sets are known to follow zipfian distribution, but uniform random sampling sample every record in data sets with the same probability. Some records may be of more value in the sample than others. If we know the value of each record, we could sample by assigning a probability proportional to the importance of the record. It is unlikely that we can define the value of each record in the data sets and, worse yet, it may be difficult to generalize results obtained from such a sample because the sample is no longer representative of the data sets. Instead, we'll consider applications in which it is possible to define sets of equivalent records and use the size of these sets to bias our sample while ensuring that the sample is still representative.We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks. Biased sampling overcome the limitations of uniform random sampling, with the same probability, it can fulfill the certain data mining task by taking a sample of smaller size. The serious limitation, however, is that we have no clear way to bias sampling toward points in the data sets since we do not know these points a priori. We observe that knowledge of theprobability density function of the underlying data set can provide enough information to locate those points, but it needs a density estimation function, it can map density to sampling probability in a way. Biased sampling according to data density can use any density estimation technique, and there is a number of methods suggested in the literature employing different ways to find the density estimator for multidimensional data sets. Although biased sampling technique can use any density estimation method, using kernel density estimation is a good choice. The technique is very efficient. Finding a kernel density estimator is as efficient as taking a random sample and can be done in one data set pass. Kernel density estimation is based on statistics and, in particular, on kernel theory. Although the exact shape of the kernel function does not affect the approximation results a lot, a Gaussian function or a polynomial can work equally well. For the simplicity, we chose the Epanechnikov kernel function, because it is easy to integrate.We use the density biased sampling as a data reduction technique to speed up the clustering data mining tasks in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest. To identify the clusters in dense regions, we can oversample the dense regions. To identify all possible clusters, we can oversample the sparser regions. It now becomes much more likely to identify sparser or smaller clusters. On the other hand, we can make sure that denser regions in the original data set continue to be denser in the sample so that we do not lose any of the dense clusters. To identify outliers, we can sample the very sparse regions, but we don't too much research in this aspect.We run clustering algorithms on the density-biased sample to evaluate the accuracy of the proposed technique. To cluster the sample points, we use a hierarchical clustering algorithm. The algorithm is a bottom-up technique, it starts with each point in a separate cluster and at each step merges the closest clusters together. The technique is similar to CURE in that it keeps a set of points as a representative of each cluster of points, but not a single object as a representative of each cluster.The main disadvantage of the hierarchical approach is that the runtime complexity is quadratic. Running a quadratic algorithm even on a few thousand data points can be problematic. A viable approach to overcome this difficulty is to reduce the size of the data set in a manner that preserves the information of interest to the analysis without harming the overall technique.A hierarchical clustering algorithm that is running off the biased sample is in effect an approximation algorithm for the clustering problem on the original data set. In this respect, our technique is analogous to the approximation clustering algorithms which bring forward by Indyk P, since the algorithms also run on a uniform random sample to efficiently obtain the approximate clustering. |