Font Size: a A A

An Improved Hierarchical Clustering And Outlier-detecting Algorithm And Application On The Data-mining Platform

Posted on:2003-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:Q N WuFull Text:PDF
GTID:2168360062490770Subject:Electronic information and computer application technology
Abstract/Summary:PDF Full Text Request
Recently, the data of organization are exploded. And how to get valuable information and knowledge from the database becomes a vital area, named as data mining or knowledge discovery.Data clustering, an important branch of data mining, is the process of group the data into classes or cluster so that the objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. And it is helpful to search some schemes and data distributions, which are novel, effective, useful or understandable.Outlier detection, an emergence of data mining relative to clustering, is to discovery the small scheme where the data have a notably different characteristic to other data objects. It can be used in fraud detection, e.g., by detecting unusual usage of credit cards or telecommunications services. It also has wide application in weather prediction, customer classification and financial fields.The basic idea for hierarchy-based method is that creating and maintaining a tree of clusters and sub-clusters according to some kind of criterion to measure the distance of clusters,the procedure will be sloped until some terminal conditions are satisfied.Hierarchical clustering method can be further classified into agglomerative and divisive hierarchical clustering,depending on whether the hierarchical decomposition is formed in a bottom-up or top-down fashion.Most hierarchical clustering methods can produce the better results when the clusters are compact or spherical in shape.But they do not perform well if the clusters are any shape or there are outliers.A main reason is that the most hierarchical clustering methods employ medoid-based measurement as distance between clusters.In [GRS98], a novel hierachical clustering method is presented. This algorithm adopts a middle ground between centroid-base and all-points-based approaches.Instead of using a single centroid or all points to represent a cluster, a fixed number of representative points in space are chosen,these points represent and capture the geometry and shape of the cluster.In addition.the representative points of a cluster are generated by first selecting well-scattered objects for the cluster and then "shrinking" or moving them toward the cluster center by a specified fraction,or shrinking factor.The shrinking helps dampen the effects of outliers.Therefore,CURE is more robust to outliers.The paper is an extension of research on hierarchy-based method. Actually, we apply it to data clustering and outlier detection with improvements on some sampling techniques used in these two areas. Finally, we implement CURE algorithm in a client intelligent analysis system(CIAS) by using data generalization techniques and have obtained some valuable results.Concretely our main works are:*fi 2 !)!:-ft 73 tilff'r l.fl'Ji-VJIJHave a research and discussion over all kinds of clustering methods,focusing on hierarchy-based clustering methods and make some improvements on the kind of algorithms to process outliers and identify clusters having nonspherical shapes and wide variancein size.Analyze CURE algorithm,give some improvements to the sampling algorithm and realize the algorithm in a client intelligent analysis system(CIAS)1.Have a research and discussion over outlier detection.analyse some reason to result in outliers.especially discuss the distance-based outlier detection methods,put forword an idea that can quicken the procedure of outlier detection by utilizing partition. Furthermore we try to get a new approach by combining data clustering and outlier detection.Discuss the sampling techniques frequently used in clustering and outlier detection. In a natural manner we bring forward a density-biased sampling technique, which avoids the deficiencies in some applications and achieves a better scalability and accuracy in comparison with uniform sampling method.3 'Ji-ft 73 !ii...
Keywords/Search Tags:Outlier-detecting
PDF Full Text Request
Related items