Font Size: a A A

Research On Connectivity-based Outlier Detection And Clustering

Posted on:2015-08-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Q WanFull Text:PDF
GTID:1228330422971492Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Outlier detection and clustering are important research of data mining. Althoughtheir technology has been very mature, there are still many challenges in sorts ofapplications. This paper hopes to continually promote the development of the basictheory of outlier detection and clustering, and solve the problems of current researchand applications. The main work is: Based on the research of relationship description,the models of outlier detection and clustering are studied, and then their solutionschemes are verified and optimized.Describing the relationship between data is basic research in the field of datamining. In practical applications, the quality of the relationship description oftendirectly affects the outcome of data mining. Therefore, this paper discusses therelationship measure–dissimilarity, and presents some related theories. Euclideandistance is the most commonly used measure of dissimilarity. However, this measuredoes not fully characterize the dissimilarity between data, and even sometimes give awrong description of dissimilarity. Hence, a neighborhood-based andconnectivity-based dissimilarity measures are proposed. The two measures respectivelyintroduce density and connectivity information so that dissimilarity can be reflectedwell. For ease of use, this study analysizes the neighborhood-based measure, and obtainits estimated value in a uniform distribution; via some theoretical derivation,connectivity-based dissimilarity is addressed as the longest edge of a minimumspanning tree path.From the point of view of connectivity, the paper presents an outlier detectionalgorithm based on the k-th most similar neighbor. The proposed method takes theconnectivity of the k-th most similar neighbor as the outlier-ness score. Thisconnectivity just corresponds to the k-th smallest value of transitive closuredissimilarity. It is proved that the the k-th smallest value is also equal to theconnectivity of the k-th merged objects of a Prim minimum spanning tree (MST). Then,the outlier-ness of an arbitrary object is defined as the longest edge of the MST pathbetween it and its k-th most similar neighbor. Besides, outlier detection algorithm alsotakes into account density. So, the proposed algorithm is suited to arbitrary density andshaped data, and exhibit better performance in local outlier and outlying clusterdetection. Connectivity is also a noteworthy factor in clustering. This paper proposes apartitioning model based on connectivity and the corresponding clustering algorithm.This model ensures the minimization of a total loss function in polynomial time,namely to get the theoretically optimal partition. The most important is that theclustering algorithm can group data with arbitrary manifold. In particular, the more adataset satisfies the constraints--intra-cluster connectivity greater than or equal to theinter-cluster connectivity, the better the clustering result. In addition, dividingsub-clusters can greatly reduce the time complexity, and can maintain the theoreticallyoptimal partition in high probability. Especially in image segmentation, theoreticallyoptimal time of the clustering algorithm can be much lower than quadratic, andmaintain good segmentation results.
Keywords/Search Tags:Outlier detection, Clustering, Connectivity, k-th most similar neighbor, k-way partitioning model
PDF Full Text Request
Related items