Font Size: a A A

Multi-Density Clustering And Outlier Recognition Algorithm Based On Grid Adjacency Relation

Posted on:2011-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:G X LiFull Text:PDF
GTID:2178360305961073Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Cluster analysis and outlier recognition are the important branch in data mining domain. With a wide range of applications of the cluster analysis and the outlier recognition technology in scientific research, market analysis, life sciences, and many other disciplines, their important position is also increasingly obvious. By researching on adjacency relations between grid units in data space, the thesis proposes a novel clustering and outlier recognition method using grid unit's relations in data space. The research work are as follows:Based on analyzing the relation between grid division and uniform distributive data projection diversity, the thesis presents a relationship theorem of grid division and the data projection diversity, and a diversity grid division method. It can deal with fraction when grid division is not an integer. This grid division method is easy and feasible because of considering data distribution and reducing the redundant grid. In order to determine the relationship between adjacent units, a kind of diversity function on distance of center of mass and relative density is defined.Outlier are some deviation objects of data points. The thesis presents an outlier recognition algorithm based on grid adjacency relation (GAO), according to the density of outlier unit comparing to its neighborhood high or low. Outlier and outlier unit are determined by the degree of deviation, which is measured by the relative density and distance of center of mass between units. The experimental results show that the algorithm can recognize outlier of multi-density and large data sets effectively. The algorithm efficiency is better than that of the Cell-based algorithm.The thesis proposes a multi-density clustering algorithm based on grid adjacency relation (GAMD) using data distribution characteristics within units, which is reflected by the unit density and the center of mass. In order to determine the unit boundary, the algorithm measures the similarity between units by the relative density of units and relative distance of center of mass. Cluster is processed while outliers are recognized simultaneously. Goodness of fit is proposed for evaluating clustering validity. The experimental results show that the algorithm can cluster the arbitrary shape and multi-density data sets effectively. The clustering results have no relationship with data input and unit order.
Keywords/Search Tags:Clustering analysis, Grid division, Adjacent cells, Diversity function, Outlier, Goodness of fit
PDF Full Text Request
Related items