Font Size: a A A

Study On The Algorithms Of Clustering And Outlier Detection Based On Neighborhood

Posted on:2018-02-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Y LuFull Text:PDF
GTID:1368330563951024Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous innovation of network information technology,the collection of data becomes very convenient,and the analysis and research of data is becoming more and more important,data mining has become a hotspot of research in many fields.Clustering analysis is one of the main tasks of data mining,and it is also the focus of data mining task.The purpose of clustering analysis is to divide the data objects with great similarity into a group,and to make the dissimilar data objects belong to different groups as far as possible.The contents of cluster analysis include obtaining correct clustering number,designing similarity measure function between data objects,efficient clustering algorithm and clustering result evaluation function.In general,the number of clustering is affected by the complex distribution structure of datasets,overlapping of samples,noise and other factors.In particular,clustering of data from different fields and perspectives is often different.In practical applications,the data similarity metric function is affected by the loss of data eigenvalue,category characteristics and high dimensional features.At present,it is very challenging to develop scalable and efficient clustering algorithms in the face of large-scale high dimensional datasets.To evaluate the clustering result,we should consider the number of cluster,the size of sample,the shape of cluster,the compactness of class,and the separation between classes.Aiming at the datasets with complex distribution structure,this paper studies the outlier detection,efficient clustering algorithm and determining the number of cluster in the dataset based on neighborhood technology.In summary,the main innovations of this paper include as below.(1)An outlier detection algorithm based on reverse K nearest neighbor is proposed.The algorithm combines the advantages of density and distance method to detect outliers.In this paper,the existing neighborhood technologies neighbor are compared and analyzed,such as k nearest neighbor,reverse k nearest neighbor,mutual k nearest neighbor,shared k nearest neighbor and natural nearest neighbor.The distribution and stability of the reverse k nearest neighbor number of the dataset are analyzed experimentally.The outlier detection algorithm is proposed to compute the reverse k nearest neighbor number of each data object,and the neighborhood density of the data object is estimated by using the nearest neighbor number of reverse k.In order to further reflect the distance between the data object and the data subject,the distance of k nearest neighbor is calculated for the data object with the same neighborhood density,the larger the distance value is,the more outliers are considered.The experimental results show that the outlier detection algorithm proposed in this paper can effectively find global and local outlier points.(2)A clustering algorithm based on neighborhood density partitioning is proposed.The proposed clustering algorithm consists of four processing processes.Firstly,estimate the neighborhood density of the data objects and divide the dataset into core datasets and non-core datasets according to the density threshold.Secondly,the core datasets are initially clustered by using the minimum spanning tree clustering algorithm.Thirdly,according to the density and compactness of the neighborhood,the data objects in the core dataset are prioritized.Finally,the nearest neighbor algorithm is used to divide the data objects in the non-core dataset into the initial clustering by priority.The experimental results show that the clustering method based on neighborhood density partitioning can eliminate the effects of noise and overlapping between classes,and it can identify clusters of different shapes.(3)A heuristic clustering algorithm based on the importance of neighborhood is proposed.The method firstly constructs k-neighborhood graph,generates the transfer probability matrix through the neighborhood graph,calculates the transfer probability matrix by using the stochastic walk model,and obtains the eigenvector after convergence,which reflects the importance of the neighborhood of the data object.Secondly,the number of important data objects is determined by the k nearest neighbor distance graph,and the correct clustering number is found by using heuristic rules based on important data objects,and the initial clustering of datasets is obtained.Finally,the unimportant data objects are divided into the initial clustering.The experimental results show that the neighborhood importance ranking algorithm can find important data objects,the heuristic rules can get the correct clustering number and the initial clustering,and the clustering algorithm has achieved better clustering effect.
Keywords/Search Tags:Data Mining, Clustering Analysis, Nearest Neighbor, Heuristic Rule, Outlier Detection
PDF Full Text Request
Related items