Font Size: a A A

Study On Non-parametric Clustering Based On Natural Nearest Neighborhood

Posted on:2015-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:J L HuangFull Text:PDF
GTID:2268330422472528Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is the procedure that search pattern, rule and law from big data. Inother words, data mining is the process that extract comprehensible and previouslyunknown knowledge, possessing potential value, from massive amounts of irregulardatasets. Data mining’s primary missions are composed of Regression, Association rulelearning, Classification, Clustering and Outlier detection etc. Among the abovemissions, Clustering is a very important technique of Data mining. Clustering is a veryeffective way to explore and know the relationship of the objiect. Clustering not only isregarded as the tool of Data mining that obtain the distributed information of the datain database, but slso thought as the preprocessing step of other data mining algorithm.Clustering is unsupervised pattern recognition method and a very significant field ofpattern recognition. Clustering can discover the cluster structure of the input data. Now,Clustering have been applied to each data analysis application, such as computer visionanalysis, statistic anrlysis, image processing, medical information processing,biological science, social sciences, psychological science etc. At the same time,Clustering is used to the business area such as business management, market analysis,engineering design and so on. Clusters are collections of objects whose intra-classsimilarity is high and inter-class similarity is low.The concept of nearest neighborhood is early proposed in1951and geted widelyattention and research. It also is widely applied to pattern recognition, machinelearning, and data mining ect. The most famous and fundamental nearest neighborhoodconcept is K-Nearest Neighborhood(K-NN) and ε-Nearest Neighborhood(ε-NN) thatwas presented by Stevens. At present, a mount of clustering and outlier detectionmethod apply the K-NN or ε-NN. Based K-NN and ε-NN, many famous data miningmethods ware proposed such as K-NN classification algorithm, LOF and INFLOoutlier detection algorithm ect. But with the research to K-NN and ε-NN is more andmore thorough, the lack of K-NN and ε-NN appear to us. The lack is that how machnumber of each objiect’s neighborhood is suitable to a dataset that we get and unknownto it. What value of the K can reflect the architectural feature of the unknowned dataset.Especially in this times, the age of big data, that data explosivegrowth. The complexityof dataset becomes higher and higher, unpredictability also to get stronger. The valueof K are more and more hard to set when we use K-NN based data mining method to dig. The application of ε-NN in each area also exist the same problem. The value εoften seriously influence the final dig effect. In the concept of ε-NN, once the value ofε is ensured, the neighborhood’s number of the object in intensive area more thansparse area. No matter K-NN or ε-NN, the way to search neighborhood depend on theparameter that seted by person ranther than the characteristic of the datasets. That is thefundamental reason of the above problem.This article introduce the Natural Nearest Neighborhood(3N) that a new conceptof nearest neighborhood to solve the problem K-NN and ε-NN faced. Furthermoreimproveing the concept and search method of3N, after analyze the original searchmethod of3N. Natural Nearest Neighborhood as a mew comcept of neighborhood waspresented by doctor Zou et at. in2011. This neighborhood concept is non-measurablethat the biggest difference compare to K-NN and ε-NN. In the3N, we can searchnatural nearest neighborhood without any parameters.3N obtain the nearestneighborhood of each data objiect through continuously adaptive learning to the givendata set. So3N’s capacity that reflecting the distribution and architectural feature isbetter than K-NN and ε-NN. The reason Why3N is a scale-free nearest neighborhoodconcept is that the neighborhood number of each data object is inequable in3N,however the neighborhood number of each data object is equable in K-NN. The naturalnearest neighborhood number of high density areas is more than low density areas.That reason is decided by the distribution and architectural feature of given data set.In response to the advantages and disadvantages, applicability and the problem ofparameters selection that exist in the K-NN or ε-NN based data mining algorithm. Thispaper proposed a non-parametric clustering algorithm based on natural nearestneighborhood. First, searching the natural nearest neighborhood of each data objectthrough the improved3N search method and get the natural feature value(supk) at thesame time. Then constructing Minimum Neighborhood Graph(MNG) that correspondto the data set according supk and natural nearest neighborhood of each data object.Last using the MNG to cluster the given data sets. Our experiments and performanceanalysis demonstrate that the non-parametric cluster algorithm based on3N could notonly cluster a dataset without any parameter, but also get better clustering results thanother representative clustering algorithms, furthermore, the adapter range is moreextensive than the formers’. Consequently the clustering algorithm that we propsed inthis paper solve the problem of paramete selection very well.
Keywords/Search Tags:Data mining, clustering analysis, K-nearest neighborhood, ε-nearestneighborhood, natural nearest neighborhood
PDF Full Text Request
Related items