Study On Hierarchical Clustering Based On Natural Neighbor

Posted on:2017-06-05

Degree:Master

Type:Thesis

Country:China

Candidate:D D Cheng

Full Text:PDF

GTID:2348330503465883

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Data mining is to discover potential valuable information from big data. The primary missions include Regression, Association rule learning, Classification, Clustering and Outlier detection etc. Among the above missions, Clustering is an important branch of data mining. Clustering is to partition the data set into clusters so that the inner cluster objects are similar to each other and different from other cluster objects. Clustering analysis is not only regarded as the tool of data mining which are used to obtain the distributed of the data, but also thought as the preprocessing step of other data mining algorithms, such as Characterization, Feature subset selection and Classification. Clustering analysis as an unsupervised pattern recognition method, has been applied to the field of computer science(such as computer vision analysis, image processing, pattern recognition and machine learning) statistical analysis, social sciences and business.Many different clustering algorithms have been proposed. Among them is hierarchical clustering methods, which is simple and effectively solve practical problems with hierarchical structure. Chameleon is a representative algorithm. It constructs K-Nearest Neighbor Graph first, then divides the graph into subgraphs or subclusters, and last it merges the initial subclusters. Chameleon can discover clusters with arbitrary shapes, but is must be given the value of K which is used to construct K-Nearest Neighbor graph, and it must select the value of smallest bisection and the threshold of similarity function.In this paper, we introduce a new concept of nearest neighbor that is Natural Neighbor(NaN), and apply it to hierarchical clustering. This new concept of neighbor is non-measureable which is the biggest difference compared with K-NN and ?-NN. In Natural Neighbor, we can search neighbor without any parameters. Natural neighbor obtains the nearest neighbor of each object through continuously adaptive learning to the given data set. So, Natural Neighbor can better reflect the distribution and structure of data set than K-NN and ?-NN. The natural neighbor number of high density is more than low density areas.In this paper, we introduce natural neighbor into hierarchical clustering, and propose a new hierarchical clustering algorithm Hi-CLUBS. First, it uses natural neighbor to construct Saturated Neighborhood Graph(SNG), and a new modularity-based graph partitioning algorithm is used to partition the SNG into initial subclusters. Then merge the initial subclusters according to the similarity between subclusters. The experiments have shown that Hi-CLUBS reduces the dependency on parameters and has more advantage over other methods on discovering clusters with arbitrary shape.A new noise removal-based hierarchical algorithm HCBNR is also proposed to solve the problem of noise in data set. HCBNR algorithm first remove the noise point with a natural neighbor-based density adaptive noise removal method, and then clustering the remaining data set with Hi-CLUBS. The experiment compared with DBSCAN and other algorithms presents that HCBNR can distinguish the noise points fast and correctly, and discovers the clusters in data set accurately.

Keywords/Search Tags:

Data mining, Clustering Analysis, Natural Neighbor, Hierarchical Clustering, Noise Removal

PDF Full Text Request

Related items

1	Study On Hierarchical Clustering Based On Representative Points
2	Study On Improved Adaptive Spectral Clustering Algorithm Based On Natural Neighbor
3	Natural Neighbor:The Concepts And Applications In Data Mining
4	Study On Clustering Algorithm Based On Natural Neighbor Graph And Internal Evaluation Index Of Noise Removal
5	Research On Online Streaming Data Clustering Algorithm Based On Natural Neighbor
6	The Clustering Technique Is Applied Research In The Auto Insurance Business Analysis
7	Optimization Research Of Density Peaks Clustering Algorithm Based On Neighbor Searching
8	Research On Clustering Algorithm Based On Shared Neighbor Affinity
9	Research On Hierarchical Clustering Algorithm And Parallelization In Massive Data Environment
10	Research Of Natural Neighbor Based Density Clustering Algorithm And Its Parallelization