Font Size: a A A

Study On Hierarchical Clustering Based On Natural Neighbor

Posted on:2017-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:D D ChengFull Text:PDF
GTID:2348330503465883Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is to discover potential valuable information from big data. The primary missions include Regression, Association rule learning, Classification, Clustering and Outlier detection etc. Among the above missions, Clustering is an important branch of data mining. Clustering is to partition the data set into clusters so that the inner cluster objects are similar to each other and different from other cluster objects. Clustering analysis is not only regarded as the tool of data mining which are used to obtain the distributed of the data, but also thought as the preprocessing step of other data mining algorithms, such as Characterization, Feature subset selection and Classification. Clustering analysis as an unsupervised pattern recognition method, has been applied to the field of computer science(such as computer vision analysis, image processing, pattern recognition and machine learning) statistical analysis, social sciences and business.Many different clustering algorithms have been proposed. Among them is hierarchical clustering methods, which is simple and effectively solve practical problems with hierarchical structure. Chameleon is a representative algorithm. It constructs K-Nearest Neighbor Graph first, then divides the graph into subgraphs or subclusters, and last it merges the initial subclusters. Chameleon can discover clusters with arbitrary shapes, but is must be given the value of K which is used to construct K-Nearest Neighbor graph, and it must select the value of smallest bisection and the threshold of similarity function.In this paper, we introduce a new concept of nearest neighbor that is Natural Neighbor(NaN), and apply it to hierarchical clustering. This new concept of neighbor is non-measureable which is the biggest difference compared with K-NN and ?-NN. In Natural Neighbor, we can search neighbor without any parameters. Natural neighbor obtains the nearest neighbor of each object through continuously adaptive learning to the given data set. So, Natural Neighbor can better reflect the distribution and structure of data set than K-NN and ?-NN. The natural neighbor number of high density is more than low density areas.In this paper, we introduce natural neighbor into hierarchical clustering, and propose a new hierarchical clustering algorithm Hi-CLUBS. First, it uses natural neighbor to construct Saturated Neighborhood Graph(SNG), and a new modularity-based graph partitioning algorithm is used to partition the SNG into initial subclusters. Then merge the initial subclusters according to the similarity between subclusters. The experiments have shown that Hi-CLUBS reduces the dependency on parameters and has more advantage over other methods on discovering clusters with arbitrary shape.A new noise removal-based hierarchical algorithm HCBNR is also proposed to solve the problem of noise in data set. HCBNR algorithm first remove the noise point with a natural neighbor-based density adaptive noise removal method, and then clustering the remaining data set with Hi-CLUBS. The experiment compared with DBSCAN and other algorithms presents that HCBNR can distinguish the noise points fast and correctly, and discovers the clusters in data set accurately.
Keywords/Search Tags:Data mining, Clustering Analysis, Natural Neighbor, Hierarchical Clustering, Noise Removal
PDF Full Text Request
Related items