Font Size: a A A

Research On Clustering Algorithms In Data Mining

Posted on:2015-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:K PeiFull Text:PDF
GTID:2298330467463514Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Data Mining is one of the most active branches in the research of database technology, and the most promising technology in the field of computer science. It is born with the need of mining useful knowledge from massive amounts of data. Data Mining is the procedure of extracting hidden and potentially useful patterns and rules from large data sets. It covers the knowledge of statistics, machine learning, neural networks, pattern recognition, information retrieval, artificial intelligence, and visualization and many other subjects, brings together variety of data analysis techniques.Cluster analysis is an important area in data mining research. Cluster analysis is an unsupervised learning process. By clustering process, we can divide data into multiple classes according to certain rules without prior knowledge, and discover the hidden patterns. The basic clustering algorithms can be divided roughly into several kinds, including partitioning methods, hierarchical methods, density-based methods, grid-based methods and so on. Cluster analysis has a wide range of applications in e-commerce, market analysis, document classification, biology and many other fields.In this paper, clustering techniques in data mining were analyzed and discussed. First of all we briefly introduced the concept of data mining and common techniques. Then according to the classification of clustering algorithms, we systematically introduced each kind of clustering algorithms and typical algorithms. Then a detailed analysis of k-means, a common classical clustering algorithm, was given, including its process, defects and some improvement ideas. We introduced canopy k-means, a hybrid clustering method, which aims to find the initial cluster centers of traditional k-means algorithm, and conducted experiments to test its performance. After that we briefly introduced the Hadoop distributed platform, proposed a parallel strategy of canopy and k-means algorithms. Finally, we presented a parallel clustering algorithm for community mining in social networks, a widely used realistic application, and tested its performance. Experimental results show that compared with the traditional k-means algorithm and canopy k-means, the proposed algorithm has a greatly improvement in efficiency.
Keywords/Search Tags:Data Mining, Clustering, k-means, Hadoop, Social, Network
PDF Full Text Request
Related items