Font Size: a A A

Large-scale Data Clustering Technology Research And To Achieve

Posted on:2010-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y J QianFull Text:PDF
GTID:2208360275982920Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of data collection and storage technology, especially the popular use of the World Wide Web as a global information system,lots of application fields nowadays are deluged with data. Data mining technology, combining with database, machine learning, statistics and artificial intelligence methods, is utilized to find the useful information and knowledge from the vast amounts of data.Clustering analysis, as an important part and essential tool in data mining, has been widely used in the blooming Internet applications, such as search results clustering in the search engine, interest-similar people grouping in the Internet community, hot news discovery in the news websites, correlative items aggregation in the Question-Answer system.By tying up with these practical Internet applications, this thesis discusses clustering techniques for processing large data sets. To be specific, the following work has been done in this study.1) Clustering techniques for large data sets are summarized. Based on a thorough study of clustering algorithms, we sum up in this study those methods, such as sequential processing, partitioning, sampling, data summarization, parallel and distributed computation, which could be effectively applied to large data sets.2) A hierarchical clustering algorithm based on MPI parallel computing technology is proposed and implemented. After a brief introduction to parallel computing and the MPI parallel programming standard, the thesis describes in detail a single-node serial agglomerative hierarchical clustering algorithm for the news web-pages clustering, and then proposes a revised parallel version. The experiment demonstrates the correctness and great effectiveness of the revised one.3) The implement details and execution steps of canopy-kmeans clustering on the Hadoop platform are thoroughly studied and clearly presented. Firstly, Google's MapReduce framework and the Hadoop parallel platform are introduced. Then the canopy-kmeans algorithm is implemented in Hadoop.4) An extendable clustering system is designed and implemented. Many aspects, such as the design philosophy, framework, flow-of-excution, sub-module implementation, and key data structures, of the system are presented detailedly.The main contributions and innovations of this study are listed below:1) This thesis summarizes clustering techniques for processing large data sets by a thoroughgoing study of many clustering methods.2) Two clustering methods based on parallel and distributed computation are demonstrated. A new serial hieratical agglomerative clustering method is propsed and then revised by MPI parallel computing; the picture of canopy-kmeans clustering on the Hadoop platform is given clearly.3) A clustering system is designed. The system has the advantage of open interface, high extendeablity and low system coupling, provides flexible schedule and integration between clustering algorithms to satisfy different kind of requirements and novel confiration method by using Json, a light-weight alternative to XML.
Keywords/Search Tags:Clustering, Hierarchical Clustering, K-Means, Parallel and Distributed Computing, Large Data Sets
PDF Full Text Request
Related items