Font Size: a A A

Research On Rapid Mining Algorithm For Massive Data

Posted on:2013-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:X F ZhuFull Text:PDF
GTID:2248330377455256Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is a procedure that extract information and knowledge which is implicit and not known in advance but potentially useful from a large number of incomplete, noisy, fuzzy, random data. With the rapid development of IT, people have accumulate more than hundreds of TB data. How to extract useful information from vast amounts of data must be addressed. For massive data mining, distributed parallel processing and incremental processing are effective solutions.Cloud computing is an emerging computational model of the shared infrastructure, it specializes in large-scale data and large-scale computing, it is the extension and expansion of distributed computing. Parallel and distributed is the key to cloud computing. In this thesis, combination with cloud computing, taking the incremental mining of association rules as the starting point, we put forward new ideas for rapid mining of massive data.This thesis describes the definition, functions, steps and challenges of data mining, analyzes the association rules mining algorithm. We also describes the concept, features, form and key technologies of cloud computing, and focus on analysis of Hadoop Distributed File System HDFS and the realization of the principle of parallel programming model MapReduce of the typical cloud computing platfonn Hadoop. The research focuses on the parallel mining algorithm of large frequent itemsets in association rules mining. We propose a rapid association rules incremental mining algorithm based on the cloud computing, we named it as C-FUP. In order to improve the efficiency of the parallelization, we improve the data set allocation method of HDFS and design a method named DAMBNP that dataset is allocated according to the calculation performance of heterogeneous nodes in cluster. From analyzing the performance of Hadoop, we find the Hadoop has the problem that the capacity of processing a large number of small files is insufficient, so we design the method for solving this problem.In addition, we design experiments to test the effect of the proposed algorithm and method, and the experimental results show that C-FUP algorithm does well in association rules incremental mining of massive data and has good scalability and expansibility. DAMBNP can effectively improve the efficiency of C-FUP algorithm on the cloud computing platform.We have been done useful work in the massive data rapid mining.
Keywords/Search Tags:Massive Data, Incremental Mining of Association Rules, Cloud Computing
PDF Full Text Request
Related items