Font Size: a A A

Research On Parallel Data Mining Based On Hadoop

Posted on:2018-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:T S HaoFull Text:PDF
GTID:2348330536979930Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of the "Internet +" era,the amount of data generated is exponential,covering a variety of unstructured data.How to find meaningful patterns and rules from changing and massive data to solve the problems in the fields of science and so on.Data mining is the integration of statistics,database,machine learning,artificial intelligence and other fields,but most of the traditional data mining and improved methods are mostly connected to the stand-alone,because the single processing capacity is limited and the memory is insufficient,which is not suitable for large-scale data mining work.In this case,Hadoop-based parallel data mining has become a new research hotspot.Apriori algorithm uses serial self-join and pruning to mining frequent itemsets by layer-by-layer iteration.Apriori algorithm has the disadvantage of repeatedly scanning the database,resulting in a large number of candidate sets and the algorithm is less efficient.The parallel Apriori algorithm based on MapReduce solves the problem that the traditional Apriori algorithm scans the database multiple times,but its candidate set is still generated by serial self-connection via frequent itemsets and produces a large number of candidate sets of intermediate data.This thesis focuses on improving the efficiency of mining the frequent itemsets based on MapReduce's Apriori algorithm,and improves the parallelization of the connection steps,and proposes the C_Apriori algorithm for mining the frequent itemsets The algorithm obtains the candidate set Ck+1 from the frequent k-itemsets in parallel through the Map and Reduce processes,which makes the whole process of the Apriori algorithm produce the frequent itemsets of parallelization and reduces the number of candidate sets in the iterative process.Based on the analysis of time complexity,this paper proved that the C_Apriori algorithm can greatly reduce the time consumption of the connection step when dealing with large-scale data.At the end of this paper,the C_Apriori algorithm was tested in Hadoop parallel data mining system which was designed by using HBase.The results show that the improved algorithm has higher efficiency and achieves excellent acceleration function in large data and smaller support environment.The program has been successfully applied in a intelligent community system.
Keywords/Search Tags:Hadoop, Data mining, Association rules, Apriori, HBase
PDF Full Text Request
Related items