Research On Parallel Data Mining Based On Hadoop

Posted on:2018-04-27

Degree:Master

Type:Thesis

Country:China

Candidate:T S Hao

Full Text:PDF

GTID:2348330536979930

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the advent of the "Internet +" era,the amount of data generated is exponential,covering a variety of unstructured data.How to find meaningful patterns and rules from changing and massive data to solve the problems in the fields of science and so on.Data mining is the integration of statistics,database,machine learning,artificial intelligence and other fields,but most of the traditional data mining and improved methods are mostly connected to the stand-alone,because the single processing capacity is limited and the memory is insufficient,which is not suitable for large-scale data mining work.In this case,Hadoop-based parallel data mining has become a new research hotspot.Apriori algorithm uses serial self-join and pruning to mining frequent itemsets by layer-by-layer iteration.Apriori algorithm has the disadvantage of repeatedly scanning the database,resulting in a large number of candidate sets and the algorithm is less efficient.The parallel Apriori algorithm based on MapReduce solves the problem that the traditional Apriori algorithm scans the database multiple times,but its candidate set is still generated by serial self-connection via frequent itemsets and produces a large number of candidate sets of intermediate data.This thesis focuses on improving the efficiency of mining the frequent itemsets based on MapReduce's Apriori algorithm,and improves the parallelization of the connection steps,and proposes the C_Apriori algorithm for mining the frequent itemsets The algorithm obtains the candidate set Ck+1 from the frequent k-itemsets in parallel through the Map and Reduce processes,which makes the whole process of the Apriori algorithm produce the frequent itemsets of parallelization and reduces the number of candidate sets in the iterative process.Based on the analysis of time complexity,this paper proved that the C_Apriori algorithm can greatly reduce the time consumption of the connection step when dealing with large-scale data.At the end of this paper,the C_Apriori algorithm was tested in Hadoop parallel data mining system which was designed by using HBase.The results show that the improved algorithm has higher efficiency and achieves excellent acceleration function in large data and smaller support environment.The program has been successfully applied in a intelligent community system.

Keywords/Search Tags:

Hadoop, Data mining, Association rules, Apriori, HBase

PDF Full Text Request

Related items

1	Research On The Apriori Algorithms For Meteorological Data Association Rules Analysis Based On Cloud Computing
2	The Research Of Quantitative Association Rules Data Mining Based On Hadoop
3	Mining Association Rules Algorithm Analysis Based On Hadoop
4	Research On Association Rules Algorithm Based On Hadoop
5	The Study On The Recommending Methods For Online Travel Websites Association Rules
6	Research On Parallel Acceleration Algorithm Of Association Rules Based On Hadoop
7	Research On Association Rules Mining Methods Of Mass Engineering Data Based On Hadoop
8	Research On A Parallel Data Mining Algorithm Apriori
9	Research Of Parallelized Distributed Association Rules Mining Algorithm Based On Hadoop
10	The Research And Implementation Of Parallel Association Rules Algorithm Based On Cloud Environment Data Mining