Font Size: a A A

Research And Application Of Parallel Data Mining Algorithms Based On MapReduce

Posted on:2016-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:B L SunFull Text:PDF
GTID:2308330470478054Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the arrival of the era of big data, the data scale shows a increased trend and data formats become diversified. The traditional data mining technology is facing the major challengein large data pressure. In the case of limited computing resources of the monomers, the appearance of cloud computing technology represented by Hadoop Map Reduce parallel computing framework can break through the key computing resource constraints CPU, memory, etc., it can effectively improve the processing capacity of big data.Many traditional data mining algorithms can only be applied to small scale data. With the increasing scale of data, they gradually exposed some performance bottlenecks such as out of memory, low calculation efficiency. Map Reduce is applied to the field of data mining and to study the Map Reduce parallelization of data mining algorithms for low-cost high-performance distributed parallel mining, not only to meet the needs of big data analysis, data mining for sustainable development is also of great significance.Firstly, this article deeply studies the Hadoop key technology Map Reduce and the Hadoop distributed file system HDFS, put forward a Map Reduce-based data mining algorithm parallelization model. According to this model,with Linear Regression Analysis and Association Rules Analysis as the research objects.The main work of this article can be summarized as follows:(1) Build a Map Reduce-based data mining algorithm parallelization model.(2) In Regression Analysis, for the performance bottlenecks of the traditional linear regression algorithm and locally weighted linear regression algorithm to process large-scale data, this article propose an improved algorithm--KNN-LWLR Algorithm. The improved algorithm has a remarkable feature that it can be parallelized. Then, parallelizations are realized on Hadoop and relevant performance test are analyzed.(3) In Association Rules Analysis, for the performance bottlenecks of mining frequent itemsets from massive-scale data, This article proposea parallel improvement strategy of the FP-Growth algorithm. The parallel mining frequent itemsets is realized on Hadoop. And in the output stage, each item is processed by merging and the algorithm output only the first K frequent itemsets including the item to improve the effectiveness of mass data decision value.(4) The improved parallel FP-Growth algorithm is applied to Web text mining to mine frequently associated terms in massive-scale Web documents. Multi-nodes tests with multiple sets of data sets are done to analyze the performance ofthe parallel FP-Growth algorithm.
Keywords/Search Tags:Big Data, Map Reduce, Data Mining, Linear Regression, Association Rules
PDF Full Text Request
Related items