Research And Application Of Parallel Data Mining Algorithms Based On MapReduce

Posted on:2016-05-05

Degree:Master

Type:Thesis

Country:China

Candidate:B L Sun

Full Text:PDF

GTID:2308330470478054

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the arrival of the era of big data, the data scale shows a increased trend and data formats become diversified. The traditional data mining technology is facing the major challengein large data pressure. In the case of limited computing resources of the monomers, the appearance of cloud computing technology represented by Hadoop Map Reduce parallel computing framework can break through the key computing resource constraints CPU, memory, etc., it can effectively improve the processing capacity of big data.Many traditional data mining algorithms can only be applied to small scale data. With the increasing scale of data, they gradually exposed some performance bottlenecks such as out of memory, low calculation efficiency. Map Reduce is applied to the field of data mining and to study the Map Reduce parallelization of data mining algorithms for low-cost high-performance distributed parallel mining, not only to meet the needs of big data analysis, data mining for sustainable development is also of great significance.Firstly, this article deeply studies the Hadoop key technology Map Reduce and the Hadoop distributed file system HDFS, put forward a Map Reduce-based data mining algorithm parallelization model. According to this model,with Linear Regression Analysis and Association Rules Analysis as the research objects.The main work of this article can be summarized as follows:(1) Build a Map Reduce-based data mining algorithm parallelization model.(2) In Regression Analysis, for the performance bottlenecks of the traditional linear regression algorithm and locally weighted linear regression algorithm to process large-scale data, this article propose an improved algorithm--KNN-LWLR Algorithm. The improved algorithm has a remarkable feature that it can be parallelized. Then, parallelizations are realized on Hadoop and relevant performance test are analyzed.(3) In Association Rules Analysis, for the performance bottlenecks of mining frequent itemsets from massive-scale data, This article proposea parallel improvement strategy of the FP-Growth algorithm. The parallel mining frequent itemsets is realized on Hadoop. And in the output stage, each item is processed by merging and the algorithm output only the first K frequent itemsets including the item to improve the effectiveness of mass data decision value.(4) The improved parallel FP-Growth algorithm is applied to Web text mining to mine frequently associated terms in massive-scale Web documents. Multi-nodes tests with multiple sets of data sets are done to analyze the performance ofthe parallel FP-Growth algorithm.

Keywords/Search Tags:

Big Data, Map Reduce, Data Mining, Linear Regression, Association Rules

PDF Full Text Request

Related items

1	Improved Linear Regression Forecast Algorithm Based On Association Rules
2	The Research And Application Of Data Mining In Mining Rules Of Medical Diagnosis
3	The Applied Research Of Data Mining On Calculator Audit
4	Research On Association Rules Mining In Data Streams And Its Application
5	Research On Data Mining Based Decision Rules And Association Rules
6	Association Rules Mining And Its Applications In Microarray Gene Expression Data
7	Data Mining Techniques And Algorithms For Mining Association Rules
8	Study On Associations Rules's Apriori Algorithm In Data Mining
9	The Research & Implement For Mining Association Rules Of Definite Semanteme
10	Research On Agricultural Product Consumption Early Warning Based On Association Rules Mining