Research And Application Of Data Mining Algorithms Using Mapreduce

Posted on:2013-07-03

Degree:Master

Type:Thesis

Country:China

Candidate:L L Du

Full Text:PDF

GTID:2298330362467021

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The amount of data that Data Mining faces is increasing. How to mine valuableinformation from mass data high-performance, rapidly, simply, low cost and goodscalability, and apply it to manufacture become a burning question. Serial algorithm takes along time to mine large-scale data, and do not has the ability to mine ultra-large-scale data.Traditional parallel computing has got certain improvement at the research on processingof large-scale data in Data Mining, but for parallel task, it has the following disadvantages:low-degree abstraction, difficult in programming, be constrainted by hardware or networkbandwidth, limited processing power and needing the support of high-performancecomputers, which increase the cost.To solve these problems, this article researches on parallel Data Mining algorithmsusing MapReduce programming model which is high-degree abstraction, simply, highscableablity, local-storage, have no use for the support of high-performance computers,thus to improve ability and efficiency of mass data minimg. We present the idea of parallelData Mining algorithms based on MapReduce, propose the algorithms of the parallelPartial Least Squares and parallel Co-citation correlation based on MapReduce, andimplement these algorithms in Hadoop. It has been proved that they hava basically linearspeedup and good scalability. The parallel PLS has been used in online near-infraredquality surveillance in production of TCM (Traditional Chinese Medicine), improving thespeed of regression modeling of near-infrared spectroscopy. The parallel Co-citationcorrelation algorithm has been applied to goods match, improving the efficiency of masscommodity similarity calculation. The main content of this article is as following:1. A Hadoop speedup model based on MapReduce and three I/O load factor model aregiven. Then this article proved and studyed how they influence speedup. This researchprovides theory of evidence to improve the speedup of the parallel Data Mining algorithmsusing MapReduce.2. In industry field, the regression modeling on near-infrared spectrum data has manydefects such as the large-scale data, slow modeling processing and low efficiency. To solvethese problems, a parallel Partial Least Square using MapReduce is proposed. This parallelPartial Least Square includes parallel standardizing data and parallel computing principalcomponent. We did interrelated experiments on Hadoop cluster which is made up byordinary computers. These experiments proved that parallel PLS has the capacity toprocess mass near infrared spectral data accurately, obtain basically linear speedup andobtain good scalability. This parallel PLS has been used in online near-infrared quality surveillance in production of TCM, and guaranteed the stable quality of TCM.3. In E-commerce field, aiming at mass products and sales data, a parallel productCo-citation correlation using MapReduce is studyed. Co-citation theory is introduced toelectronic commerce to measure correlation of isomorphism or heterogeneity products.Citation-Co-citaion is given to optimize co-citation theory, which can improve thecorrectness of product correlation and provide basis for product matching. ParallelCo-citation creating method based on MapReduce is given, and implicated on Hadoopcloud computing platform. The experiments proved this method has the capacity to processmass product information data quickly, obtain basically linear speedup, obtain effect ofreminding customer and improve sales volume.

Keywords/Search Tags:

MapReduce, Hadoop, Data Mining, Near-infrared Spectrum, Partial LeastSquares, Co-citiation Correlation, Speedup Model

PDF Full Text Request

Related items

1	Research Of Frequent Itemsets Mining Algorithm Based On MapReduce Calculation Model
2	Research Of Massive Data Processing And Mining In Database Marketing Based On Hadoop
3	MapReduce-based Graph Mining Research
4	The Research Of Clustering Mining Based On Logistics History Data On The Hadoop
5	Research On Parallel Data Mining Algorithms Based On Hadoop
6	Data Mining Based On Hadoop Platform
7	Research On Algorithm Of Data Mining Based On Hadoop
8	Research And Application Of Hadoop Distributed Clustering Mining Method Based On Virtual Machine
9	Research On Spatial Data Mining Based On Hadoop
10	Design And Implimention Of Data Mining And Migration System Based On Hadoop