Font Size: a A A

Research And Application Of Data Mining Algorithms Using Mapreduce

Posted on:2013-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:L L DuFull Text:PDF
GTID:2298330362467021Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The amount of data that Data Mining faces is increasing. How to mine valuableinformation from mass data high-performance, rapidly, simply, low cost and goodscalability, and apply it to manufacture become a burning question. Serial algorithm takes along time to mine large-scale data, and do not has the ability to mine ultra-large-scale data.Traditional parallel computing has got certain improvement at the research on processingof large-scale data in Data Mining, but for parallel task, it has the following disadvantages:low-degree abstraction, difficult in programming, be constrainted by hardware or networkbandwidth, limited processing power and needing the support of high-performancecomputers, which increase the cost.To solve these problems, this article researches on parallel Data Mining algorithmsusing MapReduce programming model which is high-degree abstraction, simply, highscableablity, local-storage, have no use for the support of high-performance computers,thus to improve ability and efficiency of mass data minimg. We present the idea of parallelData Mining algorithms based on MapReduce, propose the algorithms of the parallelPartial Least Squares and parallel Co-citation correlation based on MapReduce, andimplement these algorithms in Hadoop. It has been proved that they hava basically linearspeedup and good scalability. The parallel PLS has been used in online near-infraredquality surveillance in production of TCM (Traditional Chinese Medicine), improving thespeed of regression modeling of near-infrared spectroscopy. The parallel Co-citationcorrelation algorithm has been applied to goods match, improving the efficiency of masscommodity similarity calculation. The main content of this article is as following:1. A Hadoop speedup model based on MapReduce and three I/O load factor model aregiven. Then this article proved and studyed how they influence speedup. This researchprovides theory of evidence to improve the speedup of the parallel Data Mining algorithmsusing MapReduce.2. In industry field, the regression modeling on near-infrared spectrum data has manydefects such as the large-scale data, slow modeling processing and low efficiency. To solvethese problems, a parallel Partial Least Square using MapReduce is proposed. This parallelPartial Least Square includes parallel standardizing data and parallel computing principalcomponent. We did interrelated experiments on Hadoop cluster which is made up byordinary computers. These experiments proved that parallel PLS has the capacity toprocess mass near infrared spectral data accurately, obtain basically linear speedup andobtain good scalability. This parallel PLS has been used in online near-infrared quality surveillance in production of TCM, and guaranteed the stable quality of TCM.3. In E-commerce field, aiming at mass products and sales data, a parallel productCo-citation correlation using MapReduce is studyed. Co-citation theory is introduced toelectronic commerce to measure correlation of isomorphism or heterogeneity products.Citation-Co-citaion is given to optimize co-citation theory, which can improve thecorrectness of product correlation and provide basis for product matching. ParallelCo-citation creating method based on MapReduce is given, and implicated on Hadoopcloud computing platform. The experiments proved this method has the capacity to processmass product information data quickly, obtain basically linear speedup, obtain effect ofreminding customer and improve sales volume.
Keywords/Search Tags:MapReduce, Hadoop, Data Mining, Near-infrared Spectrum, Partial LeastSquares, Co-citiation Correlation, Speedup Model
PDF Full Text Request
Related items