Font Size: a A A

The Research Of Data Mining Based On Hadoop Platform

Posted on:2015-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:S S FeiFull Text:PDF
GTID:2298330467463753Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid breeding of information and data, the traditional database systems have been difficult to meet the needs of big data. The emergence of cloud computing provides the opportunity for massive data mining. Because the fast storage capacity and super computing power of cloud computing, data mining has entered a new era. Hadoop framework is the most widely use and the best developed cloud platform with the advantages of economical, reliable, and strong expansion capability, parallel resistance and high efficiency. The key technologies of Hadoop are the Hadoop Distributed File System (HDFS) and parallel processing MapReduce programming model respectively achieve massive data storage and parallel computing. The key to solve the problem of massive data mining is applying the traditional data mining techniques and algorithms to Hadoop platform for parallel processing.This paper analyzes the demand for big data mining and designs a data mining system based on Hadoop platform based on the depth research of Hadoop framework, data mining techniques and decision tree SPRINT algorithms. For the system algorithm layer, we realize and improve the SPRINT algorithm so that it can handle large data mining in parallel. Then we transfer the improved SPRINT algorithm to Hadoop framework by use HDFS and Mapreduce. The improved system eliminates the duplicated and unnecessary calculations that reduces the amount of computation, effectively improve the efficiency of the system; performs sort of the continuous and discrete attribute table, effectively reducing the split time of the discrete attribute; designs a new data structure to meet the needs of MapReduce programming with better parallelization. Finally we use MYSQL to constructe some large data sets and execute the test of system efficiency. Experimental results show that the improved system significantly reduces the time of data processing that the system efficiency is improved.
Keywords/Search Tags:cloud computing, Hadoop, data miningSPRINT
PDF Full Text Request
Related items