Font Size: a A A

Research On Hybrid Classification Algorithm Based On Hadoop Platform

Posted on:2016-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:J D QiuFull Text:PDF
GTID:2308330464466359Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Classification algorithm play an important role in processing data in data mining area. Nevertheless, traditional classification algorithm has its own defects.:ID3 Decision Tree has simple structure, but it is unable to tackle the data sets with real attribute values. C4.5 Decision Tree improve the problem existed in ID3, however, it needs to scan data set for several times in its construction algorithm, thus it is unfit for large volume data. Although Na?ve Bayes has solid theoretical basis, it proposes a hypothesis that data sets are independent, which can lead to biased classification result. With “Big Data” era approaching, the amounts of data increase rapidly, thus the efficiency that traditional classification algorithm decrease sharply. In order to solve this issue, a large number of scholars proposed methods to improve traditional classification algorithm, which can be divided into two parts, improvement of algorithm defects and parallelization of traditional algorithm.Firstly, through constructing Adaptive Bayes Decision Tree(called A-BDT), this paper solves the problem that traditional classification algorithm faces when processing big data sets. Secondly, we conbine A-BDT and Hadoop platform to improve the algorithm efficiency. The specific research work includes:(1) Building A-BDT algorithm: To begin with, we modify Na?ve Bayes algorithm for building Adaptive Bayes algorithm, which can lead to biased classification result. This paper aims to improve Na?ve Bayes algorithm, building an A-Bayes algorithm, which can reduce the effect that independency hypothesis has on classification result by add a strong correlation hypothesis for data sets and employing an modification factor in its algorithm formula. Furthermore, ID3 Decision Tree is combined with A-Bayes algorithm to construct a hybrid classification algorithm, called A-BDT, in which A-Bayes algorithm can finish preprocessing data sets and supplement the missing attribute values, and ID3 Decision Tree is responsible for implementing the classification of data sets and improving the defects in both of two algorithms. In serial context, we use A-Bayes, ID3, C4.5, Na?ve Bayes and Neural Network respectively to process the same data set. The experimental result demonstrates that A-Bayes can reach better precision, recall and run in shorter time when comparing with other traditional classification algorithms.(2) Hadoop parallelization of A-BDT algorithm: facing data set with massive data, traditional classification algorithms can hardly present classification result in the short run. In this paper we propose to combine A-BDT with Hadoop platform, employing MapReduce framework to divide and conquer tasks such as preprocessing data and classifying data set, and to break down big tasks, which was originally run in serial context, into several sub-tasks. Then these sub-tasks will be processed by Map and the their results will be merged by Reduce, which highly improve the classification efficiency. The experimental data shows that A-BDT can realize satisfied speed-up ratio under Hadoop platform.
Keywords/Search Tags:Big Data, A-BDT Algorithm, Hadoop Platform, MapRedce, Parallel Environment
PDF Full Text Request
Related items