Research On Hybrid Classification Algorithm Based On Hadoop Platform

Posted on:2016-06-01

Degree:Master

Type:Thesis

Country:China

Candidate:J D Qiu

Full Text:PDF

GTID:2308330464466359

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Classification algorithm play an important role in processing data in data mining area. Nevertheless, traditional classification algorithm has its own defects.:ID3 Decision Tree has simple structure, but it is unable to tackle the data sets with real attribute values. C4.5 Decision Tree improve the problem existed in ID3, however, it needs to scan data set for several times in its construction algorithm, thus it is unfit for large volume data. Although Na?ve Bayes has solid theoretical basis, it proposes a hypothesis that data sets are independent, which can lead to biased classification result. With â€œBig Dataâ€ era approaching, the amounts of data increase rapidly, thus the efficiency that traditional classification algorithm decrease sharply. In order to solve this issue, a large number of scholars proposed methods to improve traditional classification algorithm, which can be divided into two parts, improvement of algorithm defects and parallelization of traditional algorithm.Firstly, through constructing Adaptive Bayes Decision Tree(called A-BDT), this paper solves the problem that traditional classification algorithm faces when processing big data sets. Secondly, we conbine A-BDT and Hadoop platform to improve the algorithm efficiency. The specific research work includes:(1) Building A-BDT algorithm: To begin with, we modify Na?ve Bayes algorithm for building Adaptive Bayes algorithm, which can lead to biased classification result. This paper aims to improve Na?ve Bayes algorithm, building an A-Bayes algorithm, which can reduce the effect that independency hypothesis has on classification result by add a strong correlation hypothesis for data sets and employing an modification factor in its algorithm formula. Furthermore, ID3 Decision Tree is combined with A-Bayes algorithm to construct a hybrid classification algorithm, called A-BDT, in which A-Bayes algorithm can finish preprocessing data sets and supplement the missing attribute values, and ID3 Decision Tree is responsible for implementing the classification of data sets and improving the defects in both of two algorithms. In serial context, we use A-Bayes, ID3, C4.5, Na?ve Bayes and Neural Network respectively to process the same data set. The experimental result demonstrates that A-Bayes can reach better precision, recall and run in shorter time when comparing with other traditional classification algorithms.(2) Hadoop parallelization of A-BDT algorithm: facing data set with massive data, traditional classification algorithms can hardly present classification result in the short run. In this paper we propose to combine A-BDT with Hadoop platform, employing MapReduce framework to divide and conquer tasks such as preprocessing data and classifying data set, and to break down big tasks, which was originally run in serial context, into several sub-tasks. Then these sub-tasks will be processed by Map and the their results will be merged by Reduce, which highly improve the classification efficiency. The experimental data shows that A-BDT can realize satisfied speed-up ratio under Hadoop platform.

Keywords/Search Tags:

Big Data, A-BDT Algorithm, Hadoop Platform, MapRedce, Parallel Environment

PDF Full Text Request

Related items

1	Research On Parallelization Of Clustering Algorithm Based On Heterogeneous Hadoop Platform
2	Research On Parallel Association Rule Mining Algorithm Based On Hadoop Platform
3	The Research And Implementation Of The CCD Apparent Radiance Calibration Algorithm Based On Hadoop Platform
4	Research On Parallel Decision Tree Algorithm Based On Hadoop Platform
5	Research On Association Rules Mining Methods Of Mass Engineering Data Based On Hadoop
6	Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform
7	Research On Parallel Association Rules Algorithm Based On HADOOP Platform
8	Parallel Algorithm For Multiple Longest Common Subsequence And Application Research On Hadoop Platform
9	English On Design And Implementation Of Network Data Parallel Processing System Based On Hadoop Platform
10	Research On Mining Taxi Pick-up Hotspots Area Based On Big Data Hadoop Platform