Research On Classification Algorithm Based On Massive Data Mining

Posted on:2016-04-08

Degree:Master

Type:Thesis

Country:China

Candidate:J W Tu

Full Text:PDF

GTID:2278330479955442

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Classification of data mining as one of the most active branches, is widely used in pattern recognition, image recognition, machine learning and other fields. Moreover, the classification in the real social life and production practice also has a wide range of application scenarios, such as medical image recognition, spam filtering. However, with the advent of the era of big data, data is quickly and cumulative production, TB or even PB-class level of data classification application scenarios is becoming a common problem. Although the vast amounts of data so that the data model is more of a complete system, but it also brings more redundancy and noise, but also a sharp increase in execution time classification tasks. In such a context, the accuracy of the classification algorithm is no longer the sole focus of attention, at how the algorithm does not affect existing classification accuracy, raising the efficiency of the algorithm, it seems to have become the algorithm researchers the new focus of attention.Hadoop is by drawing on Google’s distributed file system GFS and Map Reduce parallel computing framework of thought, successfully cloned an open source distributed systems. Making the calculation is based on the idea of parallel clusters rapidly develop and popularize in data computing. But also opens up the data in a single-node environment run mining algorithms to migrate to the cluster environment and the wave of parallel execution.First, this paper take advantage of KNN classification algorithm based on partial information on the test sample to classify the characteristics of clustering algorithms, and test sample by cutting independent training sample, effectively reducing the computational overhead KNN algorithm, to enhance KNN classification the purpose algorithm efficiency and performance. Then the aid and MapReduce parallel computing model, design and implement a parallel cluster-based KNN classifier, and successfully completed the migration of Hadoop clusters with a range of performance tests. Secondly, through the Bayesian classifier algorithm flow task decomposition, we implemented a model based on MapReduce parallel of Naive Bayes classifier. However, in the specific practice found that discrete data into order parallelization naive Bayes classifier performance bottlenecks. Therefore, in order to break this bottleneck, also by means of MapReduce model, designed and implemented based on the entropy of discrete data algorithms. It makes parallelization Naive Bayes classifier carrying massive data classification, has a higher efficiency. Experiments show that both cluster-based KNN classification algorithm is parallelized parallelizing Bayesian classifier algorithm(using the parallel discrete data discrete processing methods), we are able to bring greater performance and good expansion sex. To some extent, to meet the people’s massive data classification performance requirements.

Keywords/Search Tags:

Classification, MapReduce, Parallelization, KNN, Na?ve, Bayes, discrete data

PDF Full Text Request

Related items

1	The Research Of Mapreduce Implementing Of Text Classification Algorithm Based On Mass Data
2	Research And Application On Naive Bayes Classification Algorithm
3	Research And Development Of Harassing Telephone Identification And Interception Technology
4	Research On Algorithms For Naive Bayes Classification And Its Tools Based On Hadoop
5	Research On Parallelization Of Clustering Algorithm Based On MapReduce
6	Research Of Classification In Data Mining Based On Bayes Technology
7	Research On Bayes Method Based Data Classification
8	Research Of Data Mining Classification Algorithm Based On Cloud Computing And The Solar Wind Data
9	Research On Text Mining Based On MapReduce
10	Research On Improved Naive Bayes Classification Model For Imbalanced E-commerce Review Text