Font Size: a A A

Research On Classification Algorithm Based On Massive Data Mining

Posted on:2016-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:J W TuFull Text:PDF
GTID:2278330479955442Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Classification of data mining as one of the most active branches, is widely used in pattern recognition, image recognition, machine learning and other fields. Moreover, the classification in the real social life and production practice also has a wide range of application scenarios, such as medical image recognition, spam filtering. However, with the advent of the era of big data, data is quickly and cumulative production, TB or even PB-class level of data classification application scenarios is becoming a common problem. Although the vast amounts of data so that the data model is more of a complete system, but it also brings more redundancy and noise, but also a sharp increase in execution time classification tasks. In such a context, the accuracy of the classification algorithm is no longer the sole focus of attention, at how the algorithm does not affect existing classification accuracy, raising the efficiency of the algorithm, it seems to have become the algorithm researchers the new focus of attention.Hadoop is by drawing on Google’s distributed file system GFS and Map Reduce parallel computing framework of thought, successfully cloned an open source distributed systems. Making the calculation is based on the idea of parallel clusters rapidly develop and popularize in data computing. But also opens up the data in a single-node environment run mining algorithms to migrate to the cluster environment and the wave of parallel execution.First, this paper take advantage of KNN classification algorithm based on partial information on the test sample to classify the characteristics of clustering algorithms, and test sample by cutting independent training sample, effectively reducing the computational overhead KNN algorithm, to enhance KNN classification the purpose algorithm efficiency and performance. Then the aid and MapReduce parallel computing model, design and implement a parallel cluster-based KNN classifier, and successfully completed the migration of Hadoop clusters with a range of performance tests. Secondly, through the Bayesian classifier algorithm flow task decomposition, we implemented a model based on MapReduce parallel of Naive Bayes classifier. However, in the specific practice found that discrete data into order parallelization naive Bayes classifier performance bottlenecks. Therefore, in order to break this bottleneck, also by means of MapReduce model, designed and implemented based on the entropy of discrete data algorithms. It makes parallelization Naive Bayes classifier carrying massive data classification, has a higher efficiency. Experiments show that both cluster-based KNN classification algorithm is parallelized parallelizing Bayesian classifier algorithm(using the parallel discrete data discrete processing methods), we are able to bring greater performance and good expansion sex. To some extent, to meet the people’s massive data classification performance requirements.
Keywords/Search Tags:Classification, MapReduce, Parallelization, KNN, Na?ve, Bayes, discrete data
PDF Full Text Request
Related items