The Research Of Mapreduce Implementing Of Text Classification Algorithm Based On Mass Data

Posted on:2015-01-19

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Xing

Full Text:PDF

GTID:2348330518971681

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Cloud computing has aroused the general interest of the IT industry since 2008. Cloud computing can be treated as the product of distributed processing, parallel processing and grid computing developing. The key is concurrence and distribution; its core is the massive data processing. However, as the cloud computing itself is just a theoretical model, in order to make it generate value, in addition to hardware, the more important is to have a software platform and an effectively-run parallel program on the platform.A common issue in data mining field is the massive-data processing. Usually many traditional data mining algorithms have such bottleneck that these algorithms are only suitable for small-scale data-inputting. They are not suitable and the efficiency will be greatly affected when the amount of data increases. But cloud computing solves the problem and Cloud computing specializes in mass data processing. If the traditional data mining algorithm can be parallelized and put onto cloud computing platform, the bottleneck in data mining above will be solved. Whether the above problem can be solved on cloud computing platform or not, the key is that if the data mining computing can be reasonably parallelized.The contribution of this thesis is a detailed decription of the process of traditional Native Bayes algorithm, points out the bottleneck of traditional Native Bayes algorithm and proposes the solution for parallelization. Then the thesis gives some details of MapReducization of the traditional Bayes algorithm on Hadoop platform. The last, by comparing the experiment on processing data of traditional Bayes algorithm and MapReducization, the thesis proves the parallelization on cloud platform can Reduce the time of large-scale data computing, and the thesis analyzes the influence of some major performance parameters on the time of job running. The thesis builds a Hadoop cluster on nine nodes, designs six different experiment plans to run traditional Bayes program and the MapReducized Bayes program, then analyzes the operation results finally. The results show that: 1) The MapReducized Bayes sorting computing comparing with the traditional serial processing, owns the capability of large-scale data computing; 2) The MapReducized Bayes sorting program has good speed-up; 3) Delay time?the number of backup and the memory buffer influence the performance of MapReducized Bayes algorithm; 4) Single node failure influences the running time of jobs. The experiment results prove that the plan presented in the thesis is executable and effective. The study in thesis offers a feasibility MapReducized plan to Bayes sorting computing.

Keywords/Search Tags:

Cloud computing, Bayes, Hadoop, MapReduce

PDF Full Text Request

Related items

1	Research And Application On Naive Bayes Classification Algorithm
2	Researches About Cloud Computing And Expolit And Test Hadoop Program
3	Research On Algorithms For Naive Bayes Classification And Its Tools Based On Hadoop
4	The Mapreduce Model In The Hadoop Implementation Of Performance Analysis And Optimization Improvements
5	The Research Of MapReduce Job Scheduling Algorithm Based On The Hadoop Platform
6	Based The Hadoop Platform Job Scheduling Algorithm
7	Research And Improvement Of The MapReduce Framework In Cloud Computing
8	The Design Of The Cloud Computing System Based On Hadoop
9	The Cloud Computing Based On Hadoop Platform And Log Analysis
10	Research On The Application Of Cloud Computing Based On Hadoop