Font Size: a A A

The Research Of Mapreduce Implementing Of Text Classification Algorithm Based On Mass Data

Posted on:2015-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y XingFull Text:PDF
GTID:2348330518971681Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cloud computing has aroused the general interest of the IT industry since 2008. Cloud computing can be treated as the product of distributed processing, parallel processing and grid computing developing. The key is concurrence and distribution; its core is the massive data processing. However, as the cloud computing itself is just a theoretical model, in order to make it generate value, in addition to hardware, the more important is to have a software platform and an effectively-run parallel program on the platform.A common issue in data mining field is the massive-data processing. Usually many traditional data mining algorithms have such bottleneck that these algorithms are only suitable for small-scale data-inputting. They are not suitable and the efficiency will be greatly affected when the amount of data increases. But cloud computing solves the problem and Cloud computing specializes in mass data processing. If the traditional data mining algorithm can be parallelized and put onto cloud computing platform, the bottleneck in data mining above will be solved. Whether the above problem can be solved on cloud computing platform or not, the key is that if the data mining computing can be reasonably parallelized.The contribution of this thesis is a detailed decription of the process of traditional Native Bayes algorithm, points out the bottleneck of traditional Native Bayes algorithm and proposes the solution for parallelization. Then the thesis gives some details of MapReducization of the traditional Bayes algorithm on Hadoop platform. The last, by comparing the experiment on processing data of traditional Bayes algorithm and MapReducization, the thesis proves the parallelization on cloud platform can Reduce the time of large-scale data computing, and the thesis analyzes the influence of some major performance parameters on the time of job running. The thesis builds a Hadoop cluster on nine nodes, designs six different experiment plans to run traditional Bayes program and the MapReducized Bayes program, then analyzes the operation results finally. The results show that: 1) The MapReducized Bayes sorting computing comparing with the traditional serial processing, owns the capability of large-scale data computing; 2) The MapReducized Bayes sorting program has good speed-up; 3) Delay time?the number of backup and the memory buffer influence the performance of MapReducized Bayes algorithm; 4) Single node failure influences the running time of jobs. The experiment results prove that the plan presented in the thesis is executable and effective. The study in thesis offers a feasibility MapReducized plan to Bayes sorting computing.
Keywords/Search Tags:Cloud computing, Bayes, Hadoop, MapReduce
PDF Full Text Request
Related items