The Design And Implementation Of Data Mining System On Yarn

Posted on:2016-01-06

Degree:Master

Type:Thesis

Country:China

Candidate:H Wang

Full Text:PDF

GTID:2298330467992898

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology, a new era of "big data" is coming forward. The data is of hige-valume, high generating speed, strong timeliness and high complexity.The difficulty of data processing is becoming bigger and bigger. People need powerful, wideuniversal tools to find valuable knowledge from huge amounts of data, which can help people make decisions and create greater value. Data mining can help us easily extract the mode of knowledge from mass data, and it also plays an important role in practical application.The putting forward of cloud computingsolved the limitations the traditional stand-alone mode faced when handling huge amounts of data.Cloud computing also provided a convenient way for data mining of huge amount of data.At first, this paper deeply studied the mainstream trend of open source cloud computing frameworks, and we also did some experiments in this research. We discussed the limitations of Hadoop1.0, compared the advantages and disadvantages of MapReduce and Spark. We summarized the characteristics of YARN and other resources management frameworks. We also elaborated the significane of Storm in stream-oriented computation. We set up the cloud computingplatform.Secondly, we designed the data mining system architecture, including the cloud computing layer. At the same time, we completed the migration of the existing data mining system to the YARN based platform.Based on MapReduce and Spark, we completed the design and implementation of a variety of parallel data mining algorithms.Finally, inthis article we did the functional test of the system. We also completed the performance test of the cloud layer and the real-time processing module.The test results show that all the modules of the migrated data mining platform can run normally and stably. It is possibly for us to construct a cloud computing platform which supports multiple computing frameworks and has higher resource utilization. MapReduce is put forward for batch processing, and it is suitable for the offline processing of mass data. Spark is suitable for iterative algorithms and interactive data mining. The performance of Spark is also superior to MapReduce in these circumstances.Storm can maintain a higher processing speed and can keep the error rate very low in flow processing. Storm is of important significance to flow processing.

Keywords/Search Tags:

data mining, cloud computing, Hadoop, Spark, YARN, Storm

PDF Full Text Request

Related items

1	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
2	Parallel Research On Data Mining Algorithm Based On YARN And Spark Framework
3	Research And Application Of Cloud Computing Technology In The Power System Bad Data Processing
4	Research On SLA-Aware Energy-Efficient Scheduling Strategy For Hadoop Yarn
5	The Design Of The Cloud Computing System Based On Hadoop
6	The Design And Implementation Of Log Analysis System In Cloud Computing Environment
7	Parallel Data Mining Algorithm Research In Cloud
8	Research And Design Of Data Mining System For Tcm Disease Based On Cloud Computing Environment
9	Research On Key Technologies And Application On YARN For High-Performance Computing
10	Research On The Energy-Efficient Hadoop YARN Resource Scheduling Strategy Based On State Matrix