Font Size: a A A

The Design And Implementation Of Data Mining System On Yarn

Posted on:2016-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2298330467992898Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, a new era of "big data" is coming forward. The data is of hige-valume, high generating speed, strong timeliness and high complexity.The difficulty of data processing is becoming bigger and bigger. People need powerful, wideuniversal tools to find valuable knowledge from huge amounts of data, which can help people make decisions and create greater value. Data mining can help us easily extract the mode of knowledge from mass data, and it also plays an important role in practical application.The putting forward of cloud computingsolved the limitations the traditional stand-alone mode faced when handling huge amounts of data.Cloud computing also provided a convenient way for data mining of huge amount of data.At first, this paper deeply studied the mainstream trend of open source cloud computing frameworks, and we also did some experiments in this research. We discussed the limitations of Hadoop1.0, compared the advantages and disadvantages of MapReduce and Spark. We summarized the characteristics of YARN and other resources management frameworks. We also elaborated the significane of Storm in stream-oriented computation. We set up the cloud computingplatform.Secondly, we designed the data mining system architecture, including the cloud computing layer. At the same time, we completed the migration of the existing data mining system to the YARN based platform.Based on MapReduce and Spark, we completed the design and implementation of a variety of parallel data mining algorithms.Finally, inthis article we did the functional test of the system. We also completed the performance test of the cloud layer and the real-time processing module.The test results show that all the modules of the migrated data mining platform can run normally and stably. It is possibly for us to construct a cloud computing platform which supports multiple computing frameworks and has higher resource utilization. MapReduce is put forward for batch processing, and it is suitable for the offline processing of mass data. Spark is suitable for iterative algorithms and interactive data mining. The performance of Spark is also superior to MapReduce in these circumstances.Storm can maintain a higher processing speed and can keep the error rate very low in flow processing. Storm is of important significance to flow processing.
Keywords/Search Tags:data mining, cloud computing, Hadoop, Spark, YARN, Storm
PDF Full Text Request
Related items