Font Size: a A A

Design And Implementation Of The Massive Data Computing Platform Based On Spark

Posted on:2017-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:K Y JiangFull Text:PDF
GTID:2348330488959927Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The data processing technology is combined with data storage and data computation that its main goal is to achieve mining analysis for all kinds of data.In recent years, UC Berkeley AMP Lab develops a new framework for massive data processing called Spark which is gradually into our horizon. It is not only perfects the early popular Hadoop framework, but also proposes the resilient distributed datasets RDD (Resilient Distributed Datasets) and more flexible programming model, which provides a simpler and more efficient way for processing massive data.With the advent of the massive data era, many companies will often encounter the problem of massive data processing and analysis. Charge, complex operations, non-customized algorithms and non-intuitive results are problems that the massive data processing systems have. The paper is proposed to a platform which is based on Spark, so that it can run efficiently to achieve massive data storage and computation. It also allows users to upload customized algorithms, then algorithms can run after easy configuration. Based on Webx framework, the system provides service through website to reduce the cost of learning the traditional command-line operation and makes Spark operation graphical. Also, the system displays analyzing results in various ways, which will make researchers more convenient for further study.Firstly, by analyzing the Spark distributed computation framework and Web technology current development in detail, the paper presents the problems of massive data processing and lists functions, performance and demands of system. On this basis, the paper introduces the Webx framework detailed, and utilizes the framework to achieve website of all system functions in visualization. Secondly, by analyzing parallel computation model, MLlib is used to implement classic data mining algorithms. Then, through analyzing data storage mechanism, Mysql database combined with distributed file system HDFS is used for storing users and algorithms information. Finally, connection technology Secure Shell implements interaction between the front website and the back Spark cluster.
Keywords/Search Tags:Massive Data, Webx, Spark, Visualization
PDF Full Text Request
Related items