Font Size: a A A

The Design And Implementation Of Collection And Analysis System For Spark Performance Data

Posted on:2016-08-22Degree:MasterType:Thesis
Country:ChinaCandidate:W Q WuFull Text:PDF
GTID:2308330479491517Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer information technology, distrib uted computation has gained magnificent progress in the field. As a popular distributed computing framework at present, Spark has been widely used in enterprise business by developers. Developers hope to understand the performance data of underlying clusters during the runtime of Spark operation programs in order to master the operation of the whole Spark cluster and to figure out the bottleneck of system. Thus performance tuning is possible and higher computational efficiency can be obtained with less operation time. Consequently, the development of a collection and analysis system for Spark performance data is particularly important for developers to carry out performance optimization.A detailed introduction on the designing and realization process of the collection and analysis system for Spark performance data was given in this paper. User demands were analyzed thoroughly at first. Functional modules of the system was then divided into four parts, i.e. configuration overview, data collection, data processing and data analysis, according to requirements. The designing and implementation process of all functional modules, the test schemes and conclusions are all given in this paper.The Spark performance data collection system designed here is based on Akka framework and the idea of distributed master-slave nodes was taken into consideration. Thus the function of distributed performance data collection and process was realized. The data collection module accomplishes the gathering and storage of Spark performance data through the DSTAT monitoring tools operating on slave nodes, which ensures the correctness and real-time feature of data collection results. In addition, the module is highly extendible and facilitates the development of new service system. The data processing module analyzes the driver log and performance data of Spark, and displays the analysis results via differe nt types of statistical graphs. The data analysis algorithm can partition different cluster status, analyze free status and proportion of unused resources, and meanwhile transform difference of performance data from different servers into Euclidean distance to implement load balance analysis, which can give analysis reports automatically and is valuable for developers to carry out performance tuning.The main result of this paper is the collection and analysis system software for Spark performance data. Currently, the system is under good running state, and performance data has been collected successfully, which is satisfying. This can h elp developers to understand the underlying operation status of Spark cluster effectively, and offer reference for performance tuning.
Keywords/Search Tags:Distributed, Performance Data, Collection and Process, Data Analysis
PDF Full Text Request
Related items