Font Size: a A A

Design And Optimization Of Big Data Analysis Platform Based On Spark And HDFS

Posted on:2019-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:D D HuFull Text:PDF
GTID:2428330590475428Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the data show an explosive growth.The accumulation of massive data puts forward new requirements for data storage and computation.Different kinds of distributed computing or distributed storage systems emerge.The distributed file system HDFS has been widely used for its good availability and fault tolerance.At the same time,Spark has received extensive attention from academia and industry because of its efficient computing efficiency.Although both Spark and HDFS can be deployed on cheap hardware devices,for ordinary users,their complicated operating processes and configuration parameters increase the user's difficulty in using them.What's more,Spark takes into account the locality of the data in the task scheduling,but does not take into account the load of the node,while may lead to some of the task execution time too long.Aiming at the above problems,in this thesis,a big data analysis platform based on Spark and HDFS is designed,in which the complicated parameters configuration and operating processes are encapsulated to facilitate the user's use.Moreover,a data distribution strategy based on node load is proposed to optimize the performance of big data analysis platform.Main contributions are listed as follows:1.A design scheme of big data analysis platform based on Spark and HDFS is proposed.It encapsulates the complex processing logic and configuration parameters,and shields the implementation details inside the platform.The user obtain the computing and storage services provided by the platform through the visual interface.In order to cut down the competition for computing resources and speed up the execution of tasks,the platform monitors the resource utilities of each node in real time,and adjusts the number of tasks parallel running in Spark,according to the collected resource utilities.2.A data distribution strategy based on node load is proposed to optimize the performance of big data analysis platform.The data to be analyzed is distributed to the low load node to indirectly control the task scheduling of Spark,which makes the task and data in the same node.It can avoid data transmission on the network and improve the performance of the entire platform.3.Based on the aforementioned research results,the prototype of Spark-HDFS big data analysis platform is designed and implemented in this thesis.Moreover functional and performance testing of the large data analysis platform are performed.The test results show that the big data analysis platform can effectively handle the user's request.The data distribution strategy based on the node load can effectively improve the performance of large data analysis application and reduce execution time.
Keywords/Search Tags:Spark, HDFS, Big Data Analysis Platform, Data Distribution
PDF Full Text Request
Related items