Design And Optimization Of Big Data Analysis Platform Based On Spark And HDFS

Posted on:2019-09-27

Degree:Master

Type:Thesis

Country:China

Candidate:D D Hu

Full Text:PDF

GTID:2428330590475428

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,the data show an explosive growth.The accumulation of massive data puts forward new requirements for data storage and computation.Different kinds of distributed computing or distributed storage systems emerge.The distributed file system HDFS has been widely used for its good availability and fault tolerance.At the same time,Spark has received extensive attention from academia and industry because of its efficient computing efficiency.Although both Spark and HDFS can be deployed on cheap hardware devices,for ordinary users,their complicated operating processes and configuration parameters increase the user's difficulty in using them.What's more,Spark takes into account the locality of the data in the task scheduling,but does not take into account the load of the node,while may lead to some of the task execution time too long.Aiming at the above problems,in this thesis,a big data analysis platform based on Spark and HDFS is designed,in which the complicated parameters configuration and operating processes are encapsulated to facilitate the user's use.Moreover,a data distribution strategy based on node load is proposed to optimize the performance of big data analysis platform.Main contributions are listed as follows:1.A design scheme of big data analysis platform based on Spark and HDFS is proposed.It encapsulates the complex processing logic and configuration parameters,and shields the implementation details inside the platform.The user obtain the computing and storage services provided by the platform through the visual interface.In order to cut down the competition for computing resources and speed up the execution of tasks,the platform monitors the resource utilities of each node in real time,and adjusts the number of tasks parallel running in Spark,according to the collected resource utilities.2.A data distribution strategy based on node load is proposed to optimize the performance of big data analysis platform.The data to be analyzed is distributed to the low load node to indirectly control the task scheduling of Spark,which makes the task and data in the same node.It can avoid data transmission on the network and improve the performance of the entire platform.3.Based on the aforementioned research results,the prototype of Spark-HDFS big data analysis platform is designed and implemented in this thesis.Moreover functional and performance testing of the large data analysis platform are performed.The test results show that the big data analysis platform can effectively handle the user's request.The data distribution strategy based on the node load can effectively improve the performance of large data analysis application and reduce execution time.

Keywords/Search Tags:

Spark, HDFS, Big Data Analysis Platform, Data Distribution

PDF Full Text Request

Related items

1	Design And Implementation Of Data Analysis And Modeling Platform Based On R Language Analysis
2	Application Research Of Real-time Data Analysis Based On Spark Computing
3	Query Optimization In Spark SQL For Business Data Of 4G Industry Card Based On HDFS
4	Design And Implementation Of Data Processing And Analysis System Based On Spark
5	Optimizing Big Data Equi-join In Spark And Its Application In Analysis Of Network Traffic Data
6	The Design And Implementation Of Network Data Analysis System Based On Spark Platform
7	Design And Implementation Of Telecom 4G Big Data Platform For Network Optimization Based On Spark
8	Research On Large-scale Handwriting Data Analysis Platform Based On Cloud Computing Architecture And Its Application
9	Research And Implementation Of Data Hybrid Computing Platform Based On Spark
10	Design And Implementation Of NetEase Mobile Big Data Support Platform Based On Spark And Hive