Font Size: a A A

Research On Big Data Distributed Storage Technology Based On Spark

Posted on:2022-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q LiFull Text:PDF
GTID:2518306350481764Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the booming development of big data and distribution,data shows explosive growth,and how to store data becomes the first issue that needs to be considered in the industry.The explosion of data is not only reflected in the explosion of bytes of data,but also in the number of files used to store data.HDFS is an excellent big data distributed file storage system in the Hadoop family,and is also one of the most widely used big data distributed file storage systems in the industry.When HDFS stores each file,it creates an index for each file.When the number of files is large,large concurrent access to files can lead to a decrease in the overall stability of HDFS.This paper mainly studies and designs a file storage technology based on Spark distributed computing platform to solve the problem of reducing the index pressure of HDFS in the case of storing a large number of files,and designs a file access agent service based on this technology.First of all,the paper also introduces the hadoop distributed file storage system,and analyzes its storage principle and characteristics,find the Name Node node not only responsible for the management and maintenance of each file index information and document project information,responsible for handling all the HDFS by request at the same time,thus caused the big concurrent access HDFS appears when the Name Node node is too busy,performance degradation,leading to the Name Node node downward running stability.For the above HDFS problems,Hadoop provides HAR File and Sequence File scheme to solve the problem of large File storage.However,in the study,it was found that both of these solutions had some limitations and were more suitable for archiving than normal HDFS files.Secondly,this paper studies and designs a distributed storage technology based on Spark as a big data distributed computing platform to reduce the index pressure of HDFS when storing a large number of files.Through the real-time computing engine in Spark,this design scheme scans the files in HDFS in real time,merges these files into a large data file,and records the file index information in Mongo DB,thus reducing the index pressure of Name Node node and improving the operation stability of HDFS.Again,the paper on how to read,operating through the combined file for further research and design,puts forward using a specific agency service to deal with the client's request,and through the Zookeeper distributed coordinated service to the client access to soft load balancing,and reduces the single node of the proxy service pressure,finally returned to the client by the proxy service data.Finally,the paper through file index quantity change after the merger,and under the high concurrency using paper design agency service operating file system stability were compared,through based on the experimental results verified the Spark big data distributed storage technology in the store a large file,file index quantity greatly reduced,and got better guarantee the system stability.
Keywords/Search Tags:Spark, HDFS, File Store, Distributed System
PDF Full Text Request
Related items