Font Size: a A A

Big Data Processing For Supercomputing Systems

Posted on:2019-08-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:T GaoFull Text:PDF
GTID:1368330623450477Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the wide application of various high-precision sensors and more and more data generated by scientific simulation,researchers need to process more and more data.Analyzing and understanding large-scale data and obtaining valuable information from it is a very challenging problem.Nowadays,"big data" has become a popular word to describe various challenges to analyze large volumes of data.MapReduce is one of the most popular programming paradigms for big data analytics.The model hides parallel execution,data movement,and fault tolerance from applications.Thus,it can greatly reduce eforts to use large-scale parallel and distributed systems and has been widely used in the commercial and research field.However,popular big data processing frameworks,such as Hadoop MapReduce and Spark,are originally designed for commodity clusters.Supercomputing systems have significant diferences in terms of storage architecture,interconnect,and software stack compared with commodity clusters.For example,most large supercomputer installations do not provide on-node persistent storage.Instead,storage is decoupled into a separate globally accessible parallel file system.On the other hand,commodity clusters usually have local storage.Because of these differences,Hadoop MapReduce and Spark cannot fully utilize resources on supercomputing systems.Although there are some researchers to design big data processing frameworks for supercomputing systems,such as MR-MPI.However,MR-MPI still has the problem to use memory eficienty Therefore,to achieve high-performance data analysis,we design a new big data processing framework for supercomputers,which is called Mimir.Mimir is designed based on the idea of in-memory computing implemented by supercomputing techniques such as MPI,and optimized for memory eficiency.As a result,it can use resources of supercomputers much more eficiently.In addition,we also study the load imbalance problem caused by data skew and propose a dynamic repartition method to balance the load.Besides that,we propose the method to use work stealing and MPI collective I/O to improve the I/O performance.These optimizations are implemented in Mimir.At last,we carry out a case study on the DNA k-mer analysis and design a new k-mer analysis system based on Mimir,which is called Bloomfish.In summary,this paper study the key technologies of big data processing on supercomputing systems.We design and optimize a big data processing framework for supercomputers,thus it is of practical significance.At the same time,the technical scheme proposed in this paper has the guidance value for supercomputing systems to prove high-performance data analysis service,so it has theoretical reference significance.The contribution of this paper include:1.We design a new big data processing framework- Mimir-for supercomputing systems based on MapReduce model.Mimir can make full use of supercomputing systems to achieve high-performance data analysis by pipelined task schedule,optimized intermediate data management,and MPI-based communication.Compared with MR-MPI,Mimir improves memory eficiency up to 16 times; compared with Spark,Mimir improves the performance up to 12 times.2.To solve the load imbalance problem caused by data skew,we propose a dynamic repartition method to balance the load.This method can improve performance up to 5 times and can scale to large-scale systems.3.To improve the I/O performance for the globally shared file system,we propose work stealing and MPI collective IO method.Experimental results prove that these methods can improve performance of reading input files up to 50% and improve performance of writing output files up to 42%.4.We design a new DNA k-mer analysis system based on Mimir,which is called Bloomfish.Bloomfish makes use of optimizations in Mimir,such as memory usage optimization and parallel IO optimization.Thus,it can significantly improve the performance ofk-mer analysis.Experimental results show that Bloomfish can analyze 24 TB DNA data in 1.1 hour.Instead,it can take dozens of hours for other systems to complete the analysis with the same size.For example,Jellyfish takes 24 hours to analyze 3 TB DNA data.5.Our experimental platforms include several supercomputers diferent in architecture.These supercomputers include Tianhe-2 at National Supercomputing Center in Guangzhou,Mira at Argonne National Laboratory,Comet at San Diego Supercomputer Center,and Stampede2 at Texas Advanced Computing Center.These experiments show that technologies proposed by us suitable for various platforms.
Keywords/Search Tags:Big Data Processing, High Performance Computing, Load Balancing, Parallel I/O Optimization, Genome Analysis
PDF Full Text Request
Related items