Big Data Processing For Supercomputing Systems

Posted on:2019-08-08

Degree:Doctor

Type:Dissertation

Country:China

Candidate:T Gao

Full Text:PDF

GTID:1368330623450477

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the wide application of various high-precision sensors and more and more data generated by scientific simulation,researchers need to process more and more data.Analyzing and understanding large-scale data and obtaining valuable information from it is a very challenging problem.Nowadays,"big data" has become a popular word to describe various challenges to analyze large volumes of data.MapReduce is one of the most popular programming paradigms for big data analytics.The model hides parallel execution,data movement,and fault tolerance from applications.Thus,it can greatly reduce eforts to use large-scale parallel and distributed systems and has been widely used in the commercial and research field.However,popular big data processing frameworks,such as Hadoop MapReduce and Spark,are originally designed for commodity clusters.Supercomputing systems have significant diferences in terms of storage architecture,interconnect,and software stack compared with commodity clusters.For example,most large supercomputer installations do not provide on-node persistent storage.Instead,storage is decoupled into a separate globally accessible parallel file system.On the other hand,commodity clusters usually have local storage.Because of these differences,Hadoop MapReduce and Spark cannot fully utilize resources on supercomputing systems.Although there are some researchers to design big data processing frameworks for supercomputing systems,such as MR-MPI.However,MR-MPI still has the problem to use memory eficienty Therefore,to achieve high-performance data analysis,we design a new big data processing framework for supercomputers,which is called Mimir.Mimir is designed based on the idea of in-memory computing implemented by supercomputing techniques such as MPI,and optimized for memory eficiency.As a result,it can use resources of supercomputers much more eficiently.In addition,we also study the load imbalance problem caused by data skew and propose a dynamic repartition method to balance the load.Besides that,we propose the method to use work stealing and MPI collective I/O to improve the I/O performance.These optimizations are implemented in Mimir.At last,we carry out a case study on the DNA k-mer analysis and design a new k-mer analysis system based on Mimir,which is called Bloomfish.In summary,this paper study the key technologies of big data processing on supercomputing systems.We design and optimize a big data processing framework for supercomputers,thus it is of practical significance.At the same time,the technical scheme proposed in this paper has the guidance value for supercomputing systems to prove high-performance data analysis service,so it has theoretical reference significance.The contribution of this paper include:1.We design a new big data processing framework- Mimir-for supercomputing systems based on MapReduce model.Mimir can make full use of supercomputing systems to achieve high-performance data analysis by pipelined task schedule,optimized intermediate data management,and MPI-based communication.Compared with MR-MPI,Mimir improves memory eficiency up to 16 times; compared with Spark,Mimir improves the performance up to 12 times.2.To solve the load imbalance problem caused by data skew,we propose a dynamic repartition method to balance the load.This method can improve performance up to 5 times and can scale to large-scale systems.3.To improve the I/O performance for the globally shared file system,we propose work stealing and MPI collective IO method.Experimental results prove that these methods can improve performance of reading input files up to 50% and improve performance of writing output files up to 42%.4.We design a new DNA k-mer analysis system based on Mimir,which is called Bloomfish.Bloomfish makes use of optimizations in Mimir,such as memory usage optimization and parallel IO optimization.Thus,it can significantly improve the performance ofk-mer analysis.Experimental results show that Bloomfish can analyze 24 TB DNA data in 1.1 hour.Instead,it can take dozens of hours for other systems to complete the analysis with the same size.For example,Jellyfish takes 24 hours to analyze 3 TB DNA data.5.Our experimental platforms include several supercomputers diferent in architecture.These supercomputers include Tianhe-2 at National Supercomputing Center in Guangzhou,Mira at Argonne National Laboratory,Comet at San Diego Supercomputer Center,and Stampede2 at Texas Advanced Computing Center.These experiments show that technologies proposed by us suitable for various platforms.

Keywords/Search Tags:

Big Data Processing, High Performance Computing, Load Balancing, Parallel I/O Optimization, Genome Analysis

PDF Full Text Request

Related items

1	Research On Job Runtime Characteristics Based Performance Optimization In Big Data Processing System
2	Research And Implementation Of Task Scheduling Mechanism On A Parallel Computing System
3	Research And Design On An Efficient Load Balancing Algorithm
4	Research On Performance-effective Load Balancing Technology In Distributed Stream Processing Systems
5	Load Balancing Problems For Parallel And Distributed Computing
6	Load Balancing In Parallel Computing Based On LAN
7	Mpich-based Parallel Computing System, Load Balancing Technology
8	Technologies For Energy-efficient And High-Performance GPGPU Computing
9	Parallel Query Processing System On Large-scale RDF Data
10	Research On Key Technologies Of Parallel Optimization For Multi-computing Platforms For Large-scale Applications