Font Size: a A A

A System For Distributed MD Data Analysis Based On Spark

Posted on:2017-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:T HouFull Text:PDF
GTID:2308330482495670Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Molecular Simulation is a very important computer simulation method for fundamental science. For example: chemistry, physical, biology and so on. It is a powerful tool for understanding the behavior of natural systems of these fields. The related technologies develop vary fast in this field and the data generated by Molecular Simulation is becoming bigger and bigger. Analyzing such huge output MD data and getting the useful results are the key goal of Molecular Simulation. The traditional method for MD data is provided by the MD software and it is suitable for the MD data analyze when the amount of data is not very huge. But, in nowadays, the amount of the output data of the Molecular Simulation is in the level of Gigabytes or Terabytes. With the large amount data generated by Molecular Simulation while observing the spatial and temporal relationships. The challenge is to handle the analytical queries that are often compute intensive. The I/O overhead and CPU overhead will have very important influence on the running of the system. So, the speed of the traditional systems is very slow. These traditional methods can not handling an processing it efficiently.Apache Spark is a big star in the field of big data processing. With six years’ development, it has been the most popular plantform for distributed computing. The key idea of Spark is processing data in memory. Spark runs very fast than other distributed processing systems by this theory. From this, we can know to build a system on top of Spark to process the huge Molecular Simulation data is very helpful for the ayalysis of MD data. Although various tools exist to tackle the problem, but in this paper, base on this idea, we propose a system for distributed MD data analysis on top of Spark to parallelize the computation of analytical queries in processing of the data of Molecular Simulation. The system consists of three layers: Apache Spark layer, MS RDD layer and MS Query Processing layer. Apache Spark layer is used to read MD data from the raw data. MS RDD layer is used to store the data read by Apache Spark layer in the format of RDD. MS Query Processing layer provides functionality of executing analytical queries. Caching mechanism is used in our system to improve the performance which can reduce the I/O overhead and CPU overhead significantly. In the last of this paper, we validation this system by some experiments and the results of the experiments shows that our system is very effective for MD data analysis.
Keywords/Search Tags:Molecular Simulation, Apache Spark, Big Data, Distributed Computing, High Performance Computing
PDF Full Text Request
Related items