Font Size: a A A

Research On Big Data Processing Across Data Centers Based On AWS

Posted on:2017-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:S S LiFull Text:PDF
GTID:2348330566957314Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the globalization of economy and the improvement of computer,business is all over the world,big data processing has become a common business need of the government and enterprise applications.Hadoop is the most mature open source big data processing framework,which implements the MapReduce programming paradigm.The mass source data are stored in HDFS supported by Hadoop and processed parallelly in computing nodes of a cluster.However,in many cases,the source data is simultaneously distributed across multiple data centers(Geo-distributed).At this time,traditional cluster deployment way of Hadoop will no longer have a significant performance advantage,since firstly big data are moved to one data center before data processing.So The performance is heavily dependent on the source data size and network bandwidth between data centers,resulting in a waste of time and required resources.Existing deployment research merely focuses on moving all data to one data to process,rather than integrating different situations with corresponding deployment.The original approach is often limited by the size of data and the network transmission capacity between data centers,resulting in a lethal impact on the performance of big data processing.The main performance we consider includes the processing time and cost.In addition,the random behavior of cloud infrastructure also affects reliability and performance of the cloud infrastructure application programming interface(API).To solve this problem,firstly,this paper proposes a deployment framework for Geo-distributed data processing,including decision-making layer,mapping layer and cloud application layer.Decision-making layer is responsible for evaluating the performance prediction of conflicted policy generated in the process of Geo-distributed data and selecting the policy which have optimal performance.Cloud application layer is not only responsible for monitoring performance data related to data processing that decision-making layer requires,but also implementing the policy from decision-making layer.The mapping layer is responsible for converting operation and events between decision making layer and cloud application layer.Secondly,the deployment way of Geo-distributed data processing is divided into three kinds:Single Cluster Deployment,Distributed Cluster Deployment and Multiple Clusters Deployment.We put forward a performance evaluation model which applicable to Geo-distributed data processing and estimate the performance of different cluster deployment way.Thirdly,based on the predictive results of the performance evaluation model,we propose three strategies:cluster deployment strategy of Geo-distributed data processing,job scheduling strategy based on genetic algorithm and fault tolerant strategy of cloud application interface call to solve performance and reliability issues of Geo-distributed data processing.Finally,for the proposed performance evaluation model and related strategies which is provided,we chose AWS EC2 platform and set up clusters to make experiment and analyze the results.Experimental results prove that the performance evaluation model presented in this paper can more accurately predict the performance of Geo-distributed data processing and policies based on this model can greatly reduce the time and cost of data processing,and significantly improve the Geo-distributed data processing performance and reliability.
Keywords/Search Tags:data center, big data, Hadoop, AWS, genetic algorithm
PDF Full Text Request
Related items