Font Size: a A A

Performance Optimization For Big Data Progressing Systems In The Cloud

Posted on:2019-04-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:H XuFull Text:PDF
GTID:1318330542474329Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,people have witnessed the explosive growth of big data,which has influenced the lifestyles of people.There is an increasing demand for data processing systems to perform large-scale data analytics and mine the potential value of big data.Due to the restriction of traditional data processing systems,the MapReduce model was proposed to handle big data problem efficiently.In the MapReduce model,data are di-vided into plenty of small fragments,which are distributed to different nodes in the cluster,to improve the effectiveness of data processing.It spawned some open source big data processing systems,such as Hadoop and Spark,which have attracted the focus of both the industry and academia.These big data processing systems need the support of massive physical infrastructure resources.However,the high economic cost and the heavy maintenance work of physical infrastructure are prohibitive for some middle-small enterprises,which impede the further promotion of big data processing systems.Along with the development of cloud computing,many enterprises and institutions be-gin to migrate the big data processing applications into the cloud,which can take the advantages of cloud computing,such as on-demand self-service and rapid scalability.Moreover,cloud computing can be helpful to shorten the period of data processing and analytics and improve the effectiveness of big data processing.Although the cloud computing brings much convenience to big data processing,it also brings some new challenges and problems.Based on the former research on big data processing systems in the cloud,this dissertation focuses on the performance optimization and carries out the research in the following aspects:the scheduling of virtual cluster for big data processing in the cloud,the strategy of data placement in the distributed file system in the cloud,and the task scheduling of big data processing in the cloud.First of all,we analyze and model the network requirement of a virtual cluster for big data processing in the cloud.We take the resource sharing in the cloud into consideration and model the maximum network performance which can be obtained by the virtual cluster of big data processing.Based on this model,the scheduling problem for the virtual cluster in the cloud is studied to optimize the network performance of big data processing.We design a heuristic to find the optimal solution within the acceptable time.The simulation results show that the heuristic can achieve the near-optimal result.Next,we study the problem of data placement for Hadoop distributed file sys-tem in the cloud.The co-location and heterogeneity among the virtual machines bring two problems to Hadoop distributed file system:data reliability loss and performance degradation.Hence,we propose a novel location-aware data block allocation strategy(LDBAS)to mitigate performance degradation and enhance data reliability at the same time.LDBAS enhances data reliability by placing the data blocks with the help of the location of nodes.Meanwhile,LDBAS allocates data blocks by predicting the pro-cessing load across different nodes to improve the data locality of map tasks and the performance of data processing applications.We conducted a series of simulation and realistic experiments in Hadoop cluster.The results show that LDABS can enhance the data reliability and reduce the job time of I/O-intensive applications while satisfying the real-time constraint.Finally,we study the speculative execution of big data processing systems in the cloud.In order to avoid the impact of some stragglers or slow tasks in the cluster,MapReduce-like systems often launch speculative backup tasks to shorten the job com-pletion time.The accuracy of time estimation of tasks and the straggler detection is the key of speculative execution.We leverage the history information of task processing in the cluster to facilitate the time estimation of tasks.Then we propose to combine the methods of time estimation based on the global speed and local phase speed to pick up the stragglers.This kind of combination can detect the stragglers more accurately and avoid the resource competition caused by the misjudgment of the regular tasks.The experimental results show that this new approach can contribute to reducing the total job completion time for the big data processing applications in the cloud.This dissertation focuses on the performance optimization of big data processing systems in the cloud.Considering the characteristics and requirements of cloud com-puting and big data processing,we carry out some research on the issues when big data processing applications are migrated into the cloud.A heuristic is proposed to search the optimal solution for virtual cluster scheduling problem.Then,a location-aware data placement strategy and a history-based method for speculative task execution are given to improve the performance of big data processing systems in the cloud.We hope the research work in this dissertation can help and guide the design of big data processing systems in the cloud.
Keywords/Search Tags:Big Data Processing, Cloud Computing, Virtual Cluster Scheduling, Data Placement, Task Scheduling
PDF Full Text Request
Related items