Font Size: a A A

Optimizing Data Placement Of MapReduce On Ceph-based Framework Under Load-balancing Constraint

Posted on:2019-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y T LiangFull Text:PDF
GTID:2428330566477985Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,a large number of data has been generated by users through applications such as social networks,blogs,and multimedia sharing services.The challenge for the research and industrial community is how to design a low-cost storage system to deal with the explosion of data.As a common solution,the distributed object storage system is often used to store large amounts of data in the actual production.Ceph,as a distributed object storage system,has been widely used as the basic storage for distributed systems due to its high availability,reliability and scalability.Strategies of data placements in Ceph composed of heterogeneous clusters can greatly affect the system performance and load balancing.At present,the algorithm which is used in Ceph only focuses on the load balancing of the system.Through the CRUSH Map,Ceph allows users to customize the weight of object storage devices,but the weight doesn't reflect the difference between devices in a heterogeneous environment.It only indicates the different size of the storage,without considering the computing capability and the network heterogeneity.By default,the data will be assigned to each object storage device in an approximate average way.Because there is no consideration of heterogeneities among clusters,it will eventually cause the performance degradation and lead to a longer running time.Based on the heterogeneous cluster environment,this paper presented an improved Ceph framework.This new framework takes the load balancing and the cluster heterogeneity,including the computing power and the network bandwidth into account.In addition,for a given application,how to determine the initial data allocation and find the best data placement strategy to minimize the completion time of the whole application in heterogeneous cluster environment is very critical.This paper focuses on how to allocate the data after migrating to the proposed Ceph-based framework.This paper chooses MapReduce,a distributed computing framework,as a use case because MapReduce is still widely used and also the presented framework is suitable for the applications based on the principle of moving computation rather than data across clusters,such as MapReduce.According to the proposed Ceph-based framework and the properties of MapReduce,we formulate the mixed integer linear programming(MILP)to obtain the optimal data placement.However,because of the large computational complexity of MILP,we devise an efficient algorithm to obtain the near-optimal solutions.The experimental results show that the proposed algorithm can achieve up to 25.6% improvement on system performance,compared with the original strategy implemented in Ceph.
Keywords/Search Tags:distributed system, Ceph, heterogeneity, data placement, load balancing
PDF Full Text Request
Related items