Font Size: a A A

Load-balanced Placement Strategy For Big Data Storage

Posted on:2016-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y J HeFull Text:PDF
GTID:2308330476953484Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the growth of big data, users have more demands on data storage, e.g. available storage volume, system scalability, reliability, performance and so on. Traditional centralized storage systems are unsustainable to serve billions of gigabytes of data, due to their costs and infrastructures. As a result, distributed storage systems are brought into being.Data placement strategy, deciding which device to hold the data and its replicas,and the way to retrieve them, has been one of the key parts for designing and implementing distributed storage systems. CRUSH(Controlled Replication Under Scalable Hashing) is an effective and highly scalable pseudo-random data distribution function designed for Ceph, a distributed storage system. However, we found that it failed to achieve system load-balance, which can be illustrated from the following two aspects.Firstly, CRUSH does not produce balanced data distribution among storage devices in the system. We observed up to 20% gap of device space utilization. The devices with more data need to handle more read/write requests, leading to imbalanced load and limiting overall performance due to these devices becoming bottlenecks, especially for highly concurrent accesses to large number of data. Secondly, CRUSH provides replica placement rules for evenly distributing replicas of the same data among different failure domains. But the limited rule primitives fail to produce ?ne granularity of control on replica placement. It only allows users to de?ne rules for replicas distributing among failure domains in the same level, instead of multiple ones.In this thesis, we propose two solutions for the above problems. For unbalanced data distribution, we design and implement B-CRUSH algorithm, a balanced data placement strategy based on CRUSH. It intends to relief the situation that devices of more data bearing with heavier load becoming system bottleneck through three steps,adding a new hashing function, designing a new selection algorithm, and introducing an adaptive module for data predistribution process. For limited replica replacement rules, we introduce a new primitive keyword ignore, design and implement new rules for replicas distributing across multiple failure domain levels. This solution can provide ?ne granularity of control on replica placement through de?ning strategies replica by replica, and ensure that current replica will not be placed on the same failure domain as previous ones.Afterwards, we evaluate the two solutions by experiments. According to the results, B-CRUSH algorithm improves balance of data distribution and brings about 10%performance enhancements for random read compared to current CRUSH algorithm.The new placement rules can evenly distribute replicas of the same data across multiple failure domain levels, and provide users with more choices on replica placement strategies.KEY WORDS: big data, distributed storage, data placement, load-...
Keywords/Search Tags:big data, distributed storage, data placement, loadbalance
PDF Full Text Request
Related items