Load-balanced Placement Strategy For Big Data Storage

Posted on:2016-10-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y J He

Full Text:PDF

GTID:2308330476953484

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the growth of big data, users have more demands on data storage, e.g. available storage volume, system scalability, reliability, performance and so on. Traditional centralized storage systems are unsustainable to serve billions of gigabytes of data, due to their costs and infrastructures. As a result, distributed storage systems are brought into being.Data placement strategy, deciding which device to hold the data and its replicas,and the way to retrieve them, has been one of the key parts for designing and implementing distributed storage systems. CRUSH(Controlled Replication Under Scalable Hashing) is an effective and highly scalable pseudo-random data distribution function designed for Ceph, a distributed storage system. However, we found that it failed to achieve system load-balance, which can be illustrated from the following two aspects.Firstly, CRUSH does not produce balanced data distribution among storage devices in the system. We observed up to 20% gap of device space utilization. The devices with more data need to handle more read/write requests, leading to imbalanced load and limiting overall performance due to these devices becoming bottlenecks, especially for highly concurrent accesses to large number of data. Secondly, CRUSH provides replica placement rules for evenly distributing replicas of the same data among different failure domains. But the limited rule primitives fail to produce ?ne granularity of control on replica placement. It only allows users to de?ne rules for replicas distributing among failure domains in the same level, instead of multiple ones.In this thesis, we propose two solutions for the above problems. For unbalanced data distribution, we design and implement B-CRUSH algorithm, a balanced data placement strategy based on CRUSH. It intends to relief the situation that devices of more data bearing with heavier load becoming system bottleneck through three steps,adding a new hashing function, designing a new selection algorithm, and introducing an adaptive module for data predistribution process. For limited replica replacement rules, we introduce a new primitive keyword ignore, design and implement new rules for replicas distributing across multiple failure domain levels. This solution can provide ?ne granularity of control on replica placement through de?ning strategies replica by replica, and ensure that current replica will not be placed on the same failure domain as previous ones.Afterwards, we evaluate the two solutions by experiments. According to the results, B-CRUSH algorithm improves balance of data distribution and brings about 10%performance enhancements for random read compared to current CRUSH algorithm.The new placement rules can evenly distribute replicas of the same data across multiple failure domain levels, and provide users with more choices on replica placement strategies.KEY WORDS: big data, distributed storage, data placement, load-...

Keywords/Search Tags:

big data, distributed storage, data placement, loadbalance

PDF Full Text Request

Related items

1	Research On Theory And Methods Of Data Placement Optimization In Distributed Storage
2	Research On Data Placement For Distributed Storage Systems With Heterogeneous Resources
3	The Research On Data Placement And Schedule Scheme In Large-Scale Multimedia Storage System
4	Research On Strategy Of Data Replica Placement For Geo-distributed Cloud Storage Services
5	Research Of Network Storage Technology And Data Placement Based On Iscsi Protocol
6	Research Of Network Storage Technology And Data Placement Based On ISCSI Protocol
7	Research And Design Of A Network-Based Distributed Storage System
8	The Research On Data Replica Placement In Geo-distributed Cloud Storage System
9	Data Placement Strategy For Data-intensive Applications In Cloud Storage System
10	Research On Key Technologies Of Distributed Storage In Cloud Computing