Font Size: a A A

Research On Data Replication In Data Grid

Posted on:2009-11-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:Qaisar RasoolFull Text:PDF
GTID:1118360278961891Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Grid computing is a wide-area distributed computing environment that involves large-scale resource sharing among collaborations, often referred to as Virtual Organizations, of individuals or institutes located in geographically dispersed areas. Data grids are grid infrastructure with specific needs to transfer and manage massive amounts of scientific data for analysis purposes. Examples of the scientific applications dealing with huge amounts of data and the potential beneficiaries of Data Grid technology are high energy physics, astronomy, bioinformatics, and earth sciences.In this thesis we first present a review of current and past research on replication techniques. Specifically, we focus on data replica placement policies proposed for use in the data grid environment. For each replica placement technique, we consider its methodology, objective and results. These strategies differ by the assumptions made regarding underlying grid topology, user request patterns, dataset sizes and their distribution, and storage node capacities. Other distinguishing features include data request path and the manner in which replicas are placed on the Grid nodes. In the presence of diverse and varying characteristics of tree and other architectures it is difficult to create a common ground for juxtaposing different replication strategies. We, therefore, classify the topologies into tree and hybrid/P2P architectures and analyze the impact of replica placement policies in each one. A hybrid topology can carry features of both tree and P2P architectures and thus can be used for better performance of a replication strategy.In multi-tier Data Grid, there is a single source of data and it is not feasible for the one server to fulfill the requests of all the users in the Data Grid. Therefore data must be replicated to the other selected nodes in order to reduce the burden on the master server. Replication also facilitates load balancing and improves reliability by creating multiple data copies. Transferring a file from a server to client consumes a huge amount of bandwidth and incurs storage cost. One possible way to reduce the access latency and bandwidth consumption is to replicate data across different sites. However, the files in Grid are big in size i.e., in the magnitude of 500MB-1GB so replication to every site is not feasible. Among many of the challenges, one and also the focus of this thesis, is to find the candidate sites where we can host the replicas. One way to tackle this problem is to place replicas at sites that satisfy the large number of requests. Another approach is to place replica at sites that optimize the transfer time. The data-intensive tasks in scientific applications usually take longer time and therefore considering storage capacity of sites and current storage load is also importance. We can manage Grid storage resources effectively if we place a replica of a file on a site that has less storage load than its neighbors and if its request for the file is above average. In the thesis, both storage status and file requests are considered before placing a replica to a site. Our approach is dynamic so it adapts to change in user behavior and system dynamics.The main objective of replication in Grid environment is to enhance data availability by placing replicas at the proximity of users so that user perceived response time is minimized. For a hierarchical Data Grid, replicas are usually placed in either top-down or bottom-up way. We put forward Two-way replica placement scheme that places replicas of most popular files close to the requesting clients and less popular files a tier below from the Data Grid root. We facilitate data requests to be serviced by the sibling nodes as well as by the parent.Another interesting issue related to file replica placement in Data Grid is load sharing among replica servers. Most of the current techniques select candidate nodes for replica placement that have maximum access requests for files. However, selecting candidate nodes based on access load and storage load together may result in more effective load balancing replication strategy. We proposed an approach called Fair-share Replication (FSR) that takes into account both the number of requests and the storage load on the candidate sites before placing any replica in hierarchical Data Grid.The simulations of proposed techniques were carried out using the GridNet that is developed for evaluating the replication strategies in Data Grid. The Two-way strategy and the Fair-share replication were tested using parameters from High Energy Physics experiments and the performance results demonstrate their effectiveness for the diverse setup of Data Grid environment in terms of user access patterns, dataset sizes, and server storage capacity constraint.
Keywords/Search Tags:Data replication, Data Grid, Replica placement
PDF Full Text Request
Related items