Font Size: a A A

Research On Some Key Techniques Of Data Grid

Posted on:2011-03-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:J H JiangFull Text:PDF
GTID:1118360305453474Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Data grid is one of the most important branches of grid computing, and it is also a heated issue in the field of grid computing. Data grid is a kind of grid which takes data access and management as its important functions. Data intensive job is the most crucial job in data grid. Therefore, how to improve the performance of data intensive jobs is the key point in data grid. At the same time, data management, plug-and-play and dynamic scalability based on requirements are also crucial to enhance the performance of data intensive jobs in data gird.A lot of research in data grid has been done widely. In the field of job scheduling algorithm in data grid, most of scholars have focused on strategies of dispatching data intensive job to the grid node holding its required data files. In the field of replica replacement strategies, most of them have adopted replica replacement strategy based on its access history of each single file in local grid node. In the field of spatial data grid, device functions of plug-and-play and dynamic scalablity are required to be designed and analyzed in detail. All of these four aspects of data grid are the main research field in this paper.Main contributions and innovative points in this thesis are outlined as bellow:1. Replica replacement issue. Replica creation is the key step in replica management in data grid, and replica replacement is occurred since the limited size of storage device. An effective and efficient replica replacement strategy will reduce the frequency of replica creation to improve the performance of data grid. Therefore, replica replacement becomes an important research field in data grid. After reviewing strategies in the field of replica replacement, we have found that two issues should be solved. The first one is that only the access history of each data file in its local grid node is considered while the association among data files is neglected. The second one is that only the access of every data file in its local grid node is considered while the global situation is ignored. As for the first issue, an associated replica replacement algorithm based on Apriori approach is proposed. This algorithm is to mine the association patterns from the access history in each grid node firstly, then replica replacement decision rules are generated based on these association patterns, and finally the simmulation result turns out to be effective. As for the second issue, a LFU-Min algorithm is proposed. LFU-Min algorithm takes the access history of every data file in the global data grid, and replaces the data file which does not occur frequently in the whole data grid. Compared with traditional LFU algorithm in OptorSim, the simmulation result also turns out to be more effective.2. Data intensive job scheduling issue. Job scheduling strategy of data intensive jobs is crucial to the performance of data grid. After anlayzing the characteristics of data intensive jobs, current research in data intensive job scheduling is given and discussed, and two issues are given to be solved. The first issue is that an efficient and effective job management and scheduling strategy is required in Gfarm data grid. The second issue is that potential behaviors of jobs in each waiting queue of data grid node should be regarded as an important factor to influence access cost, since frequent replica replacement activities will change replica distribution in storage repository in each grid node. As for the first issue, a batch mode job scheduling algorithm based on data-aware strategy for data intensive jobs is proposed after analyzing Gfarm data grid and LSF. This algorithm adopts two different job dispatching strategies based on the size of batch data intensive jobs. As for the second issue, replica distribution will be different between the scheduling and running time in each grid node when replica replacement activities are occurred frequently. Potential job activities of data intensive jobs in the waiting queue lead to replica replacement activities. To get a better access cost prediction, job scheduling algorithm with potential behaviors, named as ACPB, is proposed in this paper. To facilate the communication among each grid node, a simple and decentralized replica report feedback mechanism is given in the simulation environment. Compared with the traditional replica replacement strategy based on access cost, ACPB is more effective.3. Plug-and-play issue in spatial data grid. Firstly, spatial data grid is introduced, secondly, the plug-and-play protocols are suveryed, and finally, plug-and-play protocol architecture and design is proposed in detail. The proposed plug-and-play protocol includes: device dynamic online and offline protocol, device access control protocol, data dynamic online and offline protocol, and data convergence protocol. Compared with other device plug-and-play protocol, our proposed protocol expands the traditonal plug-and-play protocol with data dynamic online protocol, data dynamic offline protocol, and data convergence protocol. Particularly, data convergence protocol gives an effective and efficient method to improve the performance of spatial data grid.4. Dynamic scalability issue in spatial data grid. After surveying dynamic scalability protocol in grid computing environment, behaviors of analysts should be analyzed to make an effective and effcient dynamic scalability decision. The whole protocol is divided into three layers, including information collection layer, decision making layer and decision execution layer. To implement their functions of these three layers, information collection protocol, decision making strategy and decision execution protocol are designed in detail. Decision making strategy is crucial to dynamic scalability protocol. To get an informed decision, existing decision conditions and their decision rules are taken as training set to generate classfication rules. The incoming tuple with its relevant attributes are classified based on these classfication rules. Once the incoming tuple is labeled with a class, it will be performed based on the corresponding execution rule.Finally, innovative points in this thesis are summarized, and the future work is outlined.
Keywords/Search Tags:Grid Computing, Data Grid, Replica Replacement, Job Scheduling, Plug and Play, Dynamic Scalability
PDF Full Text Request
Related items