Font Size: a A A

Research On Fault Tolerance Techniques In Desktop Grids

Posted on:2014-02-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:D P WangFull Text:PDF
GTID:1228330398959104Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Until now, the development of distributed computing has led to several large-scale distributed computing models with significant influence. As one of these computing models, grid computing aggregates computing resources, which are geographically dispersed and belong to different domains, into single virtual supercomputer with powerful computing capabilities. Desktop grid is a special kind of grid which targets desktop computing resources. Volunteer computing is a branch of desktop grid, and existing volunteer computing projects have gained rival computing capabilities to contemporary supercomputer. The continuous improvement of the performance of network and desktop computer provides a solid foundation for the further development of desktop grid.In desktop grids, computing resources are mainly non-dedicated and resource availability is codetermined by resource state and resource donating strategy. Compared with dedicated resources, resources in desktop grids have rather short availability intervals. Desktop grid usually adopts fault tolerance measures to ensure task completion and efficient utilization of resource. Task replication and checkpointing technique are two commonly used measures. In fault tolerance measures, there exist some factors having a big impact on system performance, such as replication number and checkpointing strategy.Under the financial support from "the National High Technology Research and Development Program of China" and "the National Natural Science Foundation of China", this work focuses on the researches of related fault tolerance measures and aims to improve the efficiency of resource utilization in desktop grids.The main contributions of this work are as follows.Firstly, it proposes a method which estimates deadline-miss probabilities of tasks based on random sample. In desktop grids, the measure of task replication is adopted to satisfy deadlines of tasks. Dynamically replicating tasks according to deadline-miss probabilities may satisfy task deadline and achieve high efficiency of resource utilization at the same time, which raises requirement of estimating deadline-miss probabilities of tasks. Under the assumption that the loss of available time caused by resource failures can be ignored, whether or not a task misses its deadline is determined by the available time of the resource which hosts the task before its deadline. Through experimental analysis on availability trace data, we find a randomly sampling method. Random sample generated with this method reflects the distribution of period availability time of host very well. Based on such random samples, we use a non-parametric method to estimate deadline-miss probabilities of tasks. Results of simulation experiments show that this estimating method has high prediction accuracy and outperforms other methods in dynamical task replication.Secondly, it presents an algorithm approximating optimal checkpointing strategy for general failure distribution. In desktop grids, hosts follow diverse failure distributions. So, checkpointing strategy used in such context should consider general failure distribution. We theoretically prove that, to maximize the efficiency of resource utilization, two consecutive checkpoint interval lengths must satisfy a particular condition. Based on this property and greedy idea, we give an algorithm approximating optimal checkpointing strategy. This algorithm is equivalent to periodic checkpointing strategy when interfailure intervals follow exponential distributions, and is superior to periodic checkpointing strategy when interfailure intervals follow other distributions.Thirdly, it puts forward an algorithm which approximates optimal checkpointing strategy based on sample of interfailure intervals. It is difficult to get probability function of failure distribution for most hosts in desktop grids. Checkpointing strategy based on probability function can not be used in such context. So, we propose an algorithm based on sample of interfailure intervals. This algorithm adopts the idea of checkpointing frequency, and has polynomial time complexity. In trace driven simulations, this algorithm outperforms periodic strategy in terms of wasted time.Above researches only involve some circumstances in desktop grids. Based on this work, we will consider other circumstances in our future work.
Keywords/Search Tags:desktop grids, volunteer computing, fault tolerance, checkpointingstrategies
PDF Full Text Request
Related items