Font Size: a A A

Research On Dependable Computing Oriented Distributed Fault Detection System

Posted on:2013-08-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:H W LuFull Text:PDF
GTID:1228330362473596Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the constant and unceasing development of computer software, hardware and network technologies, service computing modes undergo continuously changing. Relying on the rapid development of Internet, some open large‐scale distributed computing systems, such as WAN P2P, grid, and cloud, emerge in recent years. Many service systems, which are closely linked to economic production and social lives, rely on these network computing systems. Such computing systems greatly promote economic development and social progress. Once failures or quality of service (QoS) degradations arise in these systems, much inconvenience and economic losses would come forward. Therefore, one of the key issues for application and development of distributed computing technology is how to guarantee and promote the dependability of computing systems, thus computing services with high availability and reliability could be continuously offered. Hence, industrial and academic communities have launched realated researches, which cost a lot of manpower and material resources. An important means to guarantee and promote a system’s dependability is to correctly control and tolerate the faults in the system. While an effective fault detection approach based on states of system entities is an important basis for this means.Fault detection includes not only correct recognition of faults, but also effective monitoring of the entities being detected. This thesis discusses and studies several key problems for dependable computing oriented fault detection in large‐scale distributed network computing environment. Based on thorough summarization and in‐depth discussion for existing related technologies and research achievements, the architecture of a distributed fault detection system is put forward. And a series of related algorithms including distributed self‐organizing entity monitoring algorithm, state message dissemination algorithm, detection system survivability algorithm and fault recognition algorithm, are illustrated and analyzed. Finally, a suit of self‐organizing distributed fault detection prototype system is implemented.The concrete works and innovations of this thesis include following aspects.①Under dependable computing framework, the problems of fault detection in distributed computing system are clarified. Based on this clarification, and aimed at application characteristics of open network distributed system and fault tolerance requirements, a fault detection system architecture detached from upper strategy is designed. Meanwhile an overall framework for distributed fault detection, including modules such as state data collection, system entity monitoring, state information dissemination, fault recognition, etc. is built. ②Traditionalcentralizedorhierarchicalmonitoringsystemscouldnotwelladaptthe characteristicsofopenlarge‐scaledistributednetworkcomputingenvironments,suchaswide rangeofnodesdistribution,largenumberofnodesparticipatingincomputing,instabilityof messagetransmissiondelay,uncertaintyofservicedependence,etc.Basedontheideaof self‐organizationnetwork,aneighboredmonitoringmethodinlightofthedistancebetween systementitiesisputforward.Thismethodcaneffectivelyreduceneighborhoodmutual monitoring delay, and thus improve monitoring efficiency.③For the two common message transmission methods in network, flooding and unicasting, the former may cause high network overhead, while the latter may pose high system delay. The advantages and disadvantages of traditional Gossip protocol in fault detection are analyzed. Based on the idea of Gossip protocol, a directional message dissemination algorithm, that is, D‐Gossip is designed. D‐Gossip reduces message dissemination uncertainty of traditional Gossip protocols. It effectivelyimprovestheefficiencyandcoverageofmessagedissemination,reducessystem’s redundant information.④Inthedistributeddetectionsystem,thepeer‐to‐peernatureexistsbetweennodes. Meanwhile, monitoring domains are self‐organizing formed. These two factors cause critical nodes in monitoring domains. Once these critical nodes depart the system, it will cause a large number ofnodesnottobemonitored.Consequently,itwillleadpartialfailurefordistributeddetection system, thus reduce survivability of fault detection function. This problem is particularly obvious in high churn distributed environment. Therefore, this thesis designs a series of methods, including adaptivedetection,activedetectionandneutralization.Theyeffectivelysolvesurvivabilityof distributed fault detection system.⑤For computing service in large‐scale distributed systems, the fault sample size is limited, thereforetraditionalmethodsfacedificultiesinfaultclassificationandidentification.Support vectormachine(SVM)isintroducedtodistributedfaultdetectionsystem.Itprovidesanew approachforfaultclassificationandidentification.ThisthesisstudiesthekeyproblemsofSVM methodsinfaultclassificationandidentification.BasicimplementationstepsofSVM‐basedfault identificationaregivenout.AsstandardSVMmethodcan’tbedirectlyadoptedtosolve dependablecomputingorientedfaultdetection,whichisatypicalmulti‐valueclassification problem.Therefore,amulti‐valueclassificationalgorithmbasedonDDAG(DecisionDirected Acyclic Graph) is designed. And a multi‐fault classifier model is built up, and the correctness of the model is verified by means of fault injections.⑥Adependablecomputingorienteddistributedfaultdetectionprototypesystemis illustrated in this thesis. Some key implementation processes of each component in the prototype system are stated. At the same time, for aforementioned detection system, a series of experiments areperformedovertheprototypesystem.Theexperimentsverifythefunctionsofeach component.Inconclusion,thethesisstudiesandsummarizessomekeyproblemsforexistingfault detection technologies in large‐scale distributed dependable computing application environment. A series of algorithms are designed. Meanwhile, the theoretical analysis and experimental results provethecorrectnessofthealgorithms.Andthesealgorithmscouldimplementdependable computing oriented fault detection for large‐scale distributed computing application environment, thus provide steady baiss for the decision of system’s dependability guarantee.
Keywords/Search Tags:Dependable Computing, Large‐Scale Distributed Systems, Fault Detection, SelfOrganized Management, Support Vector Machine
PDF Full Text Request
Related items