Research On Dependable Computing Oriented Distributed Fault Detection System

Posted on:2013-08-12

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H W Lu

Full Text:PDF

GTID:1228330362473596

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Along with the constant and unceasing development of computer software, hardware and network technologies, service computing modes undergo continuously changing. Relying on the rapid development of Internet, some open largeâ€scale distributed computing systems, such as WAN P2P, grid, and cloud, emerge in recent years. Many service systems, which are closely linked to economic production and social lives, rely on these network computing systems. Such computing systems greatly promote economic development and social progress. Once failures or quality of service (QoS) degradations arise in these systems, much inconvenience and economic losses would come forward. Therefore, one of the key issues for application and development of distributed computing technology is how to guarantee and promote the dependability of computing systems, thus computing services with high availability and reliability could be continuously offered. Hence, industrial and academic communities have launched realated researches, which cost a lot of manpower and material resources. An important means to guarantee and promote a systemâ€™s dependability is to correctly control and tolerate the faults in the system. While an effective fault detection approach based on states of system entities is an important basis for this means.Fault detection includes not only correct recognition of faults, but also effective monitoring of the entities being detected. This thesis discusses and studies several key problems for dependable computing oriented fault detection in largeâ€scale distributed network computing environment. Based on thorough summarization and inâ€depth discussion for existing related technologies and research achievements, the architecture of a distributed fault detection system is put forward. And a series of related algorithms including distributed selfâ€organizing entity monitoring algorithm, state message dissemination algorithm, detection system survivability algorithm and fault recognition algorithm, are illustrated and analyzed. Finally, a suit of selfâ€organizing distributed fault detection prototype system is implemented.The concrete works and innovations of this thesis include following aspects.â‘ Under dependable computing framework, the problems of fault detection in distributed computing system are clarified. Based on this clarification, and aimed at application characteristics of open network distributed system and fault tolerance requirements, a fault detection system architecture detached from upper strategy is designed. Meanwhile an overall framework for distributed fault detection, including modules such as state data collection, system entity monitoring, state information dissemination, fault recognition, etc. is built. â‘¡Traditionalcentralizedorhierarchicalmonitoringsystemscouldnotwelladaptthe characteristicsofopenlargeâ€scaledistributednetworkcomputingenvironments,suchaswide rangeofnodesdistribution,largenumberofnodesparticipatingincomputing,instabilityof messagetransmissiondelay,uncertaintyofservicedependence,etc.Basedontheideaof selfâ€organizationnetwork,aneighboredmonitoringmethodinlightofthedistancebetween systementitiesisputforward.Thismethodcaneffectivelyreduceneighborhoodmutual monitoring delay, and thus improve monitoring efficiency.â‘¢For the two common message transmission methods in network, flooding and unicasting, the former may cause high network overhead, while the latter may pose high system delay. The advantages and disadvantages of traditional Gossip protocol in fault detection are analyzed. Based on the idea of Gossip protocol, a directional message dissemination algorithm, that is, Dâ€Gossip is designed. Dâ€Gossip reduces message dissemination uncertainty of traditional Gossip protocols. It effectivelyimprovestheefficiencyandcoverageofmessagedissemination,reducessystemâ€™s redundant information.â‘£Inthedistributeddetectionsystem,thepeerâ€toâ€peernatureexistsbetweennodes. Meanwhile, monitoring domains are selfâ€organizing formed. These two factors cause critical nodes in monitoring domains. Once these critical nodes depart the system, it will cause a large number ofnodesnottobemonitored.Consequently,itwillleadpartialfailurefordistributeddetection system, thus reduce survivability of fault detection function. This problem is particularly obvious in high churn distributed environment. Therefore, this thesis designs a series of methods, including adaptivedetection,activedetectionandneutralization.Theyeffectivelysolvesurvivabilityof distributed fault detection system.â‘¤For computing service in largeâ€scale distributed systems, the fault sample size is limited, thereforetraditionalmethodsfacedificultiesinfaultclassificationandidentification.Support vectormachine(SVM)isintroducedtodistributedfaultdetectionsystem.Itprovidesanew approachforfaultclassificationandidentification.ThisthesisstudiesthekeyproblemsofSVM methodsinfaultclassificationandidentification.BasicimplementationstepsofSVMâ€basedfault identificationaregivenout.AsstandardSVMmethodcanâ€™tbedirectlyadoptedtosolve dependablecomputingorientedfaultdetection,whichisatypicalmultiâ€valueclassification problem.Therefore,amultiâ€valueclassificationalgorithmbasedonDDAG(DecisionDirected Acyclic Graph) is designed. And a multiâ€fault classifier model is built up, and the correctness of the model is verified by means of fault injections.â‘¥Adependablecomputingorienteddistributedfaultdetectionprototypesystemis illustrated in this thesis. Some key implementation processes of each component in the prototype system are stated. At the same time, for aforementioned detection system, a series of experiments areperformedovertheprototypesystem.Theexperimentsverifythefunctionsofeach component.Inconclusion,thethesisstudiesandsummarizessomekeyproblemsforexistingfault detection technologies in largeâ€scale distributed dependable computing application environment. A series of algorithms are designed. Meanwhile, the theoretical analysis and experimental results provethecorrectnessofthealgorithms.Andthesealgorithmscouldimplementdependable computing oriented fault detection for largeâ€scale distributed computing application environment, thus provide steady baiss for the decision of systemâ€™s dependability guarantee.

Keywords/Search Tags:

Dependable Computing, Largeâ€Scale Distributed Systems, Fault Detection, SelfOrganized Management, Support Vector Machine

PDF Full Text Request

Related items

1	Study On The Methods Of Fault Detection And Prediction In Non-linear Industrial Processes Based On Support Vector Machine
2	Research On Dependable Monitoring In Large-scale Distributed Systems
3	Researches On Key Issues Of Dependable Middleware Technology In Object-Oriented Disbributed Environment
4	Adaptive Scheduling Using Support Vector Machine on Heterogeneous Distributed Systems
5	Failure-Aware Reconfigurable Distributed Virtual Machine for dependable and high productivity computing
6	Research Of Fault Diagnosis Based On Support Vector Machine
7	Fault Diagnosis Method Based On Support Vector Machine
8	Design Of Support Vector Machine Accelerator Based On Reconfigurable Computing Platform
9	Study And Application On Support Vector Machine Classification
10	Research Of Cascading Support Vector Machines Based On Spark