Font Size: a A A

Research On Log-based Trust Management In Large-scale Distributed Software System

Posted on:2012-09-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:X RaoFull Text:PDF
GTID:1268330392473811Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid progress of computing techniques and communication techniques,the information service has been widely used in regular life. It persistently providestrusted service for7′24hours a week. The infrastructure of such service is usuallydistributed systems that are constructed with large amount of computing resources,handle large amount of user requests and storage large amount of user data. In order tobuild a large-scale trusted distributed system, the complexity of system behavior andsystem running environment would increase dramatically for the follow two reasons:software bugs are difficult to get rid of, and it is difficult to anticipate all the situationsthat could encountered at runtime. A direct result of such complexity is that the systemshould be persistently providing trusted service while the system composing componentpersistently fails, which seriously affects the7′24high trusted service requiremt.Monitoring-Action is a widely used runtime trust grantee mechanism. It uses thefault localization techniques to accurately diagnosis the root cause of system failure, andthen doing the corresponding actions to grantee the runtime trust property of the system.Event log has been widely used in fault localization techniques as an effectiveabstraction of system behavior. There are mainly two ways to localize a fault: use faultmodel to identify a known fault; use normal behavior model to identify the abnormalbehavior. However, due to the complexity system behavior and system environment,there still exist great challenges to localize fault by event logs: there exist large amountnoisy logs through the fault modeling process, which could lead to false positives andfalse negatives in the fault localization process; it is difficult to extract the normalbehavior model under distributed environment, and the low detection efficiency makesit difficult to detect runtime abnormal system behavior.In order to tame those challenges, we research the fault localization problems inboth fault model and normal behavior model method in typical large-scale distributedsoftware systems. By using the Haar wavelet transform techniques, similaritymeasurement and cluster index techniques, we try to solve the noisy event log filtering,continues fault behavior tracing and normal behavior modeling problem throughdifferent properties of event logs, including time series property, statistical property andevent property. Our main works including:(1) Time series similarity-based noisy event logs filteringTo filter the noisy event logs during fault feature extraction process, we propose atime series similarity-based noisy log filtering method to filter out the event logs that arenot related to the injected fault. By modeling a specific type of event logs into timeseries, and use Haar wavelet transform to extract the occurring pattern of the time series.By comparing the similarity between noisy log templates and the target log time series, we successfully identify the event logs that are not related to the injected fault andincrease the effectiveness of the fault model. By conducting the experiment in an innerlarge-scale distributed software system of alibaba cloud computing company, we canimprove the fault accuracy to96%(when setting detection time window to50seconds)and the fault recall to94%(when setting the detection time window to100seconds).To improve the efficiency of the filtering process, we propose a skip-list basedcluster index to shorten the filtering time. By clustering the noisy temples with timeseries similarity measurement and use skip-list to index the noisy temples, wesuccessfully improve the filtering efficiency by43%.(2) Event log item state-based continues fault tracingIn order to model the fault features outside the detection time window, we proposea companion state tracing mechanism to extract the fault feature. Traditional faultmodeling method uses the event logs that are observed within a fixed fault feature timewindow to model a fault and ignore the fault feature outside the time window. Sincedifferent types of fault are rather different in fault propagation time period and faultpropagation patterns, the fault feature could be mistakenly identified as false positivesand false negatives. By modeling fault into companion state machines, we can identifythe current event log pattern is caused by the previously detected fault or caused by anew fault. By conducting the experiment in an inner cluster of alibaba cloud computingcompany, we successfully increase the effectiveness of fault model up to90%(whensetting detection time window to6seconds).(3) Thread level event log sequence-based abnormal behavior detectionIn order to detect abnormal behavior from event log templates, we use the eventlogs within a thread as the system normal behavior templates, and model the abnormalevent log detection problem into sequence matching problem. To improve the efficiencyof traditional cluster-based sequence matching method, we use cosine similarity ofevent type feature vector to anticipate the sequence similarity, use Top-K searchingtechniques to limit the size of comparing set, and use invert index techniques to furtherfilter the target temples set. By conducting the experiment on Hadoop cluster, weincrease the comparing efficiency for8.6times (when setting similarity threshold to0.95and use Top50).To further improve the effectiveness of the thread level event log behavior model,we propose a sub-sequence feature vector-based cluster index method for sequencematching. Since the event type feature vector doesn’t contains temporal information, itmakes the similarity measurement doesn’t similar well to the original sequencesimilarity measurement. We use the repeat sub-sequence analyzing techniques to dividethe original sequence into sub-sequence. Then, we use the sub-sequence id to formfeature vector and use the cosine similarity method to cluster the original sequence. Thesub-sequence is a reasonable abstraction of localized temporal information of the original sequence, so we can acquire a better matching effectiveness than event typefeature vector-based method. By conducting the experiment in Hadoop cluster, weincrease the sequence matching actuary for15%(when setting the similarity thresholdto0.90and use Top40).Except for the above3contributions, we also built a set of tools namedLogAnalyzer for gathering and analyzing the event logs in large-scale distributedsoftware systems, which can help the system administrator to better understand thesystem runtime behavior through event logs.
Keywords/Search Tags:Large-scale distributed systems, Trust management, FaultLocalization, Event log analysis
PDF Full Text Request
Related items