Font Size: a A A

Syslog-based Performance Event Management Within Data Center

Posted on:2018-01-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:S L ZhangFull Text:PDF
GTID:1368330566987975Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
For providing high-speed and reliable Internet services,data centers,where the Internet services are deployed,must be quite reliable.However,there exist some events that impact a data center's performance within the data center,which can degrade or even stop Internet services,impact user experience,and in turn lead to a drop in revenue.These events can be classified into human intervention events and system events.A software upgrade or configuration change(software change)in Web services is one of the most important human intervention events.Traditionally,operators assess the impact of software changes in a manual way,which is prone to error,not scalable,and consumes a lot of human resources.In addition,a switch failure is one of the most important system events.The previously presented methods either suffer from low accuracy,or they consume too much computational resources.In addition,no switch failure prediction method has yet been proposed.Therefore,in this paper,event management,i.e.,event(switch failure)prediction,event(switch failure)detection,and event(software change)assessment,is implemented.The research works and contributions in this paper can be summarized as follows.1)A novel system,FUNNEL,for rapid and robust impact assessment of software changes in large Web services,is proposed and implemented.To detect significant performance behavior changes,FUNNEL adopts singular spectrum transform(SST)al-gorithm as the core algorithm,and applies a difference-in-difference(DiD)method to differentiate the true causality from the random correlations between the performance change and the software change.Evaluation through historical data in real-word services showed that FUNNEL achieved an accuracy of more than 99.7%.Compared with previous methods,FUNNEL's detection delay was 38.02%to 64.99%shorter,and its computation speed was much faster.2)A novel method,FT-tree,for extracting events from switch syslogs towards event management,is proposed and implemented.To extract the failure events rep-resented by syslog messages,FT-tree empirically extracts message templates more accu-rately than existing approaches,and naturally supports incremental learning.To compare the performance of FT-tree and three other template learning techniques,the four meth-ods were experimented on two-years' worth of failure tickets and syslogs collected from switches deployed across 10+ datacenters.The experiments demonstrated that FT-tree improved the estimation/prediction accuracy(as measured by F1)by 155%to 188%.In addition,FT-tree enhanced the computational efficiency a lot.3)A novel syslog based switch failure prediction framework,PreFix,is proposed and implemented.PreFix is aimed to determine during runtime whether a switch failure will happen in the near future.Our novel set of features(message template sequence,frequency,seasonality and surge)for machine learning can efficiently deal with the challenges of noises,sample imbalance,and computation overhead.PreFix was evaluated on a data set collected from real-world data center switches.PreFix achieved an average of 61.81%recall and 1.84 x 10-5 false positive ratio.
Keywords/Search Tags:Data center, Event Management, Syslog, Web service, Switch failure
PDF Full Text Request
Related items