Font Size: a A A

Automated Runtime Data Analysis for System Reliability Managemen

Posted on:2019-02-06Degree:Ph.DType:Thesis
University:The Chinese University of Hong Kong (Hong Kong)Candidate:He, PinjiaFull Text:PDF
GTID:2478390017486917Subject:Computer Science
Abstract/Summary:
Runtime data are data generated by systems or programs during their execution. Typical runtime data include system logs and Quality-of-Service (QoS) values, which are widely employed by developers in various system reliability management tasks, such as anomaly detection, operational issues handling, performance prediction, etc. However, traditional reliability management methods become inefficient and error-prone because of the increase of modern system complexity and the rapid growth of runtime data volume. In this thesis, we propose automated data analysis methods to effectively utilize runtime data in reliability management tasks.;Firstly, we conduct an evaluation study on existing data-driven log parsing methods. Log parsing is the first step of many log based reliability management methods. In log parsing, the unstructured raw log messages are transformed into structured event sequences. Although log parsing has been widely studied, a comprehensive benchmarking and an open-source toolkit are lacking. We implement four representative log parsing methods and evaluate their performance in terms of accuracy, efficiency, and effectiveness on reliability management tasks. We obtain six insightful findings, and make these parsing methods open-source for reuse.;Secondly, we propose a parallel log parsing method for large-scale log data analysis. When system logs grow to a large scale, existing log parsing methods fail to complete in reasonable time, which makes log parsing the bottleneck of reliability management tasks. Because timely reliability management is important, an efficient log parsing method that can accurately parse large-scale log data is highly demanded. Our proposed parallel log parser POP employs specially designed heuristic rules and clustering algorithm. It is optimized on top of Spark, a large-scale data processing platform. Thus, POP can employ the computing power of computer clusters and handle large-scale logs efficiently.;Thirdly, we propose an online log parsing method to parse raw log messages in a streaming manner. Most of existing log parsing methods focus on offline, batch processing of logs. However, typical log collection process in modern systems is online, which make an online log parser more eligible than the offline ones. Besides, an online log parsing methods can keep updating the parsing model by newly collected log messages. By designing a fixed depth parse tree, our proposed online log parsing method can efficiently parse log messages in a streaming manner.;Fourthly, we propose an operational issues prioritization method based on hierarchical log clustering. Modern system developers handle issues reported by their users daily. To gain insights into the issues and find out the solutions, they often need to inspect tons of logs generated during system runtime. Our proposed method largely facilitates the operational issues handling process by clustering similar issues to the same group based on their corresponding log sequences, and recommending the largest issue groups to developers. Specifically, our method includes a coarse-grained clustering based on the event appearance matrix and a fine-grained clustering based on the event count matrix.;Lastly, we propose a QoS prediction method for Web service recommendation. A typical modern system based on Web services need to regularly switch its service components based on their QoS values (e.g., response time) to avoid potential system failure and maintain system performance. However, it is difficult for service users to monitor the QoS values of all candidate services. To predict these QoS values accurately, our proposed QoS prediction method utilizes matrix factorization on existing sparse QoS values. The location of service providers and users is encoded in the matrix factorization model to improve prediction accuracy.;In summary, this thesis targets at the design of data-driven techniques on system runtime data to automate labor-intensive reliability management tasks. Extensive experiments on real-world datasets determine the effectiveness of our proposed methods.
Keywords/Search Tags:Runtime data, System, Reliability, Log, Methods, Propose, Qos values, Service
Related items