Font Size: a A A

Onsite Fault Localization and Failure Reproduction for Diagnosing Production System Anomalies

Posted on:2015-01-19Degree:Ph.DType:Dissertation
University:North Carolina State UniversityCandidate:Nguyen, Hiep ChiFull Text:PDF
GTID:1478390017993784Subject:Computer Science
Abstract/Summary:
Large-scale shared hosting infrastructures such as multi-tenant cloud computing systems have become increasingly popular by providing computing resources and software services in a cost-effective way. However, such systems are prone to system anomalies (e.g., performance degradation, software hang, incorrect output) due to their inherent complexity, sharing nature, and large scale. Although application developers perform rigorous testing before the deployment, many bugs only manifest during the production run. The traditional way of debugging production-run failures mainly relies on offline bug reproduction. However, reproducing a production-run failure execution offline outside the production environment is challenging due to lack of environment information (e.g., system variables, configuration files), insufficient hardware resource, and unavailable third-party software. Many system anomalies are still manually diagnosed and recovered, causing long service down time and significant financial loss.;In this dissertation, we explore onsite techniques which aim to localize the fault and repro- duce the production-run failure execution within the original computing environment immediately after the failure occurs. We also investigate techniques to enable debugging production-run failures at the developer's site.;First, we present a black-box online fault localization system that can pinpoint faulty components in a distributed system immediately after a performance anomaly is detected. We propose a robust and efficient algorithm to discover the onset time of abnormal behaviors at different components and distinguish those abnormal behaviors from dynamically changing behaviors caused by normal workload fluctuations. Faulty components are then pinpointed based on abnormal change propagation patterns and inter-component dependency relationships. We also propose a runtime validation technique using resource scaling to further filter out false alarms. We observe that our technique is efficient and light-weight, making it suitable for localizing root cause components in distributed systems.;Second, online service failures in production computing infrastructures are notoriously difficult to debug once they happen. When those failures occur, software developers often have little information for debugging. We design and build a practical and efficient in-situ framework for inferring possible failure paths inside production environments immediately after a failure is detected. We use virtual machine live cloning to dynamically create a shadow component of the production server and perform guided binary execution exploration on the shadow component to infer how the failure occurs. We leverage both environment data (e.g., input logs, configuration files, states of interacting components) and runtime outputs (e.g., console logs, system calls) to guide the failure path inference.;Third, we design and develop an offline replay debugging system that can synthesize the failure-triggering input and the complete failure execution of a failed production run at the developer's site. The synthesized input and execution might not be the exactly same as those of the original failure run. However, we view this as a debug determinism tool which guarantees to replay the execution that exhibits the same failure symptom (e.g., error messages) and includes the same root cause. The failure-triggering input and the complete execution path can then be fed into any interactive debugger (e.g., GDB) for further analysis.
Keywords/Search Tags:Failure, System, Production, Execution, Fault, Input, Computing, Software
Related items