Onsite Fault Localization and Failure Reproduction for Diagnosing Production System Anomalies

Posted on:2015-01-19

Degree:Ph.D

Type:Dissertation

University:North Carolina State University

Candidate:Nguyen, Hiep Chi

Full Text:PDF

GTID:1478390017993784

Subject:Computer Science

Abstract/Summary:

Large-scale shared hosting infrastructures such as multi-tenant cloud computing systems have become increasingly popular by providing computing resources and software services in a cost-effective way. However, such systems are prone to system anomalies (e.g., performance degradation, software hang, incorrect output) due to their inherent complexity, sharing nature, and large scale. Although application developers perform rigorous testing before the deployment, many bugs only manifest during the production run. The traditional way of debugging production-run failures mainly relies on offline bug reproduction. However, reproducing a production-run failure execution offline outside the production environment is challenging due to lack of environment information (e.g., system variables, configuration files), insufficient hardware resource, and unavailable third-party software. Many system anomalies are still manually diagnosed and recovered, causing long service down time and significant financial loss.;In this dissertation, we explore onsite techniques which aim to localize the fault and repro- duce the production-run failure execution within the original computing environment immediately after the failure occurs. We also investigate techniques to enable debugging production-run failures at the developer's site.;First, we present a black-box online fault localization system that can pinpoint faulty components in a distributed system immediately after a performance anomaly is detected. We propose a robust and efficient algorithm to discover the onset time of abnormal behaviors at different components and distinguish those abnormal behaviors from dynamically changing behaviors caused by normal workload fluctuations. Faulty components are then pinpointed based on abnormal change propagation patterns and inter-component dependency relationships. We also propose a runtime validation technique using resource scaling to further filter out false alarms. We observe that our technique is efficient and light-weight, making it suitable for localizing root cause components in distributed systems.;Second, online service failures in production computing infrastructures are notoriously difficult to debug once they happen. When those failures occur, software developers often have little information for debugging. We design and build a practical and efficient in-situ framework for inferring possible failure paths inside production environments immediately after a failure is detected. We use virtual machine live cloning to dynamically create a shadow component of the production server and perform guided binary execution exploration on the shadow component to infer how the failure occurs. We leverage both environment data (e.g., input logs, configuration files, states of interacting components) and runtime outputs (e.g., console logs, system calls) to guide the failure path inference.;Third, we design and develop an offline replay debugging system that can synthesize the failure-triggering input and the complete failure execution of a failed production run at the developer's site. The synthesized input and execution might not be the exactly same as those of the original failure run. However, we view this as a debug determinism tool which guarantees to replay the execution that exhibits the same failure symptom (e.g., error messages) and includes the same root cause. The failure-triggering input and the complete execution path can then be fed into any interactive debugger (e.g., GDB) for further analysis.

Keywords/Search Tags:

Failure, System, Production, Execution, Fault, Input, Computing, Software

Related items

1	Design And Implementation Of A Software Failure Mode Evaluation Tool
2	Research On Mining Important Nodes In Software Execution Network Based On Cascading Failure
3	Research On Fast Fault Tolerance Mechanism For Single Point Of Failure In Stream Computing Environment
4	Research On Method Of Fault Localization In Software With Key Nodes And Suspiciousness
5	Software Fault Localization Based On Dynamic Failed Execution Blocks
6	Research On Software Fault Localization Method By Analyzing Failure Propagation
7	Research Of Software Fault Localization Method Based On Software Dynamic Execution Graph Mining
8	Research On Software Fault Detection And Repair Based On Invariants
9	Research On Execution Slice-based Software Fault Localization Methods
10	Research On Execution Distance Measurement Based Software Debugging And Test Optimization