Research On Performance Bug Detection In Datacenter Distributed Systems

Posted on:2019-01-11

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J X Li

Full Text:PDF

GTID:1368330611493109

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Large-scale distributed systems have become a dominant backbone for cloud computing and modern applications,and billions of end users rely on the dependability of these systems.However,performance bugs in distributed systems can cause severe performance loss that can easily lead to poor user experience and severe economic loss.Unfortunately,due to the complexity and diversity of performance bugs in distributed systems,detecting distributed performance bugs faces a number of challenges,especially the lack of comprehensive and in-depth understanding of distributed performance bugs,the difficulty of detecting complex distributed performance bugs,and the difficulty of effectively triggering and verifying distributed performance bugs involving complex multiple threads.Due to the lack of understanding of the real-world distributed performance bugs(DPbugs),dealing with distributed performance bugs remains extremely challenging.Moreover,existing studies on single-machine performance bugs(LPbugs)is not comprehensive,and it usually focuses only on specific performance issues.Therefore,it is necessary to conduct a preliminarily empirical and comprehensive DPbug study.In addition,users typically expect performance isolation and high scalability from distributed systems.However,performance cascading bugs(PCbugs)often cause slowdowns in jobs to propagate,resulting in global performance degradation or even threatening system availability.Furthermore,developers often need to reproduce bugs to study their root causes to guide bug fixing.However,concurrency bugs caused by multiple threads' wrong execution timing as a root cause,called multi-threaded involved bugs(MTIbugs),may lead to various system problems such as consistency,reliability,scalability and performance.Due to their high complexity,it is always difficult for developers to manually analyze,reproduce,or trigger them.Aiming at the above problems and challenges,the thesis focuses on three issues inside distributed systems,including performance bug empirical study,performance cascading bug detection,and multi-threaded involved bug triggering.Specifically,the main work and contributions of this thesis are as follows:(1)An empirical study of performance bugs in distributed systemsDue to the need of conducting a preliminarily empirical and comprehensive DPbug study,we present TaxPerf,the largest and most comprehensive taxonomy of real-word performance bugs in distributed systems.We study 99 distributed performance bugs from five widely-deployed cloud datacenter distributed systems,including Cassandra,HBase,HDFS,Hadoop MapReduce and ZooKeeper.We study DPbug characteristics across several axes of analysis such as root causes,implications,and fix strategies,collectively stored as over 400 classification labels in TaxPerf database.Overall,TaxPerf can be used as a large-scale DPbug benchmark.It also complements the understanding of LPbugs and can help open up new research directions in combating DPbugs.(2)Automatically detecting performance cascading bugs in cloud systemsPerformance cascading bugs violate the high scalability and performance isolation properties of distributed systems,causing global performance degradation and even threatening system availability.For this,we present a tool,PCatch,that can automatically predict PCbugs by analyzing system execution under small-scale workloads.PCatch contains three key components in predicting PCbugs.It uses program analysis to identify code regions whose execution time can potentially increase dramatically with the workload size;it adapts the traditional happens-before model to reason about software resource contention and performance dependency relationship;it uses dynamic tracking to identify whether the slowdown propagation is contained in one job or not.Our evaluation using representative distributed systems,Cassandra,Hadoop MapReduce,HBase,and HDFS,shows that PCatch can efficiently and accurately predict PCbugs based on small-scale workload execution.(3)Automatically triggering multi-thread involved bugs in cloud systemsMulti-thread involved bugs in distributed systems can cause system problems such as consistency,reliability,scalability and performance.Also,they are difficult to manually analyze,reproduce,or trigger.For this,we build a MTIbug triggering model by analyzing the non-deterministic and multi-threaded properties of MTIbug.Then,based on the triggering model,we design and implement an automatic MTIbug triggering tool,MTrigger,for real-world distributed systems.MTrigger analyzes the competing pairs of MTIbugs and performs timing manipulation on the relevant threads and operations of related competing pairs to achieve the purpose of bug triggering.Experiments show that MTrigger can efficiently and accurately achieve the reproduction or triggering of bugs,which can help verify,diagnose and fix bugs.

Keywords/Search Tags:

Distributed systems, Performance bugs, Empirical study, Performance cascading detection, Bug detection, Performance impact analysis, Bug triggering, System reliability, Cloud computing

PDF Full Text Request

Related items

1	Empirical Studies of Performance Bugs and Performance Analysis Approaches for Software Systems
2	Research On Understanding And Detecting Performance Bug In Distributed Systems
3	An Empirical Study On The Impact Of Performance Commitment On The M&A Performance Of Listed Companies
4	Benchmarking Infrastructure-As-A-Service Cloud Systems Automatically With Extensibility
5	Multi-Layer Fault Tolerance Techniques for High Reliability and Performance: Devices, Systems and Data Centers
6	I/O Behavior Analysis Tool Research For High Performance Computing Systems
7	Research On The Adaptive Behaviors Prejudgment Method Of The Cloud Service System Performance Self-Optimization
8	An Empirical Study Of The Relationship Between Software Platform Development Performance,Distribution Performance And Operation Performance
9	Exact Performance Analysis Of MIMO And Massive MIMO Systems With MMSE Receiver
10	Performance Analysis And Evaluation Of Cloud Computing System