Research On Key Techniques Of Benchmarking And Optimization For Big Data Systems

Posted on:2018-03-03

Degree:Doctor

Type:Dissertation

Country:China

Candidate:W Xiong

Full Text:PDF

GTID:1318330533455886

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The arrival of big data era means that novel techniques,products and systems are emerging.Benchmarking will play an important role in the development of big data systems,which is similar to the case that benchmarking has brought flourishing for database systems in the past three decades.Currently,benchmark researching cannot satisfy the real industrial needs for big data applications in three aspects.Firstly,there is no a widely accepted benchmark suite to evaluate different big data systems.Secondly,traditional approach cannot accurate analyze workload characterization for big data applications,not to mention guiding performance optimization.Finally,in our benchmarking and workload characterization activities,we found that manually tuning is a low-efficient,and cannot achieve optimization for big data applications.Consequently,we research a set of fundamental issues,which need to be solved immediately.These issues are:(1)the approach on building a big data benchmark suite;(2)how to perform workload characterization for big data applications and guide optimization;(3)the methodology and tool on auto tuning the HBase configuration.Firstly,We develop a benchmark suite SZTS,and it is available on site(http://cloud.siat.ac.cn/szts.php),SZTS is designed for big data systems hosting traffic applications.SZTS has three advantage when compared to other Big Data benchmark suites.(1)SZTS adopts real-world input sets,which can better capture real-world characteristics of workload when compare to synthetic data;SZTS conducts workload characterization in a cross-layer way,which provide more comprehensive view than single-layer way;SZTS employ clustering technique to select the representative program and associated input sets,it can better satisfy the requirements in terms of diversity when compared to manually selected in the traditional way.Secondly,we propose a novel workload characterization methodology using ensemble learning,called Metric Importance Analysis(MIA),to quantify the respective importance of workload metrics or characteristics?Moreover,we develop the MIA-based Kiviat Plot(MKP)and Benchmark Similarity Matrix(BSM)to visualize program behavior(dis)similarity?When compared to traditional linkage clustering based dendrogram,MKP and BSM provide more insightful information about the(dis)similarity?More importantly,MIA can guide performance optimization activity?For example,TMI(the amount of I/O caused by the temporary data)is most important factor in terms of DPS(data processing speed)?we adjust the value of a parameter-io.sort.factor,reducing TMI,and saving 37.5% execution time?Finally,we propose an approach to auto tuning HBase's Configuration via Ensemble Learning,called ATH?The key idea is a performance model,which take configuration parameter as inputs and outputs performance metrics such as the throughput and the latency.Then,GA takes this model as inputs,searching optimum configuration in the huge space for a given application.We validate ATH in an HBase cluster with10 nodes by using 5 typical applications from YCSB.The experimental results show that ATH can improve throughput by 41% on average and up to 97% compared to the default configuration.

Keywords/Search Tags:

Big Data, Benchmarking, Workload Characterization, Optimization, Program Behavior, Clustering, Auto Tuning, Random Forest

PDF Full Text Request

Related items

1	Research On Key Technologies Of Personal Behavior Prediction Based On Random Forest
2	Optimization Of Distributed Random Forest Algorithm Based On Hierarchical Subspace
3	Telecom Customer Churn Prediction And Analysis Based On Improved Random Forest Algorithm
4	Research On Intrusion Detection Technology Based On Optimized Random Forest Algorithm
5	Class-Imbalanced Data Stream Classification Method Based On Adaptive Random Forest
6	Research On Optimization Of Random Forest Algorithm And Its Application In Text Parallel Classification
7	Research On Mental Workload Classification Algorithm Based On ECG Co-dimension Features
8	Research On Botnet Detection Model Based On Random Forest And Denoising Auto-encoder
9	Big data storage workload characterization, modeling and synthetic generation
10	The Design Of Elderly Behavior Identification System Driven By Data