Big data storage workload characterization, modeling and synthetic generation

Posted on:2015-11-24

Degree:Ph.D

Type:Thesis

University:University of Illinois at Urbana-Champaign

Candidate:Abad, Cristina Lucia

Full Text:PDF

GTID:2478390017993155

Subject:Computer Science

Abstract/Summary:

A huge increase in data storage and processing requirements has lead to Big Data, for which next generation storage systems are being designed and implemented. As Big Data stresses the storage layer in new ways, a better understanding of these workloads and the availability of flexible workload generators are increasingly important to facilitate the proper design and performance tuning of storage subsystems like data replication, metadata management, and caching.;Our hypothesis is that the autonomic modeling of Big Data storage system workloads through a combination of measurement, and statistical and machine learning techniques is feasible, novel, and useful. We consider the case of one common type of Big Data storage cluster: A cluster dedicated to supporting a mix of MapReduce jobs. We analyze 6-month traces from two large clusters at Yahoo and identify interesting properties of the workloads. We present a novel model for capturing popularity and short-term temporal correlations in object request streams, and show how unsupervised statistical clustering can be used to enable autonomic type-aware workload generation that is suitable for emerging workloads. We extend this model to include other relevant properties of storage systems (file creation and deletion, pre-existing namespaces and hierarchical namespaces) and use the extended model to implement MimesisBench, a realistic namespace metadata benchmark for next-generation storage systems. Finally, we demonstrate the usefulness of MimesisBench through a study of the scalability and performance of the Hadoop Distributed File System name node.

Keywords/Search Tags:

Big data, Storage, Workload, Model

Related items

1	Workload Optimal Scheduling For Large-scale Cloud Data Centers
2	Workload-Aware Hybrid Block Storage System Using Object-based Storage
3	Adaptive Management Of Transaction Workload For Database Systems
4	Multi-dimensional workload analysis and synthesis for modern storage systems
5	Research On Adaptive Workload Prediction Based On Machine Learning
6	Data center resource management with temporal dynamic workload
7	Research On Data Organize Structure Of Evolving Storage Systems
8	Statistical Characterization of Storage System Workloads for Data Deduplication and Load Placement in Heterogeneous Storage Environments
9	Research On A Hierarchy-based Workload Dividing Approach For The Distributed DDM
10	Research On Low Power Data Layout Technologies For Storage Systems