Managing Skew in the Parallel Evaluation of User-Defined Operations

Posted on:2013-01-29

Degree:Ph.D

Type:Thesis

University:University of Washington

Candidate:Kwon, YongChul

Full Text:PDF

GTID:2450390008965590

Subject:Computer Science

Abstract/Summary:

Science and business are generating data at an unprecedented scale and rate due to ever evolving technologies in computing and sensors. Analyzing big data has become a key skill driving business and science. The challenges in big-data analysis stem not only from the data volume, but also from the diversity of data types to analyze (e.g., text, image, audio, video, and graph) and the various analyses beyond relational algebra that need to be performed (e.g., machine learning, natural language processing, image processing, and graph analysis). The user-defined operation (UDO) is a powerful mechanism to implement complex data processing tasks without changing the core of the parallel data processing engine. Although users can rapidly develop a new data analysis task with UDOs and execute the task in a cluster of computers, achieving high performance is important for users, especially those who do not have an extensive background in programming.;This thesis focuses on addressing skew in parallel UDO evaluation. Skew is a problem when there exists a significant variance in the execution time of parallel tasks. In the presence of skew, the benefit of using a parallel system diminishes. Our detailed case study demonstrates that a new data analysis task can be rapidly implemented in a MapReduce-like system, but such implementation may be prone to skew problem during execution. A skew-resilient implementation is possible but requires significant implementation effort and expertise in programming. We also analyze the skew problem in three real workloads and show that skew problem is frequent (more than 40% of long running jobs experience skew).;The thesis proposes two techniques to manage skew in parallel UDO evaluations: SkewReduce and SkewTune. SkewReduce is a static data partition optimization technique for feature-extracting applications that are common in scientific analysis. SkewReduce can improve the application runtime by up to 8x compared with a default MapReduce data partitioning strategy without any code-level optimization. SkewTune is a transparent dynamic skew mitigation technique for MapReduce applications. SkewTune can improve the application runtime by up to 4x compared with default MapReduce engine without modifying the application source code, without requiring any input from the developer or user, and without causing any side-effect during the execution.

Keywords/Search Tags:

Skew, Data, Parallel

Related items

1	Massive Data Many Task Parallel Data Framework For GWAS
2	Parameter Estimation And Variable Selection Based On LAD Regression For Skew-t-Normal Data
3	Performance of the Alexander and Govern A statistic under heteroscedasticity imposed on normal data or induced by skew: An empirical study
4	On Skew Energy Of Oriented Graphs
5	Study On Block Skew-Symmetric And Skew-Circulant Matrix
6	Parameters Estimation And Application For The Skew Generalized Error Distribution
7	Ordering Of The Oriented Unicyclic Graphs With Skew Energies
8	Statistical analysis of skew normal distribution and its application
9	The Minimal Skew Energy Of Digraphs
10	Statistical Inference For Mixture Models With Skew-t-normal Data