Research On Iterative Computations For Big Data In The Cloud

Posted on:2013-04-13

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y F Zhang

Full Text:PDF

GTID:1228330467981093

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The advances in storage and networking technology have created huge collections of high-volume, high-dimensional data. While cloud computing provides flexible storage and computation power for the big data. Making sense of these data is critical for companies and organizations to make better business decisions and even brings convenience to our daily life. Data mining, machine learning, and applied statistics typically require an iterative refinement process. However, the massive amount of data involved and potentially numerous iterations required make performing data analytics in a timely manner challenging and consume considerable cloud resources. Improving performance of large-scale iterative computation becomes a hot research topic in today’s cloud computing, and the research is also quite meaningful for real life production.The traditional MapReduce model proposed by Google is originally designged for batched processing, but it is not suitable for iterative processing. Recently, a series of state-of-art frameworks proposed for supporting large-scale iterative computations. They improve the performance of iterative processing for big data from system aspect and computation model aspect.Our work, different from the previous work, improves MapReduce model for iterative processing and makes a further step on iterative computation theory. The contributions of this Dissertation are summarized as follows:(1) iMapReduce. In order to implement iterative computation by MapReduce, users are required to design a series of MapReduce jobs, each iteration corresponding to one or more MapReduce jobs. The batched processing model leads to repeated job/task scheduling and repeated data loading, which limits MapReduce’s performance. To improve the iterative processing performance of MapReduce, we propose iMapReduce. iMapReduce launches a single job for iterative computation to avoid the task scheduling overhead, maintains the local static data to avoid static data shuffling overhead, and allows asynchronous map execution to avoid synchronization overhead. Through these optimizations, iMapReduce can reduce the running time in MapReduce a lot. The experimental results on Amazon EC2show5X speedup over Hadoop.(2) PrIter. Through the research of a broad class of iterative algorithms, we propose prioritized execution of iterative computations. The traditional iterative computations perform the update on all data points without discrimination. However, in reality, the importances of these data points are different. We identify the data points that are dominant to perform more updates and ignore those data points that are neglectable on algorithm convergence. We prove the convergence and correctness of prioritized iteration and propose a distributed framework, PrIter, to support prioritized iteration. Our results show that PrIter provides50X speedup over Hadoop and5X speedup over iMapReduce.(3) Maiter. Iterative computation usually adopts synchronous iteration model, which requires the computations of all data in the previous iteration should be completed before starting the next iteration. This limits the processing power of distributed computing, especially in a heterogeneous environment. In order to support asynchronous iteration, we derive accumulative iterative updates and prove that accumulative iteration converges to the same result when performing asynchronously. We identify the sufficient conditions that an iterative computation can be transformed to asynchronous accumulative iteration and abstract a computation model, based on which we develop a distributed framework, Maiter, to support asynchronous accumulative iteration. Our results show that Maiter can achieve80X speedup over Hadoop and5X-10X speedup over their synchronous counterparts.The achievements shown above significantly improve the performance of iterative computation in cloud environments. Our results have been adopted by UMass Million Book project, CMU GraphLab project, and Microsoft Research Daytona project. Moreover, we provide the open-sourced frameworks, which can promote the research and applications of cloud iterative computations.

Keywords/Search Tags:

Iterative computation, cloud computing, distributed frameworks, MapReduce, distributed iterative algorithms

PDF Full Text Request

Related items

1	Research On Iterative Distributed Data Processing Based On MapReduce
2	Cluster Based Large-scale Distributed Graph Processing System
3	Scalable parallel computing on clouds: Efficient and scalable architectures to perform pleasingly parallel, MapReduce and iterative data intensive computations on cloud environments
4	The Research Of Load Balance Techniques And Accumulative Iterative Algorithms In Large Scale Asynchronous Graph Processing Framework Maiter
5	Research And Application Of Parallel Computation Framework Base On Task Type
6	Parallelization Research On Families Of Gradient Descent And Expectation Maximization Iterative Algorithms
7	Study On Iterative Mapreduce Computation Model For Clustering Analysis
8	Study On Parallel Alogrithm Of Large-scale Numerical Calculation In Cloud Computing Environment
9	Localization Algorithm Based On Distributed MDS In Wireless Sensor Network
10	Key Technology Researchonlog Storage And Analysis System On Cloud Platform