Font Size: a A A

Research On Optimization Of Map Reduce For Interactive Analysis On Big Data

Posted on:2014-06-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ZhaoFull Text:PDF
GTID:1228330479979655Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recent years, big data was more and more widely concerned as data volume increases exponentially. Compared with the traditional massive data, the emphasis of big data is more on the potential value of the data, requiring more powerful analysis and mining technology. Big Data analysis and mining requires massive parallel data processing technology efficient, scalable, reliable, etc. Map Reduce supports large-scale automated parallelism, highly automated scalability,transparent and fine-grained fault-tolerance, very suitable for big data analysis and mining. Currently, Map Reduce has become the core of big data analysis and mining. Map Reduce applications rapidly increase, many organizations are using Map Reduce to solve their application needs, such as cleaning satellite images, generating inverted index, analysing user click stream etc.However, Map Reduce was originally used for large scale batch processing, turning towards large-scale interactive applications. Compared with batch applications, interactive applications are more different, which makes traditional Map Reduce system is not well adapted. For optimization of interactive applications, there are lots of mature technologies on traditional database field. However,compared with Map Reduce, database has limited scalability and reliability. Motivation of this study is to utilize technologies on traditional data management to extend Map Reduce, and make it more adaptive to interactive applications. In this paper begins with the workflow of Map Reduce Framework, and study optimization opportunities at each execution phase, then finished four optimization adaptive to interactive applications. Now, we summarize this contributions as follows.(1) Optimization on Map Reduce job scheduling and executing by global indexing.Optimization object is condition-based jobs, which are very in large-scale interactive analysis and mining. Original Map Reduce system does not optimize this kind of jobs by considering semantic features. Based on the current research work, we propose a global index based job scheduling and execution optimization strategy. Optimization objective is job execution and scheduling cost. The assumptions are: data partitions are globally ordered on index attribute, there is a global index on top of data partitions. Extending Map Reduce by adding a new phase named condition analysis to examine if a job can be optimized,then modifying map task computing algorithm to reduce the scale of the map tasks.(2) Locality-aware fair task scheduling algorithm. In large-scale common network cluster, network bandwidth is the most scarce system resources. Moving computation to data is a very important principle, which named data locality optimization in Map Reduce.Data locality directly affects computational efficiency. Interactive computing platform is usually shared by lots of users, fair sharing resources is necessary. Absolute fairness seriously hurts data locality in an interactive environment. Thus, a flexible fair scheduling is proposed, named K%-fair scheduling. Data locality is in the first place, then consider fairness. by adjusting K, data locality and fairness well optimized.(3) Locality-aware task scheduling framework. In task scheduling, data locality must be considered. Besides, some other targets may be considered, such as job size, job type,input sharing, waiting time etc. Therefore, it is necessary to design a comprehensive task scheduling framework. The framework place data locality in the first place, by dynamically planning tasks executing places based on data locality. If there are multiple candidates during scheduling a task, sort the candidates based on a composite index function,then schedule the head task.(4) Efficiency-aware job scheduling algorithm. Job scheduling algorithm controls an executing order for all submitted jobs. The study is in the case of large burst loads. Under this load mode, three factors have main impact on computing efficiency, they are data locality, load balancing and resource usage pipeline. If there is no job scheduling, available resources can allocated to any job, both data locality and load balance can be easily achieved. However, it will make all jobs contend limited shared resources, which has serious impact on pipeline between different kinds of computing resources. Job scheduling controls job parallelism, which limits the schedulable task scale. Under this case, data locality and load balancing are not easily simultaneously optimized. For this situation,we proposed a load balancing aware job scheduling algorithm in the case of strict data locality.
Keywords/Search Tags:Big data analysis, Cloud computing, Grid computing, Scalable data intensive computing, Distributed and parallel computing, Index, Resource allocation, Scheduling algorithm, MapReduce, Hadoop
PDF Full Text Request
Related items