| With development of technology, computer and internet affect people’s life and work with a variety ways. With the increasing influence, the data accumulated in the daily life and work is increasing exponentially. Analysis of data can help people understand user behavior and making business decisions, so data is a valuable asset. A large number of data analysis jobs put forward a challenge to computing resources. Cloud computing can integrate and manage a large amount of computing resources, so it has become the infrastructure of jobs. New computing models can process massive data using the resources provided by the cloud. For example, MapReduce performs well in batch processing of big data. It hides the details of the parallel computing, allowing developers to focus on algorithms.The job scheduling in big data condition has the characteristics of large amount of data to be processed, geographically distributed computing resources, new programming models, and high demands of economic indexes. Aiming at such characteristics, the main contributions of this work include:(1) Data partitions and reduce tasks are unbalanced in MapReduce. If the data partition size and the computing capacity are unbalance, the workload of reduce tasks will be unbalance. Aiming at this problem, we propose a load balancing MapReduce framework. The framework auguments the number of partitions, estimates the size of partitions, statistical computing capability of each node, assigns partitions to reduce tasks dynamically. It ensures load balancing of each node and shortens the running time of jobs.(2) The current concerns on MapReduce scheduling algorithms focus on the running time of jobs, ignore the cost. Aiming at this issue, we take running time and cost as two attributes of the user quality of service. We establish a MapReduce job scheduling model taking running time and cost as optimization objectives, and solve the model using game theory and genetic algorithm. We implemented a scheduler in Hadoop. When the Hadoop is accessed by multiple users, the scheduler will allocate resources to users based on the user preferences on the time or the cost.(3) Different user priorities of MapReduce have different deadlines, require different amounts of resources. Aiming at this problem, we propose a queuing network based muti-priority scheduling algorithm for MapReduce. We consider resources of map parse and reduce parse as two service stations, summarize the three patterns of MapReduce algorithms, model MapReduce based on the jackson queuing network, calculate the demand of users with different priorities. The proposed algorithm can meet the deadline defined by the users with different priority effectively, when the arrival rate of the users changs.(4) When the data intensive workflow scheduling in the geographical distributed data center, the workload of the data migration is often higher than the workload of data analyses, and the use of bandwidth also must be pay. We model the workflow using DAG and simplify it, and the data migration is also mapped to a sub-task of the workflow. We take the workflow execution time and the cost as the optimization target and calculate an optimized scheduling scheme using simulated annealing algorithm.The above studies are around job scheduling optimization for big data, the researchs improve the performance of MapReduce and the quality of service, it made a useful contribution to use big data analysis in cloud computing. |