The development of "Big Data"(Big Data) is an inevitable trend in the future in the field of each,is currently the focus of scientific research.In the face of intensive large data,big data processing platform to meet the needs of people, is also from technicalto marketization.The big Internet companies are already shaping abroad commercial mode;Intensive domestic big data technology also achieved from the Internet banking development, biomedical, and many other fields.It’s high performance can provide more efficient operations,to provide users with more high-quality and service,to seek greater benefits for the enterprise.Hadoop as important in the field of large data processing platform, from the technical architecture and business application has the perfect fusion. Open source Hadoop has good stability, fault tolerance and high efficiency e, has been widely used now.There are two big system Hadoop framework:A blend of Google " one of the three core technologies " graphs,with the distributed computing, to provide high concurrency for upper unstructured HDFS storage service,Good at store large files, does not support small file processing.Map Reduce to solve large-scale data-intensive problem is very flexible,how to allocate and manage large-scale data,and get the optimal solution is the hotspot of research in the field of big data.Map Reduce itself the basic scheduling algorithm with FIFO scheduling, Capacity scheduling and Fair scheduling,the basic algorithm can maintain the normal operation of platform,can’t meet the demand of users,in the complex of big data environment,the workflow task scheduling is a kind of reasonable and effective strategy,is to realize the task assignment time is short and high performance of the system,this article is to a complex job scheduling problem,a detailed study, to satisfy the demand of The Times.The premise of this article is based on calculation model and workflow scheduling strategy execution in the distributed environment, the concept of workflow is introduced into the calculation model of graphs,help us to establish a specific computing services to achieve large data dependencies homework complicated calculation.Work first made a general idea about the present situation of domestic and foreign commercial Hadoop platform,research a lot about workflow and related scheduling algorithm;Secondly MRHD(Map Reduce Hadoop Data-aware) algorithm research graphs roughly calculated under the framework of two phases:Workflow level and graphs level.Workflow level stage mainly responsible for the initialization graphs Workflow DAG figure, and the implementation of real-time dynamic Workflow queue, graphs level is responsible for the operation scheduling mechanism to improve bottom map task a high percentage of data the locality, reduce data on the interactions between the processor, reducing costs;At last, through a large number of experiments to realize MRHD workflow scheduling algorithm of this paper,Solve the problem of the workflow of the underlying operating data localization and shorten theoperation time of operation.In this paper is focused on how to deal with large amount of data rich dependencies set up under the operation of a distributed job scheduling strategy, how to effectively manage graphs under the environment of computing resources and workflow execution time.The innovation point of this paper(1) initialize the graphs of workflow DAG figure, according to the provisions of this article time change rules and homework to adjust the computing power of Pj, determine the workflow job emergency degree, also is the priority assignment, which according to the real-time dynamic priority descending way to obtain the workflow job queue;(2) the path of workflow into critical and non-critical job queue.Critical path likely because of the running time of each job is different and the real-time dynamic cluster computing resources, and to determine the different critical path of workflow, this better than the static access to the critical path plan in advance;Time for then to non-critical path by ascending map localization task, at the same time, in order to ensure the whole time graphs workflow, by judging the waiting time of operation and data transmission time length of time,take the shortest way to perform. |