Font Size: a A A

Research And Implemetation Of Distributed Stream Computing Framework

Posted on:2013-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:X GuFull Text:PDF
GTID:2248330371466322Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Along with the development of the data processing technology, new applications are emerging one after another based on the data analysis. The technology of mass data processing using a distributed system is widely discussed and studied at home and abroad. Network topology of multi-isomorphism node is very suitable for mass data distributed processing. It has become a distributed parallel computing research hotspot on how to improve the system availability, robustness, expansibility and real time in the network topology. As collection technology improving in a variety of areas, the structure of the data sources also have many changes. There are not only the traditional static, the real-time structured data, and also a lot of real-time, continued unstructured data. These data access processing system such as stream and it’s difficult to control their velocity, flow and direction. Facing huge mass data flow, it’s difficult to capture the information carried by the "moving" data flow and complex calculation in a real-time relying on the traditional distributed processing. This has prompted us a further research on new management of distributed streaming computing.At present, it is in the beginning stage on system of distributed streaming computing research at home and abroad and lacking mature output. Therefore, the author, based on his accumulated research in distributed processing and a number of fields in the distributed processing product output and theory research such as traditional distributed computing, wireless sensor network, CDN distributed cluster in case of thorough analysis on flow data processing application requirements, designs and realizes the entire flow calculation management framework—Rtstream and makes a deep research and optimization on task scheduling algorithm, the key factors of framework performance, and applies the improved adaptive algorithm to the framework including the special local task scheduling algorithm of the framework and global task scheduling algorithm as a hot research direction. It proved that the framework can customize nimbly according to the actual application scene and has the good performance that meets expectations through simulation and the performance test.The innovation points of this paper are as follows:(1)The distributed system the author designs is not only for specific application scenarios and solving particular problems. It is not universal and expansibility for single scene system decided by various forms of data stream and application scenarios. In this paper, the data flow application needs are developed widely. Rtstream framework is a common platform to solve the data flow application problems with distributed means. The third party developers only need to focus on their business requirements, and after "instantiation" of framework template in certain ways to hang in the frame according to specific data flow development corresponding business module. The distributed streaming computing systems for specific applications is realized. It is unnecessary for the developers to know all implementation details of bottom frame, including distributed computing such as data transmission, high concurrent connections, data persistent, load balance, task scheduling problem and so on that solved by frame. All this for developers is the "black box".(2) Innovation of frame design. We unified use multithread model to deal with task for most distributed systems in order to improve the efficiency of concurrent. After research and testing, the author found in streaming parallel computation of large-scale network topology single node multithreaded processing tasks can’t satisfy the high concurrent demand but brings waste on thread switch and CPU resources competition. Based on the study of open source world famous high performance Nginx WEB server, the author puts single-thread and non-block model into Rtstream framework nodes, and puts forward one loop per thread model in the light of this model cannot play multinuclear superiority. In addiction, combined with several design mode, the author solves the classic problems of distributed system such as grading, asynchronous communication, high concurrent processing, adaptive adjustment of thread pool capacity and so on.(3) The innovation of the system implementation strategies. There are relatively mature strategies in a distributed system such as data routing, breakpoint continuingly, mistake retransmission, congestion control, data segmentation and so on. But these strategies are not completely suitable for flow calculation model. Based on the characteristics of flow calculation, the author improves the traditional strategy in terms of implementation of all module parts of the node in the framework and transfer protocol between the nodes, and reduces the configuration and maintenance costs by means of data routing based on the MapReduce framework distributed monitor Zookeeper.(4) This paper puts forward the special local scheduling model in Rtstream nodes. This is a classic combinatorial optimization problem which the central scheduling module in the nodes facing that how to select the most appropriate task scheduling among the task operators that are ready in the limited resources of the machine. The paper trains and updates strategy for real-time by using the dynamic knowledge. First of all, it put forwards task priority calculation method based on weight, comprehensive consideration of a variety of energy dissipation factor and Qos constraint assurance on the analysis of the influence factors of the local scheduling task. Secondly, it guides the task priority factors using heuristic scheduling algorithm through the proposed prediction model that predicts local node resources bottleneck to let nodes select the most suitable task execution.(5) This paper solves the global task scheduling, the NP complete problems, using heuristic methods in parallel computing and applies to Rtstream framework. This paper studies ant colony algorithm inner essence of combinatorial optimization problem for the optimal and applies ant colony algorithm to task scheduling model. It completes system operation analysis and refines the key information through the analysis of actual characteristic of the global task scheduling and fine-grained control information the ants carry and collect, reflecting them in the heuristic information that controls the way the ants choose. It reduces time complexity of searching the optimal solution by the real-time guidance of ants. In addiction, it ensures the rational of ant colony algorithm performance optimization using the dynamic volatile strategy for increasing probability of inherent disadvantages "premature" phenomenon for the introduction of the reinforced positive feedback strategy.
Keywords/Search Tags:streaming calculation framework, distributed, network model, heuristic task scheduling, rational ant colony algorithm
PDF Full Text Request
Related items