Font Size: a A A

Research On Dynamic Data Partitioning Algorithm For Large-scale Streaming Data Online Processing

Posted on:2016-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:C LiFull Text:PDF
GTID:2308330467995850Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Recently, with the rapid development of the Internet of Things and social networkapplications, streaming data becomes an important type of BIG DATA. Mass flow datagrows at an explosive rate and is widely used in different areas including network monitoring,stock analysis, aerospace, web applications and meteorological monitoring. Althoughlarge-scale batch processing techniques for static data has been relatively mature, for onlinedata is not. The research on large-scale online streaming data processing techniques attractsmore and more attention.MapReduce is a parallel programming model for processing massive data staticallyproposed by Google. Because of its simplicity, scalability and fault tolerance characteristics,MapReduce has gained a significant influence in industry and academia. The biggestadvantage is the ability to shield the underlying details and providing simple programminginterfaces to users, which greatly reduce the difficulty of parallel programming and improveprogramming efficiency.However, compared with the static data, streaming data presents new challenges for dataprocessing technology due to its own characteristics, for example,1) The data streamcontinues to flow in and flow out from the processing system, which requires continuousprocessing for online data;2) Stream data cannot be stored completely and will bepermanently lost without timely processing, we need to provide real-time or near real-timedata processing;3) Rate of streaming data is difficult to control or predict which requiressystem to have dynamic scalability and rapid response capability. Therefore, develop ageneral and flexible streaming processing technology is very necessary.Related research based on MapReduce model for distributed data flow processing is veryactive in domestic and international. Existing research focuses on the features of continuouslyupdated and timeliness of streaming data. Some works study the distributed data processingtechnology for online computing model based on memory instead of distributed file system,they supports the continuous high-speed processing of streaming data. However, there are stillmany deficiencies in existing work:1) Static models or existing dynamic model lacks of the ability to dynamically change atrun-time or part of operators in system cannot be extended. 2) Data transmission mechanism in static models or existing dynamic model is fully orpartially dependent on existing file system and unable to meet the requirements of real-time instream data processing.3) Dynamic data partitioning strategy in static models or existing models use static orpartially dynamic mapping, although some works convert the uniform hash to consistent hash,they did not take into account the impact of changes of topology for data partitioning.For the above shortcomings, firstly, this paper presents a model SPATE system which canadapt to the processing of streaming data; the system supports dynamic scalability of jobtopology. In the premise of flexible and scalable of topology, SPATE emphatically solves theproblem of parallel processing of multi-MapReduce job, it also designs and improvements thedata transmission mechanism and dynamic partitioning mechanism between differentoperators.We propose a dynamic online partitioning strategy and the corresponding state migrationprotocol; we model the distribution of streaming data as an online Bin Packing problem. Wemust consider some factors like topology dynamically changed, resource usage and migrationcost etc. because of the packing problem is an NP-hard problem, considering multiple factorsgreatly increase the complexity of online partitioning. We hope system can fast adapt andadjust with minimum number of nodes and minimum migration cost when flow velocity ofstreaming data has mutation.We test the partitioning algorithm and verify the scalability of system by actual dataapplication. After detailed experiments, dynamic online partitioning algorithm in this article issignificantly better than traditional ones and achieves the desired goal of work.
Keywords/Search Tags:MapReduce, Streaming Data, Data Transmission Mechanism, Dynamic Online Partitioning
PDF Full Text Request
Related items