Font Size: a A A

Research On Techniques Of Incremental Processing For Big Data Based On Hadoop Platform

Posted on:2015-06-10Degree:MasterType:Thesis
Country:ChinaCandidate:Q S DingFull Text:PDF
GTID:2308330482460333Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, big data has become a hot topic of science and business and has broad prospect in research and application, however it is facing challenges of efficiency and availability. Big data are always processed repeatedly with small changes, which is a major form of big data processing. The feature of incremental change of big data shows that incremental computing mode can improve the performance greatly. Therefore, this thesis is about how to process incremental big data efficiently in a cloud environment through focusing on the big data storage model, parallel processing model, and scheduling strategy. A technical architecture for incremental big data processing based on Hadoop is built in this thesis. Main work includes the following content.(1) The proposed big data storage model for incremental computing provides basic guarantee for big data parallel processing, implementing incremental storage in the distributed file system in the Hadoop environment with applying Rabin fingerprint algorithm to achieve Content-Defined Chunking for the data entered by users. The intermediate results of the last processing can be used efficiently by the incremental processing framework through the repetitiveness of segmentation of data blocks determined in the incremental processing.(2) The efficiency of the big data parallel processing can be improved by raised parallel processing model and algorithm which can be applied to the incremental computing, the model is mainly to achieve the design of the incremental map and incremental reduce model, and through confirming whether the task of handled Map(Reduce) is same or not to realize the reusing of the intermediate results stored in the storage server and improving the efficiency of parallel processing model of incremental processing framework.(3) The fair scheduling strategy based on load awareness propose to make rational use of resources, taking load parameters of each running slaves into comprehensive consideration to balance workload of each TaskTracker, and various resources in the cluster used effectively and reasonable can be realized through real-time monitoring of each workload value of slaves to determine whether to reassign the task of the above Mapper and Reducer.In a word, the problem of Incremental processing based on Hadoop platform is researched in this thesis, which put forth a novel, effective solution to solve the inefficient and time-consuming problem of the original system. its effectiveness and efficiency have been demonstrated by the experimental results.
Keywords/Search Tags:Big data, Hadoop, Incremental HDFS, Incremental Map/Reduce, Rabin algorithm
PDF Full Text Request
Related items