Research On Techniques Of Incremental Processing For Big Data Based On Hadoop Platform

Posted on:2015-06-10

Degree:Master

Type:Thesis

Country:China

Candidate:Q S Ding

Full Text:PDF

GTID:2308330482460333

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

In recent years, big data has become a hot topic of science and business and has broad prospect in research and application, however it is facing challenges of efficiency and availability. Big data are always processed repeatedly with small changes, which is a major form of big data processing. The feature of incremental change of big data shows that incremental computing mode can improve the performance greatly. Therefore, this thesis is about how to process incremental big data efficiently in a cloud environment through focusing on the big data storage model, parallel processing model, and scheduling strategy. A technical architecture for incremental big data processing based on Hadoop is built in this thesis. Main work includes the following content.(1) The proposed big data storage model for incremental computing provides basic guarantee for big data parallel processing, implementing incremental storage in the distributed file system in the Hadoop environment with applying Rabin fingerprint algorithm to achieve Content-Defined Chunking for the data entered by users. The intermediate results of the last processing can be used efficiently by the incremental processing framework through the repetitiveness of segmentation of data blocks determined in the incremental processing.(2) The efficiency of the big data parallel processing can be improved by raised parallel processing model and algorithm which can be applied to the incremental computing, the model is mainly to achieve the design of the incremental map and incremental reduce model, and through confirming whether the task of handled Map(Reduce) is same or not to realize the reusing of the intermediate results stored in the storage server and improving the efficiency of parallel processing model of incremental processing framework.(3) The fair scheduling strategy based on load awareness propose to make rational use of resources, taking load parameters of each running slaves into comprehensive consideration to balance workload of each TaskTracker, and various resources in the cluster used effectively and reasonable can be realized through real-time monitoring of each workload value of slaves to determine whether to reassign the task of the above Mapper and Reducer.In a word, the problem of Incremental processing based on Hadoop platform is researched in this thesis, which put forth a novel, effective solution to solve the inefficient and time-consuming problem of the original system. its effectiveness and efficiency have been demonstrated by the experimental results.

Keywords/Search Tags:

Big data, Hadoop, Incremental HDFS, Incremental Map/Reduce, Rabin algorithm

PDF Full Text Request

Related items

1	Study On The Data Driven Parallel Incremental SVM Learning Algorithm Based On Hadoop Framework
2	Research And Implementation On Incremental Data Processing Algorithm Based On Hadoop
3	Data Flow Optimization Study Of Ocean-going Vessels Engine Room Based On Incremental Updating Method
4	Study Of Rough Set Theory Based Incremental Algorithms And Its Application
5	The Recommendation Algorithm Based On Hadoop
6	Design And Implementation Of Big Data Storage System Based On Hadoop
7	Research On Key Techniques Of Incremental Object Tracking
8	Association Rules Incremental Updating Research And Application Based On MapReduce
9	The Design And Implementation Of Open Report System Based On The Hadoop
10	The Design And Implementation Of PCF Based On Hadoop