Research And Implementation On Incremental Data Processing Algorithm Based On Hadoop

Posted on:2018-05-02

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Feng

Full Text:PDF

GTID:2348330518499443

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development of the Internet industry,a growing number of emerging industries are involved in big data.Mobile terminal,wearable devices and networking coverage contribute tremendous data everyday.Distributed computing not only provides a strong support for the analysis and application of massive data processing,but also points out how to optimize and upgrade some mature service(such as website data analysis,log analysis).One obvious feature of big data is that most of data will not be modify,and the most action of dataset update is appending.Therefore,when dealing with this kind of data,incremental computation becomes a powerful method.Incoop and Had UP are two incremental computing frameworks that can be highly adaptable to the user's business,but they also have some obvious deficiencies.Incoop is a coarse-grained incremental computing solution,in which the initialization calculation must figure out each data split and save the result of map task.In subsequent calculation,a map task searches history for fingerprint of input data.If find it,we directly access to the calculation results.Otherwise,the split must be recalculated.Had UP is a fine-gained method compared to Incoop.With fixed length division schema,Had UP divides dataset into segments and chunks.Then uses D-SD algorithm(Deduplication-based snapshot Differential Algorithm)to find out the difference between new and old splits.At last,combine the different data with history results to figure out the new results.However,when the modifications occur in the front of dataset,the previous division will probably change,which results in only few splits can reuse history records.In addition,because the D-SD algorithm modifies the base framework of Map Reduce,the practicality of Had UP is reduced.This paper designs and implements an incremental data processing system named Had Inc based on Hadoop.It takes the advantages of Incoop and Had UP that divides the dataset into finer grain parts with Content Defined Chunking.It improves the stability of dataset division.This allows the system to be able to get the modified data at run time,and timely delivery to other external applications,rather than wait until the end of the entire Job.Based on the mentioned ideas,Had Inc can adapt to a wide range of application scenarios,such as many splits change slightly,or few splits change seriously.In the testing part,we analyze the time cost of each step,and demonstrate how data size and update ratio make an influence on the result of incremental computation.Then,we describe some optimization in incremental computation.Finally,in order to validate the Had Inc,this paper designed 5 cases to test it: 1)few splits are changed slightly.2)a lot of splits are changed slightly.3)few splits are changed seriously.4)when dataset size varies from little to large,we set different chunk size to validate the efficiency of Had Inc.5)we grab the real data of Wikipedia to test Had Inc.Through the test,the Had Inc system performs well in most cases,not only can deal more complex incremental scenes,but keep the calculation efficiency more stable.It is obvious that Had Inc has a high reliability and wide applicability.

Keywords/Search Tags:

Hadoop, MapReduce, Distributed computing, Incremental processing, big data

PDF Full Text Request

Related items

1	Research On Incremental Computing Technologies And Algorithms Based On MapReduce
2	Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster
3	The Principle And Design Of Distributed Computing Platform Based On Mapreduce
4	Research On Distributed Processing Of Massive Video Data Based On Hadoop
5	Research On Techniques Of Incremental Processing For Big Data Based On Hadoop Platform
6	Analysis And Application Development Of Hadoop Distributed Computing Platform
7	Research And Application Of Telecom Big Data Processing Based On Hadoop
8	Design And Implementation Of The Data Analysis System Besed On Hadoop
9	Research On Optimization Of Map Reduce For Interactive Analysis On Big Data
10	The Research And Analysis Of Hadoop Small File Processing Method