Font Size: a A A

Research And Implementation On Incremental Data Processing Algorithm Based On Hadoop

Posted on:2018-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y FengFull Text:PDF
GTID:2348330518499443Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of the Internet industry,a growing number of emerging industries are involved in big data.Mobile terminal,wearable devices and networking coverage contribute tremendous data everyday.Distributed computing not only provides a strong support for the analysis and application of massive data processing,but also points out how to optimize and upgrade some mature service(such as website data analysis,log analysis).One obvious feature of big data is that most of data will not be modify,and the most action of dataset update is appending.Therefore,when dealing with this kind of data,incremental computation becomes a powerful method.Incoop and Had UP are two incremental computing frameworks that can be highly adaptable to the user's business,but they also have some obvious deficiencies.Incoop is a coarse-grained incremental computing solution,in which the initialization calculation must figure out each data split and save the result of map task.In subsequent calculation,a map task searches history for fingerprint of input data.If find it,we directly access to the calculation results.Otherwise,the split must be recalculated.Had UP is a fine-gained method compared to Incoop.With fixed length division schema,Had UP divides dataset into segments and chunks.Then uses D-SD algorithm(Deduplication-based snapshot Differential Algorithm)to find out the difference between new and old splits.At last,combine the different data with history results to figure out the new results.However,when the modifications occur in the front of dataset,the previous division will probably change,which results in only few splits can reuse history records.In addition,because the D-SD algorithm modifies the base framework of Map Reduce,the practicality of Had UP is reduced.This paper designs and implements an incremental data processing system named Had Inc based on Hadoop.It takes the advantages of Incoop and Had UP that divides the dataset into finer grain parts with Content Defined Chunking.It improves the stability of dataset division.This allows the system to be able to get the modified data at run time,and timely delivery to other external applications,rather than wait until the end of the entire Job.Based on the mentioned ideas,Had Inc can adapt to a wide range of application scenarios,such as many splits change slightly,or few splits change seriously.In the testing part,we analyze the time cost of each step,and demonstrate how data size and update ratio make an influence on the result of incremental computation.Then,we describe some optimization in incremental computation.Finally,in order to validate the Had Inc,this paper designed 5 cases to test it: 1)few splits are changed slightly.2)a lot of splits are changed slightly.3)few splits are changed seriously.4)when dataset size varies from little to large,we set different chunk size to validate the efficiency of Had Inc.5)we grab the real data of Wikipedia to test Had Inc.Through the test,the Had Inc system performs well in most cases,not only can deal more complex incremental scenes,but keep the calculation efficiency more stable.It is obvious that Had Inc has a high reliability and wide applicability.
Keywords/Search Tags:Hadoop, MapReduce, Distributed computing, Incremental processing, big data
PDF Full Text Request
Related items