Font Size: a A A

Optimization For The Data Access Mode Of Mapreduce In HBase

Posted on:2013-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:S L TianFull Text:PDF
GTID:2298330422974266Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, the amount of data in theInternet grows fast and the kind of data becomes diversified. The world has beentransferred to a data-centric paradigm--the era of “Big data”. The traditional techniqueuses database management mode to process data. In the face of big data, it is unable tomeet the requirements for efficient processing of data, which has the problem of uneasyto extend storage space and inefficient queries. More and more enterprises turn to theopen-source Hadoop platform, using HBase to store and manage data. Data reading inHBase can use MapReduce framework to complete the parallelization. Compared totraditional database management, the speed of data processing has been greatlyimproved. However, the speed of data reading in the HBase under this framework stillcan not match the speed of data processing. The problem is mainly because of that theinterface of HBase for data reading can not fully ensure the data is in locality.This paper first introduces the knowledge of the data, which includes data storagetechnology and data processing technology, and briefly describes the classification,characteristics and the main platform for cloud computing, and focuses Hadoop on threekinds of critical technology: HDFS, MapReduce and HBase, which is the most widelyused currently. It provides a theoretical basis for the analysis and improvement of theinterface of MapReduce in HBase.After analyzing the MapReduce framework for task allocation process, datafragmentation process and data reading interface (Scan) workflow of HBase in detail,then bottlenecks of MapRedcue computing in HBase are found:1) the task can not bedone completely in locality;2) the data is read in Region by serial;3) the data needs tobe mergered and produce a record. Facing to these difficuties, this paper proposes animproved method, which does not rely on the original logic storage units named Regionbut physical storage units named Block in HBase when task is assigned.The paperredesigns the file reading interface, and adjusts the MapReduce scheduling policy forlocal task priority, which is putted forword by Hua Zhongjie.At last, the experiments indicate that: the improved interface can remove the extraworkload of Scan, extremely strengthen the data’s locality, reduce the time cost of dataaccess as1/10when compared to the original interface; this method can save the worktime and improve the work efficiency quite well.
Keywords/Search Tags:Big Data, data processing, HBase, MapReduce framework, datalocality
PDF Full Text Request
Related items