Optimization For The Data Access Mode Of Mapreduce In HBase

Posted on:2013-10-07

Degree:Master

Type:Thesis

Country:China

Candidate:S L Tian

Full Text:PDF

GTID:2298330422974266

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology, the amount of data in theInternet grows fast and the kind of data becomes diversified. The world has beentransferred to a data-centric paradigm--the era of “Big data”. The traditional techniqueuses database management mode to process data. In the face of big data, it is unable tomeet the requirements for efficient processing of data, which has the problem of uneasyto extend storage space and inefficient queries. More and more enterprises turn to theopen-source Hadoop platform, using HBase to store and manage data. Data reading inHBase can use MapReduce framework to complete the parallelization. Compared totraditional database management, the speed of data processing has been greatlyimproved. However, the speed of data reading in the HBase under this framework stillcan not match the speed of data processing. The problem is mainly because of that theinterface of HBase for data reading can not fully ensure the data is in locality.This paper first introduces the knowledge of the data, which includes data storagetechnology and data processing technology, and briefly describes the classification,characteristics and the main platform for cloud computing, and focuses Hadoop on threekinds of critical technology: HDFS, MapReduce and HBase, which is the most widelyused currently. It provides a theoretical basis for the analysis and improvement of theinterface of MapReduce in HBase.After analyzing the MapReduce framework for task allocation process, datafragmentation process and data reading interface (Scan) workflow of HBase in detail,then bottlenecks of MapRedcue computing in HBase are found:1) the task can not bedone completely in locality;2) the data is read in Region by serial;3) the data needs tobe mergered and produce a record. Facing to these difficuties, this paper proposes animproved method, which does not rely on the original logic storage units named Regionbut physical storage units named Block in HBase when task is assigned.The paperredesigns the file reading interface, and adjusts the MapReduce scheduling policy forlocal task priority, which is putted forword by Hua Zhongjie.At last, the experiments indicate that: the improved interface can remove the extraworkload of Scan, extremely strengthen the data’s locality, reduce the time cost of dataaccess as1/10when compared to the original interface; this method can save the worktime and improve the work efficiency quite well.

Keywords/Search Tags:

Big Data, data processing, HBase, MapReduce framework, datalocality

PDF Full Text Request

Related items

1	Research On Big Data Processing System Based On MapReduce Parallel Processing Framework
2	Research And Implementation Of Data Processing Framework Of IoT Based On Storm
3	Research On Data Processing Technology Based On HBase
4	Research Of Massive Data Processing Model In CDMA Packet Domain Based On Hadoop
5	Design And Implementation Of The Data Analysis System Besed On Hadoop
6	Research On Radar Data Storage And Analysis Processing Technology
7	The Big Data Processing Framework Base On Large-scale And High Dimensional Image Data
8	Research On The Concurrent Processing Of Massive Data Set
9	Research On Optimization And Improvement Of MapReduce Job Scheduling Algorithm
10	Vehicle Routing Data Processing System Based On Hadoop And C4.5 Algorithm