Research On Storage And Retrieval Optimization Of Big Data

Posted on:2019-01-07

Degree:Master

Type:Thesis

Country:China

Candidate:H Xue

Full Text:PDF

GTID:2348330569995566

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

HBase is a database that stores massive amounts of unstructured data.It has high performance in primary key retrieval,but it can only perform full-table scanning when non-primary keys are retrieved which leads to the extremely low efficiency.And also,HBase has poor support for SQL query statements.According to the format of the data set of project,the requirements of the project for data set retrieval,and the real-time objectives of the project,the design of the storage and retrieval framework based on HBase as the mainframe and the improvement of the performance of the framework are the focuses of this thesis.Its main work is as follows:(1)Designed the LBase+IHive storage and retrieval framework.The LBase storage layer combines HBase and Lucene.When faced with a large amount of data,the stability and real-time performance can be taken into consideration.The storage layer data is classified and stored,real-time data is stored in Lucene,and historical data is stored in HBase.The IHive retrieval layer combines Hive and Impala.When the amount of data is small,the latency of MapReduce retrieval through Hive retrieval is reduced through Impala retrieval.When the amount of data is large,Hive retrieval avoids memory overflow and fault tolerance mechanisms through Impala retrieval.(2)Designed the HBase secondary index.The primary key RowKey of the HBase data record is formed by splicing the prefix field generated by the cyclic program with the most commonly used IP field and time field of the project data set.According to the RowKey,the index key of the HBase secondary index is designed as a field composed of a minimum value of the prefix,a combination index identifier,and a RowKey in a single node,which ensures that the index and the data are logically separated.The secondary index and the corresponding data record are stored in the same table.The value of the index and the value of the data record are stored in different column families to ensure that the index and the data are physically separated.(3)Applied the topology sensing algorithm to the LBase+IHive storage and retrieval framework.The topology-aware algorithm dynamically distributes data copies in the cluster according to the correlation.It reduces unnecessary network traffic overhead caused by unnecessary data copy movement in the MapReduce processing flow,thereby optimizing data copy communication overhead and reducing time delay.Finally,the prototype system is implemented in this thesis,and a series of performance experiments are performed on it.This article uses a data generator to generate log data,design corresponding experiments for each step of the improved content,record the data during the experiment and analyze the experimental results.In the end,the experimental results prove that the improvements made in this thesis can achieve the expected goals of large data storage and retrieval in the project.

Keywords/Search Tags:

big data, HBase, secondary index, storage and retrieval optimization

PDF Full Text Request

Related items

1	Research On GNSS Data Storage And Retrieval Based On HBASE
2	Research On Retrieval Speed Improvement Of HBase Based On Coprocessor Mechanism
3	Research On Secure Index Of HBase Database
4	Research And Development Of Big Data Storage Systems Based On Hbase
5	Research Of Big Data Store Query Technology Based On HBase
6	Design And Implication Of Mini-files Storage System Based On Hbase
7	The Research And Implementation Of Indexing And Query Techniques Based On HBase And In-memory Database
8	Optimization Of Massive Meteorological Structured Data Query Based On HBase
9	Research And Application Of Query Optimization Based On HBase
10	Research On HBase-based Mass Image Storage And Fast Retrieval Technology