Font Size: a A A

Research On Storage And Retrieval Optimization Of Big Data

Posted on:2019-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:H XueFull Text:PDF
GTID:2348330569995566Subject:Engineering
Abstract/Summary:PDF Full Text Request
HBase is a database that stores massive amounts of unstructured data.It has high performance in primary key retrieval,but it can only perform full-table scanning when non-primary keys are retrieved which leads to the extremely low efficiency.And also,HBase has poor support for SQL query statements.According to the format of the data set of project,the requirements of the project for data set retrieval,and the real-time objectives of the project,the design of the storage and retrieval framework based on HBase as the mainframe and the improvement of the performance of the framework are the focuses of this thesis.Its main work is as follows:(1)Designed the LBase+IHive storage and retrieval framework.The LBase storage layer combines HBase and Lucene.When faced with a large amount of data,the stability and real-time performance can be taken into consideration.The storage layer data is classified and stored,real-time data is stored in Lucene,and historical data is stored in HBase.The IHive retrieval layer combines Hive and Impala.When the amount of data is small,the latency of MapReduce retrieval through Hive retrieval is reduced through Impala retrieval.When the amount of data is large,Hive retrieval avoids memory overflow and fault tolerance mechanisms through Impala retrieval.(2)Designed the HBase secondary index.The primary key RowKey of the HBase data record is formed by splicing the prefix field generated by the cyclic program with the most commonly used IP field and time field of the project data set.According to the RowKey,the index key of the HBase secondary index is designed as a field composed of a minimum value of the prefix,a combination index identifier,and a RowKey in a single node,which ensures that the index and the data are logically separated.The secondary index and the corresponding data record are stored in the same table.The value of the index and the value of the data record are stored in different column families to ensure that the index and the data are physically separated.(3)Applied the topology sensing algorithm to the LBase+IHive storage and retrieval framework.The topology-aware algorithm dynamically distributes data copies in the cluster according to the correlation.It reduces unnecessary network traffic overhead caused by unnecessary data copy movement in the MapReduce processing flow,thereby optimizing data copy communication overhead and reducing time delay.Finally,the prototype system is implemented in this thesis,and a series of performance experiments are performed on it.This article uses a data generator to generate log data,design corresponding experiments for each step of the improved content,record the data during the experiment and analyze the experimental results.In the end,the experimental results prove that the improvements made in this thesis can achieve the expected goals of large data storage and retrieval in the project.
Keywords/Search Tags:big data, HBase, secondary index, storage and retrieval optimization
PDF Full Text Request
Related items