Font Size: a A A

Research On Data Processing Technology Based On HBase

Posted on:2020-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:J C SunFull Text:PDF
GTID:2428330596468997Subject:Public Security Technology
Abstract/Summary:PDF Full Text Request
HBase is an important type of column-oriented database which can suit well to the requirements of large-scale distributed storage.Data processing technology based on HBase has always been a research hotspot.Data compression and data retrieval are the key technologies of data processing.Data compression technology can save storage space,reduce data I/O and improve data processing speed.With the increasing speed of data generation,higher requirements have been put forward on data compression technology.At the same time,although HBase has a strong storage advantage,the support for data retrieval is poor due to its inherent defects,which limits its application scenarios.Considering the problems above,HBase data compression and retrieval technology has been studied in this thesis.The specific works are listed as follows:In order to solve the problem of high learning cost and low compression efficiency,a sorted-based hybrid compression strategy of column-based compression and sector-based compression was proposed.Firstly,a method to sort the data in each column was designed according to the characteristics of HBase to strengthen the data compaction.Secondly,the compression algorithms suitable for different data were selected through research,and the XGBoost algorithm with excellent generalization characteristics and parallel computing support was introduced as the classification algorithm of compression strategy.Finally,according to the characteristics of the data,the proposed hybrid column-based compression strategy and hybrid sector-based compression strategy were applied respectively to recommend the compression algorithm.Experiments have been conducted on TPC-DS standard data and the results demonstrated that the proposed strategy had better performance in terms of compression ratio and compression/decompression time.Aiming at making up for deficiencies of full-text search performance of HBase,a strategy of joint full-text retrieval based on HBase was proposed.The strategy involved three aspects of methods including data storage,data indexing and data retrieval.Firstly,a data storage method was designed to quickly import and classify the data according to different retrieval requirements.Secondly,the data indexing method was designed to generate the inverted index through text analyzer and store it in the index table.Finally,a data retrieval method was designed.The full-text search request was firstly queried by ElasticSearch,and then the queried record ID of the match results were returned to HBase to obtain other attribute values corresponding to the row key.The key issues affecting retrieval performance such as HBase table structure design,text analyzer construction and return volume of data were discussed.The proposed strategy was verified by experiments in respects of temporal/spatial cost and query performance.The experimental results showed that the joint retrieval strategy can greatly improve the query efficiency in full-text search area under the condition of occupying small temporal/spatial cost.
Keywords/Search Tags:Column-oriented storage, HBase, Data compression, Data retrieval
PDF Full Text Request
Related items