Font Size: a A A

Research On HBase Secondary Index Based On Hash Algorithm

Posted on:2019-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y XuFull Text:PDF
GTID:2428330578472774Subject:Information Science
Abstract/Summary:PDF Full Text Request
With the continuous improvement of the informatization level in various fields,the ability of people to exchange information through the Internet has increased,the amount of data on the Internet has grown rapidly,and the storage and efficient retrieval of massive data needs to be resolved.Relational databases are difficult to deal with massive data,and NoSQL Databases are born.HBase,a typical representative of NoSQL Databases,is widely used in the industry.HBase is more flexible,has no restrictions on data types,and is easy to extend,has high reliability,and is more suitablefor storing massive unstructured data compared twith RDBMS.However.HBase only provides Rowkey-based key-value queries and full table scans,and data queries are not as flexible as RDBMS.Although the industry has added secondary index to HBase,the Rowkey of the open source HBase with secondary index solution is lengthy,resulting in a waste of storage space.HBase is oriented for column storage and rowkey are stored in each Cell.Therefore,the more rows and columns of HBase stored,the more storage space is wasted.At first this thesis summarizes the related technologies of big data,introduces the system architecture and principles of the systems in Hadoop ecosystem,and the functions of each component.Then,we summarize the HBase rowkey design principles by reading the literature and analyzing the HBase source code.And then,we designd a hash-based HBase secondary index to make up for exiting solutions.We use a hash algorithm to map long index rowkeys into 16-byte hash values,and use hash values as rowkeys in the index table.that solves the problem of wasted storage space.In the end,this thesis build a cluster environment using Cloudera version Hadoop(CDH)and Zookeeper that supports HBase running,and uses the 8 million data crawled from online bookstores as data sources for empirical analysis.The results show that the proposed solution in this paper has almost no difference with other secondary index terms of query performance,but in terms of Space occupancy.this solutionis much smaller than others.The solution proposed in this paper is suitable for the case taht the row keys exceed 16 bytes.When it comes to multi column join queries,the index row key of multiple column values will be very long,and the length is uneven,which does not conform to the HBase line key design principle.When building a secondary index table on a data table with hundreds of millions of rows,using the solution proposed in this paper,you can map long index rowkeys that exceed 16 bytes into 16-byte hash values that complies with the HBase rowkey design principles.Using hash values as the rowkeysof the index table can save a lot of storage space.
Keywords/Search Tags:HBase, Secondery Index, Hash algorithm, Complex condition query
PDF Full Text Request
Related items