Font Size: a A A

Research On Techniques And Systems For Index And Query Optimization Of Big Data

Posted on:2020-11-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:W GeFull Text:PDF
GTID:1368330572995939Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development and popularization of information technology,data scale has an explosive growth in the real world.The world has entered the era of big data.With the coming of big data's era,people have generally recognized the great value of large-scale data information resources.The era of big data will bring tremendous changes and development opportunities to human society,just like the Internet era.However,opportunities are always accompanied by challenges.There are many technical challenges in storage management and data analysis,which are involved in the application of big data.Traditional relational database can not meet the needs of distributed storage management and query processing in big data environment because it is diff-icult to expand horizontally.At the same time,traditional SQL relational database is difficult to effectively process unstructured and semi-structured data effectively.In addition,with the development of computer hardware and the evolution of computing architecture,data indexing and query optimization methods must take into account the new hardware performance and architecture characteristics.Research on big data storage management and query technology has attracted widespread attention.Some distributed big data storage management systems have been explored accordingly,such as HBase and Facebook's Cassandra.A number of systems provide good support for big data management and query process.However,due to huge volume,complex and diverse data forms in the real world,the existing big data management technologies and systems cannot fully meet the requirements of practical applications.Actually,data query technique is far from meeting the real needs.For example,HBase,a widely used key-value data store with good scalability,can store and manage data records up to tens of billions.However,HBase only provides primary key index,and the query efficiency for non-primary key data is extremely low.Big data query processing in distributed environments faces more technical problems and challenges than traditional database systems on the areas of indexing methods,query performance,data consistency,system scalability,system accessibility,and bring more problems that need to be solved.Based on the above problems and background,this thesis has carried out a series of research on big data storage management and query optimization technology.Specifically,the thesis gives the following research work and contributions:(1)Non-primary key indexing method with hotscore cache replacement policyHBase only provides the row key indexing and does not support non-key indexing,which makes it insufficient to meet the need of realtime or near-realtime applications.This thesis proposed a hierarchical secondary indexing model and method for HBase.It built the permanent layer of secondary index for non-key columns in HBase table to speed up the query process.Furthermore,we presented the cache layer of secondary index in memory and the Hotscore Algorithm,an efficient cache replacement policy for hot index data,to reduce the disk access overhead.This method has shown higher cache hit rate and faster query response time,as well as good scalability.(2)Range query optimization of big data based on hotscore adaptive partitioningRange queries on big data typically have skewed features.Some data are frequently accessed and others are not.In a slice-oriented way,the highly correlated data can be divided into data slices,which can optimize range queries in time and space efficiency.In view of this,the thesis designs a skiplist-based data slice index structure and a hotscore adaptive partitioning mechanism.Then,the average dutyrate and hotscore are designed to evaluate the hot degree of data slices.The adaptive tuning algorithm adjusts data partitioning by splitting and merging to fit into the query patterns efficiently and continuously.Under the limitation of cache space,this method maximizes the cache hit rate of range query by adaptively adjusting data fragmentation,thus improving the efficiency of range query.The partitioning precision and adjustment sensitivity are pursued by finer partitioning on hot data,whereas the cold data are partitioned with relatively larger granularity to reduce storage overhead and search cost of queries.(3)Data partitioning optimization based on correlation-aware modelBy studying the characteristics and distribution law of range queries on big data,the thesis presents a correlation-aware partitioning model for skewed range queries.It formulates partitioning optimization issue on continuous correlated data as a geometrical step curve fitting problem.The optimization partitioning algorithm pursues the optimal goal of minimizing the fitting cost deviation in partitioning process.On this basis,the following theorem are proposed and proved:data partitioning position in the range query that matches the query distribution law will definitely fall on the range query boundary.On this basis,Range Boundary Based DP Partitioning is designed to induce the optimal partition and significantly reduce the computation cost compared to the baseline algorithm.For efficiency,Bottom-up Merging Partitioning is proposed further to improve partitioning by bottom-up merging instead of searching.(4)Based on the above key technical methods,the thesis constructs and implements the HiBase system.HiBase provides query efficiency by non-primary key hierarchical index.It has become a product of ZTE Corporation,and has been applied in domestic banks,achieving remarkable performance gains.Furthermore,the query methods based on the skiplist-based index and data partitioning optimization are also integrated in HiBase.The experimental results show that the proposed methods have a good performance on the query optimization of big data in the distributed environment.HiBase outperforms the standard HBase by two orders of magnitude,and outperforms the open source system Hindex of Huawei Company by one order of magnitude.Compared with the method of transforming range queries into batch point queries on non-primary attributes,the performance improvement of data partitioning algorithm for range queries can also reach two orders of magnitude.
Keywords/Search Tags:big data processing, distributed storage management, query optimization, range query, index, cache replacement policy, data partitioning
PDF Full Text Request
Related items