Font Size: a A A

Distributed Joins And Optimization For BIG Table Based On Database OceanBase

Posted on:2017-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:Q S FanFull Text:PDF
GTID:2308330485472883Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of big data, data sizes of many applications reach PB, EB even ZB, es-pecially for seckill, rush to purchase and other phenomenal applications in the Internet. In these applications, the traditional database systems appear to be inadequate. There-fore, how to manage and utilize big data has become a problem which industry and academia have to pay attention on. Because of the hardware updating and application-driven, NoSQL databases and distributed database techniques have been greatly devel-oped. These databases are a part of the solution for the challenges of massive data man-agement and they also solve some big data applications. But for some complex data management tasks such as big table joins on the database, its performance is still not satisfactory.Join operator is one of the most important operators in a relational database. How to ensure the correctness and availability of join operation in the massive, distributed situation is a very challenging task, especially for the read/write splitting databases.To handle massive data processing, scalability is the focus of research for distributed databases. Many databases, such as HBase, Shark and so on, are designed based on this goal. For OLTP transactional databases, distributed data may lead to a very serious prob-lem:highly inefficient distributed transactions. To solve this problem, in recent years, read/write splitting database has become a branch of distributed database systems. This architecture puts all updating data on a single node to avoid the distributed transaction-s, and puts the baseline data on multiple nodes to achieve high scalability. OceanBase, which is a distributed database, is a typical representation of this architecture. This archi-tecture is able to efficiently response transactions, but also increases the complexity of the query operations. Every query can be done after merging baseline and incremental data. For joining multiple large tables, the performance of this architecture will be significantly reduced because of a large amount of network traffic.To optimize the efficiency of large table joins in this architecture, we design and implement SemiJoin and distributed sort merge join on two large tables based on Ocean-Base. Furthermore, we optimize the traditional join algorithm based on the features of the read/write splitting architecture.The main contributions of this paper are as follows:1. Design and implement Semi-join algorithm based on the architectural features of OceanBase. A very efficient Semi-join operator is obtained by increasing the speed of table filtering and reducing the network traffic. The efficiency of the optimized algorithm is verified by a series of comparative experiments.2. Design and implement distributed sort merge join algorithm in terms of architec-tural features of OceanBase. After dividing the values of join attribute into multiple ranges, the algorithm performs the sort merge join in each range parallelly. E-specially for incremental data, a "maximum range" algorithm is proposed. This algorithm treats baseline and incremental data separately, greatly improves the ef-ficiency of join. The efficiency of the optimized algorithm is demonstrated by a series of comparative experiments.3. Present a train of thought that we can treat baseline and incremental data separately instead of merging them at first. This point can be used not only on join operations, but also on the design of the query engine associated with the same architecture.
Keywords/Search Tags:Distributed Database, Query Engine, Query Optimization, SemiJoin, Distributed Sort Merge Join, Parallel Computing
PDF Full Text Request
Related items