Font Size: a A A

Research Of Hybrid Row-column Storage Model Based On Hadoop

Posted on:2016-11-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F LiFull Text:PDF
GTID:1108330503456059Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Information technologies are rapidly advancing in the last two decades. As the growth of Internet with more PC, mobile devices and communication terminals, everydays’ data production increases to the scale of TB or even PB level, leading the world to a “Big Data” era. Huge amount of data brings new opportunities and challenges to manufacturers and applications, and to researchers. Effective storage and quick retrieval of big data are two of the most focused research points in current time.As an open source framework for processing big data, Hadoop could perform distributed storage and parallel query on massive amount of data. It has been applied by well-known organizations with plenties of users, and also studied by researchers in the database area. Current research shows that data storage models are a key factor affecting the data access efficiency, which is important under Hadoop’s query process on big data. There are several distributed database systems based on Hadoop, and they use different storage models to manange certain types of data. Those storage models are capable of supporting complex data types, and queries on semi-structured and even unstructured data. However, the performance of accessing structured data is worse than that of relational database on a single machine.There are three most commonly used storage models for relational data: row-wised, column-wised, and hybrid row-column-wised. The row-wised storage model is suitable for write-optimized transactional data. It has the advantage of fast dataloading but requiring no tuple reconstruction. However, it has the disadvantage of unable to avoid full-table scan and worse data compression rate. The column-wise storage model is suitable for read-optimized analytical data. It has the advantage of reading certain columns specified by the query without scanning the full table, and achieving much better compression performance, but has the disadvantage of slower dataloading and updating, and tuple reconstruction bringing additional network transfer cost especially in Hadoop environment. The hybrid row-column-wised storage model adopts the advantages and avoids most disadvantages from both row-wised and column-wised counterparts, achieving efficient query processing, but remain many issues for further study and improvement. For instance, it does not support instant locating for specific data items, which becomes an obstacle for random access and application of index techniques.This thesis proposes a page-based hybrid row-column storage structure on Hadoop named PageFile, dips into deeper research by combining index technologies, multi-table join optimization technologies and data compression technologies, and improves the performance of query processing on big data in distributed environment. The major contributions of this thesis are as follows:(1) According to the advantages and disadvantages of existing storage models on Hadoop, this thesis proposes a page-based hybrid row-column storage model with its storage structure named PageFile, designs the algorithms of dataloading and tuple reconstructuring, and implements query on a single table. This storage model inherits the characteristic from traditional relational databases that divides each data file into fix-sized pages uniformly. Also it follows the principle of “first horizontally, then vertically” by hybrid row-column storage model to store each horizontal partition in a HDFS file, inside which columns are vertically placed. Then data belonging to the same column is stored within each page. This model has the advantage of fast locating as page-based structure does, so that a single page can be accessed instantly by its sequence number, which can be used as a pointer that helps index structures to be built on Hadoop. Meanwhile, it makes use of hybrid row-column storage model’s partition principle to avoid scanning the columns that are nothing to do with the query, eliminating the cost of network transfer in tuple reconstruction process. Thus this storage model can improve the performance of query process effectively. Experimental results show that the page-based hybrid row-column storage model is capable of not only saving more disk space, but also performing better query rate on a single table without much obivious effect on dataloading, comparing to the existing storage models on Hadoop.(2) Applying index technologies might aid to skip accessing useless data and locate the required destination instantly. It saves scan time and reduces disk I/O to improve query processing rate. Currently there are only two researches on index technologies on Hadoop, both of which are required to acquire the joint information among tables beforehand, and worked within a limited scope of data. This thesis introduces RB+-tree and ranged hash bucket structure on a single machine, combines with the characteristic of PageFile’s distributed storage structure, and proposes the conception of “Multiple RB+-tree index” and “Multiple ranged hash index”. Algorithms of the two multiple indice’s creation and query are implemented with theoretical performance evaluation respectively. Firstly, the “multiple index” enables arbitary column to have a certain number of RB+-tree files or ranged hash files. For the next, it can be created and queried in parallel to meet with Hadoop’s feature. In the end, index files are independent to each other so that there’s no effect of adding new data with its own index to the indexed column, saving the cost of index maintenance. Moreover, the “Multiple RB+-tree index” extends the index scope to file level, so that unnecessary files would be skipped during query time. However, using index technologies require additional MapReduce jobs. According to the algorithm evaluation, the performance of multiple indice is related to scanned number of index files and bytes in each file, where the selection rate of the column is the key factor. Experimental results show that multiple RB+-tree is better on condition of selection rate less than 5% or even on equivalence query, and multiple ranged hash index is better on condition of selection rate between 5% and 30%.(3) Query optimization is an important approach of improving data processing rate, among which the queries of join on multiple tables are mostly common used. A multi-table join query on Hadoop usually requires a bunch of MapReduce jobs with a large number of redundant I/O cost or one single MapReduce job with a great deal of network transfer cost, and the intermediate results produced by the query are always deleted as soon as the query is over that could not be used in the future. Thus the goal of query optimization is to generate a proper execution plan so that costs from repeated I/O, invalid data and network transfer would be minimized. Each query can be converted to a join graph, which divides the join type into star-join, chain-join and hybrid-join by the values of vertice and edges. This thesis proposes an Adaptive Multi-Join Optimization(AMJO) strategy, and it includes two points. Firstly, it can select proper join algorithms according to different join types, and break down the complex hybrid-join type into simple star-join and chain-join, and build a cost model to determine the best execution order and generate the optimized execution plan. For the next, it can reuse the intermediate results for similar queries and generate the optimized reuse strategy, then the complex join type might be simplified and number of MapReduce jobs would be decreased. Experimental results show that the execution plan by AMJO can perform better than corresponding products on Hadoop, and the index techniques play a great role on the condition of low selection rate. Also the strategy for reusing intermediate results could evidently save query time, and improve the performance of query process.(4)Data compression is a key technology in database systems for cutting down disk space utility and accelerating query performance. It is particularly important in distributed environment for compressing big data. The compression method on Hadoop is direct and simple, which compressed an entire file using a heavy-light compression algorithm. Although this approach would receive a marvelous compression performance, queries could not be executed directly on the compressed data, and also the decompression process is not fast enough. As a result, each compression method should be suitable for certain types of data, and even data within the same column should pick proper compression methods according to the distribution characteristics of attribute values. This thesis researches on current light-weight compression methods and the application state of compression technologies on Hadoop, and proposes a Heuristic Data Compression stragegy based on Extent(HDCE). It takes “extent” in Page File structure as the compression granularity, and chooses suitable compression methods according to the data type and value distribution within an extent. It can also directly perform query on compressed data without decompression. For the storage of compression data, this thesis introduces the storage format and the production of HDCE decision tree in dataloading phase. For the query of compression data, this thesis describes the algorithms for processing filtering attributes, select attributes and join attributes, including direct calculation and judgement on filtering attributes and realtime decompression according to the need of select attributes and join attributes. Experiment results show that compression data produced by HDCE is not as good as some heavy-weight compression methods on Hadoop with the aspect of disk storage utility, but much better performance for query processing compared with those compression methods.
Keywords/Search Tags:Hadoop, Hybrid row-column storage, PageFile, Mutliple index technology, Adaptive Mutli-Join Optimization stragegy, Heuristic Data Compression stragegy based on Extent
PDF Full Text Request
Related items