Font Size: a A A

Storage And Querying Optimization For Large Scale Structured Data

Posted on:2017-03-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:T XuFull Text:PDF
GTID:1318330566955847Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the size of structured data exceeding PB-level,supporting interactive queries on such large-scale datasets poses challenges to both the relational database and the big data management system.SQL on Hadoop system effectively integrates Hadoop and SQL engine,utilizes HDFS to manage and organize data in the storage layer and provides the transparent query interface and database view for users in the application layer.Therefore,it constitutes an important approach to deals with the storage and query of large-scale structured data.However,Hadoop is designed for the offline batch processing of unstructured and semi-structured data.In order to execute the real-time queries on structured data,we should carry out targeted optimization and improvement for the storage organization structure and the query analysis mechanism.For the key technologies of storage and querying optimization for SQL on Hadoop,this paper mainly focuses on the following tasks:1.Research on the optimization technology of data distribution for HDFS,and propose a data partition mechanism based on affinity analysis called DPA2.As per the operating relation of partitioned sub-tables on the query set,DPA2 builds the unified model of affinity analysis,generates a relational matrix,establishes an optimized data partitioning scheme through matrix conversion and computation and outputs the partitioning algorithm logic to the data block distribution policy in the storage layer.The evaluation results show that when compared with other methods,DPA2 significantly improves the query performance.2.Research on the optimization technology for file storage organization and structure,and propose a columnar storage structure based on group sorting of key columns called KCGS-Store.Through the two core processes of pool partition of relational table and pool recombination of key columns,KCGS-Store achieves group sorting in multiple key columns,effectively reduces the amount of data being read and completes the record reorganization using the index of pool number with a little overhead of time and storage space.The evaluation results show that when compared with ORCFile and Parquet,KCGS-Store is superior in many aspects including storage space,data loading and SQL querying.3.Research on optimization technology for the parallel query engine,and design a parallel query system with the distributed architecture called Thump Query.The core idea of this system is the two-stage strategy of query planning.Thump Query can reduce the data amount of intermediate result and the transmission quantity of the shuffle process through the adjustment of task operating sequences and data forwarding paths.The evaluation results indicate that the two-stage planning strategy effectively reduces the query cost and the network transmission pressure generated by the shuffle process while increasing the concurrent efficiency of the system.4.Research on the querying optimization technology for the heterogeneous datasets,and propose a lightweight multi-source collaborative querying system called LMCQ.LMCQ constructs the domain graph for query command with the source system as the unit,calculates the execution plan with minimum cost using the cost model,and dynamically optimizes the operation method during the execution process to reduce the waiting time for subtask start and data transmission quantity.The evaluation results indicate that LMCQ features good query performance and is easy to use,in contrast with other collaborative query mechanisms.The optimization ratio also significantly rises with the increase of data quantity and query complexity.
Keywords/Search Tags:SQL on Hadoop, big data, structured data, storage management, parallel query
PDF Full Text Request
Related items