Storage And Querying Optimization For Large Scale Structured Data

Posted on:2017-03-27

Degree:Doctor

Type:Dissertation

Country:China

Candidate:T Xu

Full Text:PDF

GTID:1318330566955847

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the size of structured data exceeding PB-level,supporting interactive queries on such large-scale datasets poses challenges to both the relational database and the big data management system.SQL on Hadoop system effectively integrates Hadoop and SQL engine,utilizes HDFS to manage and organize data in the storage layer and provides the transparent query interface and database view for users in the application layer.Therefore,it constitutes an important approach to deals with the storage and query of large-scale structured data.However,Hadoop is designed for the offline batch processing of unstructured and semi-structured data.In order to execute the real-time queries on structured data,we should carry out targeted optimization and improvement for the storage organization structure and the query analysis mechanism.For the key technologies of storage and querying optimization for SQL on Hadoop,this paper mainly focuses on the following tasks:1.Research on the optimization technology of data distribution for HDFS,and propose a data partition mechanism based on affinity analysis called DPA2.As per the operating relation of partitioned sub-tables on the query set,DPA2 builds the unified model of affinity analysis,generates a relational matrix,establishes an optimized data partitioning scheme through matrix conversion and computation and outputs the partitioning algorithm logic to the data block distribution policy in the storage layer.The evaluation results show that when compared with other methods,DPA2 significantly improves the query performance.2.Research on the optimization technology for file storage organization and structure,and propose a columnar storage structure based on group sorting of key columns called KCGS-Store.Through the two core processes of pool partition of relational table and pool recombination of key columns,KCGS-Store achieves group sorting in multiple key columns,effectively reduces the amount of data being read and completes the record reorganization using the index of pool number with a little overhead of time and storage space.The evaluation results show that when compared with ORCFile and Parquet,KCGS-Store is superior in many aspects including storage space,data loading and SQL querying.3.Research on optimization technology for the parallel query engine,and design a parallel query system with the distributed architecture called Thump Query.The core idea of this system is the two-stage strategy of query planning.Thump Query can reduce the data amount of intermediate result and the transmission quantity of the shuffle process through the adjustment of task operating sequences and data forwarding paths.The evaluation results indicate that the two-stage planning strategy effectively reduces the query cost and the network transmission pressure generated by the shuffle process while increasing the concurrent efficiency of the system.4.Research on the querying optimization technology for the heterogeneous datasets,and propose a lightweight multi-source collaborative querying system called LMCQ.LMCQ constructs the domain graph for query command with the source system as the unit,calculates the execution plan with minimum cost using the cost model,and dynamically optimizes the operation method during the execution process to reduce the waiting time for subtask start and data transmission quantity.The evaluation results indicate that LMCQ features good query performance and is easy to use,in contrast with other collaborative query mechanisms.The optimization ratio also significantly rises with the increase of data quantity and query complexity.

Keywords/Search Tags:

SQL on Hadoop, big data, structured data, storage management, parallel query

PDF Full Text Request

Related items

1	Research And Implementation Of Compression For Structured Data On Hadoop Platform
2	A Research Of Distributed Storage And Parallel Query Of Spatial Data Based On Hadoop Platform
3	Hadoop-based Geospatial Data Storage And Query Technology
4	Research On The Solution Of Hybrid Storage Based On Hadoop
5	Implemention Of The Massive Telecom Data Distributed Storage And Query System Based On Hadoop
6	Research And Application Of Big Data Migration And Query Based-on Hadoop Platform
7	Research And Implementation Of Xml Data Management System
8	Techniques Of Partition And Query In Data Warehouses Based On Hadoop
9	Research And Implementation Of Non Structured Data Management In Discrete Manufacturing Industry Based On Hadoop
10	Application And Research On Data Storage Of Rail Transit Maintenance Support System Based On Hadoop