Font Size: a A A

An Ad-hoc Query Engine Based On Spark SQL

Posted on:2019-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y CuiFull Text:PDF
GTID:2428330590992446Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Apache Spark SQL enables developers and data analysts to use standard SQL language to query structured data in Spark programs,which provides much convenience of relational model to analysts and data scientists.In addition,its internal distributed computing model,RDD,improves performance of query execution on large scale datasets.However,Apache Spark is not defined for long run services,thus its built-in DataSource would load data from underlying storage system during each table scan.Admittedly,users could use cache command explicitly to keep data in memory,but the ?cache? is so coarse-grained that it takes the whole table as the minimum caching unit.Furthermore,the cached data in memory would be gone once applications restart or shutdown.There are some other columnar file formats,like Apache Parquet and Orc that could accelerate query execution.Nevertheless,performance of query on large scale data sets using Spark is still undesirable.To tackle this problem,we present a file format along with an index structure as a pluggable component of Apache Spark SQL.It enables users to create index of tables to accelerate query execution,which meets the demands of ad-hoc queries.Meanwhile,it supports fine-grained cache,which is flexible to maintain ?hot data? in memory and evict ?cold data? out of memory.Furthermore,the ad-hoc query engine provides the compatible layer to allow users to create and use index directly on parquet file format.The customized file format,index structure and the fine-grained cache mechanism make Spark SQL a long-run service to satisfy the demands of users.Compared with the original Spark SQL,it brings an all-around performance boost.Customized services and ease of use are more user friendly.
Keywords/Search Tags:Big data, Distributed System, Database, Query optimization, Index, Spark SQL
PDF Full Text Request
Related items