Research On Range Query Processing Algorithm For Big Scientific Data

Posted on:2024-04-12

Degree:Doctor

Type:Dissertation

Country:China

Candidate:S Han

Full Text:PDF

GTID:1528307376982419

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The rapid development of information technology and data acquisition technology have given birth to the era of big data.Big scientific data is an important part of big data.Scientific data mainly stores scientific experiment data,positioning data,observation data and simulation data,and so on.Having important applications in various fields,scientific data is a significant basis for many decision support systems.Scientific data is widely used in scientific statistical analysis.As a preprocessing operation for analysis operations,range query can provide basic data for analysis operations.Therefore,providing efficient range query processing is particularly important for the statistical analysis of scientific data.However,the characteristics of scientific data,such as big volume,multiple dimensions,and sparsity,have brought great challenges to range query processing.Existing work cannot comprehensively and efficiently solve the problem of range query processing on scientific data.This paper studies efficient range query processing algorithms based on query optimization strategies and compression technologies from the perspective of time complexity and space complexity respectively.The main research contents and contributions of this paper are summarized as follows.First,this paper studies the range query processing problem based on chunk-oriented dimension ordering.The dimension order can significantly influence the range query performance.Given a range query workload,previous works found the global optimal dimension order to minimize the query processing costs.However,the data distribution and range query patterns of different data parts are often different from each other.And the fine-grained chunk-oriented dimension order optimization will bring further performance improvement for range query processing.Therefore,from the perspective of using dimension order to optimize query processing performance,this paper first proposes a chunk-oriented dimension ordering problem and gives the formal definition.Then,a multi-dimensional array storage method based on two-layer linearization,and a multidimensional range query processing algorithm are proposed.Then,a workload-driven chunk-oriented dimension ordering algorithm is designed.In order to deal with the dynamical workload,a dynamic chunk-oriented dimension reordering algorithm is proposed,which can track the change trend of the workload in time and dynamically adjust the dimension order to prevent the decline of the query performance.This paper designs and completes the experiments on both the real-life datasets and the synthetic datasets.The experimental results show the effectiveness and efficiency of the range query processing algorithm based on chunk-oriented dimension ordering.Second,the range query processing problem based on multi-dimensional range filters is studied.The lower bound of the time complexity of the traditional range query processing algorithms is Ω(n),because they need to read the whole dataset in the worst case.When n is large,the time of Ω(n)is too expensive.In order to solve the above problem,range filters can be used to avoid reading useless data so as to reduce the processing costs of range queries.However,the existing range filters are all designed for one-dimensional space,and they are not suitable for the multi-dimensional scientific data.From the perspective of designing multi-dimensional range filters to accelerate range query processing,this paper proposes a range filter based sublinear time range query processing algorithm.Firstly,a multidimensional range filter,Mu1RF,is proposed to filter the multi-dimensional range queries.Then,based on Mu1RF,an efficient multi-dimensional range query processing algorithm is proposed,which can obtain sub-linear query processing time without accessing the input dataset.This paper designs and completes experiments on both the the real-life datasets and the synthetic datasets.The experimental results verify that Mu1RF can obtain a smaller false positive rate and less filtering time,and the proposed range query processing algorithm achieves efficient query performance.Third,the range query processing supported multi-dimensional array partitioning problem for optimizing the compression performance is studied.The partitioning strategy of multi-dimensional arrays will greatly affect the compression performance of scientific data,and then affect the range query performance.However,no one has ever studied the impact of array partitioning on compression performance.This paper studies the problem of range query processing based on compression and array partitioning,which can not only minimize the space overhead of data storage,but also help improve the range query performance by reading less data.This paper first gives a formal definition of the multi-dimensional array partitioning problem for optimizing the compression performance,which is a NP-hard problem.In order to solve this problem,this paper designs two heuristic array partitioning algorithms for the case of dimension independence and dimension dependence respectively.Finally,a range query processing algorithm based on compression and array partitioning is proposed.This paper designs and completes experiments on both the real-life datasets and the synthetic datasets.The experimental results show that the proposed array partitioning algorithm can obtain a smaller compression ratio,and the range query processing algorithm based on compression and array partitioning can also obtain efficient range query performance.Fourth,the range query processing supported multi-dimensional array compression problem based on combination number is studied.At present,no compression algorithm can provide quantitative performance guarantee for the lossless compression of scientific data.The compression performance of existing compression algorithms strongly depend on the nature of the dataset itself,such as the data distribution and the data independence.In the worst case,the space occupied by the compressed data may even be greater than the space cost of the dataset itself.Therefore,this paper studies a multi-dimensional data compression algorithm that independent with the data distribution and can provide quantitative guarantees for data compression performance.The compression algorithm uses the information of the tuple numbers and the combination numbers to compress data.This paper gives a formal description of the compression algorithm and the decompression algorithm,and then proposes a range query processing algorithm based on the compression algorithm.In this paper,the experiments are designed and completed on both the real-life datasets and the synthetic datasets.The experimental results show that although the compression algorithm based on the combination number has a longer compression time,it has a smaller compression ratio,and the proposed range query processing algorithm can also achieve efficient range query processing.

Keywords/Search Tags:

scientific data, range query, dimension order, multi-dimensional range filter, compression algorithm

PDF Full Text Request

Related items

1	Algorithm Design For One-dimensional And Spatial Encrypted Data Range Query
2	Study On Indexing And Range Query Processing Techniques For Uncertain Data
3	Algorithm Research On Geometric Range Query
4	P2P Based Research On The Hyperrectangle-Range Query Of High-Dimensional Data
5	Research And Implementation Of Range Query Algorithm Based On Uncertain Data
6	A Two-dimensional Index Structure Based P2P Query Of Multi-dimensional Data
7	Precision Tracking Radar Range Loop Performance Analysis
8	Efficient Update Methods For Multi-dimension Metadata Indexing In Storage Systems
9	Research On Deceptive Jamming Technology Against Imaging Radar In Range Dimension
10	Data Integrity Verification Technology Research And Implementation Of Range Query In Location-based Service