The Research Of High-Dimensional Mass Data Storage Structure And Scheduling Methods For Top-K Query Algorithms

Posted on:2013-02-05

Degree:Master

Type:Thesis

Country:China

Candidate:Q Z Li

Full Text:PDF

GTID:2248330395973275

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

This dissertation studies the top-k query optimization algorithm applied to massive high-dimensional data with suitable data storage and scheduling method. Top-k query alogrithm calculates scores of records in the massive dataset by a given function and returns k results with the highest total score. Top-k query algorithms can be divided into two categories, accurate and approximated top-k query algorithms, considering the accuracy of the query results. In many cases, top-k algorithms do not need to visit all records to obtain query results, so they are quite efficient.Firstly, the existing query optimization techniques for massive data are analyzed, and then the series of top-k algorithms are highlighted. Two accurate top-k query algorithms, TA algorithm and the top-k query algorithm on dominant graph, are implemented in this dissertation. After that, the approximated TA algorithm and TABE algorithm are also introduced. By analyzing the pros and cons of these top-k algorithms, we propose a memory-based FTDT (Fast Threshold Decline Top-k Algorithm) algorithm. It can improve inquiry performance by reducing the updating times and comparing times of threshold value. From experiments, FTDT is better than TA when the dimension of the record is greater than5.Secondly, the dissertation analyzes top-k query algorithm on dominant graph, its layering strategy, data storage structure, and scheduling strategy. And then a new accurate top-k query algorithm-NSDL (Non Strict Dominant Layers) algorithm is proposed. NSDL utilizes the concept of likelihood-maximum-layer to stratify the data set, and organizes records of each layer effectively. In order to improve the top-k query efficiency, the auxiliary table (AT) is used to optimize the schedule of I/O and reduce the I/O cost. Here, FTDT algorithm is used by NSDL when querying data in the first layer.At last, a quicker approximate top-k query algorithm-ANSDL (Approximated Non Strict Dominant Layers) is proposed by only considering the data of first layer in NSDL, without decreasing the accuracy of query obviously.The experiments show that NSDL reduces I/O overhead significantly comparing to DG. And ANSDL further improves the query efficiency comparing to NSDL without reducing the query accuracy obviously.

Keywords/Search Tags:

high-dimensional data storage, scheduling strategy, query optimizationtop-k query

PDF Full Text Request

Related items

1	Study On Indexing And Query Processing Techniques For High Dimensional Data
2	Research Of High-Dimensional Space Query Algorithm Based On Space-Filling Curves
3	Research On Complex Query Processing In Spatiotemporal Databases
4	Studies On Efficient Query Scheduling And Data Acquiring Techniques Of WEB Data
5	Design And Realization Of Optimized Query Strategy About Multi-Tenant Saas Based Application
6	Hadoop-based Geospatial Data Storage And Query Technology
7	The Design And Optimization Of XML Data Query Method Based On The SF&B Compressing Storage Structure
8	Research And Application Of High Dimensional Query Technology
9	Storage Optimized Model Based RDF Data Query Mechanism
10	Research On Machine Learning Enhanced Query Techniques