Font Size: a A A

The Research Of High-Dimensional Mass Data Storage Structure And Scheduling Methods For Top-K Query Algorithms

Posted on:2013-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:Q Z LiFull Text:PDF
GTID:2248330395973275Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
This dissertation studies the top-k query optimization algorithm applied to massive high-dimensional data with suitable data storage and scheduling method. Top-k query alogrithm calculates scores of records in the massive dataset by a given function and returns k results with the highest total score. Top-k query algorithms can be divided into two categories, accurate and approximated top-k query algorithms, considering the accuracy of the query results. In many cases, top-k algorithms do not need to visit all records to obtain query results, so they are quite efficient.Firstly, the existing query optimization techniques for massive data are analyzed, and then the series of top-k algorithms are highlighted. Two accurate top-k query algorithms, TA algorithm and the top-k query algorithm on dominant graph, are implemented in this dissertation. After that, the approximated TA algorithm and TABE algorithm are also introduced. By analyzing the pros and cons of these top-k algorithms, we propose a memory-based FTDT (Fast Threshold Decline Top-k Algorithm) algorithm. It can improve inquiry performance by reducing the updating times and comparing times of threshold value. From experiments, FTDT is better than TA when the dimension of the record is greater than5.Secondly, the dissertation analyzes top-k query algorithm on dominant graph, its layering strategy, data storage structure, and scheduling strategy. And then a new accurate top-k query algorithm-NSDL (Non Strict Dominant Layers) algorithm is proposed. NSDL utilizes the concept of likelihood-maximum-layer to stratify the data set, and organizes records of each layer effectively. In order to improve the top-k query efficiency, the auxiliary table (AT) is used to optimize the schedule of I/O and reduce the I/O cost. Here, FTDT algorithm is used by NSDL when querying data in the first layer.At last, a quicker approximate top-k query algorithm-ANSDL (Approximated Non Strict Dominant Layers) is proposed by only considering the data of first layer in NSDL, without decreasing the accuracy of query obviously.The experiments show that NSDL reduces I/O overhead significantly comparing to DG. And ANSDL further improves the query efficiency comparing to NSDL without reducing the query accuracy obviously.
Keywords/Search Tags:high-dimensional data storage, scheduling strategy, query optimizationtop-k query
PDF Full Text Request
Related items