Font Size: a A A

Research On Top-k Computation In Distributed Environments

Posted on:2017-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:H LuoFull Text:PDF
GTID:2348330491462604Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Top-k computation, as a preference query, is a basic operation in database, which aims at finding the information that users are interested in from a given data set. As an important tool for data analysis, top-k computation is widely applied in various fields, such as web search, e-commerce, data mining, multi-standard decision support systems, etc. With the advent of the big data era, the traditional top-k processing technology has encountered unprecedented challenges, and has been unable to meet the needs of big data analysis. There are three main challenges of top-k computation must be considered under the new environment. Firstly, when the data size reaches TB or PB level, the traditional stand-alone approach is no longer applicable, thus a distributed and parallel computing framework should be considered. Secondly, when facing massive data in a distributed environment, what kind of data partitioning method can enhance the parallel performance and speed of the query? Thirdly, the traditional top-k query requires of a scoring function, while choosing a suitable scoring function is not an easy task.Therefore, In this thesis, the key technologies of data partitioning and parallel algorithms of top-k computation were explored in a distributeu environment. The main contributions are as follows:(1) In a distributed computing framework, a grid-like data partitioning method (GDP) is proposed for weighted top-k query, which divides the original data set into different child data sets and queries on some certain data sets instead of all based on the user's preferences. And for the "empty space" phenomenon in the high dimension, a grid-like and hyperplane data partitioning method (GHDP) is proposed. Compared with the way of data splitting based on angle and the hyperplane, the GDP and GHDP method is simpler as no complex pretreatment coordinate conversion is needed and GHDP is still applicable for high dimensional data set. The experimental results show that the query speed of the GDP and GHDP method is faster than the angle partition method by close to 15%, and have a good scalability. In addition, when the data set has higher dimensions (e.g.d>10), the GHDP get the more accurate query result than the angle partition method.(2) The traditional top-k query requires the user to give a scoring function, while some users are unable to give a reasonable one. To deal with the problem, we studied metric-based top-k dominating query in the distributed environment. Five parallel algorithms have been proposed based on existing single algorithm. Algorithm 1 uses the conclusion that the skyline collection contains the result of top-1 dominating query to accelerate the processing speed. Meanwhile each partition computes skyline in parallelize and the skyline do not need to be k-repeated calculated depending on the dominating relationship of candidate set, so as to speed up the query. Algorithm 2 uses k-skyband to solve the problem of top-k dominating query. Each partition computes k-skyband in parallelize without k-cycle. Based on algorithm 1, algorithm 3 uses ANN to filter the original data and then calculates the skyline. Algorithm 4 is a kind of pruning method based on ANN and k-skyband, which uses ANN to filter unnecessary data set firstly and then computes k-skyband, obtaining the results in the end. Algorithm 5 is a sort-based pruning method, which sorts the data set according to query input set Q and creates an index table which is read in the round-robin way. There is no need to traverse the original data set to calculate the dominating score for each candidate set, so algorithm 5 is fast. The experimental results show that the five parallel algorithms reduce the number of comparisons between data, improving the query efficiency and in most cases algorithm 4 is better.
Keywords/Search Tags:Spark, top-k data partitioning, dominance, big data, k-skyband
PDF Full Text Request
Related items