Research On Top-k Computation In Distributed Environments

Posted on:2017-11-01

Degree:Master

Type:Thesis

Country:China

Candidate:H Luo

Full Text:PDF

GTID:2348330491462604

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Top-k computation, as a preference query, is a basic operation in database, which aims at finding the information that users are interested in from a given data set. As an important tool for data analysis, top-k computation is widely applied in various fields, such as web search, e-commerce, data mining, multi-standard decision support systems, etc. With the advent of the big data era, the traditional top-k processing technology has encountered unprecedented challenges, and has been unable to meet the needs of big data analysis. There are three main challenges of top-k computation must be considered under the new environment. Firstly, when the data size reaches TB or PB level, the traditional stand-alone approach is no longer applicable, thus a distributed and parallel computing framework should be considered. Secondly, when facing massive data in a distributed environment, what kind of data partitioning method can enhance the parallel performance and speed of the query? Thirdly, the traditional top-k query requires of a scoring function, while choosing a suitable scoring function is not an easy task.Therefore, In this thesis, the key technologies of data partitioning and parallel algorithms of top-k computation were explored in a distributeu environment. The main contributions are as follows:(1) In a distributed computing framework, a grid-like data partitioning method (GDP) is proposed for weighted top-k query, which divides the original data set into different child data sets and queries on some certain data sets instead of all based on the user's preferences. And for the "empty space" phenomenon in the high dimension, a grid-like and hyperplane data partitioning method (GHDP) is proposed. Compared with the way of data splitting based on angle and the hyperplane, the GDP and GHDP method is simpler as no complex pretreatment coordinate conversion is needed and GHDP is still applicable for high dimensional data set. The experimental results show that the query speed of the GDP and GHDP method is faster than the angle partition method by close to 15%, and have a good scalability. In addition, when the data set has higher dimensions (e.g.d>10), the GHDP get the more accurate query result than the angle partition method.(2) The traditional top-k query requires the user to give a scoring function, while some users are unable to give a reasonable one. To deal with the problem, we studied metric-based top-k dominating query in the distributed environment. Five parallel algorithms have been proposed based on existing single algorithm. Algorithm 1 uses the conclusion that the skyline collection contains the result of top-1 dominating query to accelerate the processing speed. Meanwhile each partition computes skyline in parallelize and the skyline do not need to be k-repeated calculated depending on the dominating relationship of candidate set, so as to speed up the query. Algorithm 2 uses k-skyband to solve the problem of top-k dominating query. Each partition computes k-skyband in parallelize without k-cycle. Based on algorithm 1, algorithm 3 uses ANN to filter the original data and then calculates the skyline. Algorithm 4 is a kind of pruning method based on ANN and k-skyband, which uses ANN to filter unnecessary data set firstly and then computes k-skyband, obtaining the results in the end. Algorithm 5 is a sort-based pruning method, which sorts the data set according to query input set Q and creates an index table which is read in the round-robin way. There is no need to traverse the original data set to calculate the dominating score for each candidate set, so algorithm 5 is fast. The experimental results show that the five parallel algorithms reduce the number of comparisons between data, improving the query efficiency and in most cases algorithm 4 is better.

Keywords/Search Tags:

Spark, top-k data partitioning, dominance, big data, k-skyband

PDF Full Text Request

Related items

1	A Key-Value Skew Model Based Dynamic Data Partitioning Algorithm In Spark
2	Research And Application Of MemSql Data Partitioning Strategy For Spark
3	Research On Data Partitioning And Placement Strategy Of Spark Streaming
4	Research On Balanced Graph Partitioning Based On Vertex Cut And Its Implementation On Spark
5	Reseach On K-Skyband Based Top-K Dominating Query Over Distributed Data Streams
6	Parallel Processing System Loop Partitioning And Data Distribution Technology Research
7	Design And Implementation Of Telecom 4G Big Data Platform For Network Optimization Based On Spark
8	Research And Application Of Granularity Clustering Algorithm For Mixed Attribute Data Under Dominance Relation
9	Study And Application Of Knowledge Reduction Algorithms Based On Indiscernibility Relation And Dominance Relation
10	The Research Of Big Data Manipulating Technology Based On Spark