Font Size: a A A

Top-k Skyline Query Algorithm Based On Data Partition In Distributed Environment

Posted on:2021-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:M Y YuFull Text:PDF
GTID:2428330602473925Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet,the era of big data has led to an exponential increase in the amount of data.How to select data that meets user interests in many data sets has become a key research content.The skyline query has become a popular research point in this field in multi-objective decision-making.However,the size of the result set of the skyline query cannot be controlled.As the data volume and dimensions increase,the result set of the skyline query will also increase.The skyline results are difficult to choose.Along with this,the concept of top-k skyline query appeared.Top-k Skyline combines the advantages of both top-k and skyline.The scoring function returns k data objects that meet user needs,so that the size of the output result set is in a suitable range,avoiding the problem of uncontrollable size of the result set in skyline queries.In the era of big data,top-k skyline query processing methods have problems such as low efficiency and long response time,which makes it difficult to handle large-scale data sets.How to apply the top-k skyline query to the big data environment for processing has become an urgent problem.Map Reduce is a distributed computing framework proposed by Google,which solves the calculation problem of large-scale data and has good fault tolerance and scalability.Therefore,this article studies the top-k skyline query processing algorithm in a distributed environment.First,in view of the top-k skyline query between data redundant control inspection problems,based on data partition graphs environment top-k skyline query processing algorithm(Partitioned Top-k Skyline in Map Reduce,MR-PKS).This algorithm divides the data into regions,transforms the traditional dominance relationship into the one-way dominance relationship between regions,filters the comparison of data points without the dominance relationship between regions,reduces redundant data,and designs the implementation of top-k skyline based on the parallel computation of multiple nodes under the Map Reduce framework to improvethe execution efficiency of the algorithm.Secondly,in view of the problem that the top-k skyline query algorithm based on data partitioning is not conducive to management and inefficient in the high-dimensional space division area,the top-k skyline query processing algorithm based on user preference(User Preference based on Data Partition Top-k Skyline in Map Reduce,MR-P-PKS)is proposed.The algorithm first divides the data set according to the priority of the user dimension,divides the data into various regions,and then filters it to reduce the data set for subsequent calculations.Then,the unidirectional dominance relationship between regions is used to relax the data one by one according to the dimensional priority,which effectively reduces the number of comparisons between the data,and at the same time uses the indifference threshold to reduce the top-k skyline candidate dataset.Solve the problem of top-k skyline query in high-dimensional space,reduce the calculation overhead,and make the result set closer to the user query.Finally,in order to prove the effectiveness of the algorithm,the algorithm experiment respectively from the query response time,number of comparison between the data points,no difference threshold to validate the effect of response time analysis of the experimental results show that the proposed under the distributed environment of top-k skyline query algorithm can effectively reduce the frequency and response time,improve the query efficiency algorithm.
Keywords/Search Tags:multi-objective optimization, top-k skyline, MapReduce framework, dominance relation
PDF Full Text Request
Related items