Font Size: a A A

Research On Top-k Skyline Query Algorithm For Large Dataset

Posted on:2020-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:L WeiFull Text:PDF
GTID:2428330572479105Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Skyline query algorithm has attracted a lot of attention since proposed,which can describe the general situation of data.Thus,Skyline query algorithm is applied in many practical scenarios,such as multi-rule decision-making,real-time online services and business data analysis.With the rapid development of Internet technology,Skyline query algorithm still has great potentials.However,when the datasets are large and have many dimensions,Skyline query algorithm will fetch back too many results to users.Therefore,the Top-k Skyline query algorithm is proposed,which uses the classical Top-k algorithm to help filter the results of Skyline query.However,the state-of-the-art Top-k Skyline query algorithms still are ine:fficiency for large datasets.First,the algorithms need to create and maintain a special data structure.When the dataset is large,the cost of the construction and maintenance for the special data structures will be very expensive.Second,the time cost of the algorithm is large.The best time complexity of the state-of-the-art algorithms is O(n log n).When the datasets are large,the runtime of the algorithm will be too long.Third,the special data structure of the algorithm leads to a lack of universality.For the datasets with simple data types,building a matching data structure can reduce the time cost of the query algorithm.However,in the face of the datasets with complex data types,it is difficult to find a suitable data structure.Actually,the data types of the datasets are often complex.Here an efficient Top-k Skyline query method called DFTS is proposed,which can perform well for large datasets.DFTS involves three steps.Firstly,the "degree score" function is used to rank the dataset,and a large quantity of objects with low ranking will be filtered out.Secondly,DFTS makes a Skyline query upon the candidates and generates a Skyline subset.Finally,top-k objects with high ranking will be selected from the Skyline subset as the final result.Through these steps,DFTS can significantly reduce the time cost.Moreover,without using a special data structure,DFTS can reduce space cost and be more universal.Then,DFTS is executed on the cluster distributed computing framework,called Spark,that indicate DFTS can well applies to distributed environment.And extensive experimental results show that DFTS can achieve much better performance for large datasets than those state-of-the-art methods.
Keywords/Search Tags:Skyline, Top-k, Apache Spark
PDF Full Text Request
Related items