Font Size: a A A

Research On Distributed Data Query Based On Hadoop

Posted on:2019-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:Q YangFull Text:PDF
GTID:2428330566974051Subject:Full-time Engineering
Abstract/Summary:PDF Full Text Request
With the vigorous development of Internet information,data generated by Internet is growing rapidly.How to find out user's data in massive data has become the focus of research.Skyline queries are often used in many fields,such as multi-objective decision analysis and data visualization,and can effectively find some of the better subset on the data set.With the increase of the amount of data,the Skyline query algorithm is run on the Hadoop framework,which can effectively handle the Skyline query in the large data environment.Because the size of Skyline result set exponentially increases with data dimension.When the result set is too large,it can not return precise information for users.How to select smaller and more representative query results is worth further research.In order to solve the problem of low efficiency of existing distributed Skyline query algorithms,this paper optimizes the Skyline query algorithm based on the MapReduce running framework.The idea of this algorithm is to preprocess the original data set,select strong points,filter the original data set,and filter most of the non Skyline data points before the algorithm starts.At the same time,combined with the processing strategy of the hybrid Skyline query algorithm,set up a time interval,update the local Skyline query algorithm in the time interval,and reduce the duplicate comparison between data points.The experimental results show that the algorithm can filter out non Skyline points in advance and improve the time performance of the algorithm.Aiming at the huge problem of Skyline result set in big data environment,in order to optimize the Skyline result set and get more representative Skyline results,a Skyline result set optimization algorithm based on dominating number in MapReduce framework is proposed.The algorithm puts forward the calculation method of data point dominating number,that is,when data points are compared and controlled,the number of data points is dynamically calculated,so that users can return K Skyline points with the highest number to represent the Skyline result set.The experimental results show that the algorithm can effectively control the size of the Skyline result set,and has good time and space performance.
Keywords/Search Tags:Big data, Skyline query, Hadoop, MapReduce, filter, Domination number
PDF Full Text Request
Related items