Font Size: a A A

Research About Data Allocation Strategy For All-to-all Comparison Problem With Large Data Sets

Posted on:2018-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y J GaoFull Text:PDF
GTID:2348330536465884Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
All-to-all comparison problems represent a class of big data processing problems.It is widely found in bioinformatics,biometrics,data mining and so on.Distributed computing based on distributed storage architecture is widely used to solve large-scale computing problems,including all-to-all comparison,because of its advantages such as high efficiency,high reliability and high scalability.It breaks down a big problem into a number of small problems,and then schedules of these to the distributed worker node.However,its performance depends on data distribution,task decomposition and task scheduling strategies.For the comparison task,inappropriate data distribution and poor data locality will greatly reduce the overall computing performance,in addition,unbalanced computational loads will also affect the computing performance.This paper first introduces the background of the problem,and the shortcomings of the traditional solution to the problem.Secondly,this reserach makes an in-depth theoretical study and model construction on the all-to-all comparison problem,and puts forward the corresponding algorithm and obtains the good calculation performance.The main contributions of this paper are as follows:(1)An in-depth theoretical analysis of the whole comparison problem is carried out,and the model of the all-to-all comparison is analyzed.(2)A heuristic data allocation algorithm based on greedy thought is proposed.According to the theoretical model of data distribution problem,heuristic rules are proposed,and the data allocation algorithm is proposed according to these rules.Ensures that all of the comparison tasks are 100% data locality,improving storage efficiency compared to the strategy for storing all data files on each node,improving overall computing performance compared to Hadoop's default data allocation strategy and good scalability.(3)A data allocation algorithm based on graph covering is proposed.This method is proposed for the first time in this paper to solve the all-to-all comparison problem.Firstly,the theoretical basis of the problem of data allocation with graph coverings is described.Secondly,it is proved that the optimal solution of the graph can be constructed under certain situation,and several sets of optimal solutions are constructed successfully.Compared with the heuristic,in addition to ensuring that the comparison task has 100% data locality,load balancing,in the case of special solution,data allocation algorithm based on graph covering algorithm has better computing performance.
Keywords/Search Tags:distributed computing, big data, all-to-all comparison, data distribution, graph covering
PDF Full Text Request
Related items