Research About Data Allocation Strategy For All-to-all Comparison Problem With Large Data Sets

Posted on:2018-11-08

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Gao

Full Text:PDF

GTID:2348330536465884

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

All-to-all comparison problems represent a class of big data processing problems.It is widely found in bioinformatics,biometrics,data mining and so on.Distributed computing based on distributed storage architecture is widely used to solve large-scale computing problems,including all-to-all comparison,because of its advantages such as high efficiency,high reliability and high scalability.It breaks down a big problem into a number of small problems,and then schedules of these to the distributed worker node.However,its performance depends on data distribution,task decomposition and task scheduling strategies.For the comparison task,inappropriate data distribution and poor data locality will greatly reduce the overall computing performance,in addition,unbalanced computational loads will also affect the computing performance.This paper first introduces the background of the problem,and the shortcomings of the traditional solution to the problem.Secondly,this reserach makes an in-depth theoretical study and model construction on the all-to-all comparison problem,and puts forward the corresponding algorithm and obtains the good calculation performance.The main contributions of this paper are as follows:(1)An in-depth theoretical analysis of the whole comparison problem is carried out,and the model of the all-to-all comparison is analyzed.(2)A heuristic data allocation algorithm based on greedy thought is proposed.According to the theoretical model of data distribution problem,heuristic rules are proposed,and the data allocation algorithm is proposed according to these rules.Ensures that all of the comparison tasks are 100% data locality,improving storage efficiency compared to the strategy for storing all data files on each node,improving overall computing performance compared to Hadoop's default data allocation strategy and good scalability.(3)A data allocation algorithm based on graph covering is proposed.This method is proposed for the first time in this paper to solve the all-to-all comparison problem.Firstly,the theoretical basis of the problem of data allocation with graph coverings is described.Secondly,it is proved that the optimal solution of the graph can be constructed under certain situation,and several sets of optimal solutions are constructed successfully.Compared with the heuristic,in addition to ensuring that the comparison task has 100% data locality,load balancing,in the case of special solution,data allocation algorithm based on graph covering algorithm has better computing performance.

Keywords/Search Tags:

distributed computing, big data, all-to-all comparison, data distribution, graph covering

PDF Full Text Request

Related items

1	Parallel Massive Data Processing Platform Based On Graph Computing
2	Design And Implementation Of Distributed Graph Computing Engine
3	Research On Data Extraction And Distributed Graph Data Management
4	Research On A Distributed Graph Data Process Mechanism Based On Spark
5	Distributed Data Process In Graph Database
6	Research On Distributed Graph Computing Performance Optimization For Natural Graphs
7	Performance comparison of data distribution management strategies in large-scale distributed simulation
8	Research Of Keywords Covering The Collection Issues On Graph Data
9	Hybrid Graph Query And Graph Computing Engine For Distributed Graph Database
10	Efficient Algorithm For Mining Dense Subgraphs In Uncertain Graph