Research On The Algorithm Of Essential Proteins Prediction Based On Spark And Strategy For Shuffle Memory Optimization

Posted on:2020-07-17

Degree:Master

Type:Thesis

Country:China

Candidate:D Q Hu

Full Text:PDF

GTID:2370330578454562

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years,biological information processing is a hot research direction,in which essential proteins prediction can effectively and quickly predict some essential proteins in protein-protein interaction networks.The essential proteins are very important for the survival,reproduction and drug target selection of organisms.At present,there are many essential proteins prediction algorithms.However,the computation complexities of serveral algorithms are very high,and calculating the small-scale protein-protein interaction networks is also a difficultjob.It is a feasible solution to accelerate the running of these algorithms with a distributed computing framework.Now,Spark has become a mature distributed computing framework.However,the shuffle performance of Spark is low in the actual distributed computing process.Therefore,it is of great significance to study the shuffle optimization strategy of Spark.The main research results of this paper are as follows:(1)In the research of predicting essential proteins,the L-BC indicator has the advantages of considering local characteristics of networks and reducing the running time,while the k-BC indicator distinguishes the vertex importance degree in more detail.In this paper,the Ll-BC indicator is proposed by combining the advantages of these two algorithms.The experiment results show that the prediction accuracy of L1-BC indicator is better than that of other topological attribute indicators in most cases.Compared with the traditional BC indicator,the prediction accuracy of the L1-BC indicator can be increased by 10%to 50%.To accelerate the calculation of L1-BC indicator,the parallel algorithm of calculating L1-BC indicator is implemented based on Spark.By using broadcast variables and accumulators,the parallel algorithm can effectively avoid memory overflow when computing large-scale networks.At the same time,the acceleration ratio can reach up to 94.31%.(2)In the research of optimizing the shuffle performance of Spark,an adaptive algorithm of memory allocation is proposed according to the historical spilling times of tasks.In the proposed algorithm,the memory is first borrowed from the tasks that do not have spilling.Then the tasks that have spilling are assigned the corresponding weights according their spilling times.Finally,the free memory is lended to those tasks according to the weight.Through the adaptive adjustment,the algorithm can effectively reduce the total size of memory spilling,improve the shuffle performance and shorten the overall running time of the job.The experiment results show that the algorithm can shorten the running time by 11.2%and reduce the size of memory spilling by 8.5%on the skew dataset.

Keywords/Search Tags:

Essential proteins prediction, Betweenness centrality, Spark, Shuffle, Memory allocation

PDF Full Text Request

Related items

1	Identification Of Key Nodes Of Complex Networks
2	Research On Computation Method Of Betweenness Centrality For Road Networks
3	Study On The Network Invulnerability Based On Local Betweenness-Degree Centrality
4	Research On Social Network Link Prediction Algorithm Based On Node Similarity
5	Research On Incremental Compuation Of Top-k Betweenness Centrality
6	Predicting Essential Proteins Based On Protein Network And Protein Function
7	Research On Network Node Centrality Prediction Based On Link Prediction
8	Research On K-step Betweenness Centrality Approximation Algorithm For Multidimensional Complex Networks
9	Application Of The Betweenness Centrality In The Scheduling Problem Of Directed Acyclic Graph
10	Research On The Relationship Between Topological Structure And Functional Lethality In Protein-protein Interaction Network