| In recent years,biological information processing is a hot research direction,in which essential proteins prediction can effectively and quickly predict some essential proteins in protein-protein interaction networks.The essential proteins are very important for the survival,reproduction and drug target selection of organisms.At present,there are many essential proteins prediction algorithms.However,the computation complexities of serveral algorithms are very high,and calculating the small-scale protein-protein interaction networks is also a difficultjob.It is a feasible solution to accelerate the running of these algorithms with a distributed computing framework.Now,Spark has become a mature distributed computing framework.However,the shuffle performance of Spark is low in the actual distributed computing process.Therefore,it is of great significance to study the shuffle optimization strategy of Spark.The main research results of this paper are as follows:(1)In the research of predicting essential proteins,the L-BC indicator has the advantages of considering local characteristics of networks and reducing the running time,while the k-BC indicator distinguishes the vertex importance degree in more detail.In this paper,the Ll-BC indicator is proposed by combining the advantages of these two algorithms.The experiment results show that the prediction accuracy of L1-BC indicator is better than that of other topological attribute indicators in most cases.Compared with the traditional BC indicator,the prediction accuracy of the L1-BC indicator can be increased by 10%to 50%.To accelerate the calculation of L1-BC indicator,the parallel algorithm of calculating L1-BC indicator is implemented based on Spark.By using broadcast variables and accumulators,the parallel algorithm can effectively avoid memory overflow when computing large-scale networks.At the same time,the acceleration ratio can reach up to 94.31%.(2)In the research of optimizing the shuffle performance of Spark,an adaptive algorithm of memory allocation is proposed according to the historical spilling times of tasks.In the proposed algorithm,the memory is first borrowed from the tasks that do not have spilling.Then the tasks that have spilling are assigned the corresponding weights according their spilling times.Finally,the free memory is lended to those tasks according to the weight.Through the adaptive adjustment,the algorithm can effectively reduce the total size of memory spilling,improve the shuffle performance and shorten the overall running time of the job.The experiment results show that the algorithm can shorten the running time by 11.2%and reduce the size of memory spilling by 8.5%on the skew dataset. |