Font Size: a A A

Research On Parallelization Of Improved AP Algorithm Based On Spark And Its Application In Protein Complexes Identification

Posted on:2021-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:C DengFull Text:PDF
GTID:2370330626458926Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Proteins are the basis of biological activities.It is difficult that a single protein realizes the rich life activities of organisms.They usually interact with each other and form protein complexes to achieve specific biological functions.Protein-protein interaction networks have complex structures and large data scales.Therefore,accurate and efficient identification of protein complexes is important for understanding the structure of protein-protein interaction networks,analyzing the process of cells realizing life activities,and biomedical research significance.In the existing research,the identification of protein complexes is mainly divided into experimental methods and computing methods.Generally,the experimental methods take more time,costly,and lower detection efficiency.The computing method can make up for the shortcomings of the experimental methods.At present,Algorithms for automatically mining protein complexes from protein interaction networks have been developed by scholars.With the increase of the protein-protein interaction networks,the speed of existing algorithms needs to be improved.In oder to improve the identification algorithm of efficiency,this paper combines the protein complex identification algorithm with Spark technology.With the advent of big data and rapidly developed of distributed computing frameworks,Spark was born.Spark is a big data computing framework based on in-memory computing.It has own core RDD,which reduces disk usage during parallel computing? I / O operations.Compared with other distributed platforms,it has a rich ecosystem?obvious advantages.So it has been widely applying in the big data industry.Among algorithms,It is an effective method to convert protein-protein interaction networks into graphs and use the clustering algorithms to identify protein complexes.The Affinity Propagation(AP)algorithm is a high-precision clustering algorithm,but its time complexity is relatively high and not suitable for the large-scale protein-protein interaction networks,and its similarity matrix preference value affects the clustering effect.In this paper,the existing AP algorithm is improved correspondingly,therefore the EG-AP algorithm is proposed.The advantage of thisalgorithm is that it can maintain high accuracy.Additionally,we further applied the Spark platform parallel EG-AP algorithm to accelerate its efficiency.The main research work of this paper is as follows:1)We improve the original AP algorithm and propose the EG-AP algorithm.The EG-AP algorithm is divided into the following steps: First,a similarity matrix is constructed.According to the relationship of data nodes in the network.For two data points,the more public nodes are connected by nodes,the higher the similarity between the two data nodes.The ECC algorithms and Gene Ontology annotation are used to calculate the similarity between the data nodes and construct a similarity matrix.The diagonal value in the similarity matrix will affect the clustering effect,We also call it the preference.To this end,the setting of the preference value is improved.In the traditional AP algorithm,the preference was set to a fixed value.But it ignored that the value of the preference should be related to the similarity of other data nodes which connected to this data node.In this paper,the value of the preference of each data node is set as the quotient of the sum connected with the point and the number of data points,and then added to the average value of all similarities.2)The EG-AP algorithm is applied to identify protein complexes.In this paper,F-measure and Sep are evaluation metrics on the protein-protein interaction network of three different species.The experimental results show that the algorithm has higher identification accuracy on different datasets,which verifies the effectiveness of the EG-AP algorithm.3)The AP algorithm is based on the iterative operation between matrices.In addition,the protein-protein interaction networks are relatively large,and it will consume more time.Therefore,this paper uses the spark platform to build a spark cluster,and to parallel AP algorithm.On these datasets,the running time of the EG-AP algorithm in the stand-alone mode and the cluster mode is compared,and the acceleration ratio is calculated.The experimental results show that the EG-AP algorithm further improves the efficiency of protein complex identification,which further illustrates the effectiveness of the parallel EG-AP algorithm.
Keywords/Search Tags:AP algorithm, spark, protein complex, f-measure, Gene Ontology annotation, ECC weighting
PDF Full Text Request
Related items