Font Size: a A A

Parallel Clustering Algorithm For Large-scale Biological Data Sets

Posted on:2015-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:M C WangFull Text:PDF
GTID:2298330422989405Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is an interdisciplinary subject which takes computer science, appliedmathematics, information technology and statistics to address biological problems. Dataclustering is one the important methods, and attracts lots of attention recently. However, thescale of biological data sets becomes much larger than before, in which some traditionalsequential clustering algorithms can’t meet the demand of large-scale computing. The paperfocuses on the parallel clustering algorithm for large-scale biological data sets, andimplements two parallel clustering algorithms which handle large-scale data sets effectively.Firstly, we propose the MFC clustering algorithm based on the Markov model. MFCalgorithm includes three steps: Walking, Finding and Inflation. Compared with theTRIBE-MCL algorithm, the MFC algorithm is more resistant to the effect of noise. However,the MFC algorithm simulates the random walks based on the initial probability matrix, so ittakes longer time or more iteration steps to converge. In the experiments, we apply the MFCalgorithm to detection of protein families, and find the algorithm performs well. Moreover,we also use the same data sets to test the TRIBE-MCL algorithm, and results indicate thatthe MFC algorithm has a better performance than the TRIBE-MCL algorithm.Secondly, the MFC algorithm has about O(3) time complexity, so it is difficult tohandle some large-scale data sets. In order to address these challenges, we implement theparallel MFC algorithm on the GPU. Compared with the original MFC algorithm, theparallel MFC algorithm achieves a good accelerating performance and can address thelarge-scale data sets effectively.Finally, we implement the parallel affinity propagation clustering algorithm on thedistributed system. The affinity propagation performs well in many clustering problems;however, it is difficult to address some large-scale data sets because of its time and spacecomplexity. The distributed system can supply the large-scale memory and great computingcapacity, so it is promising to implement the parallel propagation clustering algorithm on it.In the experiments, we applied the parallel affinity propagation to analysis of protein familyand microarray gene data sets, and the algorithm gets a good performance. Compared withthe original affinity propagation algorithm, the parallel affinity propagation achieves a good speedup, and reduces the runtime from hours to several seconds. In the experiments on thecluster computer of Shanghai University, we run the parallel affinity propagation on128cores, and achieve more than100speedup.The implementation of parallel clustering algorithm which can address large-scale datasets effectively makes necessary foundation of algorithms and applications for Proteomicsand Genomics researches.
Keywords/Search Tags:parallel clustering algorithm, high performance computing, detectingprotein family, biological clustering, MPI
PDF Full Text Request
Related items