Parallel Clustering Algorithm For Large-scale Biological Data Sets

Posted on:2015-02-01

Degree:Master

Type:Thesis

Country:China

Candidate:M C Wang

Full Text:PDF

GTID:2298330422989405

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Bioinformatics is an interdisciplinary subject which takes computer science, appliedmathematics, information technology and statistics to address biological problems. Dataclustering is one the important methods, and attracts lots of attention recently. However, thescale of biological data sets becomes much larger than before, in which some traditionalsequential clustering algorithms can’t meet the demand of large-scale computing. The paperfocuses on the parallel clustering algorithm for large-scale biological data sets, andimplements two parallel clustering algorithms which handle large-scale data sets effectively.Firstly, we propose the MFC clustering algorithm based on the Markov model. MFCalgorithm includes three steps: Walking, Finding and Inflation. Compared with theTRIBE-MCL algorithm, the MFC algorithm is more resistant to the effect of noise. However,the MFC algorithm simulates the random walks based on the initial probability matrix, so ittakes longer time or more iteration steps to converge. In the experiments, we apply the MFCalgorithm to detection of protein families, and find the algorithm performs well. Moreover,we also use the same data sets to test the TRIBE-MCL algorithm, and results indicate thatthe MFC algorithm has a better performance than the TRIBE-MCL algorithm.Secondly, the MFC algorithm has about O(3) time complexity, so it is difficult tohandle some large-scale data sets. In order to address these challenges, we implement theparallel MFC algorithm on the GPU. Compared with the original MFC algorithm, theparallel MFC algorithm achieves a good accelerating performance and can address thelarge-scale data sets effectively.Finally, we implement the parallel affinity propagation clustering algorithm on thedistributed system. The affinity propagation performs well in many clustering problems;however, it is difficult to address some large-scale data sets because of its time and spacecomplexity. The distributed system can supply the large-scale memory and great computingcapacity, so it is promising to implement the parallel propagation clustering algorithm on it.In the experiments, we applied the parallel affinity propagation to analysis of protein familyand microarray gene data sets, and the algorithm gets a good performance. Compared withthe original affinity propagation algorithm, the parallel affinity propagation achieves a good speedup, and reduces the runtime from hours to several seconds. In the experiments on thecluster computer of Shanghai University, we run the parallel affinity propagation on128cores, and achieve more than100speedup.The implementation of parallel clustering algorithm which can address large-scale datasets effectively makes necessary foundation of algorithms and applications for Proteomicsand Genomics researches.

Keywords/Search Tags:

parallel clustering algorithm, high performance computing, detectingprotein family, biological clustering, MPI

PDF Full Text Request

Related items

1	Research On Parallel Hierarchical Clustering Algorithm
2	Parallel Clustering Algorithm Based On MapReduce
3	Parallel Computing Acceleration Research And Implementation Of Several Classification And Clustering Algorithms
4	The Research And Realizing Of Parallel Cluster Analysis Algorithm
5	Research On Parallel Non-Intervention Document Clustering Algorithm
6	A minimum spanning tree based clustering algorithm for high throughput biological data
7	Design And Implementation An Of Document Clustering Algorithm Based On The GPU
8	Parallel Design And Implementation Of AP Clustering Algorithms Based On CUDA
9	Research On Parallelization Of Clustering Algorithm Based On Many Cores
10	Research And Implementation Of Mapreduce-based Graph Clustering Algorithm