Font Size: a A A

An Improvement Of Cluster On Phylogenetic Profiling Method

Posted on:2012-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:D H LiFull Text:PDF
GTID:2218330368996000Subject:Computer Applications scientific
Abstract/Summary:PDF Full Text Request
With appearance of the automatic, efficient sequencing technique, the task of Bioinformatics has transferred to the gene analysis and genome donation. Because of the shortcomings of the homology method, people pay more and more attention to the non-homology ways. The classification and function analysis is based on the attribute of the sequence.Phylogenetic profiling is a kind of non-homology annotation method using evolution information. After it was proposed by Pellegrini in 1999, many researchers had improved it from reference genome selection, phylogenetic profiling foundation and profile's similarity analysis. Phylogenetic profiling has three forms: discrete, continuous and weight-based. Weight–based type is developed from continuous one. It can mark the gene which has good performance in the sample protein more prominent, and the gene which is seldom translate in the sample will also be weaken by weight accordingly. In this paper we use this type of phylogenetic profiling method to pre-process the protein data, then the hierarchical cluster and K-means cluster are used together. Two improvements are made upon predecessor's work: First, a kind of distance based on Bioinformatics background is used in hierarchical cluster. Second, abstract more information from hierarchical cluster result as the initial parameter of K-means cluster. It will make the K-means cluster more efficient.Most distance we adapt in cluster algorithm are Euclidean distance. Because most of the samples we deal with are in Euclidean space, the cluster result perform well. The distance we adapt in hierarchical cluster is a new type, which belongs to non-Euclidean space. Compare with the Euclidean distance, this kind of distance strengthen the already-known information. Not only the distance between two samples is taken in consideration, the distance between samples and the reference subject is also took in, which ensure us to deal the samples similar to the reference group.The shortcomings of the K-means is that initial parameters has strong impact on the result. Currently most improvement are mainly focus on the choice of the initial parameters. The purpose of using hierarchical cluster and K-means cluster together, is that provide K-means cluster initial cluster number K from hierarchical cluster. We abstract more information from hierarchical cluster result to provide K-means cluster initial point.Finally, Escherichia coli K12 genome is chose as experiment sample to verify the improvement. As we can see from the result, compare with the traditional one,new algorithm has more accuracy and more efficient.
Keywords/Search Tags:Phylogenetic Profiling, Bioinformatics distance, hierarchical cluster, K-means
PDF Full Text Request
Related items