| Since the human genomic project has been put in practice, all kinds of biological data grow exponentially every year. It is necessary for ones to mine useful knowledge to help human explain the life. Hence, bioinformatics emerged as an academic subject. The large amount of data emerged in bioinformatics is a great challenge to the traditional computer algorithms.Protein homology detection is an important research aera in bioinformaties, it is to classify the newly determined protein sequences into certain protein family with known structure and function using the homology between protein sequences. Hence,the new protein's structural and functional characteristics can be induced. Recently, some new methods have been proposed and applied to protein homology detection successfully. However, like any new technology, there are still many limitations in the existing methods, which, do not work welle specially when the similarity between protein sequences is very low, i.e., remote homology detection. Hence, this thesis will focus on the research of protein homology detection especially remote homology detection. The main works in this thesis can be introduced as follows:1. Summarized the detection of distant homology protein research in the current,For sake of clarity, we can further split the classification of the computational methods as follows:(A) methods that compare proteins on the basis of their sequence information: (A1) Based only on protein sequences comparison; (A2) Based on protein sequence profiles; (A3) Based on information derived from protein structures; (A4) Based on machine learning predictors; (A5) Based on consensus. (B) Methods that compare protein structures (i.e. using their 3D structures) based on structure versus-structure alignment.For each calculation method of the introduction briefly.2. In this paper, to select neighbors based on the spread of clustering algorithm far as protein homology detection methods. The affinity propagation clustering(referred to as APC) for the unsupervised clustering algorithm, with the traditional classification methods (such as K-means) compared to treatment can be faster many, complicated, many property data. This article is based on the algorithm of the above-mentioned advantages of its introduction to the Protein Remote Homology Detection, the results of clustering analysis, clustering well prove that the homology of the protein far from the classification is effective, and clustering the results of the other well in the existing clustering method.3. Useing random index (RAND INDEX) to verify clustering results, because results of verification RANDINDEX are more rough, this paper draw on the method ofgene clustering by functional prediction accuracy, wrote accuracy calculation of the protein distant homology detection,, improved the accuracy of the clustering validation.4. The protein amino acid physico-chemical properties was Selected as a characteristic value, through the similarity measure will be converted matrix eigenvalue to similarity matrix, and then affinity propagation clustering algorithm used to generate clustering results.In this paper, selected Electron-ion interaction potential values, Hydrophobicity scales, Normalized van der Waals volume, Isoelectric point, Mean polarity, Polarizability parameter and Net charge.Through the affinity propagation clustering algorithm to cluster.5.One of the amino acid composition of the first and second composition vector was selected as eigenvalue; The protein secondary structure of used GOR and Porter prediction, then, the content of secondary structure , the content secondary structure conversion and the content of amino acid corresponding a secondary structure was selected as eigenvalue. Through the affinity propagation clustering algorithm to cluster. In this paper, the clustering results will be compared with the K-means clustering results, Clustering results validation showed when the number of clustering was the same, the clustering results of this paper is better than K-means clustering results.6. Through the radial basis kernel function (RBF) will be mapped the matrix eigenvalue to higher dimensional space, then cluster to the generating high-dimensional matrix with the affinity propagation algorithm. and the different characteristic values to generate high-dimensional matrix of heterogeneous data integration, then integration of heterogeneous matrix clustering, clustering accuracy rate have improve. |