Font Size: a A A

Study On Protein Function Prediction Based On Multi-Sources Integration

Posted on:2016-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2308330461976546Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the completion of genome sequencing, biological research comes into the post-genomic era and one of the key areas is proteomics which is important to reveal the phenomenon of life activities. Protein is the main component of cells which undertakes the basis of organism life and biological funcitons. As one of the important direction of protemoics, it has an important significance for understanding biological operating mechanism, cell structure, disease diagnosis and crop improved by determining the function of proteins. Currently, biological experiments are the mainly methods to anntotate protein function precisely, but it has the limits with higher cost, time-consuming and human factors. Therefore, it becomes a trend by using computational mehods on high-throughput protein datas to predict protein funcitons in recent years. With the development of gene chips and bio-mass spectrometry techniques, a variety of protein high-throughput datas have been produced, including gene expression, protein sequence and protein interaction data and so on. Different data reflects the protein function information from their own perspecitve, so it is essential for accurate prediction of protein functions by effectively using the information of each data source to integrate these heterogeneous protein datas.Protein function prediction is a problem with mulit-sample and multi-label which can use annotated protein informations to predict unannotation protein funciotns. A particular function is usually not achieved by a protein but a protein complex, which shows the presence of interaction relationship between proteins. As the protein interaction network contains both annotated proteins and unannotated proteins, it can use semi-supervised learing method with graph theory to predict funcitons. The paper proposed a function prediction method by label propagation algorithm based on multiple data sources integration. For each data source, an interaction network is built by calculate the similarity value between proteins and select the greater values. The method use the naive Bayesian fashion to integrate the multiple data sources, adopt label propagation algorithm to transmit the functions of annotated proteins to unannotated protein through several rounds, and finally get a score vector of all funcions. The results of corss-validation exprements on yeast datasets show that the method has a higher average precision, lower coverage and is superior to single data source methods.With some statistical analysis, it shows that related functions generally common annotate some proteins, and Gene Ontology (GO) term has annotation correlation that if a protein annotates a sub-GO term, the protein also has its partent-GO terms. Therefore, the paper use jaccard coefficient to construct a function correlation network by calculating the annotation correlation between proteins, and integrate into function prediction model to improve prediction accuracy. A doubly indexed matrix is built by combining the function correlation network and protein interaction network, and random walk is used to predict protein funcions. The experimental results on yeast show that the method has a strong classification performance and has better performance than other multi-data sources integration methods.
Keywords/Search Tags:Multiple Data Integration, Protein Function Prediction, Random Walk, Function Correlation, Gene Ontology
PDF Full Text Request
Related items