Font Size: a A A

Research On Automatic Methods For Predicting Functions Of Biological Sequences

Posted on:2009-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z C LianFull Text:PDF
GTID:2178360242480273Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of modern sequencing technology, huge biological sequences are produced every day. However, it is only the first step for the research of varies biological processes and phenomena to obtain part or whole genome sequence. The final purpose is to analyze functions of these biological sequences. Function prediction of biological sequences is to design models or software to analyze biological sequences of unknown function, based on experiments or computational methods, with the goal of discovering biological functions and significances of these sequences.Now, there are a lot of experimental methods to research functions of biological sequences, but they all take a lot of times and energies and can not meet the need of predicting huge data. Although the technology keeps advancing, the gap is increasing between the sequences of given functions and those of unknown functions. So it is an emergent task to combine knowledge and technologies of Bioinformatics and several other subjects to develop a reliable automatic method of predicting functions of biological sequences.Gene Ontology provides normalized vocabularies about genes and proteins. It is the base of unifying the data about genes and mining the data. Gene Ontology project provides a set of unified, standard and hierarchical terms to note functional characteristics of gene products. People can use nomenclature provided by GO project to annotate biological functions of biological sequences. Now people have developed several methods to predict functions of biological sequences based on Gene Ontology and BLAST. Among these methods most are based on characteristics of graph and do not consider characteristics of sequence alignment. In the meanwhile, with the strict scoring function, the precision and recall rates of these methods are less than those of Top BLAST method obviously.Around the point of predicting functions of biological sequence, we recommend a new design of BLAST-based GO term annotator which incorporates data mining techniques and utilizes rough set theory originally. Being different from other methods which only depend on BLAST or hierarchical characteristics of Gene Ontology, this method finds characteristics of the result of BLAST, from the view of data mining, to solve the problem of annotating biological sequences. Firstly, the method finds the sequences (object sequences) similar to the unknown sequence (original sequence) according to the result of BLAST, and searches the database for annotated terms of object sequences. Then, the method trains the rules of attributes for each term using the algorithm based on rough set. Finally, we predict whether the terms are function annotations of original sequence based on attributes of the term and the rules of the term. The experimental results prove that the proposed rough set-based method has a greater precision than conventional methods, and makes more reliable the prediction way based on computational methods. We also analyze defects of the method, find the reasons, and propose an advanced plan.Furthering the analysis of the results, we discover one important reason for the low recall rate. Because the terms of prediction are obtained from object sequences by BLAST, it is of course difficult to improve the recall rate when all the terms provided by object sequence only cover a small part annotations of original sequence. To improve the recall rate, we use minimum covering graph method to provide more terms of prediction. The main idea is to infer more terms related to given terms based on structure characteristics. The goal of the method is to increase the number of prediction terms by finding a sub-graph covering all the given terms. Firstly, the method finds the object sequences similar to the original sequence by BLAST, and classifies the terms provided from object sequences based on ontology type. Secondly, for all the terms belonging to the same ontology, we find the ancestors of all the terms, and construct the covering graph by combining all the paths from the ancestors to the given terms. The minimum covering graph means that the root of this covering graph is the furthest term to the root of the ontology among all the ancestors. All of the terms covered by the minimum covering graph are treated as the terms for prediction.Based on the results of the above research and analysis, we also propose a novel method to predict functions of biological sequences automatically, combined minimum covering graph method (MCG) and artificial neural network (ANN). Besides minimum covering graph method, new attributes are also been recommended to improve the effects, based on structure characteristics of Gene Ontology. We use the formula of log-likelihoods to transform the values of attributes, and test the relationship between these new attributes and the results. In addition, for decreasing computation complexity and remedying the defect that rough set can not solve the serial data, we use two different artificial neural networks to predict functions of biological sequences instead of rough set. The experiment proves that the novel combined methods get higher precision and recall rates than those of Top BLAST method with the strict scoring function, and overcome the defects of rough set. The analysis of results also demonstrates that the method can get different effects to meet different needs by choosing different values of parameters.According to the different effects of these two methods, the method based on rough set is more suitable to predict whether the sequence has one specific function, but the method combined MCG and ANN is more suitable to predict functions of sequences in huge numbers. These proposed methods are proved to be validated and remedy defects of conventional methods. These methods not only enable electronic annotation to be more reliable but also decrease the cost for functional prediction, which makes these methods effective supplements of experimental methods. In addition, the effects of these methods validate the feasibility of predicting functions based on the characteristics of results of BLAST from the view of data mining. So these researches broaden the view of function prediction in future. Of course, we should not ignore the shortcoming of these methods.Although the methods get better effects than Top BLAST method, the recall rate is still unsatisfied. We will further the research of function prediction of biological sequence from the view of data mining, and find more efficient attributes. Besides the information based on sequences, function prediction of biological sequence can also use several other information such as structures of proteins and phylogenetic tree. It is also an important research topic how to integrate the information to predict functions of biological sequences.
Keywords/Search Tags:Predicting
PDF Full Text Request
Related items