Font Size: a A A

Predicting Essential Genes Based On Artificial Neural Network

Posted on:2009-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2178360242480421Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the progress of the genome project and sequencing projects, the focus of the study gradually transfers from the accumulation of data to how to interpret the data. In the future, a new discovery of biology will greatly depend on the analytical ability which we combine and associate diverse data from different point of view and no longer depend on the concern of the traditional areas. New research is not only the foundation of the understanding of life and evolution, but also the foundation of the discovery of new drugs and new therapies. Bioinformatics has become a new subject which has strategic significance in the area of the life sciences and computer science. It will affect medicine, biotechnology and many areas of society through various ways.Along with the rapid growth in sequence data, compared to the biological knowledge which we will explore, our existing biological knowledge is limited. In biology and other disciplines which have many information data, particularly in bioinformatics, people recognize that existing knowledge has many uncertainty. Some knowledge is unknown or wrong. So biologists often have to induce and deduce problems. For example, they have to consider how to use the data to build model, find or modify unknown or existing biological knowledge. The machine learning methods can be applied in the fields which have large amounts of data and the corresponding theoretical is not perfect fields.In this paper, we introduce some knowledge of bioinformatics. Then we introduce the concept of essential genes, the significance of the study and status at home and abroad. There are the genes which would lead to the death of the individual. once organisms lose one of them. The genes are called essential genes. Essential genes are very special genes in function. They have very high value of the theory and application. However, at present, the study of essential genes basically depends on the essential genes which are from the organisms in experiment. When we want to study new essential genes from other organisms, it cost too much time and money. We can use the approaches of machine learning to predict essential genes. At present, there are some persons who try to predict and classify essential genes in the method of decision tree. In this paper, we use the information from the essential genes and nonessential genes of Escherichia coli. We want to use BP network which is very nature in theory and has been applied in many fields and Support Vector Machine which is used in bioinformatics many times in recent years to build models which can be used to predict and classify the essential genes of other organisms.This article introduces the basic method, the basic principle, the merit and shortcoming of the BP network and Support Vector Machine.When we use BP network to build the model, we discuss some parameters, such as the selection of efficiency study, the number of hidden layer neurons. In order to avoid overfitting, we discuss how to determine the number of training times. When we distinguish essential genes or non-essential genes, there is an imbalance between the samples of essential genes and non-essential genes. The imbalance leads to the big sample high accuracy and small sample low accuracy. Small sample is essential genes. To enhance the prediction accuracy of small sample is more meaningful. The traditional machine learning methods can not solve these problems. We use the method of sampling which can balance the imbalance of number of samples in different classes to improve the accuracy of essential genes.When we use the model of Support Vector Machine, we discuss the selection of the two options C and g. We decide to adopt automatic search. We get all the value of C and g by certain length of step in the scope of C and g. At last we select the best value of C and g which can make classification accuracy highest and two classes interface largest to build the model. We import a Weighted Support Vector Machine that makes the model show more concern for an important minority of essential genes and improves the classification accuracy of essential genes under such uneven distribution.The main work of this paper is that, at first, we introduce the source of data set. We collect all the Escherichia coli's sequences of genes from Genbank. We removal some genes which are neither essential genes nor non-essential genes by experiment. Then the data set includes only essential genes and non-essential genes by experiment. We use software CodonW to mine the information of the sequence of the Escherichia coli's essential genes and non-essential genes. We get 35 attributes. After the analysis, we reduce 35 attributes to 7 attributes. Second, the model is constructed using BP model. We find that the accuracy of the collectivity is relatively high, but the accuracy of the essential genes is relatively low. We introduce sampling method to artificially balance number of samples in different classes. Then we find that the classification accuracy of the collectivity and accuracy of essential genes both have been improved. Third, when we use the model of Support Vector Machine, we find that we can get different classification accuracy when we select different nuclear function or different value of g and C. In order to improve the classification accuracy of essential genes, we import the Weighted Support Vector Machine. The results is that the classification accuracy of essential genes has been improved, but the classification accuracy of essential genes declined somewhat. Finally, we add another 60 genes of Bacillus subtilis to build test set in order to test the validity of the model.
Keywords/Search Tags:Predicting
PDF Full Text Request
Related items