Essential genes, as a minimal gene subset in organisms, are required for survival, development, orfertility. The identification of such genes is always a crucial goal of synthetic and systems biology due tothe theoretical and practical significance. In previous studies, the identification of essential genes areprimarily performed by experimental techniques (e.g., single gene knockouts, RNA interference, andtransposon mutagenesis), by which genes are randomly or systematically inactivated, and their essentialityis inferred based on the effects on the organism. However, the identification of such genes on agenome-wide scale has not been achieved in most organisms (e.g. human) because of the requirement oftremendous time and cost. Therefore, computational techniques are developed to predict essential genes,and have been proven reliable in some fungi and bacteria. But many new problems, such as training set andfeatures selection, have emerged along with the increased application of them. Accordingly, to addressthese issues, we made deep studies on the following three topics. First, training set selection wasinvestigated and four criteria were determined for the prediction of essential genes. Second, a newcomputational model which could significantly improves the accuracy and robustness of essential geneprediction was developed in this part. Finally, on the base of former two studies, we predicted humanessential genes on a genome-wide scale and applied them for the exploration of novel drug targets.Part I: In this study, by reciprocally predict the essential genes in21species with a na ve bayesclassifier, training set selection was investigated for the prediction of essential genes.The results showed that:1) training set selection greatly influenced the predictive accuracy.2) the sizeof the training set should be at least10%of the total genes to yield accurate predictions.3) the integratedtraining sets exhibited remarkable increase in stability and accuracy compared with single sets.4) a rationalselection of training sets based on our criteria could yield better performance than random selection.Conclusion: four criteria for training set selection were determined: a) essential genes in the selectedtraining set should be reliable; b) the growth conditions in which essential genes are defined should beconsistent in training and prediction sets; c) species used as training set should be closely related to thetarget organism; and d) organisms used as training and prediction sets should exhibit similar phenotypes orlifestyles.Part II: In this study, based on na ve bayes classifiers, logistic regression, and genetic algorithm, wedeveloped a novel model called Feature-based Weighted na ve bayes Model (FWM), which could significantly improve the predictive accuracy.The results showed that:1) the effect of multicollinearity among gene features and the diverse andeven contrasting correlations between gene features and essentiality among different species hadremarkable impact on prediction.2) FWM had better performances (accuracy, robustness and adaptability)than other classifies, such as support vector machine, na ve bayes and logistic regression classifiers.3)FWM could improve predictive accuracy at least from2%to9%compared to na ve bayes classifier.Conclusion: Feature selection must be very cautious for the prediction of essential gene. Not allfeatures associated with gene essentiality could improve prediction precision, but contrarily, selectinginappropriate features could result in the cumbersome predictive model and low classification precision.Besides for the application in prediction of essential genes, FWM can be used as an alternative method forother classification work (e.g., the prediction of disease genes).Part III: In this study, based on two types of computational models, we predicted7,000essentialgenes in human genome; Then by comparing the identified human essential genes with tumor and pathogenessential genes, we identified55and2,046potential drug targets for cancers and other diseases associatedwith infectious pathogens, respectively.The results showed that:1) our predictive essential gene set had an higher accuracy of>0.73.2)human essential genes were significantly enriched for some core biological processes and molecularfunctions, such as regulation of transcription, macromolecular metabolism and binding activity.3) essentialgenes were over-represented among disease genes, and that both disease and essential genes are understronger purifying selection pressure than other genes.Conclusion: The identification of human essential genes was very reliable by means of these twocomputational models. The essential gene set will have wide application prospect in identification ofpotential drug targets.Taken together, by a lot of analytical methods of computer simulation, comparative genomics,statistics, data mining, bioinformatics, we systematically investigated the application of computationalmethods in identification of essential genes. The studies provide empirical guidance for the identification ofessential genes on a genome-wide scale, and contribute substantially to the knowledge of the minimumgene sets required for living organisms and the discovery of new drug targets. We also expect that thecatalog of human essential genes may facilitate the functional annotation of all human genes as well as thediagnosis and treatment of human diseases. |