Essential gene in the organism is very important,the functiona of thegene is vital to sustain life.There are two methods to predict and discover essential genes.The first one is experimental method,such as RNA interference and single gene knockout.However,the method is time-consuming and expensive.Owed to the drawbacks and limitations, the experimental method become less applicable for large scale gene essentiality analysis to date.Computational method,offer an appealing alternative for predicting essential genes with reasonable or minimum expenditure of resources than the experimental counterparts.Most computational methods used to date work in integration,this method is very dependent on experimental data,in the absence of experimental data is difficult to predict bacterial essential genes. In order to get rid of such limitations, we decided to develop a gene essentiality prediction algorithm based on the characteristics of the genes.First,we chose the protein domain as a feature of the prediction algorithm.Through the experimental verification,we found that the protein domain plays an indispensable role in the prediction of essential genes.Then we chose 25 species as the datasets,and use the genetic distance relate to the protein domain of the different species.A essential gene prediction algorithm was designed.Through to datasets of multiple cross validation and the AUC values were calculated.Finally, from 25 species we have chosen, 5 species are more than 0.9,and 14 species are between 0.75 and 0.9,6 species are less than 0.75,the lowest value is 0.66.It shows that our algorithm is very good.Next,we upgrade the essential gene prediction tool—Geptop,which is based on the features of the gene sequence.Compared with the older version,the new one has the following improvements.(1) the datasets are extended from 19 species to the 25.(2) simplify the scoring formula,so that it is easy to understand.(3) optimize the prediction program to improve efficiency.By upgrading,the prediction accuracy of Geptop has been improved.Compared with the older version,there are 12 species were increased in datasets.About running efficiency,we use E.coli to test our program and the time reduced from 107 minutes to 26 minutes,the efficiency increased by nearly 4 times.Finally,we try to combine the essential gene prediction method based on protein domain and Geptop to get better results.Because of the limited time,we didn’t find a way to improve the prediction results.But we can provide our experience to other scholars. |