Font Size: a A A

The Research Of Algorithm Predicting Protein-coding Genes In Prokaryotes

Posted on:2016-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z G HuaFull Text:PDF
GTID:2180330473452285Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Gene identification is the first step to extract information and knowledge from the sequenced genome sequences. Routine experimental methods could not keep peace with the explosive growth of whole genome sequence data and hence computational prediction becomes one necessary method to identify bacterial genes. In 2003, based on the Z curve theory of DNA sequence, a prokaryotic gene recognition program ZCURVE 1.0 was proposed. In 2006, ZCURVE_V 1.0 was proposed to address viral gene recognition by modifying ZCURVE. With the accumulation of experimental data, the development of machine learning theory, and the improvement of the computer devise, it is necessary to upgrade ZCURVE 1.0 and ZCURVE_V 1.0, respectively. Meanwhile we developed online services for ZCURVE 3.0 and ZCURVE_V 2.0. The updated program can be freely accessed from http://cefg.cn/zcurve/ and http://cefg.cn/zcurve_v/.Compared with the older version, ZCURVE 3.0 has the following improvements.(1) Support vector machine(SVM) was used to replace the Fisher linear discriminant;(2) Inspired by the petals pattern in nucleic acid distribution of ORFs, the program generated six groups of negative samples, and hence performed six times of SVM discrimination. Such improvements will greatly reduce the pseudo positive predictions;(3) Two and three order Z curve variables were integrated with the originally used 0 and 1 order ones, and the variable number increased from 45 to 765;(4) To exclude overlap related miscarriages, internal parameters were re-optimized. Similarly, the viral system ZCURVE_V 1.0 was updated to the 2.0 version and it has the following improvements.(1) Considering frequencies of adjacent bases across codons, the initial 33 parameters increased to 45;(2) Based on the petals pattern, m six Euclid discrimination were performed;(3) By stepping debug of the program, the optimal set of parameters to exclude overlapping were changed.Results of the ZCURVE 3.0 in 337 prokaryotic genomes show the average accuracy rises to 94.0% comparing to the original version(89.6%). It was also demonstrated the ZCURVE 3.0 is competitive with Glimmer 3.02, which has the average accuracy of(93.5%) on the same test set. And the additional prediction rate(8%) of ZCURVE 3.0 is lower than Glimmer 3.02(11.3%). Similarly, the program ZCURVE_V 2.0 has been tested in 24 viral genomes and shows the average lower additional prediction rate(5.79%) than ZCURVE_V 1.0(10.83%). This rate is similar with that of GeneMarkS(5.21%). However, the average sensitivity of ZCURVE_V 2.0(93.94%) is much higher than GeneMarkS(88.95%).Finally, we also did some research work about predicted translation initiation site b ased on the basic features.
Keywords/Search Tags:gene finding, ZCURVE, ZCURVE_V, accuracy, additional prediction rate
PDF Full Text Request
Related items