Font Size: a A A

The Coding Sequence Recognition Of Prokaryote Base On Hidden Markov Model

Posted on:2014-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:J MaFull Text:PDF
GTID:2254330398462100Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective:This study is based on Hidden Markov Model, to identify Prokaryotes coding sequence and analysis the influencing factor of the identification, aims to study the theory of Hidden Markov Model deeply and provide research basis for the use in finding pathogenic sites and biological information mining.Method:In the paper we have built three models base on the baum-welch algorithm.They are100iterations HMM-gene model,10iterations HMM-gene model and100iterations HMM-nogene model. The training set get from the national biological information technology center (NCBI) and download from the shared resources. Before the training we have rejected the sequence that the length longer than20000bp and shorter than80bp.The training set is randomly selected from the2/3coding region sequences and2/3non-coding region sequences. The method of determine the effect of iteration is compared the accuracy of identifying the nucleotides. The test set is50sequences that randomly selected from the remaining1/3sequences. The recognition method of the paper is comparing the difference between1and the value, the value is the ratio difference based on the model for coding region and model for non-coding region.The test set are360sequences that each randomly selected from the remaining1/3coding region sequences and no-coding region. Then use specificity, sensitivity and accuracy evaluates the identify results of e. coli coding based on the method of the paper. Seeing the length and CG%of the sequence as two factors, use logistic regression to analysis their effect.Results:We fund the identification accuracy of nucleotides of100iterations(65.15%) is much better than the10iterations(49.89%)based on recognized50sequences at the same time. For recognize the sequence, the specificity is67.78%, the sensitivity is73.33%and the accuracy is70.56%based on the method of the paper. We fund the sequence that the length longer than1000bp and the CG%higher than53%has better effects; lower CG%doesn’t have a good result. Conclusion:HMM has a good use in gene recognition and fully iteration is very necessary. The sequence has longer sequence and higher CG%has better effect. In the paper still exists many problems, such as modification of training set, the judgment method of further improving, and the more fully consideration of the characteristics of biological sequences, etc., all of these need further study.
Keywords/Search Tags:Hidden Markov Model, Coding Region Recognition, Prokaryote
PDF Full Text Request
Related items