Font Size: a A A

Research On DNA, RNA And Protein Sequence Feature Extraction Method And Its Application

Posted on:2016-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:F L LiuFull Text:PDF
GTID:2180330503451121Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the biology sequencing technology, people obtained a lot of DNA, RNA and protein sequence data, but the corresponding functional and structural data increase slowly, so it is necessary to use machine learning methods to solve this problem. The key problem of using machine learning methods to study their structures and functions through sequence is how to extract valid sequence feature. The research conducted in-depth study of DNA, RNA and protein sequence feature extraction methods, proposed 34 kinds of feature extraction methods, and applied these features in the researches of three important issues in bioinformatics: DNase I hypersensitive site identification, micro RNA precursor identification and DNA binding protein identification.The research studied the sequence feature extraction methods of DNA, RNA and protein. Using the machine learning methods, it needs to extract the sequence features firstly. However, converting the biological sequence features with different lengths into fixed length feature vectors is difficult. In addition, the feature extraction algorithms affect the accuracy of the prediction methods directly. To solve this problem, the research proposed three category of sequence feature extraction methods: nucleotide/amino acid composition based method, autocorrelation based method and pseudo-nucleotide/pseudo-amino acid composition based method. The nucleotide/amino acid composition based methods use sequence information of sequences’ basic composition, namely the statistical properties of nucleotide/amino acid, to represent the sequences. Although the nucleotide/amino acid composition based methods achieved certain success, it ignored the effect of the physical and chemical properties of the sequences, that is the nucleotide/amino acid, which leads to the weak portrait of the sequence information. To cope with this problem, the research proposed autocorrelation based methods. In order to represent the sequecnes better, the research also considered local and global sequence order information meanwhile, and proposed pseudo-nucleotide/pseudo-amino acid compostion based methods. The research also proposed a feature extraction method based on the RNA secondary structure status. Based on the results of the methods mentioned above, the reseach developed three sequence feature extraction tools, namely rep DNA, rep RNA and Pse-in-One, which were used to the feature extraction of DNA, RNA and protein sequences separatly.In order to verify the validity of the feature extraction methods proposed above, the research proposed prediction methods for DNase I hypersensitive site identification, micro RNA precursor identification and DNA binding protein identification by using these features. For DNase I hypersensitive site identification problem, The research extracted three kinds of DNA sequence features including nucleotide composition based, autocorrelation based and pseudo-nucleotide compostion based. Because these features have different sequence distributions, the ensemble learning method is used to combine different features, and the final result is obt ained by weighted voting strategy. For the micro RNA precursor identification problem, the research used a similar feature extraction methods and ensemblem learning strategy achieving a prediction accuracy value with 86.14% on the dataset at last. For the D NA binding protein identification problem, the research extracted amino acid composition, autocorrelation and pseudo-amino acid compostion, three kinds of protein sequence features, using ensemble learning strategy achieved a prediction accuracy value of 77.96%.
Keywords/Search Tags:DNA sequence feature extraction, RNA sequence feature extraction, protein sequence feature extraction, DNase I hypersensitive site identification, micro RNA precursor identification, DNA binding protein identification
PDF Full Text Request
Related items