Font Size: a A A

Biological Sequence Model Based On Inter-character Distance And Its Applications

Posted on:2018-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y F LvFull Text:PDF
GTID:2310330533463130Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
One of the important research topics in bio-mathematics is to establish a mathematical model for biological big data,and then discover much hidden valuable information by rapid model process and effective analysis.Based on the inter-character(base or amino acid)distance of the biological sequence,some mathematical models are established by statistical method and machine learning technique.Applications of models in sequence analysis and essential gene identification are also discussed.On the one hand,new inter-base(amino acid)distance sequences are proposed by the existing character distance sequence.It can reproduce the original biological sequence easily without any other auxiliary conditions.Furthermore,(ordered)precise inter-base(amino acid)distance sequences are presented,and five basic statistical quantities are extracted to construct a feature vector to characterize the original sequence.Next,use the Euclidean distance between the vectors to calculate the similarities between biological sequences.Finally,the presented model is applied to three groups of experiments: DNA group,including 18 eutherian mammals,23 mitochondrial genome and 11 exon sequences;Non-coding RNA groups,consisting of 19 non-coding RNA sequences;Proteome,containing 9 ND5 sequences,20 FG sequences and 24 transferrin sequences.Derived phylogenetic trees by using MEGA,Phylip and Treeview software are quite agree with a few popular studies,which illustrates that the proposed method is an effective tool for sequence analysis and comparison.On the other hand,essential gene identification contributes to the exploration of the origin and evolution of life,and it is also important for the design of drug targets,the treatment of diseases,and the study of the minimum genome of synthetic biology.The presented feature vector is applied to identify essential gene combined with support vector machine method.Firstly,design testing sets and training sets.Then make 10-fold cross validation to the feature vectors of essential and non-essential genes of five bacteria to get the optimal parameters.Finally calculate AUC value(area under the receiver operating characteristic curve)to evaluate the presented model.The obtained AUC value is higher than some well-known results,which confirms that the proposed method is an alternative for essential gene identification.
Keywords/Search Tags:Biological sequence, inter-character distance sequence, feature vector, sequence analysis, essential gene identification, support vector machine
PDF Full Text Request
Related items