| With the rapid development of high-throughput sequencing technology,the scale of biological data has been improved.Extracting useful information from these biological data can not only better reveal the essence of life,but also provide important theoretical support for disease prevention and diagnosis.Starting from the statistical characteristics of DNA and amino acid sequences,this paper analyzes the correlation between nucleotides and amino acids,proposes a novel alignment-free method —— Nucleotide Amino Acid K-mer Vector(NAAKV),and applies this vector to the whole genome evolution analysis of bacterial and viruses and the identification of eukaryotic coding regions.Firstly,convert the DNA sequence into a pseudo amino acid sequence(PAAS).Secondly,calculate the kinds and frequencies of k strings in PAAS and construct the corresponding feature vector NAAKV.The number of k string types in PAAS is much lower than the standard amino acid sequence,thereby reducing the dimensionality of NAAKV and improving computational efficiency.After verification,NAAKV is more accurate and efficient than MUSCLE and the classic k-string method in gene classification of five datasets,providing strong support for evolution analysis.In addition,the combination model is generated by combining NAAKV with probability statistical method logistic regression.Two eukaryotic benchmark datasets,HMR195 and BG570,were selected for five-fold cross validation.The results showed that the average AUC values were 0.981 3and 0.987 4,respectively,which were significantly better than traditional Bayesian discriminant analysis and VOSSDFT methods.This proves that the new algorithm NAAKV proposed in this paper can also be applied to predict eukaryotic coding regions. |