Font Size: a A A

Species Phylogeny Research Based On K-string Sequence Similarity And Machine Learning

Posted on:2023-12-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:R B TangFull Text:PDF
GTID:1520307103976989Subject:Statistics
Abstract/Summary:PDF Full Text Request
Accurately identifying the categories of biological data(or similarities between biological data)has become an indispensable step in sequence analysis.Traditional species phylogenetic analysis utilizes multiple sequence alignment(MSA)to infer species phylogenetic relationships under the premise of a series of a priori hypothetical conditions,such as linear arrangement of homologous portions of homologous sequences and so on.However,the assumptions are not always vaild on all types of data in the postgenomic era.With the proposal of alignment-free method,analyzing sequences based on the properties(such as frequency and position)of sub-string(kmer)in sequence has become one of the common and effective methods in phylogenetic research.In addition,mature machine learning methods have also been used to help analyze the phylogenetics relationship between species and the identification of recombinant components of sequences.The rapid spread,mutation and other biological behaviors of HIV-1 strains in the population make the fatality rate of the host becoming extremely high.In addition,due to its long incubation period,when the host is cross-infected by the virus,its recombination behavior in the host becomes frequent,which in turn presents diversity.This inter-subtype genomic recombination of HIV-1 becomes circulating recombination forms(CRFs)after widespread dissemination in the population.Currently in the HIV database,the forms of CRF appear more and more diverse,and the recombination situation is also more complex.Therefore,the main research contents of this thesis are to use the kmer properties to propose a new method for phylogenetic analysis;for the specificity of HIV-1 data,the standard kmers which can express the unique attributes of different HIV-1 strains were screened;and identification of recombinant components in HIV-1 circulating recombinant sequences(CRF).The detail of research contents of this thesis are as follows:1.Based on the inner distance distribution of kmer pairs in a sequence,we propose a novel alignment-free method(named KINN).This method analyzes the similarity between sequences and infers the phylogeny relationship among sequences by calculating the contribution of kmer pairs to sequences,and then converting the sequences into contribution vectors of kmer pairs.During the inspection,we analyzed the detailed inner distance distribution and the global inner distance distribution of identical substrings in homologous and non-homologous sequences,and sequence similarity analysis was performed based on these distributions.Meanwhile,the performance of KINN is evaluated by applying it to DNA and protein datasets with different sequence lengths.The results show that the inner distance distribution has the ability to distinguish homologous and non-homologous sequences.In addition,the current advanced alignment-free methods and the KINN method based on the contribution vector of kmers pairs were used to construct phylogenetic tree of the species datasets,respectively.And then the constructed trees were compared with the reference tree of the species datasets,and the Robinson-Foulds(RF)distance between them was calculated,respectively.The results show that KINN can always achieve better performance.In particular,for the analysis of HRV protein sequences by KINN,the RF value reaches 0(i.e.the same as the reference tree).2.When obtaining an HIV-1 strain sequence,we should first identify its relationship with the type of existing sequences in the database to determine whether it belongs to a new type,and if it is a new type,the recombinant component of this new type needs to be accurately identified;if it is not a new type,we need to efficiently identify its type in the database.Therefore,we screened out the standard kmers from the HIV-1 reference sequence based on the varying k,and used standard kmers to extract the feature vector of HIV-1 sequence.On the dataset composed of reference sequences and CRF sequences,by using the frequency information features based on standard kmers to build a phylogenetic tree for sequences in the dataset,the results show that the topology of the tree is consistent with the facts;Furthermore,we analyzed the spatial relationship between CRF sequences and the sequences of their recombinant components(subtypes)based on the frequency and location information characteristics of standard kmers.The results obtained demonstrate that the properties of screened standard kmers can be used for the expression of genetic information from sequences of different HIV-1 strains.3.The recombination behavior among HIV-1 strains conforms to the application scenario of multi-label machine learning algorithm model.Therefore,the complete identification and prediction of all recombinant components of the CRF sequence,generated by recombination,is a complex machine learning problem.In order to identify and completely predict the multiple components of CRF sequences and their chronological numbers in the database,we proposed a multi-label learning algorithm based on voting mechanism.The final predictions are augmented by voting on the predictions of the three multi-label learning methods to avoid the bias of a single algorithm.During our processing,we extracted the frequency and location features of standard kmers of HIV-1 sequences to capture the uniqueness of pure subtypes and CRF sequences,and applied the method to 7183 HIV-1 sequences,including 5530 pure sutype sequences and 1653 CRF sequences.Experimental results show that the method performs well(up to 99%)in predicting the whole set of labels for HIV-1CRF sequence.After analyzing the situation of the wrongly predicted label,it is actually an incomplete prediction for the sequences and the prediction result is very close to the complete actual label.
Keywords/Search Tags:Phylogenetics analysis, kmer, multi-label learning, inner distance, HIV-1
PDF Full Text Request
Related items