Font Size: a A A

Biological Sequence Similarity Study Based On Alignment-Free Technology

Posted on:2020-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:J WeiFull Text:PDF
GTID:2370330620456747Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Biological sequence alignment analysis is the basic work in the field of bioinformatics research.It lays an important theoretical basis for sequence similarity analysis,helps to predict the function of unknown sequences,constructs phylogenetic tree,and analyzes homology.Traditional sequence similarity analysis methods are usually based on comparison algorithms,which have high time and space complexity.With the rapid development of biotechnology,biological sequences have grown exponentially,and the current massive amounts of biological data are big challenge for exuting algorithms..Although many alignment-free methods have been proposed to analyze biological sequences,many alignment-free models still have problems such as low eff-iciency,time consuming,and low prediction accuracy.Aiming at these existing problems in the alignment-free sequence similarity analysis methods,the following three aspects are proposed,and the validity and reliability of the proposed models are verified In experiments.1.For the traditional K-mer based feature extraction method,the position information of the word is usually ignored,which leads to the problem that the extraction features are unable to represent a sequence comprehensively.A new alignment-free sequence similarity model based on LF(Local Frequency,LF)entropy is proposed.By extracting the sequence K-mer,the LF entropy is calculated by using the word frequency and the position information of the word.And the sequence feature vector is extracted from protein sequences which is used to protein clustering.Comparing to the state-of-the-art methods,the experimental results show that the new model can effectively extract the sequence information and improve the accuracy of clustering2.In the existing feature extraction methods,the sequence frequency domain features are usually ignored,and the change information of the sequence cannot be accurately reflected.At the same time,in order to overcome the information loss caused by the down-sampling of the traditional discrete wavelet transform,a new alignment-free based on stationary discrete wavelet transform is proposed,called sequence similarity analysis method(SSAW).In the method,a new K-mer mapping method is used.The K-mer is mapped to the complex field by the amplitude angle,the word frequency is standardized,and the stationary discrete wavelet transform is combined with the complex representation of the K-mer to obtain the feature vector.Used in clustering and classification application,comparing to the state-of-the-art WFV model and K2*model,the experimental results show that SSAW has advantages in accuracy,recall rate and F1 value,and in most cases,the running time is significantly lower.3.Aiming at the low accuracy and high time complexity of the current Horizontal Gene Transfer(HGT)detection method,a novel HGT detection method based on SeqRank and Gaussian Similarity(SRGS)analysis is proposed,which utilizes SeqRank algorithm and k-means clustering.The HGT candidate data set is predicted by mapping the Euclidean distance between the candidate set and the genome to the high-dimensional space by the proposed Gaussian similarity formula.The experimental results show that the model has the lower time complexity and the better detection results than the other models.
Keywords/Search Tags:alignment-free method, LF entropy, stationary discrete wavelet transform, SeqRank algorithm, Gaussian similarity
PDF Full Text Request
Related items