Font Size: a A A

Research And Application Of Biological Sequence Feature Coding Methods

Posted on:2022-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:L ShenFull Text:PDF
GTID:2480306752469274Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of life sciences and computer science,bioinformatics came into being.Sequence feature coding is an important part of bioinformatics,which is applied to learn the numerical representations of biological sequences.This paper proposes three coding methods of sequence feature,which are applied to three bioinformatics tasks:first,we propose a method based on deep hash learning to predict protein-protein interaction,then a method based on deep learning and embedded taxonomic tree is proposed to classify bacterial sequences.Finally,Seq Rank is improved to map biological sequence into a one-dimension feature space,which is used to detect the center sequence to solve the problem of multiple sequence alignment.The main results are as follows:(1)A method based on deep hash learning for protein-protein interaction prediction.In order to solve the problem of low efficiency caused by most methods that need to predict the interaction between proteins by pairwise comparison,we propose a protein coding model DHL-PPI which is based on deep hash learning.Our model encodes protein sequence into binary hash code,then PPI is predicted using hash code.The advantage of our method is that the PPI discrimination is transformed into retrieval problem,so that avoid the pairwise comparison of protein sequences and the massive amount of calculation of machine learning on classification and clustering.The experimental results show that DHL-PPI can recognize the protein pairs with interaction accurately and can reduce the time complexity of interaction prediction of batch proteins.(2)The method based on deep learning and embedded taxonomic tree for bacterial sequence classification.In order to solve the problem that the existing biological sequence classification methods based on deep learning have lots of parameters and need to design and train multiple models,this paper proposes a new method based on deep learning and embedded taxonomic tree BSC-TE.Firstly,the vector representation of each taxon is generated by node2vec,and then the bacterial sequence is encoded into vector representation by the deep learning model.Finally,according to the cosine distance between the vector of the sequence and the vector of each genus,the corresponding genus of the sequence is predicted,and the bottom-up classification strategies is used to classify sequence.The experimental results show that BSC-TE can classify the bacterial sequence accurately and efficiently at various taxa level.In addition,hierarchical classification can be carried out quickly to improve the efficiency of bacterial sequence classification.(3)The algorithm for detecting center sequence in star alignment strategy based on Seq Rank.In order to solve the problem that the existing methods have high time complexity and the lack of representativeness of the detected center sequence,our method improves the algorithm Seq Rank for detecting center sequence in star alignment strategy.In this method,the weights of the sequences are calculated by random walk model of bipartite graph,then sequence with the highest weight is regarded as the center sequence.Finally,the center sequence is aligned with other sequences in pairwise to complete the multiple sequences alignment.The time complexity of our method proposed in this paper isO(NLlog2NL),where N is the number of sequences and L is the average length of the sequences.This method greatly reduces the time of detecting the center sequence in star alignment.The experimental results show that the center sequence detected by our method is better than the comparison methods,and the effectiveness of our method is proved.
Keywords/Search Tags:Sequence feature coding, Sequence Classification, Multiple Sequences Alignment, Center Sequence
PDF Full Text Request
Related items