Font Size: a A A

Research On Protein Remote Homology Detection Based On Machine Learning Methods

Posted on:2019-11-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:J J ChenFull Text:PDF
GTID:1360330566497533Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Proteins are one of basic materials that make up the life system,participating in almost all life activities.With the development of biological sequencing techniques,people have accumulated a large number of protein sequence data,while the protein structure data is growing slowly.The gap between of them is huge.The study of protein remote homology detection is of great significance to predict proteins' structure,which have attracted a large number of researchers to study this problem in depth from different perspectives.Due to the high cost of biological experiments,it becomes increasingly urgent and important to predict protein structure and function based on protein sequence by using machine learning methods,in which two key science issues are how to vectorize a protein sequence and train a predictive model.Biological sequences are the "language" of life.Because the similarity between biological sequence and natural language,this dissertation learned the ideas from natural language processing to vectorize protein sequences,and proposed many protein remote homology detection methods based on machine learning techniques.The main research contents include four parts:Firstly,detection method based on learning to rank(LTR).The most existing ranking methods detect the remote homology between a pair of proteins by using sequence-alignment-based methods.However,these alignment methods have high false positive rate,especially for those proteins with low sequence similarity,making the constructed feature vectors containing a lot of noise.In view of this problem,this dissertation learns the ideas from the "query-document pair" in text retrieval to construct the ‘query-protein pair' by taking query proteins as query terms and candidate protein as documents.A feature matrix is constructed with scoring similarities of protein pairs by using several sequence alignment tools.Then the feature matrix is used to train a LTR model and re-rank the candidate homologous proteins.The experimental results show that the proposed model can not only correct the false positive errors in the candidate remote homologous,but also improve the detecting stability.Secondly,detection method based on sequence-order frequency matrix.It has been confirmed that protein vectors with evolutionary information can significantly improve the performance of protein homology detection.The commonly used methods for obtaining evolutionary information ignore the dependence of local amino acids and lose a lot of evolution information of protein sequences.In order to solve this shortage,this dissertation learns the ideas that sentences with similar semantic contain similar key words,and presents a protein sequence representation method based on sequence-order frequency matrix by taking these remote homogous proteins as sentences with similar semantic and amino acid subsequences as the ‘key word' in protein sequences.Then the discriminative model is trained.The experimental results show that the evolutionary information of the protein sequence obtained by this method is more than that of traditional methods,which also confirms that the local amino acid dependence is of great significance to improve the homology detection of protein.Thirdly,detection method based on amino acid embedding and recurrent neural network.Currently almost all of the protein feature vectors are constructed with handcraft,but it is difficult to extract complex amino acid patterns only based on human knowledges on proteins,resulting in incomplete information in the protein feature vector.In view of this problem,this dissertation learns the ideas of word embedding from natural language processing,and presents amino acid embedding by taking amino acid subsequence as the words in protein sequence.And then this dissertation proposes an effective method based on recurrent neural network by combining amino acid embedding.The experimental results show that the performance of this method is better than those methods based on handcraft feature vetor.Fourthly,detection framework based on ensemble and fusion methods.Aiming at combining the advantages of ranking strategy and discriminative strategy for protein homology detection method,this dissertation first uses the ranking strategy to detect remotely homologous proteins with high reliability,and then these remote homologous proteins with low reliability are re-detected by using ensemble discriminative strategy.This framework can employ a variety of protein vectorization methods to characterize the protein sequences from different perspective and integrate the adavantages between ranking and discriminative strategies.The experimental results show that the proposed framework can improve the detection performance and enlarge the applicable range.In summary,this dissertation focuses on the protein remote homology detection based on the idears learned from natural language processing,and proposes several machine learning methods.At last,the detection performance is improved successfully.
Keywords/Search Tags:protein remote homology detection, machine learning, natural language processing techniques, ranking method, discriminative method
PDF Full Text Request
Related items