Font Size: a A A

Research On Sequence Information Extraction Methods And Subcellular Localization Of Proteins Based On SVM

Posted on:2020-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ZhangFull Text:PDF
GTID:2370330572980087Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the arrival of the era of big data and post-gene,a large amount of protein sequence information with unknown function and complex structure have flooded into biological database.Exploring the relevant information of these protein sequences has become a hot research direction in informatics and biology.The role of protein plays in organisms is closely related to their subcellular location.Therefore,in-depth study of protein subcellular location prediction has become the focus of bioinformatics.Under such a background,with the advancement of"Internet +",the traditional biological experimental data acquisition method has long been unable to meet the needs of modern research.The information extraction and processing represented by machine learning algorithms and intelligent positioning prediction play an irreplaceable role.In this paper,machine learning algorithm is used to study the protein subcellular location prediction.Combined with the relevant knowledge of information processing in the major,the paper mainly studies the two aspects of information feature extraction algorithm and classification prediction model:(1)Based on the existing methods,a new pseudo-amino acid composition algorithm is proposed in this paper:nine new information features are added to express protein sequences,and the feature expression model is reconstructed.In the process of feature extraction,a new protein feature vector expression model is constructed by combining autocorrelation coefficient,entropy density method and the proposed algorithm based on the idea of multi-feature fusion,which further enriches the expression of sequence information features.The support vector machine is selected as the classifier.Finally,the leave-one-out method is used to cross-check the two data sets of Gram-positive and Gram-negative,and the results obtained by the traditional method are compared to verify the practicability of the new method.(2)PsePSSM matrix is introduced to extract characteristic information to characterize the possibility of amino acid mutation during evolution;According to the physicochemical properties of amino acids,the amino acids are divided into six major categories.To further explore the influence of local position information of amino acid residues on the overall sequence of protein sequences,the idea of segmenting the sequences first is introduced.The new method combines the improved PseAAC,tripeptide composition and information extracted from PsePSSM matrix.In order to solve the unavoidable limitation of single classifier in classification prediction,this paper further optimizes the classification algorithm model,constructs an integrated classifier by parallel multiple support vector machines,and selects two datasets containing multi-locus proteins for verification.The results show that the integrated classifier can further improve the prediction performance compared with the single classifier.
Keywords/Search Tags:Protein subcellular localization, Feature fusion, Support vector machine, Improved pseudo-amino acid composition, Ensemble classifier
PDF Full Text Request
Related items