Font Size: a A A

Research On Multi-site Protein Subcellular Localization Prediction Method Based On Fusion Feature And Multi-label Deep Forest Model

Posted on:2024-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:H R YangFull Text:PDF
GTID:2530306938451634Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Protein is an indispensable organic molecule in the organism,which maintains the normal life of the body and is of great value in modern biomedical research.Studies have shown that multi-site proteins have more complex biological functions and multi-site protein subcellular localization prediction is of great significance for the interpretation of biological phenomena and the design and development of drugs for related diseases.With the rapid development of gene sequencing technology,the explosive growth of multi-site protein data brings opportunities and challenges to predicting protein subcellular localization.The method of determining protein subcellular localization by traditional biochemical experiments has been unable to meet the needs of multi-site protein subcellular localization prediction research,so biological computing technology based on machine learning has become the mainstream means to study the prediction of multi-site protein subcellular localization.Based on the knowledge of bioinformatics,this thesis carries out in-depth research on predicting multi-site protein subcellular localization based on multi-label deep forest model for the first time.The main tasks are as follows:(1)A method based on functional amino acid composition and JS divergence for multi-site protein subcellular localization prediction is proposed.Firstly,amino acid composition based on functional matrix(FM-AAC)and JS divergence based on position specific scoring matrix(JS-PSSM)are extracted to characterize the protein sequence.FM-AAC reduces irrelevant and redundant information by screening strong functional amino acids and discarding weak functional amino acids.Compared with KL divergence based on position specific scoring matrix(KL-PSSM),the feature dimension of JS-PSSM is halved,which can reflect the correlation information between any two amino acid attributes more realistically.Secondly,AAC-PSSM-JS is obtained by fusing the two features in a serial manner to enrich protein information.Finally,a multi-site protein subcellular localization prediction method based on multi-label deep forest model(MLDF)is studied.The fusion features are processed layer by layer to enhance the feature representation ability,and the multi-label forest module is used to handle the relevant information between labels,so as to realize the multi-site protein subcellular localization prediction.Through 5-fold cross-validation,the results show that the method proposed in this thesis is feasible and effective for multi-site protein subcellular localization prediction.(2)A method for predicting multi-site protein subcellular localization called w MLDA-MLDF is proposed.A multi-site protein subcellular localization prediction method based on MLpsl-MLDF is proposed.Firstly,eight feature extraction methods including functional amino acid composition(FM-AAC),JS divergence analysis(JS-PSSM),amino acid physical property analysis(AA-PHP),pseudo amino acid composition(Pse AAC),multi-scale continuous and discontinuous(MCD),encoding based on grouped weight(EBGW),detrended cross-correlation analysis(DCCA)and pseudo-position specific scoring matrix(Pse PSSM)are utilized to discover a wealth of information contained in the protein and then fuse the above eight features in serial.Secondly,multi-label linear discriminant analysis(MLDA)is applied to eliminate the irrelevant and redundant information doped in the fusion feature.Finally,more discriminant fusion feature is input into multi-label deep forest model(MLDF)to predict the subcellular localization of multi-site protein.After 5-fold cross-validation,the OAA and OLA of MLpsl-MLDF on the Gram-positive bacterial protein dataset are 99.61% and 99.90%respectively,which are higher than those of other existing methods.The OAA and OLA on the Gram-negative bacterial protein dataset are 93.97% and 98.96% respectively,which are competitive with other methods.Experimental results show that MLpsl-MLDF can effectively predict the subcellular localization of multi-site protein.(3)A method named wMLDA-MLDF is proposed to predict the subcellular localization of multi-site protein.Firstly,based on the above eight feature extraction methods,protein features are obtained and serial fusion of them is performed.Secondly,entropy weighted multi-label linear discriminant analysis(w MLDA)is used to reduce the dimension of the fusion features to weaken the interference of redundant information.The more class labels the multi-label example belongs to,the greater the uncertainty of each label,so entropy is used to describe the uncertainty of random variables.The w MLDA can make better use of label information to optimize fusion feature,so as to obtain the optimal feature.Finally,the optimal features are input into the multi-label deep forest model(MLDF)to predict the multi-site protein subcellular localization.Tested by 5-fold cross-validation,the overall actual accuracy(OAA)and overall location accuracy(OLA)on virus and plant datasets are 96.12% and 99.19%,98.74% and99.90%,which are 1.42%~17.92% and-0.04%~24.39%,0.84%~30.64% and 0.12%~28.2%higher than other existing methods.The predicted results show that the w MLDA-MLDF method can predict multi-site protein subcellular localization well and improve prediction accuracy,further.
Keywords/Search Tags:multi-site protein subcellular localization, feature extraction, multi-feature fusion, feature dimensionality reduction algorithm, multi-label deep forest model
PDF Full Text Request
Related items