| With the advent of the post-genome era,membrane protein type prediction has become a research hotspot as an important part of proteomics.In the research of prediction of membrane protein type,in the face of increasingly massive data,membrane protein type prediction by traditional methods such as biological experiments are no longer applicable.Starting from the feature expression of the data,this study takes machine learning as the basis and transform the membrane protein sequence into a feature vector that can be input into the machine learning algorithm,what’s more,achieving significant performance by using various prediction model and ensemble learning.The main contents of this paper include feature extraction,highefficiency feature using,selection of ensemble strategy,construction of deep learning model,feature fusion,etc.,and we achieved better performance than existing methods.The specific research contents are listed as follows:1.Given a protein sequence,we can know its amino acid composition information and evolution information.Based on these two kinds of information,we can extract the features of protein and get the feature expression that can be input into machine learning algorithm.The main methods of feature extraction are amino acid composition(AAC),dipeptide composition(DIPC)and position-specific scoring matrix(PSSM).Among them,the position-specific scoring matrix is a powerful feature extraction method,but we have to postprocess it so that it can be input to traditional machine learning methods because of the particularity of its dimensions.There is no doubt that these postprocess will lose information in a way.In this paper,the deep learning model is combined with the position-specific scoring matrix to achieve good predictive performance without destroying the original position-specific scoring matrix.2.For a protein sequence,the information it contains is extremely complex.Amino acid position,sequence length and long-distance dependence of some amino acids will all determine the properties of membrane proteins and thus determine their class.Individual classifiers and single criterion cannot accurately capture the rich intrinsic information in the membrane protein sequence,thus affecting the performance of the prediction.Ensemble learning can “learn from others” and combine different classifiers to bring great improvements to the prediction results.How to combine classifiers and the choice of ensemble strategies is critical.We experimented with various ensemble strategies,including multiplication strategy,maximum strategy,linear weighting,exponential weighting,logarithmic weighting and stacking.3.The membrane protein in the membrane protein dataset is expressed by a sequence which consisting of a sequence of 20 amino acid letters.Each membrane protein sequence has a length ranging from tens to thousands,and the distribution of length is not uniform.Traditional machine learning algorithms unable to capture the relationship between long distance dependencies and internal in such long sequences.We introduce recurrent neural network to directly input the original position-specific scoring matrix into the deep learning model that we constructed,thus preserving the original structure of the position-specific scoring matrix without causing loss of information.This efficient use of position-specific scoring matrix allows us to obtain very good prediction performance with only one feature extraction method. |