| Understanding the structure and function of proteins is of great practical significance in the fields of life-science,agriculture and medical treatment.Prediction of protein secondary structure is an important part of protein structure research.Using machine learning to predict protein secondary structure with PSSM is an important method in bioinformatics field.In order to reduce the dependence on high-quality annotated data,it is of important informatics and biological significance to explore the semi-supervised learning to mine the protein sequence data and realize the prediction of the secondary structure.The core of this algorithm is to fully mine the useful information of unlabeled data and form efficient fusion with labeled data.At the same time,the design of appropriate feature engineering method will effectively improve the recognition performance of protein secondary structure.This paper carries out the following work on this issue:(1)An effective feature representation method was designed for the PSSM.In other words,considering the evolution information of different protein amino acids and the information between adjacent residues and non-adjacent residues,a variety of feature representation methods based on PSSM are designed to map(or transform)to generate numerical feature vectors with strong discriminant ability.(2)In view of the high-dimensional feature vectors generated by feature representation,the role of feature selection in structure prediction is examined.That is to say,the filtering method based on statistical information eliminates the redundant and irrelevant features generated in the feature representation process,and the experiment compares their effects on semi-supervised learning performance.(3)To introducing a variety of semi-supervised learning algorithms,ladder network is put forward,which is a model of the integration of supervised and unsupervised characteristics,based on the noise reduction mechanism on structures,the encoder and decoder communication bridge,so as to realize the model of a semi-supervised learning.(4)By experiment and comparison,the grouping design of D8244 and D640 data sets and three different standard ratios proves that the accuracy of the ladder network semi-supervised model is better than that of other classical semi-supervised models under the same external conditions.In addition,the parameter and feature representation of the optimized combined model were optimized.Compared with traditional SVM and RF,the performance of the obtained model was comparable to that of the supervised algorithm.The semi-supervised learning algorithm based on ladder network has some practicability in protein secondary structure recognition,and the preliminary feature engineering can improve the performance of the model.Therefore,the method proposed in this paper can be applied to the prediction of secondary structure of proteins,and the research method also has informatics and biological significance for the combination of data mining methods and cutting-edge problems in biological science. |