Font Size: a A A

Research Of Protein Subcellular Location Using Machine Learning Algorithms

Posted on:2020-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:X J ChenFull Text:PDF
GTID:2480306314495834Subject:Master of Agriculture
Abstract/Summary:PDF Full Text Request
Protein,as a basic constituent of living organisms,plays an important role in life activities.The function of a protein is closely related to its subcellular location.Different proteins can only play their roles in a specific subcellular location.Therefore,it is important to determine the subcellular location by existing methods,and to understand the functions and properties of the proteins,as well asto recognize the interactions between them.With the advent of the high-throughput sequencing era,the efficiency of using traditional artificial experiments to obtain protein subcellular locations is far from meeting the needs of scientific research,thus promoting the development of machine learning in protein subcellular localization prediction.Because the information contained in a protein sequence is large,and the sequence lengths belonging to the same subcellular location are not equal,as well as the uneven distribution of sequence features,resulting in poor classification results if only simply using traditional protein sequence feature extraction algorithms for classification.As for some related methods based on the combination characteristics of multiple biological information,although the better results are obtained,the feature extraction process is relatively complicated,and the resulting feature vectors are sparsely high-dimensional,ignoring the correlation between sequence features.Therefore,this paper is written to improve the above problems,the main contributions of the paper include:(1)A bag of words model based on relational expansion was proposed and applied to the field of protein subcellular localization prediction.The spatial position information of the protein sequence words was extracted by introducing the relational map,and the relational map was transformed by Convolutional Neural Network(CNN).Finally,the bag of words features and relational map features were merged as the final feature,which was sent to the Support Vector Machine(SVM)for the classification.The experimental results show that the proposed relationship map can effectively solve the problem of insufficient discrimination of traditional bag of words features of protein sequences,and further improve the accuracy of protein subcellular localization prediction.(2)A protein sequence feature extraction algorithm based on multi-level sparse coding was proposed.The algorithm used sparse coding,combined with traditional amino acid composition information(AAC)to extract protein sequence features,and performed dictionary learning and sparse representation on them.The sparse features were multi-level pooledaccording to different dictionary sizes,and Principal Component Analysis(PCA)was used to selectthe optimal features.Finally,the obtained feature vectorswere sent to the SVM for classification.The algorithm was evaluated by Jackknife's hypothesis test,and used Sensity(S),Specificity(S),Matthews Correlation Coefficient(MCC)and Overall Accuracy(OA)as the evaluation indices.The experimental results show that the algorithm can reflect the sequence features more comprehensively and further improve the classification performance.(3)In addition,in order to make our research results easy to observe and use,wedeveloped the corresponding protein subcellular localization prediction system based on the above proposed algorithms for the practical application needs of the relevant workers.The feasibility and demand analysis of the system were carried out from the perspective of software engineering.The design principles of the system were elaborated.The overall framework of the system,the division of labor of each module,the system construction information and the technology used in the system construction werealso described.At the same time,the detailed use-case analysis of the module,such as data acquisition,feature extraction and model call,was carried out,and the related database tables involved in the module were designed.Finally,according to the system operation interface,the actual operation process of the system was described in detail,and the corresponding page was displayed on the system.
Keywords/Search Tags:Protein subcellular localization, Sequence feature extraction, Relationship map, Sparse coding, Support vector machine
PDF Full Text Request
Related items