Font Size: a A A

Predicting The Proteins Subcellular Localization Based On Physical And Chemical Features Fusion

Posted on:2018-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:L Y WangFull Text:PDF
GTID:2310330512481825Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Protein is the material basis of life,involved in many life activities,including DNA replication,transcription,translation,material metabolism,signal transduction and cell cycle control,it is a direct manifestation of the phenomenon of life.Therefore,the study of protein function is a host topic in the field of proteomics research.While the protein subcellular location directly determines the protein function,and the protein must be transported to its subcellular position in order to properly exercise its function,otherwise it will produce the body dysfunction,and then a variety of diseases and other phenomena,and even dangerous to life safety.So the study of protein subcellular localization prediction is very meaningful.In addition,determining protein subcellular locations for cancer disease pathogenesis research and target cell drug discovery has played a crucial role.Some studies of protein subcellular localization have discovered that more and more proteins exist or move between two or more subcellular locations.Therefore,the study of protein subcellular localization prediction has shifted from single-site protein subcellular localization to multisite protein subcellular localization,and it has become a research hotspot in bioinformatics.The emergence of huge amounts of protein sequences for subcellular localization prediction research brings great challenges and difficulties,so it need to realize automatic protein subcellular localization prediction with the help of computer technology.While traditional protein subcellular localization prediction method is usually divided into four steps.The first step is to build protein datasets,which provide reliable data for completing subcellular localization prediction.The second step,the feature extraction of protein,is the key step in the subcellular localization prediction,and the traditional way restricts the predicting accuracy.The third step is the selection of prediction algorithm,which choosing the proper prediction algorithm is the most important step in the process of research,and directly affects the final prediction results.The fourth step,prediction algorithm of the evaluation,is to by analyzing the evaluation results to determine whether the selection of the feature extraction methods and prediction algorithms is good or not in order to improve the accuracy of prediction.Around the protein subcellular localization prediction,this paper studys the featureextraction of protein,the subcellular localization prediction algorithm,etc.In this paper,the main works are summarized as follows: Firstly,the paper adopts protein sequence datasets,which contain subcellular locations of proteins with both single and multiple sites at the same time such as Virus-mPLoc and Gpos-mPLoc.Secondly,based on entropy density,pseudo amino acid composition(PseAAC)and the amphiphilic pseudo amino acid composition(AmPseAAC)of three feature extraction methods,this paper is carried out the following three aspects of research.The first aspect is the improved method of characterization of the amphiphilic pseudo amino acid composition and comparing with the pseudo amino acid composition to evaluate the effectiveness of the improved method.The second aspect is to improve feature fusion rules.In this paper,based on the simple feature fusion rule,using the20 dimensions of entropy density replaces the former 20 dimensions of amphiphillic pseudo amino acid composition which is called the special fusion method.The third aspect,combining two feature extraction methods of dipeptide composition model and amino acids index model(AAID),proposes a new feature extraction method based on the physical and chemical characteristics of amino acids,and protein localization prediction results prove the effectiveness of the proposed feature extraction method.Thirdly,using the multi-label k-nearest neighbor algorithm(ML-KNN)as a prediction algorithm,and taking into account the problem of data imbalance,an improved version of the multi-label k-nearest neighbor algorithm(wML-KNN)is used.Fourthly,the prediction algorithm is evaluated by the five evaluation indicators of the hamming loss,one-error rate,coverage,average precision and accuracy.Based on the evaluation results of the prediction algorithm,it can prove that the selected of feature extraction methods and the prediction algorithms is feasible on the two datasets of Virus-mPLoc and Gpos-mPLoc,and obtains better the prediction accuracy.
Keywords/Search Tags:protein localization prediction, the amphiphillic pseudo amino acid composition, a new feature extraction method, feature fusion rules, the multi-label k-nearest neighbor algorithm
PDF Full Text Request
Related items