Font Size: a A A

The Research On Protein Sequence Feature Extraction And Its Application On Protein Subcellular Location

Posted on:2014-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:Q M HuFull Text:PDF
GTID:2250330425483750Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the completion of the human genome sequencing, then explosive growth ofprotein sequence information, a large number of protein sequences swarm intobiological databases. The gap between the protein of known sequence and the proteinof known function is becoming more and more bigger. In these imbalances situation,it severely restricted the research of proteomics and the development of new drugs.Protein function is very close connection with it’s subcellular localization. It must betransported into a specific organelles if protein wants to play its normal function.Protein subcellular localization information can provide useful clues for theprediction of protein function, so the study of protein subcellular localization hasbecome an important research field of proteomics. The core step of protein subcellularlocalization prediction is feature extraction and classification. This topic mainlyfocuses on the sequence feature coding method and the classification algorithm design.The main work includes the following two points:This paper presents a new method of extracting sequence features, which caninclude more sequence information. Protein sequences can be transformed into a232-dimensional numerical vector through extracting sequence features, whichcontains following three components: The conventional amino acid compositioninformation of20dimensional; Amino acid residue position information which can bepresented by20dimensional numeric vector; Amino acid sequence order informationof192dimensional. It will bring the curse of dimensionality if directly used triplet ofamino acids to extract the order information. Each amino acid residue can beexpressed by its corresponding vertical codon. So, after each amino acid residue isrepresented by the vertical codon, every protein sequence can be expressed by threeRNA sequences. The order information of protein sequences can be extracted throughcalculating the probability of triplets appeared in each RNA sequence. This methodcan get good experimental results when predicted protein subcellular localization ontwo standard data sets through the nearest neighbor classification algorithm.In order to further excavate sequence characteristic information, in this paper, anew pseudo amino acid composition coding method based on the pseudo amino acidcomposition idea is presented. This method makes full use of the physico-chemicalproperties of amino acids for predicting protein subcellular location. The amino acids are divided into six categories, and the order information can be reflected through theorder correlation factor. The SVM algorithm is used to predict the proteinsubcellular location on apoptosis protein, prokaryotic and eukaryotic protein data sets.The results show that our method obtained good experimental results by leave-one-outcross validation.
Keywords/Search Tags:Protein subcellular localization, Pseudo amino acid composition, Featureextraction, Nearest neighbor classifier, Support vector machine
PDF Full Text Request
Related items