Font Size: a A A

Study On Some Information Extraction Algorithms In Protein Subcellular Localization Prediction

Posted on:2015-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z X ShiFull Text:PDF
GTID:2268330428963219Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
As the function of a protein is closely related to its subcellular location, studies onsubcellular localization prediction which is based on machine-learning can provide importantreference for researching the function of a new protein. This article focuses on the informationextraction algorithms in subcellular localization prediction. The contents of this thesis are asfollows:(1) Information extraction algorithm which is based on AAindex database mining.According to the physicochemical properties of amino acids, we scanned the544amino acidindexes in the AAindex database employing the autocorrelation function and classification ofreductive amino acid groups to systematically study the impacts of different amino acid indexes,different classifications of reductive amino acid groups and different information extractingmethods on protein subcellular localization prediction.(2) Information extraction algorithm which is based on PSI-BLAST homology alignment.In current researches, there exists redundancy and inefficiency when we use PSI-BLAST to buildalignment database. In this paper, we propose a new method that substitutes the training set itselffor the commonly used NR database. This novel method can extract the homology informationmore efficiently and meanwhile exclude the interference of redundant data. And it can greatlyimprove the matching efficiency and achieve higher prediction accuracy in protein subcellularlocalization prediction.(3) Information extraction algorithm which is based on the golden ratio segmentation ofprotein sequence. Depending on the amount of information contained in different segments of aprotein sequence, the golden ratio is used to divide the protein sequence. We count componentinformation and location information from each segment; the PSSM matrix of protein sequencewas split into different sub-matrixes by golden ratio and statistics on their evolution informationwere collected from each sub-matrix. It is found, in this article, that the fusion model thatintegrates the composition information, location information, and evolution information ofsegmented sequence can significantly improve the subcellular localization prediction accuracy.In addition, based on the principal component analysis, we developed a simple feature subsetsearch algorithm that can reduce the number of dimensions and distinctly improve the predictionaccuracy at the same time.
Keywords/Search Tags:Protein, Subcellular localization prediction, Information extraction algorithm, Machine learning
PDF Full Text Request
Related items