Font Size: a A A

Research On Protein Subcellular Localization Based On LDA With Noise Ratio And Knn With Membership In Classes

Posted on:2018-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z F LeiFull Text:PDF
GTID:2370330518955131Subject:Computer technology
Abstract/Summary:PDF Full Text Request
It has been learned that the function of protein is closely related to their localization within cells.The newly synthesized protein is only transported to the correct subcellular sites to exert its function,otherwise there will be such as cancer,genetic disease which will be difficult to cure well.It'll take a lot of time and money to obtain location information,if we use traditional biological experimentation.The use of computer methods which find the localization information rapidly and accurately in the massive protein sequences has become a hot spot in bioinformatics research.Due to large amount of biological data,the high dimensionality of the current feature data and the randomness,explosiveness and discontinuity of protein creation,there will be data noise,imbalance in number and other issue.So,this thesis proposes a method named Research on Protein Subcellular Localization Based on LDA with Noise Ratio and KNN with Membership in Classes.The specific work of this thesis includes the following three aspects:(1)This thesis processes the high dimension data of protein by the linear discriminant analysis method.In this method,it is through the Fisher linear discriminant rate to find a direction of the straight line,so that all kinds of samples on the projection of this line can be separated as far as possible between-classes with the maximum distance,the smallest distance within-classes.However,due to the noise,the distance within-classes of spatial distribution of protein may be increased in the process of translation into protein by mRNA,greatly affecting the reduction of the effect.So,this thesis first introduces the noise ratio in this field,and uses the noise ratio to weight linear discriminant analysis method to reduce the dimension,so that the distance within-classes becomes smaller as possible,and the dimension is reduced well.(2)For imbalance issue of number of biological data on subcellular sites,this thesis adopts a kind of K nearest neighbor algorithm with membership in classes.We first introduce the within-class thought to cleverly avoid mistaken for other classes because of less data in a certain class(not selected in the first K samples);Secondly,we increase the relationship between data attributes by the membership,to better classify.(3)In the experiment,two sets of data(Gram-Negative,Gram-Positive protein)and Jackknife test are used.The results show that the work of this thesis has an obvious effect on the improvement of the correct rate of protein classification.For example,when the Gram-Negative data is reduced to 7 dimensions,the classification accuracy rate is basically stable at 89%.Finally,based on the above research,this thesis presents a prototype system which can provide the function of reducing and forecasting,which is convenient for practical application in the late years.
Keywords/Search Tags:Protein subcellular localization, Noise and noise intensity, Membership degree, Linear discriminant analysis, K-nearest neighbor classifier
PDF Full Text Request
Related items