Font Size: a A A

Research On Feature Extraction Algorithms Of Membrane Protein Classification

Posted on:2009-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:B JiangFull Text:PDF
GTID:2178360278456821Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Since its starting implementation in the early 1990s, the Human Genome Project (HGP) has got tremendous achievements under the united efforts of scientists all over the world. Meanwhile the Genomics and Proteomics have achieved great development. Nowadays, with the unprecedented growth of biological information, human beings obtain a great deal of sequence information concerning amino acid residues, and the number of membrane protein sequences in protein database also increases rapidly. As one of the main components of biomembrane, membrane proteins play a vital role in organisms. Although phospholipid bilayer makes up of the basic framework in biomembrane, membrane proteins is yet the main manifestation of biomembrane's function and it makes the material basis for cells to implement various functions. Moreover, recent research reports indicate that the structure or function change of some membrane has extremely close relations with the production of human beings' diseases, and the relevant receptor membrane proteins also become an important target for drug design. Therefore, predicting the respective types of membrane proteins through their primary sequences so as to gain the correlative advanced structure and function information is a very important and challenging research.However, in the post-genome era, determining the types of membrane proteins by molecule biology experiments is time-consuming and costly when confronted with the tremendous amount of sequence information. What's more, it may encounter some difficulties in the experiments that can't be solved at present; hence it fails to meet the requirements of reality gradually. Thus it is more and more important to develop new bioinformatics tools and design highly effective and reliable computational methods to extract the feature information from the primary sequences of membrane proteins and to further study the advanced structures and functions, which is just the key task of bioinformatics in the post-genomic era.Feature extraction of membrane protein sequences is a basic problem in the research of protein classification based on calculation, and is also a key factor that determines the classification performance. From the membrane proteins' primary sequences, this thesis studies the classification problem for membrane proteins' structures and functions, proposes two new feature extraction algorithms, and takes some testing and analysis for these algorithms based on the standard dataset. The main work and innovations of this thesis are summarized as follows.(1) Organizing and constructing datasets for membrane proteins. Aiming at the problem of membrane protein classification, this thesis collects relevant standard datasets from the major international public databases and a large number of published literatures, so as to guarantee the scientific impartiality for the analysis and comparison of the follow-up experiments; then the construction criterion of the standard datasets is analysed in order to build a more complete and ideal dataset according to the refresh data in Swiss-Prot.(2) The prediction for the types of membrane proteins is a crucial fundamental research in the field of the structures and functions of membrane proteins and will also provide guidance for the related research in biology. For the problem of membrane proteins classification, focusing on the sequence correlation among amino acid residues, this thesis uses the method of k-substring source of diversity to extract the features of membrane proteins, and constructs a new type of membrane proteins classification model which combines the approach of the smallest increment of diversity with the weighted-KNN algorithms. Under three typical methods (Self-consistency, Jackknife and Independent dataset), the accuracy rate of prediction is respectively 99.95%, 86.16% and 98.36% based on the membrane protein standard dataset CE2059 and CE2625. The experimental results demonstrate the usefulness of k-substring source of diversity method when extracting the feature information of membrane proteins, and this classification model obtains a higher overall classification accuracy compared with existing models.(3) In order to get a classification model with better prediction accuracy and to furthest mine the information of structures and functions in the membrane protein sequences, this thesis further considers the physical and chemical properties of amino acid residues and long distance correlation between them. In order to predict the type of membrane protein, this thesis introduces the position information of amino acid in the sequence and calculates multi-amino acid index correlation coefficients, whereafter a novel type of membrane proteins classification model which combines two feature classes and support vector machine algorithm (SVM) is constructed. Under three typical methods (i.e.:Self-consistency, Jackknife and Independent dataset), the accuracy rate of prediction is respectively 98.25%, 88.10% and 95.62% based on the membrane protein standard dataset CE2059 and CE2625. Compared with existing models, the prediction method gets a remarkable improvement in prediction performance, indicating that combined features approach could more profoundly figure the feature information in membrane protein sequences, and the classification model gains a fantastic performance.
Keywords/Search Tags:proteomics, bioinformatics, membrane protein, feature extraction, k-substring, source of diversity, weighted-amino acid composition, amino acid index, correlation coefficient
PDF Full Text Request
Related items