Font Size: a A A

Prediction Of Cancer Driver Mutations Based On Protein Sequences

Posted on:2016-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:F YeFull Text:PDF
GTID:2308330461491661Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Bioinformatics applies mathematics, information science, statistics and computer science method to study the problems of biology. Cancer bioinformatics focuses on the understanding of cancer biology from an informatics perspective. It becomes widely accepted that human cancer is a disease involving dynamic changes in the genome and that the missense mutations constitute the bulk of human genetic variations. The number of missense mutations being identified in cancer genomes has increased greatly as a consequence of technological advances and the reduced cost of next generation sequencing methods. However, a high proportion of the amino acid substitutions detected in cancer genomes have little or no effect on tumor progression, known as passenger mutations. Another proportion known as driver mutation contribute a lot to the occurrence and development of tumors. Driver mutation is critically important for understanding the molecular mechanisms of cancer development and progression, by which more targeted and effective treatments can be carried out for patients. At present, many research methods are proposed to solve the question, this paper mainly uses the machine learning algorithm.Firstly, we should encode protein sequences. As an endless stream of methods of extracting features of protein sequences emerged, the protein sequences feature code makes use of a very wide resource. According to physicochemical properties, structure, function and evolutionary properties information, we extract features based on 2-gram encoding method and 6-letter exchange group method, amino acid residue changes and amino acid residue substitution score. The more representative the extracted features are, the more precise the results will be.Secondly, usually feature dimension obtained by above methods is higher and redundant. Feature selection is the key to data pre-processing in pattern recognition field. The results of feature selection have directly effect on classification accuracy and generalization performance. We propose the least absolute shrinkage and selection method (Lasso) to carry out driver mutation features selection. Lasso is a feature selection method based on a paradigm for the optimal solution of the penalized regression. In addition, we calculate weight for each feature, and then select subset of features based on the weight by adding features sequentially. Lasso not only can accurately select the features that are strongly related with class label, but also have good stability by contrast.Finally, we train model based on subset of features by machine learning algorithm. Random forest, rotation forest, support vector machine (SVM), and extreme learning machine (ELM) classifier were used to predict driver mutation. In addition, we compare the performance of our classifier with each other, and also compare with the performance of the classifier from other articles.
Keywords/Search Tags:driver mutation, feature code, feature selection, random forest, rotation forest
PDF Full Text Request
Related items