Font Size: a A A

The Research And Application Of Feature Selection Algorithms In Mass Spectrometry Based Metabolomics Data

Posted on:2012-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2218330368487762Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an important part of system biology, metabolomics is receiving more widely interest in the area of the study of live events. Since metabolomics focuses on the final products of all kinds of live events, it is able to directly reflect the metabolic disturbance caused by external (infection, medication, surgical operation, etc.) and internal (disease, ageing phenomenon, etc.) factors, metabolomics studies can be applied to the diagnosis of the metabolic disturbance of live events in the human body. Since biological data is usually characterized by very high dimension, large amount of noisy features, it has become a bottleneck of the metabolomics studies to extract the most crucial information that represents the underlying discipline of the biological problems.Data mining techniques are able to capture the characteristics of the data by building particular models, which is helpful to the interpretation and analysis of the data. Feature selection algorithms are effective in discovering the most representative features that describe the distribution of samples from the high dimensional data. In order to properly interpret the data from metabolomics studies and find out the most important metabolites, feature selection is quite necessary. Estimation of distribution algorithms (EDAs) are a type of evolutionary algorithms based on probability models. Because of the excellent performance of EDAs on optimization problems and the model interpretation, EDAs are receiving more interest in recent years. Through the research and application of the estimation of distribution algorithms (EDAs) for feature selection problems, a solution capacity limited estimation of distribution algorithm L-EDA is proposed. In L-EDA, the capacity of candidate solution is limited to a small number, and the best candidate solutions to the problem are highlighted. Based on the selected best solutions, a probability model updating strategy with a global baseline is proposed, which makes the updating of the probability model more accurate and makes the algorithm more efficient in discovering the most related features from the data. The other aspect of the work is to develop a backward feature elimination method named F-SVM. Initially, statistical value F in analysis of variance and support vector machine weight are combined to filter out noisy features from the total feature set. After that, the most discriminative features are further discovered by iteratively building support vector machine models and evaluating the features in the feature set. In the research of epithelial ovarian cancer (EOC) recurrence after surgical operation, compared to the traditional algorithms, L-EDA succeeded in filtering out non-relevant factors such as chemotherapy and radiotherapy, and found out 5 metabolites that could reflect the symptom of EOC and the selected metabolomics could be used to assist the clinical diagnosis and treatment of the disease. F-SVM was applied to deal with the metabolomics data of liver diseases, and 22 differential metabolites related to various pathways were discovered and could help to diagnose liver diseases in clinical use. In the experiments, K-fold feature selection model was adopted to validate the ability of F-SVM from the perspectives of discovering differential features from high-dimensional data and differentiating samples.
Keywords/Search Tags:Metabolomics/Metabonomics, Feature Selection, Estimation of Distribution Algorithms, Support Vector Machine
PDF Full Text Request
Related items