Font Size: a A A

Epistasis Network and Machine Learning Methods for the Analysis of Biological Large Dat

Posted on:2019-01-18Degree:Ph.DType:Thesis
University:The University of TulsaCandidate:Parvandeh, SaeidFull Text:PDF
GTID:2448390005472040Subject:Bioinformatics
Abstract/Summary:
This thesis involves the development of epistasis network methods with two separate applications. An epistasis network is a gene network whose edges represent statistical interactions between the genes that influence an outcome variable like depression status or immune response. This thesis used epistasis network feature selection in two ways. In the first part of this thesis, we developed a novel analysis workflow to predict influenza vaccine, where we used reGAIN edges to identify pairs of genes that help predict immune response. In the second part, we incorporated prior knowledge into centrality feature selection from the reGAIN network to predict major depressive disorder.;We developed a multi-stage machine learning strategy to build a predictive model of vaccine response using pre-vaccination antibody levels and pre-vaccination gene expression levels. The first step uses a nonlinear regression model to predict day 28 antibody response from pre-vaccination antibody levels. This model explains a significant amount of variation in post-vaccination response, especially for subjects with large pre-existing antibody levels; however, for individuals with low pre-vaccination antibody levels, there remains a large amount of variation in post-vaccination antibody that may be explained by differences in baseline gene expression levels. Thus, we used Gaussian mixture modeling to cluster low, medium, and high responders from pre-vaccination titers and then a reGAIN gene interaction network feature selection algorithm that finds the best pairs of genes whose co-expression is associated with antibody response within titer clusters. We used ratios of these pairs as predictors in a penalized regression model of antibody response for each cluster of responders separately. Using three publicly available data, we trained and tested our algorithm on data for individuals immunized against influenza vaccine. We provide the analysis strategy as an R Shiny application.;We developed two new epistasis-expression centrality methods that incorporate interaction prior knowledge. The first extends our SNPrank (EpistasisRank) method by incorporating a non-constant (gene-wise) prior knowledge vector from the Integrative Multi-species Prediction (IMP) database. The second method extends Katz centrality to epistasis-expression networks and extends the Katz bias factor to be a gene-wise interaction prior knowledge vector. We compare pathway enrichments and nested cross-validation accuracies with and without prior knowledge in epistasis-network centrality feature selection. Using microarray studies of major depressive disorder, we find that including prior knowledge in co-expression and epistasis network centrality improves enrichment of genes in biologically relevant pathways and improves testing classification accuracy.;Finally, we implemented nested cross-validation method for feature selection and parameter tuning. We used the ReliefF feature selection method for ranking importance in inner folds and we used caret library to tune the classifiers parameters. We compare nested CV with the recent private evaporative cooling (private EC) method. Our results suggest that nested CV shows less overfitting than private EC on simulated data but more overfitting in real data. Both nested CV and private EC yield similar accuracies.
Keywords/Search Tags:Epistasis network, Method, Nested CV, Private EC, Feature selection, Prior knowledge, Pre-vaccination antibody levels, Large
Related items