Font Size: a A A

Several Studies On Of Feature Selection Algorithms That Incorporate Pathway Information To Identify Relevant Genes

Posted on:2019-12-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Y TianFull Text:PDF
GTID:1368330548962043Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Feature selection is a powerful machine learning method to tackle with the high dimensionality issue associated with gene expression data in which the number of genes is much larger than that of samples.Different from the classic feature selection algorithms,a large number of methods that incorporate the information contained in biological pathways have been proposed to guide the selection of relevant variables in recent years.This type of methods is dubbed as pathway-based feature selection algorithms.Studies have demonstrated that pathway-based feature selection algorithms are superior to their gene-based counterparts in terms of prediction ability,model stability and biological interpretation.Depending on what a feature is referred,pathway-based feature selection algorithms may be classified into three categories,namely,pathway analysis,bi-level selection and pathway-based gene selection.In contrast to gene selection and pathway analysis,bi-level selection is a process of selecting not only important gene sets but also important genes within those gene sets.According to the order of selections,a bi-level selection method can be classified into three categories-forward selection which first selects relevant gene sets_followed by the selection of relevant individual genes;backward selection which takes the reversed order;and simultaneous selection,which performs the two tasks at the same time usually with the aids of a penalized regression model.A pathway-based gene selection method may be classified into stepwise forward,weighting and penalty methods according to the ways the pathway information is incorporated.The stepwise forward methods involve starting from one gene,e.g.,the most significantly differentially expressed one,and then adding genes gradually and evaluating on some statistic until no further gain upon this statistic can be obtained.The second category is to create a pathway knowledge-based weight for each gene and then combine these weights with the test statistics or expression values directly to select relevant genes.In the penalty category,an additional penalty term accounting for the pathway structure is added to the objective function for optimization.Here,we focus on several pathway-based feature selection methods we have developed-one hybrid method that combines the pathway information based weights with a stepwise forward method called significance analysis of microarray-gene set reduction(SAMGSR)(Dinu et al.2009),a new means of weighting by combining the pathway knowledge based weights with the gene expression profiles directly,and two bi-level selection algorithms that extend an existing method called the Cox-filter method(Tian et al.2015)to identify subtype-specific prognostic genes.This paper may be divided into three sections.First,we enable the SAMGSR algorithm to account for not only gene membership information but also gene topology information.Although the SAMGSR algorithm itself is a pathway-based feature selection algorithm,the SAMGSR algorithm considers only the gene membership information and ignores completely the topological information between the genes.To address this,we propose a weighted version of the SAMGSR algorithm-the weighted SAMGSR algorithm.The algorithm constructs the weights using gene-to-gene interaction information,and then combines the weights of the genes and their test statistics to select the relevant genes.Using simulations and real-world applications,we have demonstrated that the weighted SAMGSR algorithm is indeed superior to the original SAMGSR algorithm.Therefore,genes' connectivity knowledge provides additional useful information for the selection of relevant genes.Second,we introduce another weighting approach that combines weights directly with gene expression values to generate weighted gene expression values.Using these weighted gene expression values and a well-known regularized regression model,i.e.,LASSO,the selection of relevant genes is carried out.This weighting means not only successfully incorporates pathway information but also makes the reuse of many traditional feature selection algorithms possible.Using simulated and real-world data,we have demonstrated that weighted gene expression profiles are usually superior to the original gene expression profiles.Lastly,we propose to use the sign average method(Eng et al.2013)to produce a pseudo gene that may represent the expression level of a whole gene set,and then apply the Cox-filter algorithm(Tian et al.2015)successively in the pathway level and gene level to screen out genes/pathways.In the Cox-filter algorithm,for each gene a Cox model with patients' survival time as a dependent variable was fitted to examine the degree of association of the specific gene,disease subtypes,and their interaction with survival time.Our extensions,referred to as the forward Cox-filter algorithm and the backward Cox-filter algorithm herein,can not only select subtype-specific prognostic genes,but also incorporate pathway information to facilitate feature selection.Using simulations and non-small cell lung cancer gene expression data,we have demonstrated that the forward Cox-filter algorithm outperforms the backward Cox-filter algorithm and other relevant algorithms in terms of predictive capacity and model stability.In summary,pathway-based feature selection algorithms deserve more attention and further exploration.As below,we introduce the major results of our studies in detail.1.The weighted SAMGSR algorithmThe original SAMGSR algorithm consists of two steps.The first step is to use significance analysis of microarray-gene set to select important pathways.In SAMGS,the following functional score is defined,where di is the SAM statistic(Tusher et al.2001)and calculated for each gene involving in gene set j,xd(i)and xc(i)are the sample averages of gene i for the diseased and control group,respectively.Parameter s(i)is a pooled standard deviation that estimated by pooling all samples together,while s0 is a small positive constant used to offset the small variability in microarray expression measurements,and |j| represents the number of genes within gene set j.A gene set's significance is estimated using a permutation test with label perturbations.For each significant gene set identified by SAMGS,the additional reduction step of SAMGSR partitions the entire set S into two subsets:the reduced subset Rk including the first k genes and the residual one Rk for k=1,...,|j| by ordering the genes inside the set S decreasingly based on the magnitude of SAM;.Then the significance level of Rk was evaluated.That is,let ck be the SAMGS p-value of Rk,the iteration stops when ck is larger than a pre-determined threshold for the first time.The SAMGSR algorithm does not take into account the topology information,and thus all the genes are treated as exchangeable and assigned the identical weights,which increases the possibility that a driver gene with subtle change is mistakenly filtered out.To tackle this drawback of SAMGSR,we propose to combine a weight constructed on the basis of connectivity information with the SAMGS statistic.Specifically for G genes under consideration,a GxG adjacency matrix is defined.Its ij component,i.e.,aij equals to 1 if genes i and j are connected,0 otherwise.Then the connectivity weight for gene i is defined as,and the weighted SAMGS statistics is defined as,In the weighted SAMGSR algorithm,the weighted SAMGS statistics replace their original counterparts to execute pathway selection and individual gene selection.Using two simulations,one multiple sclerosis and one non-small cell lung cancer gene expression dataset(including one two-class classification application and one multiple-class classification application),we have demonstrated that the weighted SAMGSR algorithm in terms of predictive ability and model stability is superior to the SAMGSR algorithm and other relevant algorithm.2.The weighted gene expression profilesThe weighted expression profiles were obtained by combing weights with gene expression values.The weighted expression values for gene i are specifically obtained by multiplying the original expression values with(l+weighti)to a power of a.The value of a is set as 0.2,but it varies from application to application.Here,three different weights are considered.Weight 1 is the number of gene sets to which the specific gene belongs.Weight 2 is the number of genes connected to a given gene based on the protein-protein interaction information retrieved from a canonical pathway knowledgebase.Finally,the Pearson correlation coefficients between gene pairs are calculated using the training dataset,and the two genes are regarded as being connected if the absolute value of the correlation coefficient is greater than a pre-determined cut-off.Then weight 3 is the number of genes connected to a given gene.Using simulations and non-small cell lung cancer gene expression data,we compare these three weights,because weight 3 is calculated using the data of the specific disease under study,it usually outperforms the other two weights and no weight.3.Two extensions to the Cox-filter methodIn the Cox-filter method,a Cox model is fit on each gene,and the hazard function of patient i for gene g(g= 1,...,p)at time point t is given by,?ij8(t)=?0g(t)exp[?1gI(j=c2)+ ?2gXijg+?3gI(j=c2)×Xijg)(4)here,Xij=(Xij1,…,Xijp)T represent expression values for the p genes under consideration and X0g(t)is an unknown baseline hazard function at time point t.I(j=c2)is an indicator,taking the value of 1 if patient i belongs to group 2.Both ?2g and ?3g are the parameters of interest,with ?2g?0 but ?2g+?3g=0 corresponding to a subtype I-specific gene and ?2g+?3g?0 but ?2g =0 corresponding to a subtype II-specific gene.This method does not account for any pathway information.To tackle this drawback,we combine the sign average method and the Cox-filter method,and propose two bi-level selection algorithms-the forward Cox-filter method and the backward Cox-filter method.In order to use a sign average to represent the gene expression level of a specific gene set,all genes are classified into either a hazardous group or a preventive group according to the signs of(32g or ?2g+?3g in the Cox-filter models.Then the sign average Zijk for patient i of subtype j in gene set k is calculated as Zijle=(?I?HkjXil-?l?PkjXil)/(|Hkj|+|Pkj|)(5)At the level of pathways,the Cox model becomes?ijk(t)=?0k(t)exp(?1kI(j=c2)+ ?2kZijk + ?3kI(j = c2)× Zijk)(6)The forward Cox-filter algorithm first selects the important pathways by fitting the Cox model in eq.6 for each pathway,and then selects the important genes by fitting the Cox model in eq.4 for each gene.On the other hand,the backward Cox-filter algorithm takes the opposite orders to select genes and pathways.Using two simulations and one non-small cell lung cancer gene expression dataset,we have demonstrated that the forward Cox-filter method outperforms other relevant algorithms,namely,the backward Cox-filter method,the Cox-filter method,the COX-TGDR method and the separate analyses for individual subtypes using LASSO.
Keywords/Search Tags:Feature selection, pathway information, gene expression, classification and prognosis, significance analysis of microarray(SAM)statistic, Cox model, weighting
PDF Full Text Request
Related items