Font Size: a A A

Optimization Algorithm Of The Maximal Information Coefficient And Its Application In Bioinformatics

Posted on:2021-03-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:D CaoFull Text:PDF
GTID:1480306518488334Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Accurate measurement of the correlation between paired variables is the cornerstone of data mining and machine learning.Different correlation measurements can be used for different types of variable pairs Y-X.The correlation between unordered-unordered pairs can be measured by?2-score or mutual information I,the unordered-ordered pairs can be measured by t-score or F-score,the linear ordered-ordered variables can be mearsureed by R2.However,?2-score,I,t-score,F-score are all unbounded measurements,and the correlation significance of?2-score,t-score,F-score are affected by the degrees of freedom,there applications are limited when the sample distribution is unknown.Although R2 is normalized to[0,1],it cannot capture nonlinear correlation.The maximal information coefficient(MIC)is normalized to[0,1],and can be used to measure the linear and non-linear correlation of different types of paired variables.However,as a computer-intensive method,the estimation of MIC is difficult.For App MIC algorithm proposed by Reshef,the empirical constraint of the maximum number of bins is nx×ny<n0.6,which may lead to low statistical power and false correlation in small samples.The Chi MIC algorithm developed in our lab earlier only controlled the number of bins by the?2 test along the optimization direction,which improves the statistical power,but does not control too many segments along the equipartitioned direction.In this paper,we first developed BackMIC,an optimization estimation algorithm of MIC,and then applied it to feature clustering and feature selection.The main results are as follows:MIC optimization estimation algorithm BackMIC.Both App MIC and Chi MIC are based on the premise that one direction is equipartitioned.However,equipartition is neither sufficient nor necessary for MIC estimation.In this paper,BackMIC was proposed to optimize the estimation of MIC.By replacing nx×ny<n0.6 with?2 test and adding a backtracking process,BackMIC finally achieved bidirectional bins control and obtained bidirectional unequipartition.Compared with the three algorithms,the results of simulation datasets showed that BackMIC was more reasonable in grid partition,more accurate in MIC estimation and better in statistical power and equitability;the pairwised correlation between357 variables in the WHO dataset showed that,BackMIC can obtain larger MIC values with less bins and may give more reasonable explanations for grid partition,meanwhile it has lower false-positive rate and higher sensitivity.Co-expression network construction based on Pearson and BackMIC for the identification of oncogenes.Weighted Gene Co-expression Network Analysis(WGCNA)is a method to construct gene co-expression modules based on gene expression data and identify key genes according to the correlation between gene modules and phenotypes and the internal connection of gene modules.The basic assumption of WGCNA is that"genes with similar expression patterns are also functionally similar",it is a R-type clustering method.The classic WGCNA(denoted as WGCNA-P)uses the Pearson correlation coefficient to measure the linear correlation between the expression levels of two genes,which failed to capture the non-linear correlation that may exist widely between genes.Considering that the statistical power of BackMIC is lower than Pearson correlation coefficient in some specific linear cases,a new weighted co-expression network construction method WGCNA-P+M was developed based on Pearson correlation coefficient and BackMIC.Comparing the two co-expression network construction methods,the results of two real datasets GSE44861 and LIHC showed that:1)Using“usefulness”score(U)to evaluate the module enrichment,WGCNA-P+M has a higher U value,indicating that the modules obtained by WGCNA-P+M is more biologically meaningful.2)WGCNA-P classified more genes as"invalid genes"into the Grey module.However,the GO function enrichment results showed that these genes were found to be significantly enriched in the GO terms related to cancer,implying that WGCNA-P may lost some information genes due to ignoring the non-linear correlation between genes.3)The top hub gene obtained by WGCNA-P+M has better prediction performance on four classifiers(SVC,DT,RF,KNN).4)By comparing the survival analysis results and literature reports of different hub genes based on the two methods,it can be seen that there are more hub genes obtained by WGCNA-P+M significantly related to overall survival of cancer and reported to be related to cancer.In short,the co-expression network based on WGCNA-P+M is more reasonable,and its ability of oncogenes recognition is stronger.Weighted feature selection based on BackMIC.Feature selection is the key to supervised learning.Redundancy between features is widespread,and the commonly used m RMR algorithm has some disadvantages,such as the relevance and redundancy are incomparable,and the redundancy of feature subset is simplified to mean value.In this paper,a new algorithm of MICFS-W(BackMIC based Weighted Feature Selection)is developed,which measures the relevance and redundancy based on BackMIC,and assigns different weights to redundancy according to the relevance between each selected feature and the categorical variable Y.Comparing MICFS-W with MIFS,MIFS-U,m RMR and NMIFS,the5-fold cross validation results on four classifiers showed that,MICFS-W can obtain higher prediction accuracies with fewer features,and have the highest average prediction accuracies.Optimal feature subset selection considering paired interactions based on BackMIC.In biological data,there exists some cases in which a single X1 or X2has nothing to do with phenotype Y while the interaction of X1 and X2 is closely related to Y.The aforementioned MICFS-W only considered feature redundancy to give feature importance ranking,and cannot automatically terminate feature introduction to directly obtain the optimal feature subset.Here,we first transformed paired features into a single interaction feature according to|X1-X2|,then based on BackMIC and the redundancy shareing strategy,we develop a new optimal feature subset selection method BackMIC-Share.The 5-fold cross validation results of three complex disease datasets on four classifiers showed that,the average prediction accuracy of BackMIC-Share considering paired interactions was better than that of BackMIC-Share not considering paired interactions.Moreover,the literature reports confirmed that most of the selected interaction genes were closely related to tumorigenesis.Feature interactions should be considered in feature selection.
Keywords/Search Tags:MIC, Gene expression profile, WGCNA, Oncogene, Feature selection, mRMR, Interaction
PDF Full Text Request
Related items