Font Size: a A A

The Study On Extensively Applicable Gene Set Analyses For Different Data Characters

Posted on:2020-04-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Q LiFull Text:PDF
GTID:1360330614450714Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Gene set analysis(GSA)is a kind of commonly used multivariate analysis method.Its power is usually higher than that of the conventional univariate analysis method.Therefore,it is commonly used in functional enrichment analysis,molecular pathway analysis and finding differentially expressed genes.In recent years,people not only use gene set analysis to analyze the difference of gene expression,but also analyze the change of correlations between genes.In order to improve the efficiency of testing and better explain biological phenomena,it has more and more broad application space.However,the majority of gene set analysis methods selectively use "self-contained methods" or "competitive methods" to test gene sets,but do not fully take into account that the two methods have distinct requirements for the characteristics of data.In addition,different gene set analysis methods also have assumptions about variance,sample size or other data characteristics.When the assumptions do not match the actual data,the false positive rate of the existing methods will out of control.In order to solve these problems,a self-cantained and competitive intergrated analysis(SCIA)method based on prior biological network is constructed to solve the problems of inadequate applicability and high false discovery rate of existing gene set analysis methods.In order to solve the problem of high false positive rate when dealing with data with high inter-gene correlations by competitive test method,this paper constructs a self-contained test statistic C,and verifies the performance of this method through a series of simulation experiments.When dealing with the simulated data with correlation coefficients of 0-0.9,this method could correctly control the false discovery rate of 5%,and the average sensitivity(about 0.35)is not lower than that of ROAST(limma software package)or other self-contained methods(about 0.35)and is more stable,which will not fluctuate greatly with the change of correlation coefficients;when dealing with the data with the changes of inter-gene correlations and enough sample size,this method could control the false discovery rate of 5% and the sensitivity can be more than twice as that of ROAST or other methods.The above results show that this method can effectively avoid the shortcomings of competitive test methods,and provide a basis for the subsequent integration algorithm of self-contained and competitive testing methods.In order to solve the problem of high false discovery rate when dealing with data with high proportion of differentially expressed genes by self-contained method,this paper integrates self-contained statistic C and new competitive testing methods by using a priori biological network to constructs SCIA method,and verifies the performance of this method through a series of simulation experiments.The false discovery rate(0.09-0.16)of SCIA was not higher than that of GSEA and other competitive test methods(0.12-0.15),while the sensitivity(0.76-0.82)was significantly higher than that of other methods(0.53-0.58)when dealing with the data of 20%-60% differentially expressed genes.When dealing with the data of 0-0.9 correlation coefficient between genes,this method could correctly control the false positive rate(less than 0.05)while GSEA and other methods could control the false positive rate(less than 0.05).The false positive rate(about 0.1-0.4)will be amplified when the correlation is close to 1,and the sensitivity of this method(about 0.25-0.3)is slightly higher than that of other methods(about 0.2-0.3).The above results show that this method can avoid the shortcomings of competitive and self-contained methods at the same time,and the use of prior biological networks means that the power of this method can be gradually improved with the gradual improvement of biological networks.Because the existing methods can not deal with the data with very small sample size(e.g.n=2),this paper constructs SCIA-AFC method based on adjusted fold-change(AFC)method,and validates the performance of this method through a series of simulation data and real data.The sensitivity of AFC method is about 50% higher than that of traditional FC method when the false detection rate is less than 5% in simulation experiment.When analyzing real data,the consistency of AFC method with gold standard is more than 60%,which is also higher than that of traditional FC method(about 40%).The above results show that AFC method is more suitable for data analysis of high-dimensional and small samples than FC method.Finally,SCIA method was used to analyze the expression profiles of two sets of lung squamous cell carcinomas and two sets of mi R-1 transfection experiments.Firstly,this method can obtain more than 61 GO annotations and KEGG functional pathways that can be supported by existing literatures,and only 7 of them can be found by traditional hypergeometric test and GSEA method.Secondly,this method can obtain more than 40% consistent results when analyzing the same type of data,while hypergeometric test and GSEA can only get about 10%?20% consistent results.Finally,when we use different target gene prediction databases to analyze the data of mi R-1 transfection experiment,the consistency of the results can also exceed 50%.The above results show that the results obtained by this method are accurate and different from that of traditional methods,and can effectively supplement the existing gene set analysis methods,and can selectively use prior biological information to reduce the impact of false positive results in known biological information on this method.In summary,based on the self-contained and competitive integrated method,this paper proposes a widely applicable gene set analysis method,which can correctly process the expression profile data with different characteristics,and can also provide in-depth biological interpretation of the results.This method can not only obtain a large number of accurate and novel results,but also effectively supplement the existing methods.It can also reduce the impact of different data and different prior biological information on the results,so that people can further integrate the results of different studies.The R language software package "SCIA" based on this research can be downloaded free of charge on Github website: https://github.com/Yiqun Li HIT/SCIA.
Keywords/Search Tags:Gene set analysis, competitive method, Self-contained method, Gene set enrichment analysis
PDF Full Text Request
Related items