Font Size: a A A

Research On Technologies Of Causal Inference For Cancer Genomics Observational Data

Posted on:2024-10-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J LiuFull Text:PDF
GTID:1524307340476604Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The analysis of cancer genomics data plays an essential role in understanding the complex molecular mechanisms of cancer and in driving the development of personalized treatment strategies.However,traditional statistical methods often struggle to reveal potential causal relationships,limiting researchers’ in-depth understanding of cancer’s mechanisms.In contrast,causal inference techniques provide new perspectives on inferring causal relationships among variables from observational data,which can help identify key drivers and potential therapeutic targets of cancer.Nonetheless,researchers still face many challenges when applying causal inference techniques to cancer genomics data.Firstly,the high dimensionality and complexity of cancer data pose a major challenge in constructing accurate and stable causal models.Secondly,confounders and heterogeneity in the data make it more difficult to distinguish true causal relationships from statistical associations.Thirdly,the validation of the biological significance of the causal inference results has also been a challenge.Although applying the inference results to downstream tasks,such as drug development and personalized therapies,could verify the effectiveness of the results,the complexity and inter-patient heterogeneity of the targeted drug development have also posed a new challenge.Therefore,this paper explores the potential of causal inference techniques in cancer genomics research from the perspective of these three challenges to promote a deeper understanding of cancer and advance the development of precision medicine.The main research contents and contributions of this thesis are as follows:(1)To address the problem that it is difficult to build an accurate and stable feature selection model for identifying cancer biomarker genes from high-dimensional and complex cancer omic data,a causality-inspired least angle nonlinear distributed(CLAND)feature selection method is proposed.Firstly,CLAND simultaneously eliminates the confounding bias of feature selection results caused by the imbalanced class distribution by class and sample branches.In the class branch,CLAND increases the influence of minority classes in the feature selection process through the resampling technique.In the sample branch,CLAND adjusts the weights of samples based on the distance with the same and different class samples in the marginal vector feature space.Secondly,CLAND integrates the feature weights extracted from the two branches via a class-adaptive module and selects features with higher weights as feature selection results.Experiments on six cancer datasets with different imbalance ratio degrees show that the cancer biomarker genes identified by CLAND are biologically significant.(2)To address the problem that the ignorance of confounders in cancer driver identification studies leads to the failure of accurately identifying the true causality,a model for calculating the causal effect of a mutation on cancer biological process(CEBP)is proposed.Due to the lack of biological activity labels of samples,firstly,CEBP obtains the biological process activity of each sample by the core gene regression method.Secondly,CEBP,which is based on a deep variational autoencoder model,utilizes an encoder to learn the latent variable representation of observed and unobserved confounders,and a decoder to learn the generation process of observations by inferring them from the latent variable space.With the representation of confounders,CEBP can estimate the average causal effect without the confounding bias.Experiments on ten cancer datasets with aberrant cell proliferation and epithelial-mesenchymal transition processes demonstrate the significant advantages of CEBP in identifying drivers of cancer biological processes.(3)To address the challenge of validating the biological significance of causal inference results,this paper applies these results to drug discovery and cancer precision medicine and constructs a comprehensive database of pharmaco-omics for cancer precision medicine(DBPOM),which is based on an extensive analysis of drug efficacy and side effects.Firstly,to precisely assess the efficacy and safety of drugs,RADE(Reversed and Adverse Drug Effect)is proposed to quantify the potency of drugs by quantifying their effect on reversing and enhancing the expression of cancer-related genes.Secondly,a method to calculate the mutant genome similarity between cancer tissues and cancer cell lines is proposed,aiming to infer the potential efficacy and side effects of a drug for a specific patient based on the drug response data of cell lines.Finally,DBPOM is developed as a web-based service platform to provide query and comparison functions for detailed efficacy and side effect calculations of 19,406 small molecule compounds and drugs,as well as provide clinicians and drug researchers with online calculation and analysis services,such as analysis of biological functions of drugs and formulation of precision medicine strategies,thus facilitating the optimization of cancer therapeutic strategies and implementation of personalized medicine.
Keywords/Search Tags:Causal inference, Feature selection, Causal effect estimation, Cancer drug discovery, Cancer precision medicine
PDF Full Text Request
Related items