Font Size: a A A

Analysis On Risk Prediction Model For Complex Diseases And Data Mining On Genetic Variants

Posted on:2020-06-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y XuFull Text:PDF
GTID:1364330614467820Subject:Internal Medicine
Abstract/Summary:PDF Full Text Request
Entering the era of precision based health and wellness,risk prediction for complex diseases has become a crucial area,which contains revolutionary significance for human health.Developing disease risk prediction models based on genetic susceptibility markers is currently the most likely strategy to achieve.Meanwhile,precision medicine in cancer treatment is at the forefront of the era of precision medicine,with personalized cancer treatment become a recognized trend.A large number of cancer susceptibility loci and targeted drug sites being continuously discovered,has contributed to a more reliable database consisting of cancer susceptibility genes and genetic variation locus,which is the key to personalized cancer treatment.Based on above,we carried out the following works:First,we adopted the 1572 European-American sample as the research object,and selected the most significant SNPs from association analysis to predict the smoking behavior of the samples.The prediction model can effectively distinguish smoking samples from non-smoking samples,and the AUC reached 0.91.However,based on the 34 susceptible SNP loci collected from the existing GWAS reports,it failed to distinguish smoking samples from non-smoking samples when the same predictions were made for the samples.Therefore,we believe that it is feasible to use the genetic markers found by GWAS for disease risk prediction,but we need to divide the population according to the physiological and geographical distribution of the predicted population,and seek a specific genetic marker set for different populations through genome-wide association analysis.Then the disease risk prediction algorithm can be applied to accurately predict the risk of complex diseases or traits.Second,we applied different machine learning approaches(SVM and RF,logistic regression and LASSO regression)to establish a reliable predictive model of smoking behavior with the 1572 European-American samples and 3371 African-American samples.For the data set of this study,the predictive performance of the support vector machine model is superior to the random forest model with each parameter;the machine learning model based on logistic regression has more stable performance than the LASSO regression,and the generalization ability is also strong than the LASSO regression;predictive models based on African-American samples cannot be validated in European-American samples.We then combined logistic regression(P<0.01),LASSO regression(λ=10-3)and SVM to construct a smoking behavior prediction model with 500 SNPs,which achieved an AUC of0.897 in the independent AA sample test set.We hope that the successful establishment of this machine learning prediction model can provide a reference for risk prediction studies of other complex diseases or phenotypes.Third,we mined 249 susceptibility genes and 3,074 pathogenic genetic variations related with 20 common cancers from public databases.The pathogenic genetic variations were classified by defining a reasonable standard,and then the variation database was constructed.The database covers almost all common cancers and includes comprehensive cancer-associated susceptibility genes and genetic variations.The database greatly enhances the efficiency of data screening by researchers or medical workers,and is of great significance in the genetic research of cancer,and the database can provide guidance for gene detection and targeted drug development or drug use in cancer personalized treatment.In summary,we demonstrate the feasibility of developing disease risk prediction based on genome-wide association research,and also successfully construct a risk prediction model for complex diseases or traits based on machine learning approaches.In addition,the cancer genetic variation database we constructed will also play an important role in the field of precision medicine.
Keywords/Search Tags:risk prediction, genome-wide association analysis, susceptible genetic variation, machine learning, cancer, genetic variation database
PDF Full Text Request
Related items