Font Size: a A A

Invertigation Of SNP Loci And Reliability Analysis Based On Machine Learning And Multilayer Networks

Posted on:2018-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:D H WangFull Text:PDF
GTID:2370330566498761Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
As genetic sequencing technology becomes more sophisticated,biotechnologists are increasingly concerned with the correlation between genes and diseases.The genome-wide association study(GWAS)method has been made great progress in the correlation between genes and diseases,and has found significant loci of influence phenotype at the molecular level.Because most diseases are multi-gene diseases and adjacent genetic variation may have linkage disequilibrium,it is difficult to determine the causal relationship between genotype and phenotype.This article combines the GWAS and machine learning,multi-layer network to improve the reliability of the genetic variation of pathogenic.This article mainly studies the whole-genome data of Hepatitis B Virus and breast cancer by GWAS.For Hepatitis B Virus,an e Xtreme Gradient Boosting(xgboost)algorithm was established to analyze the single nucleotide polymorphism(SNP).For breast cancer,the xgboost algorithm was used to analyze tumor markers,then to filter out significant molecular markers and build a multi-layer network model.This article mainly contains the following two parts:(1)We ues whole-genome data to analyse the drug resistance of Chinese Hepatitis B Virus population to select a group of SNP loci with significant P value.The xgboost algorithm was used to further analyse the SNP loci with significant P value and select the loci combination that affected the Hepatitis B Virus,and then verified the SNP loci with GWAS.We found that rs12576054 on the KCNQ1-AS1 gene is a new loci for the Chinese Hepatitis B Virus population.(2)About the breast cancer data,using xgboost algorithm to analyse tumor markers such as gene,mi RNA and protein,and three groups of significant molecular markers were obtained.Meanwhile using the maximum information coefficient to measure the relationship strength value between genes,mi RNA and protein,the multi-layer network including SNP loci was constructed with threshold of 0.6.According to the degree of multi-layer network and the clustering coefficient,we found that the network structure in normal tissue is more intensive than the tumor tissue,which indicates that the SNP loci in the tumor tissue has influenced the network channel of regulating the protein through the expression of oncogene or the mutation of the tumor suppressor gene.By comparing the connective subnetwork between tumor tissue and normal tissue,we found that the rs11257188 interact with 14.3.3_zeta node in tumor tissue and interdict the Bax node in normal tissue by PFKFB3 gene,where made a significant overexpression in most tumor tissue.We confirmed that gene with large degree in the normal tissue subnetwork are significantly overexpressed or underexpressed in tumor tissue and normal tissue.For the protein nodes with larger degree that connected with the gene layer are also significantly overexpressed or underexpressed in both tumor tissue and normal tissue.We found PRC1,EBF1 and TGFBR2 that selected through this method are associated with breast cancer,confirming this method can screen effectively for pathogenic genes and increase the reliability of SNP loci.
Keywords/Search Tags:genome-wide association study, xgboost algorithm, multilayer network, reliability
PDF Full Text Request
Related items