Font Size: a A A

The Research On Data Mining Methods For Single Nucleotide Polymorphisms Data And Its Application

Posted on:2016-03-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LiFull Text:PDF
GTID:1368330473467148Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The goal of next-generation sequencing technology and genome-wide association analysis is to identify the association pattern between genetic variation,epigenetic modification and complex disease for carrying out personalized medicine based on individual genetic information.Genetic diseases can be divided into single-gene disorders and complex diseases.Single-gene disorders are in line wit h Mendelian inheritance laws.The changes of important loci in a single gene lead to the disorder of gene function,resulting in genetic disease.In contrast,complex diseases such as cancer,diabetes,etc.are formed from the interaction between SNPs in multiple genes,and the pathogenesis on the molecular level remains unknown.Because of the important characteristics of SNP such as huge amount,wide distribution and so on,the genome-wide SNP data has become one kind of biomarkers for identifying pathogenetic genes related to complex diseases.Due to the presence of linkage disequilibrium between SNP s,there is a lot o f redundant information.Therefore,t he complex diseases analysis based on genome-wide SNP data usually consists of two main processes.Firstly,select a representative subset from a large number of SNP in a small sample,in order to reduce redundancy and noise in formation.Secondly,identify pathogenic genes fro m candidate SNPs in large-scale samples.In this study,we design optimizatio n algorithms and data mining technology for the analysis of SNP dataset of complex diseases,the main study contens as follows:1)Informative SNPs selection methods based on combinatorial optimizatio n algorithms.To address the challgens of candidate subset constructions and genotype reconstruction in informative SNPs selection,intelligent optimization algorithm and exact algorithm are separately used to design two kinds of methods.The first method applies an optimization criterion by combining two-locus and mult ilocus LD measure to construct the object function of Max-Correlation and Min-Redundancy(MCMR).Then,we use a greedy algorithm to select the candidate set of informative SNPs constrained by the object function.Because MCMR aims to optimize the LD between SNPs,the results of selection have strong interpretability and avoid the repeat reconstruction.The second method des igns a nearest means classifier(NMC)to avoid the repeatly thorough reconstruction of predictor.NMC directly aims to optimize the reconstruction accuracy and applies ant colony algorithm to search the combinatorial space.Although NMC ignores the linkage disequilibrium which is an important biological phenomen,NMC is suitable for both genotype and haplotype datasets.The experimental results show that these two types of information SNP selection strategies are applicable to different situations and have certain advantages.2)Tag SNPs selection method based on multiple ant colony algorithm(MACA).Unlike the informative SNPs selection which is measured by reconstruction accuracy,tag SNPs which involves in haplotype association studies is evaluated by haplotype coverage.Our study MACA designs a multiple ant colony algorithms framework to search loci combinations in different granularity.The main idea is that larger granularity can save running time,and smaller granularity can select a smaller set of tag SNPs.To improve the search ability,we design heuristic function with three heuristic factors(coverage,repeatabilit y,margins).The experiments both on simulated and real datasets validate the advantages on the number of tags and running time.3)Unifying informative SNP and tag SNP method based on kernel SNPs selection.The informative SNPs are selected from all SNPs by information measures,but tag SNPs is selected according to haplotype diversity.Therefore,the subests o f informative SNPs selection and tag SNPs selection are significantly different,which confuses the biology researchers.In this study,a kernel SNP selection method based on hierarchical clustering(KSHC)is proposed.KSHC firstly applies relative entropy reduction to formulize the distance measure between clusters.Then,hierarchica l clustering is conducted to gather highly correlated SNPs.After that,t hese kernel SNPs are selected from every cluster through the top rank or backward eliminatio n scheme.The basic idea of KSHC is that hierarchical clustering consists of both information gain and haplotype diversity,so that the proposed approach can achieve unification.Using these kernel SNPs,extensive experimental comparisons are conducted between informative SNPs on haplotype recon struction accuracy and tag SNPs on haplotype coverage.Results indicate that the kernel SNP can practically unify informative SNP and tag SNP and is therefore adaptable to various applications.4)Pathogenic genes idenctification method based on maximum co nsistency and maximum discrepancy(MCMD).Unlike these traditional methods which only focus on the discrepancy between cases and controls,our method MCMD not only guarantees the maximum disecrepancy,but also the maximum consistency in cases.MCMD assumes that in ideal situation these cases involved in the same disease hold the same pathogenic genes,but there may be several pathogenic barcodes in cases due to the heteregenity of complex diseases.Therefore,we assume that the disease pattern in cases should keep stable.Based on this assumption,greedy algorithm is applied to analyze the epistasis related with the breast cancer.After that,we apply ant colony algorithm to look for multiple pathogenic barcodes in different epistatic combination of genes and carry out a heteregenit y analysis of the breast cancer.
Keywords/Search Tags:Complex diseases, GWAS, SNP, Data mining, Intelligent algorithm, Systems biology, Epistasis
PDF Full Text Request
Related items