Font Size: a A A

Research Of Quality Control Methods For SNPs Based On Clustering Algorithm

Posted on:2014-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y L SunFull Text:PDF
GTID:2250330425483701Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Single nucleotide polymorphism (SNP) has been used widely in biological research, servering as the third generation genetic marker. Genome-wide association studies (GWAS) apply SNP as genetic marker in case-control studies, in order to detect and locate genes that are correlated to complex diseases, helping to provide evidence for disease diagnosis、individual treatment、medicine development, etc. SNP quality is the key factor for GWAS. In fact, the obtained SNP data is prone to error because of hardware or software problem during experiment. For these reasons, it is necessary to perform quality control process on SNPs.In this paper, the main work is to seek for effective SNP quality control methods in GWAS. There are three basic parameters to measure SNP data quality:genotyping call rate, minor allele frequency and HWE. The current quality control method is "supervised" expert filter which set the parameters’thresholds manually. To deal with this problem, new quality metrics are reset to be more stringent. And two new quality control methods based on clustering algorithms are proposed in this paper.(1) Quality control method based on weighted fuzzy kernel clustering algorithm. There are several attributes for SNP dataset. Attributes impact differently between normal SNPs and noise SNPs cluster. In this paper, the weighted fuzzy kernel clustering algorithm is used to detect normal and noise SNPs by computing the imbalance between attributes. Compared to other clustering methods, this algorithm is especially suitable for high dimensional and non-sphere dataset. Results show that this method performs well.(2) Quality control method based on SNN clustering algorithm. For the problem of high dimension of SNP dataset, the filtering of SNPs can be done in two steps. Firstly, use principal component analysis to reduce data dimension and map the SNPs onto a two-dimensional floor plan. Secondly, run SNN clustering on this plan. SNN can find out clusters with different sizes, shapes and density in datasets with noise, and detect noise SNPs automatically. Experimental result shows the efficiency of this method.
Keywords/Search Tags:Single nucleotide polymorphism, Genome-wide association study, Quality control, Clustering, Principal component analysis
PDF Full Text Request
Related items