Font Size: a A A

Development And Implementation Of The Software For Managing Imputed Genotype Data

Posted on:2021-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:J X YangFull Text:PDF
GTID:2480306050473074Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Genome-wide association study(GWAS)is an important research method in human genetics.It is to find associations between human heritable variations and diseases or traits.The variations are mostly referred to single nucleotide polymorphisms(SNPs).Genotype imputation is an essential technique in GWAS to increase the SNP density in the imputed genotype data,expand the data set and reduce information loss.The imputed data using the same reference panel can allow genotype data of different studies to share the same set of SNPs,and then be used for the meta-analysis of results from multiple studies.The imputed genotype data is usually large and in a variety of formats,requiring preprocessing and quality control before the analysis.At the same time,various GWAS analysis software require different formats of the genotype data,hence software that manage and manipulate data is necessary to perform format conversion,data processing and quality control on the imputed genotype data.Functions of existing software are not comprehensive,some important functions are even missing,and some have low efficiency.Therefore,there is an urgent need for a feature-rich and efficient software for managing the imputed data in major formats.The software for managing imputed genotype data in this paper is based on C++and under the Linux platform.Functions of the software are form needs in GWAS practice.After researching the imputed genotype data in commonly used formats and analyzing the relevant parameters for quality control,functions to be implemented are determined.This paper analyzes the structure and contents of files in three commonly imputed data formats(Impute,Minimac,and VCF).Technique roadmap for format conversion,data processing,and quality control functions are described.First,starting from data reading and writing,the format conversion is implemented.Next,a variety of data processing functions including extracting,merging,and deleting the data by ids of SNPs or samples are implemented.The key in data processing functions is that the ids of SNPs or samples need to be cross-referenced before merging the data.This usually requires loading all the data into memory or reading the hard disk data repeatedly,which is the efficiency bottleneck of most existing software.This paper uses a method that first stores the locations SNPs or samples in files in memory,then compares and outputs the data according to the locations which are called indices.This improves the merging efficiency substantially by reducing the memory usage and hard disk reading simultaneously.At the same time,algorithm of string matching based on map in C++is used for matching,which reduces the algorithm complexity from O(n~2)to O(n)and greatly improves the matching efficiency.Finally,this paper implements the commonly used quality control functions of GWAS,including calculating the quality indicators of the imputed genotype data and filtering SNPs or samples according to a given threshold of the indicators.In order to improve the storage efficiency in hard disk,most imputed genotype data is compressed in gz format.This paper uses the zlib library to support direct reading and writing of the files in gz format,so that the format conversion,data processing,and quality control functions can directly read or write the data in gz format.After implementing the aforementioned functions,this paper first uses small data sets to test accuracy of each function,then large data sets to test the stability and efficiency of the software.By comparing with fc GENE,GTOOL,Dosage Convertor,QCTOOL and other similar software,algorithms of each function are further optimized to reduce the time complexity and improve the operating efficiency.Eventually,the design goal of the software for managing imputed genotype data in various formats with comprehensive functions and high efficiency implementations is achieved.
Keywords/Search Tags:genotype imputation, data management, algorithm optimization, genome-wide association study, single nucleotide polymorphism
PDF Full Text Request
Related items