Font Size: a A A

Research On Phenotype Prediction Method Of Whole-Genome Sequencing And Its System Construction

Posted on:2018-10-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y TanFull Text:PDF
GTID:1310330515975126Subject:Agricultural Electrification and Automation
Abstract/Summary:PDF Full Text Request
The whole Genomic Selection(GS)called Genomic Prediction(GP)is a method used to calculate Genomic Estimated Breeding Value(GEBV)and detect all genomic genetic markers that could accurately estimate breeding values,and evaluate the individuals.There are many methods of Genomic Selection,such as methods based on Bayesian model(Bayesian A,B,C,Lasso),and methods based on the best linear unbiased prediction BLUP model(gBLUP,rrBLUP).Because difference calculation methods and principles of each prediction method,the prediction effect of phenotypic traits are great different.The choose of suitable prediction method is one of the best ways to improve the maximum accurate value for accuracy,which has provided important reference breeding values for breeders.This paper develops a new selection method called m MAP(Mining the genome Maximum Accuracy of Prediction)which is the collection of current popular GS methods.It predicts and calculate many species traits,and uses cross validation to get accurate values to establish reference knowledge base.The new species traits are selected according to the knowledge base and the optimal GS method is used to predict the breeding of GEBV.Now there are more than 300 kinds of knowledge base of species traits with accurate values,and it continues add GS methods into this library and prediction new traits accuracy to accumulate the knowledge base.The realization process and effect of m MAP method are introduced below:For predicting new species traits,m MAP introduces data mining technology into Genome Selection.The combination of data mining and popular genome prediction methods will be used to minimum GP methods and calculate the highest precise values.According to the data provided three groups knowledge base,one real prediction accumulation data and two simulation data,they are of normal distribution and random distribution respectively.The corresponding GS method are chose at the highest precise value,respectively more than 92.27%,93.40% and 90.2%.(2)For prediction accuracy value,it applies Cross validation for verification.The design were randomly divided into 5 groups,with 80% known phenotypic data as the training group to predict 20%unknown phenotypic data.And then repeat 100 more times to verify the prediction precision value basically covers all assumptions for reducing error rate.Cross validation groups will be extended to different settings,such as 3,5,10,20,etc..The accuracy value will be stability at more than 100 repeats.(3)Choose best opportunity of the GS method to get new traits.It designs the nearest and furthest convergence method of cluster center to test the most suitable traits for finding the GS method.The processing cluster core is relatively stable by repeat clustering more than 100 times on initial knowledge base.And then convergence search according to distance to clustering core could help to add temporary combination into the knowledge base until no new consecutive value appeared to avoid accidental.Through the real data test for the new trait,it takes a few seconds to find best method in about 3 times iterations.(4)Each GS method has an independent implementation package.It only need generate new GSclustering method and the specie traits into the packages to predict.The packages are independent with each other.The main thread clustering results will not affect that of other packages.This design will ensure security and shorten running time.(5)mMAP was realized by using object oriented technology.It can be applied in multi-threaded mode.Through the cite of the encapsulation technology Docker,it could form the Docker container by the Linux service package.The container is encapsulated into the WEB environment,and provides the service for the breeder to carry out the operation at any time and any place by the B/S mode.The platform has been able to remote applications.It have been displayed at the conference of The Plant and Animal Genome in January 2017.(6)It is more flexible by using the mMAP method for computing and inputting genomic data.mMAP also provides the interface for the original high-throughput sequencing data conversion.Its aiming is to reduce the original data storage space,shorten the data transfer time and efficient service breeding business.It accords to the current process of high-throughput sequencing data(.fastq format)alignment to reference genome sequence alignment(.fasta format),and data conversion such as bowtie,samTool and GATK,then calling SNP converting into genotype data such as Hapmap,Numeric Genotype and other formats for the follow-up analysis service.The whole data processing and conversion take a long time and occupy more storage space,because the high throughput sequencing original data is too large.According to the time and space problems,this research designs the phonetype data processing pipeline to improve the efficiency.Its characteristics is converting fasta format reference sequences to HDF5 in small occupied space and faster alignment speed;Because the processing efficiency of a large amount of intermediate data,this paper designs an efficient double index file format to store the difference loci data with small file space;due to the SNPs and gene files occupy a lot of space,the paper introduces gZIP compression method,which could get more than 50 times compression.Furthermore the compression data could be called and retrieval as the original files.In order to improve the efficiency of the process of the conversion,the compressed data format is used to transform the data form,and a lot of flexible operation interfaces are provided to facilitate the subsequent analysis and statistics.The entire pipeline apply object-oriented technology with multi-threaded mode which could integrate interface services and mMAP integration.mMAP can be used to predict the traits of many species,and choose the optimal GS method according to the conditions of setting.Through the pipeline research,it can convert large high Genome sequencing data,solve with a large amount of genomic data with parallel computing and related technology,and provide basic services of phonetype data analysis(such as Genome-Wide Association Study and GS).
Keywords/Search Tags:Genome sequencing, Genome selection, Data mining, m MAP, Breeding analysis
PDF Full Text Request
Related items