Font Size: a A A

A Study On Genomic Selection Method Based On New Compressed Component And Machine Learning Strategy

Posted on:2023-10-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:M J AnFull Text:PDF
GTID:1520307160967839Subject:Animal breeding and genetics and breeding
Abstract/Summary:PDF Full Text Request
Genomic selection(or genomic prediction)has been widely used in animals,plants,humans,and even microbes,which has effectively promoted the rapid developments of product improvement in animals and plants,disease risk prediction in humans and other fields.With the new developments of genomics and sequencing technologies,the cost of sequencing has been continuously reduced,and there have accumulated massive omics data in various species.Life science has entered the big data era.Advances in genomics and sequencing technologies have also contributed to a steep increase of the breeding data size.Confronted with increases of individuals and markers in breed data size,the well-known genomic selection methods such as GBLUP,Bayes,and machine learning are facing huge computational challenges.To solve the computational challenges of breeding big data,different genome selection strategies have been successively proposed,among which dimensionality reduction strategy provides one of effective candidate strategies to improve the computational efficiency of breeding big data.Compared with the other dimensionality reduction methods such as principal component regression(PCR)and independent component regression(ICR),the correlated component regression(CCR)method has been proven to have outstanding ability with the highest compression ratio of the number of original variables.It has huge application potential and development space in the field of genomic selection.However,the existing CCR method,which is based on OLS estimator in multiple regression and step-down variable selection algorithm,has high computational complexity and low efficiency when dealing with p>>n problem,and its computational efficiency is not enough to easily deal with the breeding big data in future.To surmount the shortcomings of the existing CCR method,the purpose of this research is to develop a new efficient CC algorithm with a focus on improving computation efficiency and reducing computation resource requirements,based on which we develop a new genomic selection method by combining machine learning strategies.This new method is named as CCFM(Compressed Component with Flexible Modeling).A series of jobs have been done around the development of CCFM method,and the main results are as follows:1.A new efficient CC algorithm is established by introducing single SNP one-by-one regression strategy.The computational complexity of single SNP one-by-one regression is low,being suitable for parallel and distributed computing.There is no p>>n problem for our new CC algorithm when processing high-dimensional genomic data.The tool of new CC algorithm has been completed through code optimization,including C++ programming for core steps,Open MP parallel acceleration and big matrix compression format storage;2.The technical framework of CCFM method has been established by combining the new CC dimensionality reduction tool and machine learning strategies.The key steps of CCFM method include CCM calculation,CC selection,and genomic prediction model selection.CCM is computed using our self-coded CC dimensionality reduction tool.CC selection adopts a designed strategy and a cross-validation strategy based on the training set.In order to increase the flexibility of genome prediction model selection,an improved machine learning method,SVM_bagging,is developed by hybridizing SVM with ensemble learning bagging method.The CCFM method has also been compiled into a workable tool.3.To evaluate the computational efficiency of CCFM method,the upper limit of computational memory requirement of CCFM,GBLUP and BayesC is compared using the simulated datasets.It is found that CCFM has a huge advantage over GBLUP and BayesC in terms of computational memory consumption.The computational memory of CCFM only accounts for 1/430,000 of BayesC and 1/299 of GBLUP.4.The computation time consumption of CCFM,GBLUP and BayesC is further compared using simulated datasets with sample sizes of 2,000,4,000,6,000 and 8,000,respectively.The results showed that CCFM has the shortest computing time consumption among all simulated analyses,and its speed is improved from 18 times to 177 times compared with GBLUP and BayesC.Moreover,the larger sample size,the more relative advantage can be achieved by CCFM method.5.The genotypic data from the 1000 Genomes Project were used to produce simulated datasets with different combinations of QTN numbers and heritability levels.The results of simulation analyses showed that the prediction accuracy of CCFM can remain at the same level of GBLUP.6.The prediction accuracy of CCFM and GBLUP was further compared based on multiple real datasets from animals,plants and microbes.The results of 26 traits showed that,except for several animal traits,the prediction accuracy of CCFM method with optimal option was more or less superior to GBLUP,and the improved degree of prediction accuracy was 4% over GBLUP for two traits.In summary,this study has developed a new genomic selection method that is suitable for processing breeding big data.A series of research results revealed that CCFM has the advantages including fast computing speed and small memory requirements,while the prediction accuracy is roughly remained at the same level of GBLUP,which provides a new tool for efficient genomic selection in the era of big data.The achievements of this research have contributed to the effective development for genomic selection methods,and complement and enrich the contents of genomic selection methods in animals and plants.CCFM method provides powerful candidate tool for animal breeding,plant breeding,microbial breeding,and even human medicine.Therefore,this research has important theoretical and practical merits.
Keywords/Search Tags:genomic prediction, breeding big data, compressed component, machine learning, CCFM, computational efficiency
PDF Full Text Request
Related items