Study On Genome-wide Association Analysis And Phenotypic Prediction Method Based On Two-stage

Posted on:2021-09-21

Degree:Master

Type:Thesis

Country:China

Candidate:J L Sun

Full Text:PDF

GTID:2480306605993119

Subject:Mathematics

Abstract/Summary:

PDF Full Text Request

Most of the complex traits of humans,animals and plants are quantitative traits.Detecting the quantitative trait locus(QTL)is very important to analyze the genetic basis of complex traits.However,the quantitative traits are usually controlled by many genes with small effects,and they are vulnerable to environmental impacts.It is inefficient to detect the genes,which are controlling quantitative traits,by biological experiments.Whereas,statistical methods are much powerful in genetic detection.Genome-wide association analysis(GWAS)are used to link the phenotype of biological individual with the genotype.It is easy to detect genes or quantitative trait loci(QTN)by the association degree between single nucleotide polymorphism(SNP)markers and traits.With the development of sequencing technology,millions of SNPs for target traits have been generated,however,the traditional statistical methods are difficult to analyze the highdimensional genetic data.Recently,Machine learning methods have the properties of fast calculation speed,which have successfully applied to high-dimensional genetic data analysis.However,most of machine learning methods have many drawbacks,such as poor generalization ability,over-fitting,unsatisfactory classification effect and low detection accuracy.In order to overcome the above weakness,this study proposed a two-stage genome wide association analysis method called two-stage algorithm based on least angle regression and random forest(TSLRF),which combined variable selection with machine learning algorithms.The new method fully considered the control of population structure and polygenic effects,firstly,selected the SNPs that were potentially related to target traits by using least angle regression,for the second step,these SNPs were furtherly used to construct random forest model to furtherly select QTNs significantly related to target traits.In this study,we verified the reliability of the new method using simulation datasets(1,000 replications)and real datasets(5 flowering time datasets)from an Arabidopsis thaliana natural population.Furtherly,extended the new method to the phenotypic prediction,which are applied to explore the accuracy and reliability of plant trait prediction.The main results are as follows:1.The new method firstly uses FASTmrEMMA(Fast multi-locus random-SNP-effect EMMA)to correct the population structure and the polygenic effects of the datasets,then applies least angle regression to select SNPs,which are potentially related to the target traits.For the second step,the new method employs random forest to rank these SNPs according to their importance scores.In this procedure,the QTNs significantly associated with target traits are detected.Results from simulation and real datasets experiment show that:compared with other methods,the distinction between QTNs and unrelated SNPs is more significant for the new method;TSLRF has relatively high model accuracy and model fitting degree;stronger capability of gene detection,which detected 60 genes confirmed to be significantly related to the target trait,and simultaneously detected multiple gene clusters associated with the target trait;in addition,TSLRF has fast computing speed.2.Extend the new method,we consider the complex genetic structure of the population through the correction of population structure and polygenic effects,and predict the phenotype of the same genetic data.The phenotypic prediction results from simulation and real datasets experiment show that:we implement four scenarios in simulation experiments,the miss rate is 5%,10%,15%and 20%,respectively.Compared with other methods,the accuracy of the phenotypic prediction and the fit of the prediction model are relatively high for the new method.In addition,with the increase of the miss rate,the accuracy of the phenotypic prediction and the fit of the prediction model continue to decrease,and shows more and more significant differences.Using the new method to analyze genetic data combined with phenotypic prediction values has high calculation efficiency and fast speed,and the gene detection ability is stronger than other methods.The new method provides more theoretical basis for prediction of elite parental combinations and genome-wide marker assisted selection in large-scaled data analysis.

Keywords/Search Tags:

random forest, least angle regression, variable selection, importance score, polygenic effect correction

PDF Full Text Request

Related items

1	Study On Two-Stage Stepwise Variable Selection Based On Random Forests
2	Variable Selection For Causal Inference
3	Novel random forest and variable importance methods for correlated survival data, with applications to tooth prognosis
4	Estimation And Variable Selection Of Regression Models With Missing Data
5	A Weighting Mean Score Estimation For A Random Effect Logistic Regression With Auxiliary Variables
6	Statistical Inferences And Variable Selection For Varying Coefficient Models With Missing Responses
7	Comparative Study And Application Of Several Variable Selection Methods
8	Research On Progressive Modeling Method Of Hierarchical Fuzzy Inference System Based On Data
9	Variable Selection For Functional Linear Regression Models With Heredity Constraint
10	Research On Movie Box Office Prediction Based On Random Forest