Font Size: a A A

Effect Of Mutation On Linkage Disequilibrium And Genotype Inference And Its Detection By Machine Learning Methods

Posted on:2024-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:R ZhangFull Text:PDF
GTID:2530307094967699Subject:Aquaculture
Abstract/Summary:PDF Full Text Request
Mutations have recently received extensive attention due to their role in the evolution and genetics of complex traits.The distribution of linkage disequilibrium is influenced by mutations,but the recently reported transition matrices assume the same mutation rate.The impact of different mutation types,rates,and randomness on the distribution of linkage disequilibrium remains unknown.Assuming mutations in the transition matrix have different types and rates,such as nucleotide transitions and transversions,which are considered different types of mutations with varying rates,we investigate how the distribution of linkage disequilibrium between two biallelic loci is affected.After examining factors such as effective population size,recombination,and selection,it is found that different mutation types and rates further alter the dynamics of linkage disequilibrium distribution.Within the current range of mutation rates(10-9~10-8),mutations appear to have a minimal effect compared to recombination and selection.The randomness of mutations increases the ruggedness of the linkage disequilibrium curve,leading to fluctuations around the"equilibrium"state.The analysis of linkage disequilibrium(LD)between genetic variations is fundamental to population genetics and evolutionary research.However,current research is limited to considering linkage disequilibrium analysis only among genetic variations on the same chromosome,and our understanding of the linkage disequilibrium relationships across different chromosomes remains limited.Furthermore,with the rapid advancement of genomic sequencing technologies,the unprecedented abundance of genetic variations are being generated,further increasing the demand for fast calculations of linkage disequilibrium.Therefore,there is a need to develop new methods and tools that can perform fast linkage disequilibrium analysis across the entire genome,allowing for a comprehensive understanding of the patterns and interactions of genetic variations.Here,we developed the GWLD R package and cross-platform(Windows,Unix/Linux)software based on Rcpp Armadillo,Armadillo,and Open MP.It is a parallelized and generalized tool for fast genome-wide parallel computation of LD values,including conventional D,D’,and2,as well as information theory-based measures such as mutual information(MI)and reduced mutual information(RMI).In the GWLD R package and software,LD between genetic variations within and between chromosomes can be computed rapidly,and the results can be visualized using the provided visualization tools.Four real datasets were used to test the performance of GWLD,comparing the computational efficiency and LD patterns between GWLD’s single-threaded and multi-threaded modes with other computational tools(R packages:Genetics;software:Tassel,Plink).The R package and software of GWLD are available in Git Hub(https://github.com/Rong-Zh/GWLD).Genotype inference is an important tool for genomic analysis from genome-wide association to phenotype prediction.Traditional genotype inference methods typically rely on haplotype clustering algorithms,Hidden Markov Models(HMMs),and statistical inference.Recently,deep learning-based approaches have been reported to address data inference problems across various domains.To explore the performance of deep learning in genotype inference,in this study,we constructed a deep learning model called K-Means+Con1DAE for genotype inference.The model incorporates a one-dimensional convolutional layer that can extract various relevant or linkage patterns from genotype data.Additionally,L1 regularization is applied in the model to generate a sparse weight matrix,effectively handling high-dimensional data such as genotype data.We evaluated the performance of four models(K-Means+Con1DAE,Con1DAE,Con2DAE,and SCDA)in genotype inference using Beijing duck genotype data.The K-Means+Con1DAE model outperformed the other three models in terms of accuracy on the Beijing duck dataset,exhibiting strong robustness across different dataset sizes and noise ratios.Machine learning can automatically learn features from data and is very effective for tasks such as mutation detection.Accurate detection and identification of mutations remains a challenge.However,most machine learning algorithms focus on somatic mutation identification,and there are fewer related studies on identifying somatic and germline mutations.In this study,we constructed 11 machine learning models(including 5 deep learning)methods for mutation detection and type identification.Random forest and 5 deep learning models can accurately detect mutations and identify mutation types(average AUC>0.95).In summary,by simulating two biallelic genetic loci,analyzing how different types of mutations synergistically affect linkage disequilibrium in terms of effective population size,number of generations,Selection pressure,etc.,we can gain a better understanding of the role of mutations in molecular evolution and the genetics of complex traits.The GWLD R package and software developed based on mutual information enable fast computation and visualization of the patterns of linkage disequilibrium between genetic loci within and between chromosomes,providing new insights for genetic research.The K-Means+Con1DAE model demonstrates strong robustness in genotype inference,offering a new application approach in this field.Random Forest and deep learning models effectively enhance mutation detection methods,providing more accurate,efficient,and rapid solutions for mutation detection.
Keywords/Search Tags:Mutation, linkage disequilibrium, genotype inference, machine learning, mutation detection
PDF Full Text Request
Related items