Font Size: a A A

A Study Of Tracing Forensic Paternal Biogeographic Ancestry By Using Y-STR Haplotypes To Predict Y-SNP Haplogroup

Posted on:2022-07-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:M Y SongFull Text:PDF
GTID:1524306551963109Subject:Forensic medicine
Abstract/Summary:PDF Full Text Request
Objective:Y-chromosomal short tandem repeats(Y-STRs)and Y-chromosomal single nucleotide polymorphisms(Y-SNPs)are important forensic genetic markers.For the purpose of personal identification and paternal lineage tracing,Y-STR profiles are widely used in forensic investigations.The usage of Y chromosomes for biogeographic ancestry inference relies on the existence of databases.Large-scale Y-chromosome databases include Y Chromosome Haplotype Reference Database(YHRD)and public security DNA database.There is a large number of Y-STR haplotype data in the YHRD database,covering worldwide populations.It can provide haplotype frequency reference and calculation basis for evidence evaluation.In the practice of forensic medicine in China,in order to solve cases,forensic geneticists in public security department have established a Y-STR database.Unlike YHRD,this database contains familial information.Searching samples in the database can help find matching samples,thus knowing the family of the suspect.The above two databases are both mainly made up of Y-STRs.Y-STR has a high capability in personal identification,but the mutation rate of Y-STR is high.It is difficult to use Y-STR alone to trace the paternal lineage.The Y-SNP mutation rate is low(about 10-8)and can be used to construct a Y-chromosome DNA phylogenetic tree.Therefore,the Y-Chromosome Consortium chose Y-SNPs to define the tree.The Y-chromosome phylogenetic tree includes 20 main haplogroups(called A to T).In order to predict haplogroups to narrow down the investigation scope and improve the accuracy of familial searching,a database needs to be constructed,which can be used to predict Y-SNP haplogroups based on Y-STR haplotypes.To provide a data basis for familial search and paternal biogeographic ancestry inference,the database is composed of three elements:1)reliable Y-STR data generated by the verified Y-STR typing system;2)a large sample set with both Y-STR and Y-SNP data;3)building software using Y-STR to predict haplogroups to achieve the purpose of tracing paternal biogeographic ancestors.We call this database"Y-STR predicting haplogroup database".We explore the first element of the database to lay a foundation for data quality control of the database.With the increasing of samples in DNA database,a Y-STR system with higher personal identification ability is required.Kits with more loci continue to be produced.Due to the rigor of forensic tasks,each kit should be validated before actual application.In order to establish the Y-STR predicting haplogroup database,the typing system used to generate data needs to be verified in advance.In this study,a new Y-STR commercial kit Yfiler Platinum PCR kit is to be validated for forensic purposes.Sample sets with both Y-STR and Y-SNP data were in need for the database.At present,the public security Y chromosome database is mainly constructed by Yfiler Plus kit.Therefore,in order to expand the Y-STR predicting haplogroup database,the Yfiler Plus kit was selected to perform Y-STR typing of the Li population in Hainan Island.The high-resolution Y-SNP system was used to genotype Y-SNPs of the Li population.The integration of Y chromosome data of four populations(Han,Tibet,Hui,Li)and comparative analysis of paternal genetic structure illustrate the value of the combined application of Y-STR and Y-SNP,and the reliability of Y-STR predicting haplogroup database.Build software for haplogroup prediction.Some haplogroup prediction software has been established in the previous research.For example,Haplogroup Predictor mainly targets haplogroups E,G,I,J,R,and can only predict samples under the main haplogroup;the prediction software Haplogroup Classifier developed by Joseph Schlecht mainly focus on 30 major haplogroups such as I,J,and R,with low resolution;Haplo-I subclade Predictor developed by Jim Cullen can only predict haplogroup I;R-L21 SNP Predictor created by Robert Casey needs verified sample of haplogroup R-L21.More accurate prediction software,especially for haplogroup O of the Chinese population,has yet to be developed.In response to this demand,based on the Y-STR predicting haplogroup database obtained in this research and previous studies,this research mainly uses data belonging to the main haplogroup O of the Chinese population and its branches to develop a software to predict the high-resolution Y-SNP haplogroup,solving the key problem of how to use the forensic Y-STR to trace the paternal biogeographic ancestry.Methods:1.1000 Chinese Han population samples,140 father-son pairs samples,6 real crime scene samples,11 species samples(guinea pigs,cats,ducks,rabbits,chickens,cattle,sheep,pigs,rats,Neisseria gonorrhoeae FA1090 and Saccharomyces cerevisiae BY4741)was collected.DNA was extracted and purified by Chelex-100method or biomagnetic separation techniques(Changchun Bokun Biotech Co.,Changchun,China)in the Kingfisher Purification System(Thermo Fisher Scientific,Waltham,MA USA).Sensitivity,male specificity,the ability of the mixture analysis,species specificity,efficacy in the presence of inhibitors and environmental degradation,discrimination capability,precision of the new Y-STR kit Yfiler Platinum(Thermo Fisher Scientific,USA)was validated on the capillary electrophoresis platform;2.Pure Link Genomic DNA Mini kit was used to extract genomic DNA.Yfiler Plus PCR amplification kit(Thermo Fisher Scientific,USA)was used to perform Y-STR analysis on a total of 302 unrelated male samples from Li and Han populations.SNa Pshot was used to perform Y-SNP typing of these 302 samples on the ABI3130 genetic analyzer(Applied Biosystems,USA).Median adjacent tree was utilized to analyze the genetic structure of the Li and Han nationalities,and the gene flow of the Li and other populations in Hainan Island.We integrate the Y-chromosome data of 4 Chinese populations(Han,Tibet,Hui,Li),and extract the Y chromosome data from 1000 Genomes Project,using t-Distributed Stochastic Neighbor Embedding(t-SNE)and Nonlinear least square data-fitting(NLLS)to compare and analyze the paternal genetic structure and admixture propotion of the four populations;3.Use Java language and Python language to call six supervised machine learning algorithms of k-nearest neighbor,naive Bayes,logistic regression,support vector machine,decision tree and random forest,which are built in the software to train the model.When using the Y-STR predicting haplogroup database to train the model,we randomly divide the data into training and test data sets.We divide 3455pieces of data into two non-overlapping subsets:a training set for learning the association between Y-STR and Y-SNP and a test set for evaluating prediction accuracy(400 samples are used as test Data set,the rest are used as training data sets).Use five-fold cross-validation for all data.In order to give a rank for reference samples that are closest to the unknown samples,the cosine distance is used to establish the similarity score to achieve the third function.Predict the real case data and the population data generated by Yfiler Platinum kit above,and compare the consistency of the predicted results with the actual test results.Results:1.In this study,we conducted a developmental validation of a newly emerged Y chromosome kit that combines two different kinds of markers:38 Y-chromosome short tandem repeats and 3 Y-indels.The results show that this kit has high sensitivity when there is a small amount of DNA(125 pg),more than one male(minor:major=1:7),or a mixture of males and females(male:female=125pg:1875pg),inhibited substances(800μM hematin and more than 1600 ng/μL humic acid).The kit exhibits high precision level with a standard deviation of allele size no more than0.14 nt.Population samples are well identified(match probability of 0.001106),and mutations can be observed in father–son pairs(47 mutations in 70 pairs,10 in locus DYS627).Out of all the population samples,13.2%belonged to haplogroup M117-O2a2b1a1,with their ethnic group being Han Chinese.This kit can improve the performance of identifying male individuals,obtaining more unique haplotypes(increasing from 894 to 918 of 1000 male samples)and higher discrimination capacity(increasing from 0.942 to 0.955)in this study compared to previous widely used Yfiler Plus kit.2.A total of 302 unrelated male samples from the Li and Han ethnic groups in Hainan Island were genotyped.The haplotype diversity of the Li ethnic group reached 0.9997.The majority(98.04%)of Li individuals were divided into haplogroup O-M175,the branch where most Chinese individuals lie[4].The remaining two samples(1.96%)belonged to haplogroup C-M130.All Li samples were divided into four major haplogroups:O1b-P31(N=61,59.80%),O1a-M119(N=28,27.45%),O2a-M324(N=11,10.78%)and C2-M217(N=2,1.96%).18terminal haplogroups were observed in the Li ethnic group.Haplogroup O1b1a1a1a1a1b-CTS5854 was the most predominant haplogroup,including 44.12%of Li individuals.We observed 48 haplogroups in Hainan Han ethnic group.Median-joining trees showed little gene flow between the Li and Han individuals,as well as between the Li and other ethnic groups in Hainan Island.Our results indicated that 1)in contrast with the Han ethnic group,a low degree of genetic diversity was observed in the Li ethnic group;2)there was limited gene flow between the Li and Han ethnic groups;and 3)founder effect was identified in the Li ethnic group in Hainan Island.Then,we analyzed four human population datasets of Han,Tibetan,Hui and Li ethnic groups.The results show that Han and Hui have great genetic affinity,and Hui is the most admixed ethnic group.Tibetans are distinct for its high frequency of haplogroup D and Li is the most isolated and homogeneous population in this study.The comprehensive dataset in this study is the largest of this kind reported to date and proposes reference population data for future paternal genetic studies and forensic genealogical identification.3.YHP(Y Haplogroup Predictor)was developed,which is an open-access and easy-to-use software package that can predict Y-SNP haplogroups,compare similarities between haplotypes,and conduct haplotype mismatch analysis.The first function is to calculate the difference of Y-STR locus of different haplotypes.When there are multiple haplotypes,it can help find the haplotype with the least difference.The second function is the main function,which is to predict Y-SNP haplogroup when there is a single Y-STR haplotype.Considering that multiple haplotypes are often obtained after database searching and their haplogroups are unknown,the third function will compare multiple different haplotypes,calculate the similarity score between two haplotypes,and find out the the haplotype closest to the target haplotype.The accuracy of the main haplogroup prediction is 98.4%,and the accuracy of the high-resolution haplogroup prediction is 77%.The accuracy of the support vector machine and random forest method is better than the other four methods.When comparing the similarity between haplotypes,the software uses cosine distance to score and rank the similarity between samples with similar Y-STR haplotypes and target samples to decide the priority of forensic investigations.When using Y-STR haplotypes to perform mismatch analysis,if the number of mismatches does not exceed 2,the frequency of sample pairs belonging to the same haplogroup exceeds 97%.Conclusion:This project has constructed a Y-STR predicting haplogroup database,which can predict Y-SNP haplogroups based on forensic Y-STR typing and trace the paternal biogeographic ancestors.We verified that the new Y-chromosome kit Yfiler Platinum can be used for the construction of the Y-STR prediction haplogroup database and other forensic databases.Combining the verified Y-STR and Y-SNP typing system to obtain Y-chromosome data of the Hainan Li ethnic group,it revealed that Li ethnic group is less mixed with other major Chinese groups.Strong founder effect was observed in Li ethnic group.It showed the value of the combined analysis of Y-STR and Y-SNP in the tracing of paternal biogeographic ancestry.Finally,according to the established Y-STR predicting haplogroup database,software YHP(Y Haplogroup Predictor)was developed to provide high-resolution Y-SNP haplogroup prediction with only Y-STR genotypes to help solve the key problem of tracing paternal biogeographic ancestors in Chinese population and provide new artificial intelligence technology for familial searching and paternal biogeographic ancestry inference.
Keywords/Search Tags:Y-STR, Y-SNP, population genetics, Li ethnic group, machine learning, prediction
PDF Full Text Request
Related items