Font Size: a A A

Genotype Identification And Analysis Process Development Of Homologous Segment Targeted Sequencing Data

Posted on:2022-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:J N WuFull Text:PDF
GTID:2480306497969239Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
BACKGROUND: Single nucleotide polymorphisms(SNPs)are polymorphisms that are caused by point mutations that give rise to different alleles containing alternative bases at a given position of nucleotide within a locus.It has been a hot spot in genomics and genetics research for many years.However,many experiments have shown that the presence of homologous segments will seriously affect the bioinformatics analysis and SNP genotype identification within gene clusters or polyploidies.The dilemma in the development and identification of allopolyploid SNPs is mainly since there are homologous sequences in the target segment,resulting in various types of mutations other than SNPs,such as homoeologous sequence variants(HSVs)and paralogous sequence variants(PSVs).Both HSVs and PSVs cannot be used as genetic markers,for too many false positives(>80%).Commonly used techniques for SNP typing of polyploid species include Sanger sequencing,SNP chips,KASP(Kompetitive Allele-Specific PCR),etc.However,they rely on specific amplification or specific hybridization,which are reduced by homologous segments.Next generation sequencing(NGS)technology has greatly reduced sequencing costs and greatly improved sequencing efficiency.Targeted DNA enrichment based on multiplex PCR is an economical,fast and accurate sequencing library strategy,and it has great potential in SNP detection of large sample populations.Therefore,this experiment is to optimize the key parameters based on the targeted sequencing SNP Calling analysis process,including the reference genome and different mapping software,and develop a pipeline for SNP genotyping of the homologous segments targeted sequencing data.OBJECTIVE: SNP detection in polyploidy plant is complicated due to the presence of homologous segments.Here,amplicon dataset of homologous segments from the tetraploid cotton is used as an example to observe the influence of the homologous segments and the optimization of the bioinformatics pipeline.METHODS:(1)Library construction and sequencing: Specific primers were designed for the known potential SNP sites that need to be detected in the three target segments of Upland cotton,in a single tube Perform multiplex PCR amplification,different samples were distinguished with different Barcode primers,and high-throughput sequencing were performed on amplicons.(2)Data analysis environment configuration: A high performance computer(HPC)was built to analyze and store massive amounts of high-throughput sequencing data.The hardware used to build HPC mainly includes Intel's Xeon E5-2620 processor and Supermicro's X10 DRH motherboard;the software tested is divided into mapping software and SNP calling software.The mapping software includes BWA?MEM(version 0.7.17),Minimap2(version 2.11),and the SNP calling software includes SAMtools(version 1.9),GATK(version 3.7).(3)Data analysis: First,the original data is preprocessed by Cutadapt and Fast QC for quality control obtain clean data;then,the BWA-MEM comparison software is used to compare the processed data with the reference genome composed of the target segment Compare and filter to obtain SAM files and BAM files.Mapping reads statistics on the pre-filtered and filtered SAM files were obtained with in-house scripts.IGV,a visualization software,was used to manually review the sample genotype.Finally,SNP was called from SAMtools and GATK.If the potential SNP genotype identification results of the two methods are consistent were checked.Variant annotation analysis was performed on the VCF file.(4)Optimization of key parameters and establishment of the analysis process: After evaluating the results of the initial analysis,this experiment chose two key parameters for optimization: if homologous segments are added to the reference genome,and which mapping software is better.Compare the optimized data analysis results and determine the best analysis plan to establish a pipeline.RESULTS:(1)Routine analysis: There are 5 potential SNP loci in the three segments analyzed.The results of SAMtools and GATK mutation analysis showed that the three SNP loci of segment 1and the SNP locus of segment 2 were all identified as the correct genotype(homozygous),while almost all samples of the SNP locus of segment 3 were identified as wrong genotype(heterozygous).Among them,the SNP?79 locus of segment 1 and the SNP?120 locus of segment2 were homozygous genotypes different from the reference base,and the other two potential SNP loci of segment 1 were homozygous genotype same as the reference base.In segment 3,five new100% heterozygous loci appeared in addition to potential SNP?143 locus.According to the reads ratio,the allele base ratio of these six loci was close to 1:1,suggesting us that there have homology sequences interference.(2)Blast analysis showed that: The homologous segments of the three target segments(located on chromosome A12)are all located on chromosome D12,and the similarities are respectively 97.17%,98%,96.28%,and the genes are all MYB39,so conjecture that the interference mutation are HSVs.(3)Optimization analysis: The reference sequence alignment was divided into three conditions.Condition 1,the reference sequences for mapping only contained three homologous segments.It was found that the potential 124?SNP and 162?SNP of segment 1 were identified as variant homozygous genotypes,which were not consistent with conventional analysis.The genotype identification results of other potential SNP loci were consistent with conventional analysis.Condition 2,the reference sequences for mapping included segment 1,2,3 and homologous segment 3,a total of 4 sequences.The results of variant analysis showed that the genotype of the potential SNP?143 locus of target 3 should be homozygous TT,and it was found that the proportions of different subgenomic reads compared to segment 3 and homology 3 were47.7% and 52.3%,respectively.Therefore,the presence of homology segments sequencing reads was the main cause of segment 3 genotyping errors,and three of the five new SNPs in routine analysis proved to be false positive "SNPs" caused by HSVs.These results proved the importance of the method of adding homologous segments as the reference genome.Condition 3,the reference sequences for mapping included segments 1,2,3 and homologous segments 1,2,3,a total of 6 sequences.A large number of reads were lost when filtering after mapping,indicating that the BWA mapping software is sometimes not suitable for homologous sequences mapping.After replacing the mapping software with Minimap2,the reads were no longer lost and the mapping return to normal.Then the variant analysis was performed,it was finally determined that the 143?SNP locus of target 3 was indeed the variant homozygous genotype TT,the potential123?SNP and 161?SNP of target 1 and the five new variant loci of target segment 3 called in the routine analysis all were false positive "SNPs" caused by HSVs.CONCLUSION: This study combined multiple PCR targeted sequencing and bioinformatics to analyze the potential SNPs of the three target segments of upland cotton,which proved that the existence of homologous segments seriously affected the identification of polyploid SNP genotyping;optimization of key parameters,especially the homologous sequence was added to the reference genome,obtained the correct genotype and improved the accuracy of SNP genotyping.This laid the foundation for the SNP genotyping analysis of the whole genome of polyploid crops such as cotton and customized breeding Panel.
Keywords/Search Tags:Targeted sequencing, Homologous segments, SNP, Genotyping, Pipeline
PDF Full Text Request
Related items