Font Size: a A A

Whole Genome G-quadruplex Structure Analysis Based On Quality Control Strategy

Posted on:2020-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:M Q DuanFull Text:PDF
GTID:2370330626450809Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
G-quadruplexes(G4s)are nucleic acid secondary structure that form by Hoogensten hydrogen bonding within guanine-rich DNA or RNA seqences.As important elements of gene expression regulation,G4 scatter in whole human genome.At present,researches in physical and chemical properties of G4 has been performed in detail.However,the studies of the position and the corresponding functions of G4s are still confronted with some challenges.This project explored the G4s in GM12878 genome.We established a practicable procedure which was coded in Perl for mining quadruplex forming G-rich sequences(QGRS)from the next generation sequencing data.We also evaluated the performance of our procedure and simply investigated the impact on single nucleotide polymorphism(SNP)during the forming of G4s.We utilized the character that the G4s happening during the synthesis of polypeptides lead to the drop of Phred scores in sequencing-by-synthesis.We counted the Phred score of all the sequences at every locus and calculated the median value of them.The median Phred scores were used to help measure the quality of all the loci.In the OQ(Observed QGRS)detecting step,a low-quality-locus scanning algorithms was desigened,which was implemented by comparing the median Phred scores between loci.Due to the algorithms operated by comparing values in a small range,it can avoid the missing or redundancy regions selected.We also established a process of parameter adjustment so that the procedure can be suitable for other sequencing data sets.For determining PQ,we used a QGRS predicting software named g4predict which is based on a machine learning Quadparser algorithms to predict the G4s in reference sequence(hg19).There were 356,298 PQ in the whole genome.178,606 of them were located in forward strand,and the rest 177,692 found in reverse strand.In a whole-genome range,we found1,054,941 and 936,545 OQs from two sets of sequencing data,which were about 2.7~3 times as many as PQs.The total length of OQs accounted for about 3%of whole genome sequences.After classifying OQs according to the structures of QGRS,we detected the canonical,the long-loop,the bulge and the two-quartet G4s.From the OQs that did not harbor any G4s,we divided them into the i-motifs,the hairpins and the triplexes.In total,there were 6.3%of OQs that could not be classified.We detected the number of PQs which could be harbored inside the region of OQs,i.e.PQinOQ.There were 185,822 and 172,946 PqinOQs,which accounted for 52.2%and 48.5%of all the PQs.Considering the short length of OQs,the high detection rate of PQinOQ showed that the procedure was highly credible.We also counted the number of PQs and PQinOQs according to the length of their loop size.The results showed that the shorter the loops in PQ,the stabler they performed in buffer with Na~+.We also compared the QGRS we detected with another G4-detecting experiment in 2015.In that study,researchers added some metallic ion which created a G4-induced reaction condition to obtain a positive PQinOQ set.As for the detection of OQ and PQinOQ in different regions of human genome,the distribution of PQinOQ was consistent correspond with their biological functions.The OQs,especially thoes haboring canonical G4s,scattered not only in the regulation regions,but alos detected in the coding region.We also used IGV Tools to visualize the OQs and PQs that located in some specific genes.We focused on the sequencing data of chromosome 1 to study the influence of SNP.After we modified the genotype of all the homozygous and some of the heterozygous SNPs,the number of PQinOQ increased by 126,which reflected that the modified PQ were closer to the genotype of nucleotide sample.For the study of heterozygous SNPs,we listed all the relative reads of two typical loci,and visualized the change of Phred scores.The difference of quality score between wild-type and mutant genotype revealed the influence of SNP.
Keywords/Search Tags:G-quadruplex, next-generation sequencing, DNA secondary structure, SNP
PDF Full Text Request
Related items