Font Size: a A A

The Genome-wide Distribution Of Z-DNA And Its Potential Correlation To Copy Number Variations In Colon Cancer

Posted on:2017-01-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z L RenFull Text:PDF
GTID:1224330488983829Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
A double-stranded DNAin natural state usually exists as B-DNA in living creatures, which is the right-handed double helix structure of DNA identified by Watson Crick in 1953. The discovery of right-handed double helix marks the birth of modern molecular biology. Since then biologists had always regarded the DNA as static structure and follow a B-DNA model until Z-DNA (Z conformation DNA, also known as left-handed helix DNA) was first discovered by a high-resolution single-crystal X-ray diffraction technique by Rich et al. in 1997, and scientists finally recognized that the conformation of DNA is actually dynamic equilibrium between various conformations.The glycosylation backbone of Z-DNA is in a zigzag configuration, where purine and pyrimidine bases are arranged along the DNA strand with anti and syn-conformation, while bases of B-DNA are all in an anft-configuration. Meanwhile, the major groove of B-DNA double helix also no longer exists in Z-DNA double helix, instead by a structure similar to minor groove in Z-DNA double helix. Moreover, Z-DNA conformation is a high-energy transient state; hence it is difficult to detect the existence of Z-DNA through experiments.In the first few years, molecular biologists and biophysicists paid a lot of attention to such molecules with an unusual structure, because they deeply trust a general rule in life science that there is a correlation between structure and function.In 1980s, scientists have found Z-DNA antibodies and anti-Z-DNA antibodies. Using Z-DNA antibodies, Nordheim et al. noticed the appearance of anti-Z-DNA fluorescence located in the transcriptionally active areas as well as interband regions. Others found that negative supercoiling would stabilize Z-DNA conformation and the evidence of Z-DNA presence in vivo.The research on Z-DNA had been progressing slowly right after its discovery. After a few years, the biological significance of this molecule with special conformation was widely questioned by molecular biologists. Laboratories working on conformation and functions of Z-DNA had gradually reduced, and in the late eighties to early nineties in 20th century, scientistshad been less and less interested in Z-DNA conformation. However, with the reports in the last 20 years on potential involvements of Z-DNA on biological processes and pathogenesis of various diseases Z-DNA related studies have been arrested more and more attention of experimental biologists recently.During the 37 years since its discovery, Z-DNA was found to involve the transcription regulation of certain genes through study on potential Z-DNA distribution onhuman chromosome 22, in whichpotential Z-DNA Forming Regions (ZDRs) was found to be enriched near transcription start sites of genes rather than random distributed. The three-dimensional structure of Z-DNA was obtained until 2005. The following study also found that the chromosomal genetic instability caused by Z-DNAmight be associated with some cancers, such as leukemia and lymphoma; Z-DNA is also associated with someauto-immune deficiency diseases like systemic lupus erythematosus and diabetes type-I; besides, Z-DNA can also promote the DNA damage mechanism to cause base insertions and deletions in DNA sequences. These pieces of information provided us with limited knowledge regarding to Z-DNA conformation, and until now scientists we are still lack of systematic understanding of its biological significance. This is not only due to limited Z-DNA research institutions, but also its poor stability in vivo.It is generally believed that negative supercoiling helix released energy when DNA fragment transform from B conformation to Z conformation (B-to-Z transition). Thus, Z-DNA conformation is a high-energy state, and it is a transient state because it presents in very short time. After implementing its functions, it releases its energy and recovers from the Z-DNA conformation with high energy state to B-DNA conformation with stable low energy state. Due to the high-energy transient state of Z-DNA conformation, it is difficult to be captured and located accurately in vivo. Besides, several Z-DNA binding proteins have low efficiency in recognition, which lead to poor results of experimentally capture of Z-DNAs. Therefore, it is a more feasible approach to adopt bioinformatics methods to make Z-DNA prediction according to its sequence and conformation features. The bioinformatics prediction of Z-DNA forming regions can not only help us to understand the features of Z-DNA and explore the distribution of potential Z-DNA forming regions at genome scale, but also play a supporting role on the research of its biological functions. This is the starting point and purpose of our research.Z-Hunt, software to predict potential Z-DNA forming region, has long been developed from Rich’s research team, and improved by later researchers to propose the meihodZ-HuntⅡ (http://gac-web.cgrb/oregonstate.edu/zDNA/index). Jie Xiao put forward the method Z-Catcher at The International Conference on Genome Informatics 2008. Zhabinskaya also developed another algorithm of potential Z-DNA region prediction based on SIBZ (superhelically stressed DNA of specified sequence). Wang et al developed a method to predict H-DNA and Z-DNA based on identifying the base characteristics of DNA sequences named dnastructure (http://www.utexas.edu/pharmacy/dnastructure/). However, it is no longer meet current scientific requirements and open source sharing trends, due to some drawbacks of these softwares, such as the limitation of input sequence length, output of several potential Z-DNA fragments with fixed length, web-based service rather than installation packages and open source programs. They are no longer meeting the current requirements of the scientific researches on Z-DNA conformation. Moreover, the resource of Z-DNA data for bioinformatics is largely insufficient.There has been no database which separately store Z-DNA forming regions in genomes and provide data query and download services. Existing database Non-B DNA (https://nonb-abcc.ncifcrf.gov/apps/site/default) stores the nBMST predictions for several common species. Besides, the genomic sequence versions in this database are not updated and the software nBMST only provides with website query, which make it impossible to predict potential Z-DNA forming regions in other species.From the above, the aims of the present studyare:1. Improving the Z-Catcher to provide with practical, convenient and efficient bioinformatics prediction of Z-DNA forming regions;2. Exploring the distribution of potential Z-DNA forming regions in the human genome and the genomes of other model organisms;3. Building a complete database and a website of prediction results to provide query and utility for those researchers interested in biological functions of Z-DNA;4. Exploring the potential correlation between Z-DNA and colon cancer CNVs.Our study consists of the following four parts:Part I:Prediction of potential Z-DNA regions and development of Z-Catcher2.We evaluated the advantages and disadvantages of the existing methods Z-Hunt, Z-Huntll, SIBZ and Dnastructure by comparing the search strategies of these methods.Firstly, all of the above methods are limited in the input DNA sequence length. The website services of Z-Hunt and Z-Huntll database can only provide DNA query sequence no more than 1Mb.The query database of SIBZ only accept 5-10Kb DNA sequences. Dnastructure requires input a sequence fragment with repeating sequences and determines potential Z-DNA regions according to a given initial score.Secondly, they have limitations in the outputs, the output sequence length of Z-Hunt is 16 to 24 bases; the output sequence length of Z-Huntllis 12 to 16 bases; the Z-Hunt to explore the length of potential Z-DNA region in a certain range to balance the search speed and accuracy of results. However, the lengths of a potential Z-DNA region tend to be non-fixed. Besides, the length of search box in Z-Hunt changes with the size of Z-Score, which increases the difficulty in identifying potential Z-DNA regions. SIBZ provides probability of each base to be in Z-DNA conformation for the input sequence, but it is difficult to choose the probability threshold to determine whether a sequence segment should be a potential Z-DNA region.Thirdly, theZ-Hunt and Z-Huntll searching strategies adopt a fixed size of sliding window starting from the first base of the query sequence, and after the first turn it walks 1 base and then starts the second search. These inefficient searching strategies definitely increase the running time of the program. The author of method SIBZ guessed that it would take approximately 10 days on a server with 100 CPU (Opteron) for analyzing the potential Z-DNA regions in the human genome just for once. Such search strategies will undoubtedly reduce the running speed of programs. These methods can handle the query sequence of small fragments, but they are time-consuming and unacceptable for searching the large genome of an organism.Fourthly, Dnastructure and nBMST simply usesequence features of alternating purine-pyrimidine to identify potential Z-DNA region, but abandon the most important information of thermodynamic energy change in base pairs in conformational transitions. They seem to be even more mechanically compared to the above methods.In view of the above disadvantages and drawbacks of the existing methods, we rewrote the Perl program based on the core strategy of Z-Catcher and renamed it as Z-Catcher2. The rewritten program has the following improvements:First, the program bugs were fixed in Z-Catcher which caused inaccurate results, and improve the readability of program by usingthe strict syntax of Perl programming. Second, we added information about the predicted Z-DNA region’s starting point, ending point on chromosomes and ID information in the output file, which facilitate the following studies regarding the distributions of potential Z-DNA regions in the genome. Third, we enhanced the overall running speed of the program by reducing the several intermediate files, i.e. reducing the number of reading and writing of certain files. Fourth, we provided two ways of programming call, interactive mode and batch mode, and added help command and parameter prompt which was widely used in Linux system to help input the program parameters. Fifth, several scripts were added to calculate the length of chromosomes, GC content and quantity of N bases on a chromosome for the convenience of further analysis of potential Z-DNA forming regions.Z-Catcher2 has three advantages compared to other existing methods:first, the input sequence can be a fragment at any length. It could be a piece of a gene sequence, a chromosome or an entire genome, and it can contain more than one query sequence in a same file. Second, the sequence length of predicted Z-DNA region is greater than or equal to 12 bases (initial sliding window length) without a upper limit, because the searching strategy of Z-Catcher2 is to find out uninterrupted Z-DNA region as long as possible and take negative supercoiling density (σ0) as a threshold to determine whether the query sequence meet the required energy to be a Z-DNA conformation. Third, the program runs faster, and it takes about 15 hours to scan the entire human genome by Z-Catcher2.As for verification on prediction results of Z-Catcher2, classic Z-DNA fragments were used to test the accuracy of Z-Catcher2. There are three known Z-DNA fragment Z1, Z2 and Z3 on human c-MYC gene. When σ0=-0.075, fragment Z2 can be detected; when σ0=-0.08, both fragments Z1 and Z2 can be detected; when σ0=-0.09, all the three fragments Z1, Z2 and Z3 can be detected. Though comparing the running results of nBMST in the human genome (version hg19), it is found when σ0=-0.07 the 66.88% of potential Z-DNA regions predicted by Z-Catcher2 are overlapped with 68.56% of potential Z-DNA regions predicted by nBMST. When σ0=-0.075, the Z-Catcher2 predictions covered 86% of nBMST. The remaining non-overlapped parts may result from the programs adopting different threshold criteria and the fact that nBMST has no consideration on thermodynamic features.Part Ⅱ:Using Z-Catcher2 to predict potential Z-DNA regions in the human genome and model genomes to find the distribution pattern.Due to the difficulty of obtaining the Z-DNA conformation in vivo, the distribution of Z-DNA forming regions in the human genome is a controversial issue. Limited studies from Ho et al showed that ZDRs accumulate at the TSS of the human genes, and the similar ZDRs distribution was found in the mouse genome by Zhabinskaya detecting ZDRs in 12841 mouse genes. Other studies explored the relationships between ZDRs and CG content, TSS in the simple genomes such as SV40, E.coli, and Arabidopsis. Furthermore some researchers studied ZDR distribution in plants genomes, for instant wheat, rice, and potato, also did GO annotation analysis in genes correlated to ZDRs. However, researches for ZDR distributions in the genomes of human and other model organisms are relatively rare.In this part, we detected ZDRs using Z-Catcher2 in the human genome (version hg38), chimpanzee genome and model organisms (Fruitfly, C.elegans, Rat, Mouse, Zebrafish, Sac.ca, and Arabidopsis), and obtained some interesting results regarding to the ZDR distribution in the human genome. First, there is a correlation between ZDR distribution and GC content of chromosomes, but Z-DNA formation is not solely determined by GC content. Chromosome 21 has a significantly enriched ZDR density (132 ZDRs per million bps) despite its generic GC content (42.2%). Furthermore, the correlation between ZDR density and GC content in model organisms shows weak positive correlation but this does not have statistical significance. For example the number of ZDRs in mouse and rat are near double more than that in human and chimpanzee, however GC contents in mouse and rat are only slightly higher than that in human and chimpanzee. Second, the length of ZDRs by Z-Catcher2 in the human genome is significantly different to the length of ZDRs in the randomly generated sequences; about 40% ZDRs in the human genome are longer than 41 bps compared to 0.8% ZDRs in the randomly generated sequences, the length of majority of ZDRs in the randomly generated sequences are ranged from 12 to 20 bps. Third, the human ZDRs are clustered near TSSs of protein coding genes but not significantly enriched near the TSSs of non-protein coding genes. Last, for human protein coding genes ZDR density in exons showed significant difference to those in introns and on chromosomes. This is also found in the chimpanzee genome but not in the other model organisms. In addition, human ZDR density in LncRNA genes is similar to that in introns, and was significantly different from that in exons.Part III:Building a ZDR database and a website for searching, retrieving and downloading.Currently the non-B is the only database storing the data of potential Z-DNA regions. Users could download ZDRs predicted by nBMST in the human genome (version hg19) and other common species from this website. There is no specialized Z-DNA database, since the non-B also stores data of other DNA conformations, such as G4 motifs, mirror repeats and so on; in addition the genomic versions in non-B are not updated for a long time.In this part, we aim to put the ZDRs predicted by Z-Catcher2 in human and other model organismsin a ZDR database and build a dynamic web application for researchers to retrieve and download the data as well as the Z-Catcher2 source code. Because the format of predicted ZDRs is relatively simple, we use the light opensource software SQLite to create the ZDR database; moreover, the interface of SQLite in R platform is very convenient due to the RSQLite package in R platform. Shiny package of R platform, which was developed by Rstudio Company, is used to implement dynamic website application. Using the Shiny package could quickly create beautiful and practical dynamic web page, and users do not need to understand HTML files, CSS formation and other complicated operation, and they could concentrate on data processing and visualization of the data. Nowadays Shiny has become a popular tool of web building.Part IV:Exploring the relationship between Z-DNA and copy number variations in colon cancer.Besides its involvement in gene transcription regulation, Z-DNA can also induce genetic instability. In mammalian cells CG and TG low-copy sequences induce large-scale deletions, translocations and rearrangements. The Z-DNA induced double-strand breakpoints are consistent with chromosomal breakpoints in leukemias and lymphomas. In addition, the transforming of chromosomal secondary structure results in CNV, and CNV mutations are commonly found in cancer patients. But the relationship between Z-DNA and colon cancer CNV has not been reported.In this part, using confident CNV peak regions in colon cancer samples, we calculated ZDR density of these regions and compared to ZDR density of the corresponding chromosome to examine the potential correlation between ZDRs and colon cancer CNVs. Out of the total 17 significant focal amplification peak regions there are 13 with considerable higher ZDR densitiescompared to that on the corresponding chromosomes, and 15 out of 28 focal deletion peak regions also have higher ZDR densities than that on the corresponding chromosomes. Furthermore, ZDRs are enriched in 12 of 15 peak regions associated with tumor aggression. The ZDR enrichment suggests that these peak regions with tumor-suppressor genes and oncogenes might have higher frequency of occurrence of genetic instability associated with Z-DNAs. In other words, these regions have high-frequency of genetic instability events, following with abnormal CNVs that change the expression of involved genes, and result in the occurrence and progression of cancer.In conclusion, due to the special conformation of Z-DNA, the biological functions of Z-DNA are very difficult to be discovered in vivo and using bioinformatics methods to detect Z-DNA forming regions is essential and a stepping stone of Z-DNA studies. This study aims to predict and explore ZDR distributions in the genomes of human and other model organisms. Firstly, a novel prediction program is developed using Perl programming language named Z-Catcher2. Secondly, Z-Catcher2 isused to predict ZDRs in the genomes of human and other model organisms to explore the ZDR distribution in the genomes. Last, the prediction results are utilized to create a Z-DNA database and implemented with a dynamic web application for searching, retrieving and downloading data. We hope that this study provide could be helpful for further understanding of the biological significance Z-DNA conformation.
Keywords/Search Tags:Z-DNA, Z-Catcher 2, Transcriptional start sites, Distribution of ZDRs, Database of ZDRs, Copy number variation in colon cancer
PDF Full Text Request
Related items