Font Size: a A A

Study On The Identification Of Pattern And Its Power Based On Sequence Analysis

Posted on:2014-01-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y DiFull Text:PDF
GTID:1220330398460228Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Gene expression means in the process of life, cell transform the genetic information which in DNA to be active protein. Gene expression occurs in two major stages. The first is transcription. In this process, one of the DNA sequence is copied to produce an messager RNA molecule based on the principle of complementary base pairing. The second stage is protein synthesis. This stage is also known as translation, and is so called because there is no direct correspondence between the nucleotide sequence in DNA (and RNA) and the sequence of amino acids in the protein. In this process, the messager RNA was used as a template strand, and the tRNA as the transport, under the effort of enzyme to synthetic protein.Transcription factor binding sites is the region which transcription fac-tor combine the mRNA when it regulate the gene expression, it included promoter, enhancer and silencer, so it can also known as cis-acting element. The transcription factor binding site do not encode any protein, it just pro-vide a location which can bind the transcription factor to regulate the gene expression. In the molecular sequence, each binding site of transcription factor have an given pattern, these pattern can be known as Motif. The identification of the Motif is very important in genomic research.Early, scientist identify the transcription factor binding sites through electrophoretic mobility shift assay(EMSA) and Dbase footprinting, but these methods waste a lot of time and cannot get the accurate result, the high-throughput analysis cannot achieved yet. In the middle of1990s, Capil- lary Array Electrophoresis made the high-throughput possible. Recently, combine the Chromatin Immunoprecipitation(Chip) and chip, scientist get a lot of chip-chip data, the length of the chip-chip data are about800bp([26],[45]).Many approaches have been developed to identify the transcrip-tion factor bingding site in these long sequence, but for the power of the transcription factor binding site, there is only simulation method, no theo-retical power are available. As the development of next generation sequenc-ing, combine the Chromatin Immunoprecipitation(Chip) and next generation sequencing, a lot of chip-seq data available. So how to identify the transcrip-tion factor binding site in these chip-seq data and how to study the power of the binding site is another new problem.So in this thesis, we will discuss these two question.1. The Power of Motif based on long sequenceSo far, There are many approaches have been developed to identify the transcription factor binding sites, one of the successful approaches is to identify statistically over-or under-represented patterns in a sequence. And there are only simulation approaches have been used to evaluate the power of motif detecting methods. No systematic theoretical formulas are available for the power of detecting over-represented patterns when the sequence contain multiple incidences of motifs, so in section2, we developed a hidden markov model to study the power of test statistics.In the hidden markov model, we model the sequence data using three components:the background model, the foreground model for the motif, and the distribution of the motifs along the sequence. We also can know:the emit probability of the background model, the position weight matrix of the motif, motif density, initial distribution, state space, state transition matrix. Let the length of the background sequence is n, and W be the motif which we interested, the length of W is w. Let Nw(n) be the number of occurrence of W in the sequence.In the theoretical part, we first give the mean and variance of Nw(n) in the subsection2.2.1, and then we get the result:for the numbers of oc-currences of frequenct patterns, we can use the normal distribution to ap-proximate the distribution of Nw(n), and for the number of occurrences of rare patterns, we can use compound poisson distribution to approximate the distribution of Nw(n).In the simulation part, we carry out three simulations to evaluate the validity of the theoretical results. In the first simulation, the pattern which we interested is "11", state space is{0,1}, the probability of choosing1in the background sequence to be0.1,0.5,0.7, respectively. The density of the Motif to be0,0.05,0.1, respectively. In the second simulation, the state space is{A,C,G,T}, we consider two different pattern:"ACGT" and "CGCG". The following three different situations are considered:CG poor, uniform and CG rich. In the third simulation, We consider two relatively long sequence:ACGTATC and AAGAAGAA. We also consider the situa-tions:CG poor, uniform and CG rich. For these three simulation, we use three different critetia to compare the theoretical result and the simulated result. In the first criteria, we compared the simulated the mean and vari-ance with the theoretical mean and variance. In the second criteria, for pattern "11", we give the simulated power and normal approximated power. For pattern "ACGT","CGCG","ACGTATC" and "AAGAAGAA", we give the simulated power and compound poisson approximated power. In the third criteria, for pattern "11", we use qqplot to compare the standard nor-mal distribution with the standardization of Nw(n). For pattern "ACGT","CGCG","ACGTATC" and "AAGAAGAA", we compare the histograms of the simulated value of Nw(n) with the compound poisson distribution. We also give an online program to calculate the power of a pattern.In the real data part, we give four example. In the first example, we consider the CpG enriched region of C.elegans, D.melanogaster and E.coli, and we give an relation figure between the power of number of occurrence of CpG under normal approximation and the sequence length. In the other three examples, we consider the binding sites of transcription factor SP1, a zinc finger motif C2H2, and a structural motif. Based on the position weight matrix of these examples, we obtian the figure which compare the power under compound poisson approximation and the Motif destity.2.The Identification of Motif and Its Power based on Next Generation SequencingThe Human Genome Project (HGP) was accomplished by the first gen-eration sequencing, but it costs near three billions and three years, so the first generation sequencing was not the ideal sequencing method for us. Since the21st century, the next generation sequencing technology was developed. The next generation sequencing except keep the accuracy of the first gen-eration sequencing, it has a low-cost and has high-throughput, so the next generation sequencing was used in many biology studies.In the next generation sequencing data, the sequence reads are ran-domly sampled from the genome sequence of interest. Most comptational approaches for next generation sequencing data first map the reads to the genome and then analyze the data based on the mapped reads. But many or-ganisms have unknown genome sequences and many reads cannot be uniquely-mapped to the genomes even if the genome sequences are known. So a new method need to developed to analyze the next generation sequencing data.Here we use word patterns to analyze next generation sequencing data. Word pattern counting has played an important role in molecular sequence analysis. Many approaches have been developed to analyze the number of occurrence of word pattern in a long sequence and give its approximation distribution, but for next generation sequencing data, no studies on the dis-tribution of the number of occurrences of word patterns have been carried out. In section3, we developed a probabilistic model. In this model, the back-ground sequence is i.i.d random sequence, and the length of the sequence is n, then we choose M reads of length β from the background sequence randomly. For pattern W, let Nw(M,n,β) be the number of occurrnece of pattern W in these M reads of length β.In the theoretical part, the same to the last section, we first give the mean and variance of Nw(M,n,β). We also consider the normal approxi-mation and compound poisson approximation. Especially for the compound poisson approximation, we consider the single-strand and double-strand, re-spectively, and give the total variance distance for these three approximation. In the last section of the theoretical part, we talked the power of NW(M, n, β) using the hidden markov model which we developed in section2.In the simulation part, we consider five different pattern:"TAT","ACGT’"CGCG","ACGTATC" and "AAGAAGAA". About the probability of nu-cleotide, we consider the following three situations:CG poor, uniform and CG rich. In all our simulations, we compare the histogram of the simulated value of NW(M, n. β) with the compound poisson distribution, and for some situations, we also cure the density function of the normal approximation. In addition to the histogram, we also consider the power of NW(M,n,β), and compare the simulated power with the theoretical power for these five patterns. Of course, we have developed a MatLab GUI program to calculate the p-value of the pattern.In the real data part, We consider the chip-seq data of the binding site of transcription factor GABP in. We obtained the p-value of all patterns of length6for the control data and chip-seq data through the compound poisson approximation. We analyze the top10smallest p-values and the corresponding pattern, then we construct a consensus sequence which is the same to the real sequence.
Keywords/Search Tags:Hidden Markov Model, Transcription factor bind-ing site, Motif, Next-generation Sequencing, power, Panjer recur-sion
PDF Full Text Request
Related items