Font Size: a A A

Prediction Of Active Binding Sites For Transcription Factor CTCF

Posted on:2021-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2370330620976583Subject:Physics
Abstract/Summary:PDF Full Text Request
CCCTC-binding factor(CTCF)is a multiple zinc finger protein widely distributed in eukaryotes that exert diversified function under different genomic contexts.There are multiple biological processes with the participation of CTCF,including transcription,imprinting,and long-range chromatin interactions.It is well known that CTCF can act both as a transcriptional repressor and activator.On the other hand,various findings indicate that CTCF is a major tumor suppressor gene.Disruption of CTCF binding at specific gene loci may increase the risk of developing cancer such as breast cancer by leading to aberrant expression of cancer-related genes.Genome-wide ChIP-seq analysis revealed tens of thousands of binding sites for CTCF,indicating its wide-range regulatory function in the genome.In fact,the binding of CTCF is affected by a variety of factors,including DNA sequence,the binding of other transcription factors,chromatin accessibility,DNA methylation and histone modifications.The ENCODE project provides a wealth of apparent modificationdata that has proven to be a valuable data resource for the study of gene regulation.Our data for following analysis and prediction are downloaded from the ENCODE website.Here,using the peak data of transcription factor CTCF of 82 cell lines,we constructed the data sets of CTCF active binding sites(positive set: 876 sites,named CABS)and CTCF inactive binding sites(negative set:231130 sites,CIBS)in GM12878 cell line.Then,a variety of epigenetic signals including DNase-seq,RAD21,SMC3,H3K9 ac,H3K27me3,H3K9me3,H3K4me3,H3k4me2 and H4k20me1 were extracted from ENCODE.At last,we used support vector machines(SVM,Jackknife validation)and random forest(RF,5-fold cross validation)to predict active binding sites of transcription factor CTCF in GM12878.And the prediction accuracies were 93.87%and 94.46% with nine features,and the average prediction accuracies of100 times were 94.78% and 95.40%.Meanwhile,only using information from DNase-seq,RAD21 and SMC3,the comparable accuracies have been achieved.The results show that the chromatin accessibility from DNase-seq data,the binding information of RAD21 and SMC3 have better prediction power for the prediction of active CTCF binding sites.And histone modifications provide moderate prediction power.Furthermore,the data sets of CTCF binding sites associated with breast cancer are constructed based on the ChIP-seq data of MCF-7(30859 sites)and HMEC(13171 sites)cell lines.Utilizing the DNase-seqdata and DNA methylation with RRBS format,the distribution information of 400 bp regions centered at the midpoint of CTCF peak is counted in cancer and control cell lines.Combined with three motif matrices of transcription factors CTCF,RAD21 and SMC3,five kinds of features based on discrete incremental were used to predict the active binding site of CTCF.The prediction accuracies of SVM and RF are83.09% and 84.19% respectively.The results illustrated that the prediction level more than 80% could be obtained for active CTCF binding sites in MCF-7 cell line.Meanwhile,the prediction performance showed that the chromatin accessibility and DNA methylation have stronger effect on the binding of CTCF.And RAD21 and SMC3 play a role for the binding of CTCF.Our research is helpful for analyzing and predicting the interaction between DNA and other transcription factors.
Keywords/Search Tags:CTCF, active site, motif, DNase-seq, DNA methylation, discrete increment
PDF Full Text Request
Related items