Font Size: a A A

Analysis Methods Of The Transcription Factor Binding Sites Based On Chromatin Accessibility Sequencing Data

Posted on:2022-06-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:S W XuFull Text:PDF
GTID:1480306353476144Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Modern medical research has proved that many human diseases are directly or indirectly related to genes.In recent years,with the understanding of human genome,it has been found that the so called junk non-coding genomic regions often play an important role in complex diseases.In fact,the noncoding region of genome contains a large number of binding sites of gene regulatory elements.Transcription factor is an important class of gene regulatory elements,which can specifically bind to DNA sequences,and then regulate the target genes.Studies have shown that the disorder binding of transcription factors in the non coding region is one of the causal for cancer.With the improvement of genome sequencing technology,genomics data show an explosive growth.In the face of massive data,pattern recognition and machine learning methods can effectively process genomic data and discover the gene regulatory patterns that contained in the data.In this paper,the intelligent analysis methods of pattern recognition and machine learning are used to explore research on the method of transcription factor binding sites based on chromatin accessibility data.The research contents are as follows:Firstly,in order to identify the transcription factor binding sites and correct signal bias,high-throughput sequencing data of non-coding regions were obtained from open source database.We then detected many kinds of transcription factor binding sites by using data preprocessing methods and bioinformatics tools.Then,a novel sequence bias correction algorithm was designed.In theory,high-throughput sequencing data can help us to locate transcription factor binding sites with high resolution.However,due to the DNase sequence bias,it will have a negative impact on the specific recognition of binding sites.By comparing the correlation of the data before and after correction with the real signal,we found that the corrected data is closer to the real signal.A CNN-based model of transcription factor binding sites identification is designed.By the successful prediction of transcription factor binding sites in different data sets,we verified that the proposed algorithm can effectively identify transcription factor binding sites from high-throughput sequencing data.Secondly,the prediction method of transcription factor binding sites based on ATAC-seq high-throughput sequencing data was studied.ATAC-seq is a new sequencing technology which can locate transcription factor binding sites with high resolution.For ATAC-seq data,a recognition model of transcription factor binding sites that based on bidirectional gated recurrent unit network was designed.This model uses the information between protein binding site sequences effectively.Through the prediction of multiple transcription factor binding sites datasets,the model achieves better recognition effect than similar models.It verifies that the algorithm can effectively identify transcription factor binding sites from ATAC-seq highthroughput sequencing data.Thirdly,for identifying the potential functional variants within transcription factor binding sites,a computational framework for automatic identification of allele specific transcription factor binding sites is proposed.Existing studies have shown that mutations or variations of non-coding sequences may affect the binding of transcription factors and affect the expression of regulated genes,which eventually lead to the occurrence of diseases or cancer.Therefore,accurate identification of these functional regulatory variants can help us to explore the disease mechanism.In this part of work,ATAC-seq data of human breast cancer cells and human bone marrow mesenchymal stem cells are used to identify a large number of potential pathogenic variants with the generalized linear mix model.By combining RNA-seq and 3D genomic data,the genes that regulated by these potential regulatory variants were found.By comparing with published clinical data,we further found that these target genes were significantly enriched in the related diseases pathway,which verified the effectiveness of our algorithm.Finally,for identifying the non-coding regulatory regions that associated with tissue image features,a deep learning based integrated analysis of breast cancer pathological tissue image features and chromatin accessibility data is conducted.First of all,a tissue quantification algorithm based on deep learning for breast cancer pathological images was designed.Using this algorithm,we segmented the panoramic pathological images of patients with estrogenreceptor positive breast cancer.Then the area of epithelial tissue and stroma tissue in each image were recognized.We further identified a large number of regulatory regions that related to epithelial tissue area with the correlation analysis between the epithelial tissue proportion and chromatin open regions.By analyzing the function and enrichment of genes that regulated by these regulatory regions,we found that these regulated genes were significantly enriched in breast cancer related pathways.Finally,the survival analysis was performed to verify the target genes and the image features can be used to predict the prognosis of patients more effectively.In this paper,we analyzed the tumor associated transcription factor binding sites from above three different perspectives.Many potential causal disease-associated sites were found.These predicted potential regulatory variants and regulatory regions can provide a theoretical basis for precision medicine and personalized treatment in the future.Based on chromatin accessibility sequencing data,a large number of functional variants and regulatory regions that related to tumor diseases can be found from four different perspectives.This provides a new method for future precision medicine and personalized treatment.
Keywords/Search Tags:Genomic non-coding regions, Chromatin accessibility, Transcript factor binding site, Gene expression, Pattern recognition, Machine learning
PDF Full Text Request
Related items