Font Size: a A A

Hi-C Data Comparison Method Based On Domain Skeleton

Posted on:2022-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:J C LiFull Text:PDF
GTID:2480306602467054Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
The three-dimensional structure of chromatin plays a key role in gene regulation,DNA replication,DNA damage repair,and disease.The folding of non-adjacent gene sites on the linear genome can support remote interaction.In order to study the spatial structure and regulation of chromatin from the whole genome level,high-through chromosome conformation capture(Hi-C)technology has emerged.The whole genome chromatin interaction map can be obtained by high-throughput sequencing of protein-mediated adjacent DNA fragments,so as to study the role of the three-dimensional structure of the genome in basic cell functions such as gene regulation.Analysis of Hi-C data is the basis for studying the three-dimensional structure of chromatin.Measuring the similarity between Hi-C data and extracting different effective regions is one of the commonly used methods.Similarity quantification is used as a basic quality control tool to evaluate the repeatability of Hi-C experiments,to ensure that Hi-C data is scientifically effective,and to guide experiments to be repeated or carried out in depth.At the same time,understanding the similarity between duplicate samples is also an important step in difference analysis,and is the premise of reliably identifying significant biological differences.Recognition of different regions is an important method to explain the differences of cell biological functions in various states from the perspective of three-dimensional genomes,such as analyzing differences in chromatin regulation between healthy cells and diseased cells,and guiding subsequent experiment.However,the comparative analysis of Hi-C data presents challenges due to presence of technology-driven and sequence-specific biases.The existing methods are still deficient in computational efficiency,accuracy and interpretability.This paper proposes a method for comparing Hi-C data based on domain skeletons.Using Gaussian mixture model clustering and KD-Tree method,through the steps of screening,merging and alignment,we get the set of significant interaction points that can depict the entire Hi-C interaction graph,which is defined as the domain skeleton.Combined with the similarity calculation formula,the skeletons in different Hi-C data fields are absorbed to calculate the similarity quantization value.The local relative difference score of the domain skeleton is calculated by Gaussian filtering,and the significant difference points are screened according to the threshold.Thereby,Hi-C data similarity measurement and difference region identification are completed based on the domain skeleton.The introduction of domain skeleton can effectively reduce the noise interference and the calculation scale of high-resolution data,and improve the validity and interpretability of difference points.In order to verify the validity and interpretability of the domain skeleton,the consistency of the domain skeleton and the chromatin structure loop is compared,and the average overlap ratio is 97.9%.For similarity measurement,the proposed method is compared with HiCRep and GenomeDisco methods on Hi-C data with different data sizes and different resolutions,and the results show that the fluctuation amplitude is low by 10%to 70%,which indicates that the stability of the proposed method is superior to other methods.The calculation efficiency is 50 times faster than GenomeDisco under 5kb data and has been significantly improved.For the detection of difference regions,real Hi-C data from multiple cells are used,and compared with the Find and Selfish methods,the consistency of detected difference sites is up to 75%,and the consistency of corresponding genes is up to 85%.In addition,by analyzing the overlap rate,gene enrichment and central significance,the result shows that this method is superior to other methods in accuracy and interpretability.At the same time,our method is superior to other methods in recall and accuracy when using Poisson distribution to generate Hi-C simulation data and presenting different significant intensity difference points.Furthermore,by comparing the average running time,it is found that the proposed method is only 0.79%of the Find method,which is 8.8%faster than the Selfish method.The experiments above show that the Hi-C data comparison method proposed in this paper is a method with strong accuracy,stability,interpretability and efficiency.
Keywords/Search Tags:3D genome, Hi-C data comparison, Gaussian mixture model, clustering, domain skeleton
PDF Full Text Request
Related items