| Chromatin,as the organizational form of the genome,plays a crucial role in gene expression and regulation.By studying the structural units of chromatin,a deeper understanding of the composition and mechanisms of the genome can be obtained,laying the foundation for the study of gene expression and regulation mechanisms.Abnormal chromatin structure is closely related to various diseases,such as chromosome translocations,deletions,and amplifications.Studying chromatin structural units can reveal the mechanisms and progression of diseases,providing new methods and ideas for disease prevention and treatment.The computational methods of bioinformatics can be used to process massive chromatin sequence data and analyze and detect multiple data simultaneously.By continuously optimizing algorithms and models,the accuracy and precision of detection can be improved,enabling the rapid and efficient detection of the boundaries of topologically associated domains.Furthermore,by analyzing the detection results and features of the model,new perspectives and methods can be provided for a deeper understanding of life activities and gene regulation.Therefore,in view of the relatively single encoding scheme and low model accuracy in the current chromatin topology detection research,this research comprehensively studies different sequence feature encoding schemes,constructs an ensemble learning model to detect chromatin topological domains,and promotes the development of related fields.The main research contents are as follows:(1)To construct a high-quality and efficient integrated model,this study first selected data feature encoding schemes and comprehensively compared the performance of seven feature encoding schemes,including K-mer,mismatch k-tuple,and nucleotide pair spectrum encoding.The data was subjected to standardized preprocessing,and different feature encoding schemes required for the experiments were chosen.Feature extraction was performed on DNA sequence data using the selected encoding schemes,and the importance of features was compared and analyzed using different combinations of encoding schemes.Ultimately,through visualization and analysis of the results,the optimal encoding scheme was determined.The results demonstrated that the K-mer feature encoding exhibited excellent performance in detecting the topological domain boundaries in fruit fly.(2)To investigate the algorithm for detecting topological domain boundaries based on chromatin three-dimensional structural characteristics and ensemble learning methods,this study designed and established an ensemble learning solution called Stack TADB.This framework integrates four base classifiers,including random forest,logistic regression,Knearest neighbors,and support vector machines.Through the stacking ensemble method,in conjunction with K-mer feature encoding,multiple training sets were generated using bootstrap sampling.Each training set was used to train a base classifier,and the results of multiple base classifiers were aggregated to obtain the detection results of the ensemble model.The performance of Stack TADB was tested and analyzed using a one-hot encoded DNA sequence dataset created in previous studies.The results showed that Stack TADB outperformed traditional feature models and deep learning models on six metrics,including AUC,accuracy,Matthews correlation coefficient,precision,recall,and F1 score.Stack TADB improved the performance of the best-performing traditional feature model by1.4%,6.5%,13.9%,6.4%,6.5%,and 6.5%,respectively,and improved the performance of the best-performing deep learning model by 3.6%,10.3%,23.0%,10.2%,10.3%,and 10.3%,respectively.To enhance model understanding and improve credibility,this study utilized the SHAP(SHapley Additive ex Planations)framework to explain the detection results of Stack TADB.It identified the crucial role of subsequences matching the BEAF-32 motif in detecting topological domain boundaries.This research provides an effective detection tool for downstream analysis of chromatin topological domains.(3)In order to facilitate the application of topological domains in the field of biomedical research,this study developed a topological domain detection and analysis system using the Django backend framework and the Lay UI frontend framework.The system is primarily designed to detect chromatin topological domains based on DNA sequence data and the Stack TADB model.It provides users with a simple,efficient,and flexible interactive interface.Our research model provides biologists with an efficient and robust TAD detection tool,and provides bioinformaticians with models and methods for detecting chromatin structural units and regulatory elements,which is conducive to promoting downstream analysis of TADs and advancing the research on three-dimensional genome structure. |