Identification of cis-regulatory element(CREs)is one of the important research topics in the field of bioinformatics.CREs are diverse and participate in gene expression regulation,playing a critical role in the cell replication,differentiation,and development of multicellular organisms.Determining whether a certain gene segment in DNA contains specific CREs and conducting in-depth research on the corresponding regulatory mechanisms plays a decisive role in understanding the pathogenesis of genetic diseases and expanding treatment strategies.Experimental identification of CRE segments is costly,time-consuming,and not scalable.Therefore,there is an urgent need to develop lightweight and accurate computational methods to identify CREs.Based on traditional machine learning and deep learning,this thesis establishes corresponding identification methods for enhancers in the human K562 cell line and silencers in Japanese rice genes.The main research contents are as follows.(1)Based on the existing research,which only considers simple sequence composition features and lacks the utilization of sequence gene-derived information,a deep learning-based silencer multi-perspective feature prediction model is constructed.Firstly,this method uses kmer,nucleotide chemistry,and word embedding to extract features from the sequence composition,physical and chemical properties,and semantics.Secondly,based on the Mixup data augmentation method in the image domain,a parallel network architecture of Convolutional Neural Networks(CNN)and Dense Neural Network(DNN)is constructed to extract multidimensional features of sequences.Finally,Long Short-Term Memory is used to process the long-term dependencies of sequence data and further process the fused features abstracted by CNN and DNN.The experiments show that the predictive performance of this method is better than existing methods.Among them,the multi-perspective feature method can provide key features of sequence composition information for the model,which is beneficial to improving the accuracy of the predictive model.The Mixup method effectively expands the silencer data space and enhances the model’s robustness.The combined method performs better than the single method and improves the stability of the model.(2)To address the issue of insufficient utilization of the silencer data,a new silencer dataset was reconstructed using the latest databases.Based on the research content(1),a combination feature coding method is designed and its effectiveness is proved over a single coding method.Furthermore,the predictive abilities of different prediction algorithms,oversampling methods,and feature selection methods are explored in the complete feature space.Principal component analysis is then used for dimensionality reduction.Finally,the ensemble learning model is constructed by the base classifier with the top prediction performance.The experimental results show that the predictor with oversampling algorithm has higher specificity,the feature selection algorithm can remove redundant features and reduce computational overhead,and the combination method has better predictive performance than a single prediction algorithm.(3)Given the structural and functional similarities between biological enhancers and silencers,a Bi-directional Gated Recurrent Unit(Bi-GRU)-based enhancer identification model is proposed.Based on existing references,this model constructs a rice enhancer dataset and extracts sequence composition and 3D structural information using one-hot and DNAshapeR,respectively.A feature extraction module is constructed using CNN,and a feedforward attention mechanism is added to improve model performance.Experimental results show that DNAshapeR’s gene structure features have a positive impact on prediction results.CNN can autonomously abstract representative data features,and Bi-GRU has higher predictive performance and lower training costs than LSTM.The design of the feedforward attention mechanism effectively addresses the problem of Bi-GRU capturing long-term sequence dependencies. |