Font Size: a A A

Research On Gene Prediction Methods Based On Structural Features And Deep Learning

Posted on:2023-10-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:C WeiFull Text:PDF
GTID:1520306608468484Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Gene prediction is one of the important contents of genome analysis,and is a key link in understanding gene regulation and expression.The accuracy of gene prediction will directly affect the accuracy of subsequent tasks such as gene function annotation.In many cases,the only way to identify gene function is to interfere with the target gene to observe its effect on the phenotype.However,in the past ten years,the development of gene prediction has not attracted enough attention.On the one hand,most gene prediction methods are built on the dataset generated by the previous generation of gene sequencing technology,which not only has a small amount of data,but also has unguaranteed accuracy.On the other hand,effective coding features are still limited to a few biological features(such as codons,hexamer usage,etc.).At the same time,due to complex non-linearity of coding features and some gene functional sites are surrounded by poorly conserved sequences,the prediction performance of existing gene prediction methods still needs improvements.With the rapid development of next-generation sequencing technology,a large amount of genetic data has been accumulated,and a large amount of works have been devoted to identifying gene mutations from the sequenced data,which makes it very urgent to seek an effective gene prediction method.At the same time,deep learning methods have been successfully applied to a variety of data including image,video,natural language,etc.,by virtue of their strong end-to-end learning capabilities and expressive power,and have also achieved good results in a considerable number of gene prediction applications.However,the existing deep learning-based gene prediction methods are difficult to fully extract the useful features in biological sequences.On the one hand,genes in biological sequences have strong structural features.For example,the codons in protein coding regions are arranged together in a certain order(the continuity of the codons),the translation start site is located at the boundary of the non-coding region to the coding region in the first reading frame,and this kind of methods ignore or fail to make full use of the structural features.On the other hand,unlike image data,biological sequence,as a symbolic data,has high semantic information,and there is often heterogeneity between biological features,and this kind of methods only use a single data representation method and a single computational model are difficult to comprehensively extract these features.Based on the above analysis,taking the protein coding regions prediction and translation start site prediction of the two subtasks in gene prediction as examples,this paper explores how deep learning combines the structural features of gene to further improve the prediction performance of the two subtasks.The concrete content includes the following aspects:1.Considering that the existing methods for protein coding regions prediction ignore structural feature in coding regions,a gene prediction method based on structural feature and bidirectional recurrent neural network with skip connection is proposed.The method exploits the continuity of codons for the first time and models the dependence among coding labels in the coding region.Meanwhile,a skip connection in the network architecture is used to solve the problem of label message passing over long distance.The bidirectional recurrent neural network with skip connection effectively learns the structural feature of codon continuity by capturing the label message passing from two neighboring positions.Tests on human and mouse transcriptome sequences show that the proposed method significantly improve the prediction performance of the existing state-of-the-art methods for protein coding region prediction.2.On the basis of the first part of the content,the bidirectional recurrent neural network with skip connection is extended to genome sequences,and considering the heterogeneity of the coding features in the biological sequence,a protein coding regions prediction method based on hybrid encoding and convolutional-bidirectional recurrent neural network is proposed.This method uses convolutional neural networks to capture the global sequence order information for the first time,and introduces the popular gapped kmer(gkm)feature into protein coding region prediction for the first time.Integrating the above three heterogeneous features,the proposed method has performed the best among the existing state-of-the-art methods for protein coding region prediction on human and mouse genome and transcriptome sequences.3.Considering that most of the current deep learning-based translation start site prediction methods ignore the structural features around the translation start site,and cannot make full use of coding features,a translation initiation site prediction method based on structural features and deep learning is proposed.This method takes advantage of the structural feature that the translation initiation site is located at the boundary between the non-coding region and the coding region in the first reading frame,and uses dependency network to explicitly model the label dependence between the coding region and the translation initiation site.Then,coding features learnt(obtained through the second part of the content)are integrated into the convolutional neural network for structural feature learning.Moreover,the ribosome scanning model and structural features around the stop codon are also incorporated into the prediction of the translation start site in transcriptome sequences.Tests on human and mouse genome and transcriptome sequences show that the proposed method significantly improves the prediction performance of the existing state-of-the-art methods for translation start site prediction.The contents of the above three parts have confirmed that a flexible combination of the structural features of genes and deep learning is of great significance to improve the prediction performance of gene prediction methods.At the same time,it also confirmed the effectiveness of hybrid encoding for coding features extraction.
Keywords/Search Tags:Bioinformatics, Gene Prediction, Deep Learning, Structural Features, Hybrid Encoding, Label Dependency
PDF Full Text Request
Related items