Font Size: a A A

A Study On Protein Family Classification Based On Multi-level Feature Extraction

Posted on:2024-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:G K ZhouFull Text:PDF
GTID:2530307067493364Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Next-generation sequencing technologies have generated a large number of new sequenced proteins,and classifying these proteins into different functional families is an important task in the field of biological protein characterization.Since traditional biological experimental methods are costly and time-consuming to assay,protein classification studies based on machine learning have become a hot topic.However,the existing work suffers from the problems of single feature level,bias in classification and upper limit of protein sequence length set in the dataset,which greatly limits the applicability of classification models.To address these problems,the paper proposes an ultra-long protein family classification model that combines multi-level feature information with dilated convolution.Performance tests are conducted on commonly used protein datasets,and the proposed model in the paper shows significant advantages.The main work and contributions of the paper include the following:First,the skip-gram model is used to construct amino acid mapping matrices to address the problems of lack of amino acid correlation information and position information.Unlike previous protein classification models that use solitary heat encoding,the paper uses distributed encoding to encode amino acids and uses the skip-gram model for the first time in the construction of amino acid mapping matrices.The experimental results show that more amino acid correlation features and positional features can be preserved using distributed encoding.Secondly,a new deep neural network structure is designed for multi-level feature extraction of proteins to address the problem of insufficient protein feature information extracted by current work.The neural network mainly consists of an embedding layer,a convolutional neural network module,a global and local feature(GL)module,and a fusion module.The embedding layer uses a mapping matrix constructed by the skip-gram model to encode amino acids at a faster rate and with better information retention capability.Inspired by the attention mechanism,the GL module is designed to extract amino acids and overall features of sequences in protein sequences,and the fusion module is used to fuse the obtained feature maps to produce the final feature vectors as classification results.The experimental results show that the neural network designed in the paper achieves a 3% to 7% improvement in five relevant metrics.Third,a new classification framework for ultra-long protein sequences is designed to address the lack of research on the classification of ultra-long protein sequences.The framework mainly includes an embedding module,a global feature module,a multiscale computation module and a long-time memory module.Among them,the multiscale computation module extracts multiscale feature information using convolutional kernels of different sizes,which enhances the ability of protein domain discovery.In addition,the long-time memory module,inspired by the expansion convolution and causal convolution,is constructed to enhance the feature retention ability and extraction ability of long protein sequences.The experimental results show that the framework has better stability and classification accuracy in handling ultra-long protein sequences compared with traditional protein classification networks.
Keywords/Search Tags:Protein classification, Distributed coding, Multi-level features, Ultra-long protein sequences, Long-time memory module
PDF Full Text Request
Related items