| Proteins are essential components in various biological processes and a vital substance that constitutes living organisms.They are typically translated and transcribed from DNA.The proposal of the Human Genome Project has led to rapid development in bioinformatics and related technologies.With the completion of the sequencing phase,current scientific research has entered the post-genomic era.As the scale of biological data continues to grow,the number of proteins with unknown functional labels also increases.Therefore,there is an urgent need for an automated annotation system based on computer technology to comprehensively,automatically,accurately,and conveniently analyze protein biological data.This approach is not only beneficial to human understanding of the physicochemical properties of proteins and exploration of the underlying principles of life activities but also provides large-scale data support for research fields such as medical genetics,pathology,drug development,and bioscience.Therefore,automated protein function label prediction based on computer technology is one of the important and fundamental tasks in the post-genomic era.Protein sequence data are the most fundamental protein omics data obtained through sequencing technology.Due to the range of sequence length,amino acid sequences typically vary from a few dozen to several thousand.The vast amount of amino acid data cannot be fully utilized,which poses a challenge for current protein function prediction.Moreover,protein omics data are diverse and not limited to sequence data.Other types of data include protein-protein interaction data,protein 3D structure data,and protein coexpression data.Additionally,how to reasonably select protein omics data based on demand,given the bias of researchers and research hotspots,is also a problem.Furthermore,how to integrate multiple modes of protein omics data and accurately predict protein function labels is currently a challenge in researchDesigning and implementing an automated analysis system based on deep learning,which has powerful feature extraction capabilities,is an optional solution to address the above problems.By using an improved deep learning-based protein function prediction algorithm,an online protein function prediction can be realized,providing a feasible automated execution plan for protein function prediction.Therefore,this study analyzes various modes of proteomic data and the gene ontology paradigm,conducts in-depth research on existing protein function prediction algorithms,and proposes a protein functional label prediction model based on deep learning,which integrates protein sequence data and protein interaction network data.Subsequently,an online prediction platform based on this algorithmic model is designed and implemented.The prime task of this thesis includes:First,this thesis proposes a deep learning-based protein function prediction algorithm,named Deep GAGO,which fuses both sequence information and interaction data to address the limitations of existing protein function prediction methods.Most existing methods use convolutional neural networks to process protein sequence data,which involves trimming and padding sequence data to fit the input of the convolutional neural network.This process results in the loss of valuable training data and increased learning burden.In response to this issue,the Deep GAGO algorithm introduces pyramid pooling to adapt to any length of sequence input.Additionally,protein sequences undergo twisting and folding to form spatial structures,but most existing algorithms do not consider the local recombination of sequence information during spatial transformation.To address this issue,the Deep GAGO algorithm introduces dilated convolution to enlarge the network’s receptive field and increase the fitting effect.Furthermore,most existing algorithms directly concatenate multiple data representation vectors,which increases the dimensionality of the classifier’s input and does not eliminate redundant feature information.To address this issue,the Deep GAGO algorithm introduces a graph neural network based on an attention mechanism to integrate protein sequence data and interaction information.Finally,the thesis designs rigorous comparative and ablation experiments to verify the effectiveness of the Deep GAGO algorithm on the CAFA dataset,and the experimental results demonstrate that the proposed model performs better.Second,this thesis designs and implements an online prediction system that uses deep learning algorithms.The system is implemented using a microservice architecture,with the front-end using technologies such as VUE and Nuxt,and the back-end using the Gin framework.The algorithm is written in Py Torch and uses the Flask framework to provide services.The entire system is deployed using Docker and includes functions such as user management,online prediction,omics database,and visualization.Necessary system function and performance tests are conducted after implementation,and the test results show that the system is complete in terms of functionality,stable in operation,and meets performance requirements.In summary,this thesis analyzes and improves deep learning-based protein function prediction algorithms to address the issue of automated protein function prediction,designs and implements an automated online prediction platform based on the improved algorithm,which improves the efficiency of protein function label analysis and achieves convenient protein function automation analysis for researchers. |