Feature Extraction And Learning Algorithm For Functional Effect Prediction On Nucleotide Variation

Posted on:2024-09-12

Degree:Doctor

Type:Dissertation

Country:China

Candidate:F Ge

Full Text:PDF

GTID:1520307331972359

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In living organisms,multiple factors can cause cellular DNA changes,including external environment(such as ultraviolet rays),chemical environment(such as nitrous acid),DNA oxidative damage,and mismatch in DNA replication.These factors determine the turn-on and turn-off of genes and then change the biological function of encoded proteins.With the continuous accumulation of pathogenic variants,normal cells would turn into cancer cells,threatening life safety.With the completion of the human genome project,the post-genome era urgently needs to study the essence of variants’ effects from multi-levels,multi-perspectives,and multi-technologies,through integrating chromosomes,DNA,RNA,and protein information.It would be helpful to understand the relationship between specific variants and genetic diseases,guide the formulation of gene therapy programs,and find therapeutic drugs for diseases.Hence,feature extraction and learning algorithm for nucleotide variation functional effect prediction have become a hot research point in bioinformatics field.At present,the nucleotide variation prediction mostly adopts individual traditional machine learning method to obtain the potential regulation mode of variation pathogenicity.However,such related methods may not be able to capture the nonlinear relationship between protein features.Meanwhile,in this research field,it is difficult to reveal the biological characteristics of protein sequence through 0-1 coding method on small or medium-sized variation dataset.Simultaneously,there are some mutually exclusive factors among different types of the extracted protein features and the serial features combination can’t effectively improve model prediction performance.In addition,more attention has been given to the GOF/LOF analysis on small-scale variation data or specific genes,but few prediction models have been developed on a large-scale dataset consisting of neutral and pathogenic GOF/LOF variants.To solve the aforementioned problems,in this thesis,we research the feature extraction and learning algorithm for nucleotide variation functional effect prediction as the following three aspects: the pathogenicity prediction of missense mutations in transmembrane proteins,the pathogenicity prediction of ns SNPs in general proteins,the pathogenicity prediction and outlier variants detection on large-scale neutral and pathogenic GOF/LOF variants.The main innovations are summarized as follows:(1)Based on the characteristics of protein sequence,structure and energy,Gaussian weight attenuation position-specidic scoring matrix(WAPSSM),and annotation information of existing prediction tools,as well as the information gain from the previous layer sub＿XGBoost,a pathogenicity prediction model for missense mutations in transmembrane proteins named MutTMPredictor is proposed.In MutTMPredictor,a new Gaussian feature WAPSSM is introduced built on protein evolution information,which realizes the extraction of microenvironment characteristics of mutation sites under different weights.At the same time,the annotations of several missense mutation prediction tools and the outputs of three sub＿XGBoost in the previous layer are used to enrich the feature representation,which form an effective feature reuse and realize a robust prediction framework.(2)Based on protein evolution information,predicted secondary structure,relative solvent accessibility,protein disorder information,and physicochemical properties of wild type and mutant amino acids,a pathogenicity prediction model for non-synonymous single nucleotide polymorphisms(ns SNPs)named FFMSRes-MutP is proposed with multi-scale kernels and depth feature fusion technologies.In FFMSRes-MutP,2D-Res Net and 1D-Res Net are used to extract 2D protein-based characteristics and 1D amino acid physicochemical properties.Three groups of multi-scale Res Net blocks are adopted to capture information in different ranges around mutation sites.In addition,the deep feature fusion serves to concatenate and capture more comprehensive features from different perspectives,and realize high-precision prediction of pathogenic ns SNPs.(3)We construct a large-scale variation dataset comprising of neutral variants and pathogenic GOF/LOF variants(more than 140,000),and extract 503 three-level features(i.e.,variant-level,protein-level,and genome-level)for each variant.Then the model named RUS-Wg-MSRes Net is proposed to predict the variant’s pathogenicity.In this model,random under-sampling of samples,newly defined weighted binary cross-entropy loss function,and multi-scale Res Net are used to reduce the influence of data imbalance and significantly improve the model’s feature extraction ability.Meanwhile,the outlier detection method XGBOD is used to detect the outlier GOF variation from most LOF variations.

Keywords/Search Tags:

Nucleotide variation prediction, multi-scale ResNet, deep feature fusion, protein big-data, pathogenic GOF and LOF variation

PDF Full Text Request

Related items

1	Research On Multi-site Protein Subcellular Localization Prediction Method Based On Fusion Feature And Multi-label Deep Forest Model
2	Genetic Variation Mining And Analysis Of Genomic Data For Large Samples And Multiple Animals
3	Protein Function Prediction Based On Multi-View Feature Fusion
4	Research On Enhancer Prediction Method Based On Deep Learning And Multi-feature Fusion
5	The Classification Prediction Of High Dimensional Data Of Membrane Protein Based On Multi-feature Fusion
6	Research On Prediction Of Nucleic Acid Binding Proteins Based On Deep Neural Network
7	Prediction Method Of Protein Glycation Site Based On Ensemble Deep Learning
8	Based On Feature Fusion Protein Properties Prediction Of Multiple Points Of View
9	Prediction Of DNA Methylation Sites Based On Nucleotide Coding
10	Research And Implementation On Genome Structural Variation Prediction Based On Deep Learning