| In living organisms,multiple factors can cause cellular DNA changes,including external environment(such as ultraviolet rays),chemical environment(such as nitrous acid),DNA oxidative damage,and mismatch in DNA replication.These factors determine the turn-on and turn-off of genes and then change the biological function of encoded proteins.With the continuous accumulation of pathogenic variants,normal cells would turn into cancer cells,threatening life safety.With the completion of the human genome project,the post-genome era urgently needs to study the essence of variants’ effects from multi-levels,multi-perspectives,and multi-technologies,through integrating chromosomes,DNA,RNA,and protein information.It would be helpful to understand the relationship between specific variants and genetic diseases,guide the formulation of gene therapy programs,and find therapeutic drugs for diseases.Hence,feature extraction and learning algorithm for nucleotide variation functional effect prediction have become a hot research point in bioinformatics field.At present,the nucleotide variation prediction mostly adopts individual traditional machine learning method to obtain the potential regulation mode of variation pathogenicity.However,such related methods may not be able to capture the nonlinear relationship between protein features.Meanwhile,in this research field,it is difficult to reveal the biological characteristics of protein sequence through 0-1 coding method on small or medium-sized variation dataset.Simultaneously,there are some mutually exclusive factors among different types of the extracted protein features and the serial features combination can’t effectively improve model prediction performance.In addition,more attention has been given to the GOF/LOF analysis on small-scale variation data or specific genes,but few prediction models have been developed on a large-scale dataset consisting of neutral and pathogenic GOF/LOF variants.To solve the aforementioned problems,in this thesis,we research the feature extraction and learning algorithm for nucleotide variation functional effect prediction as the following three aspects: the pathogenicity prediction of missense mutations in transmembrane proteins,the pathogenicity prediction of ns SNPs in general proteins,the pathogenicity prediction and outlier variants detection on large-scale neutral and pathogenic GOF/LOF variants.The main innovations are summarized as follows:(1)Based on the characteristics of protein sequence,structure and energy,Gaussian weight attenuation position-specidic scoring matrix(WAPSSM),and annotation information of existing prediction tools,as well as the information gain from the previous layer sub_XGBoost,a pathogenicity prediction model for missense mutations in transmembrane proteins named MutTMPredictor is proposed.In MutTMPredictor,a new Gaussian feature WAPSSM is introduced built on protein evolution information,which realizes the extraction of microenvironment characteristics of mutation sites under different weights.At the same time,the annotations of several missense mutation prediction tools and the outputs of three sub_XGBoost in the previous layer are used to enrich the feature representation,which form an effective feature reuse and realize a robust prediction framework.(2)Based on protein evolution information,predicted secondary structure,relative solvent accessibility,protein disorder information,and physicochemical properties of wild type and mutant amino acids,a pathogenicity prediction model for non-synonymous single nucleotide polymorphisms(ns SNPs)named FFMSRes-MutP is proposed with multi-scale kernels and depth feature fusion technologies.In FFMSRes-MutP,2D-Res Net and 1D-Res Net are used to extract 2D protein-based characteristics and 1D amino acid physicochemical properties.Three groups of multi-scale Res Net blocks are adopted to capture information in different ranges around mutation sites.In addition,the deep feature fusion serves to concatenate and capture more comprehensive features from different perspectives,and realize high-precision prediction of pathogenic ns SNPs.(3)We construct a large-scale variation dataset comprising of neutral variants and pathogenic GOF/LOF variants(more than 140,000),and extract 503 three-level features(i.e.,variant-level,protein-level,and genome-level)for each variant.Then the model named RUS-Wg-MSRes Net is proposed to predict the variant’s pathogenicity.In this model,random under-sampling of samples,newly defined weighted binary cross-entropy loss function,and multi-scale Res Net are used to reduce the influence of data imbalance and significantly improve the model’s feature extraction ability.Meanwhile,the outlier detection method XGBOD is used to detect the outlier GOF variation from most LOF variations. |