Font Size: a A A

Essential Protein Identification In The PPI Network Based On Feature Selection

Posted on:2024-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:S R XiaFull Text:PDF
GTID:2530307106965309Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
At the same time that high-throughput biotechnology has developed at a high rate,massive amounts of biological data ensue.Irrelevant and redundant features in high-dimensional sequence datasets generated by high-throughput technologies not only trigger dimensional catastrophe but also interfere with the training process of the model,leading to the degradation of the predictive performance of the model,so it is important to carry out feature selection studies for high-dimensional biological sequence datasets.Analysis of the connection patterns of nodes in protein interaction networks,and integration with biological data to discover the functions of proteins,has gradually become a hot topic in bioinformatics.The identification of protein complexes and key proteins in protein interaction networks has instructive effects on the association between the functional mechanisms in which a protein participates and the spatiotemporal attributes in which that protein resides.Recently,many feature selection methods have been designed for sequence data However,most of them ignore the gap between redundant and valid samples,resulting in a low accuracy rate of feature screening.Complex network-based graph clustering algorithms are more commonly used in the identification methods of protein complexes.Nevertheless,the existing methods tend to overlook the inherent structure and inherent biological functions of protein complexes.Therefore,to address the above issues,this thesis presents a feature selection method for biological datasets and an essential protein identification method.The main works of this thesis are summarized as follows:(1)This study proposed a novel feature selection framework for biological sequence data,namely Reweighting-Boost,which selects an informative set of features in classification problems based on the boosting algorithm.Specifically Reweighting-Boost first uses a tree boosting model(xgboost)that is highly scalable and computationally fast to obtain the initial rearrangements of the feature sets,and then the weights of the features are recalculated in each round of iterations to select effective features from high-dimensional data.Unlike previous weighting approaches that have left all features out,the top P features add sequentially to the set of selected features,and obtain a classification model score.Then they select the top scoring of the P features to add to the selected sets.Finally the weights of the feature samples in subsequent iterations are updated,taking the results as input to the deep forest classification model.Using Reweighting-Boost perform experiments on five data sets of three species,including human data,Zea mays,and Arabidopsis thaliana.The experimental results have demonstrated that Reweighting-Boost has better effectiveness and robustness in screening effective features compared to LPI-Hy ADBS.(2)This study proposed a novel essential protein identification method,namely Ess-LGS,which combines multi-source biological data and protein interaction networks.It avoids false positives and false negatives in protein interaction networks and improves the identification of essential proteins.Since the high-dimensional protein sequence data may influence the subsequent experiments,this thesis first applies the Reweighting-Boost on the protein sequence data.First,construct the protein interaction networks,which divide the whole network into 11 subnetworks based on subcellular localizations due to the uneven distribution of essential proteins in different subcellular regions.Second,combine gene expression data and selecting protein sequence to form protein complexes.The experimental analysis of the obtained protein complexes results proved its biological significance.After the experiments,the predicted protein complexes could perform better than some other protein complexes recognition methods such as JDC and WDC on ACC and PPV.The candidate essential proteins were next screened based on the topological attributes of the protein nodes in the identified protein complexes.The Ess-LGS method identified the essential proteins with a higher accuracy rate contrasting with existing methods.
Keywords/Search Tags:protein protein interaction network, feature selection, protein sequence, protein complexes, essential proteins
PDF Full Text Request
Related items