Font Size: a A A

Predicting Essential Genes Based On Semi-supervised Learning

Posted on:2022-06-03Degree:MasterType:Thesis
Country:ChinaCandidate:T HeFull Text:PDF
GTID:2480306536963309Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Essential genes play an indispensable role in the survival and reproduction of organisms.The identification of essential genes is of great significance in synthetic biology,biomedicine,and biochemistry.The identification of essential genes mainly adopted of experimental methods and computation-based methods.Most computational methods-based studies mainly employ supervised learning methods,which require a large amount of labeled samples to train the model to ensure the model performance.But obtaining label information of genes is difficult.In this work,we proposed to use the semi-supervised learning to predict essential genes.Semi-supervised learning can use labeled and unlabeled samples to reduce the demand for labeled samples.41 prokaryotes and 6 eukaryotes were selected as research objects,and the main work carried out is as follows:Firstly,a graph-based semi-supervised learning method was adopted to predict essential genes of prokaryotes to verify the effectiveness of the proposed method,and comparative experiments were designed to explore the influence of the labeled sample amount on the model performance.In the experiments,the proportion of the labeled samples amount to the total sample size increased from 10% to 90%,and the step size was 10%.The results show that the model performance gets better with the increase of the proportion of the labeled sample.When the proportion is 20%,the average AUC score of 41 prokaryotes is 0.710,indicating that semi-supervised learning can use limited labeled samples to construct an effective essential genes prediction model.Then,the graph-based semi-supervised learning method was used to predict essential genes of eukaryotes to further verify the model effectiveness.Three groups of comparative experiments were designed to explore how to optimize the model performance.It is necessary to construct a graph when constructing the model.Employing different kernel functions to measure the similarity of samples affects the model performance.Thus,comparative experiments were designed to explore the influence of using the Laplacian kernel function and Gaussian kernel function to measure the similarity of samples on the performance of the model.Results show that the model is more effective when the Laplacian kernel function is used to measure the sample similarity.K Nearest Neighbor(KNN)algorithm is employed to construct the sparse graph,and the value of K also affects the model performance.Here,an adaptive K value selection strategy was proposed,which correlated the K value with the sample size of data set.And a group of comparative experiments based on different K values was designed.The results show that using the adaptive K value selection strategy and select the appropriate K value is helpful to improve the performance.Besides,a group of comparative experiments was designed to explore the influence of the amount of the labeled sample on the model performance.When the labeled sample proportion is 30%,the average AUC score of the six eukaryotes is 0.710,which further verify the effectiveness of the method.
Keywords/Search Tags:Essential Genes, Semi-supervised Learning, KNN, Kernel Function
PDF Full Text Request
Related items