| RNA information plays an important role in the process of gene expression and transcription,and the analysis of RNA transcription information helps to explore gene function and structure.Single-cell RNA sequencing data provides RNA information at the resolution of a single cell,providing important data support for the study of cellular heterogeneity.In the field of single-cell RNA sequencing data analysis,accurately identifying cell types is beneficial for biologists to achieve downstream biological analysis,so high-precision cell type identification methods have become a key research issue.Although a variety of cell type identification methods have been developed,the singleness of the algorithm makes it have many limitations in practical application.Ensemble learning can make up for the shortcomings of a single model and capture the multi-angle characteristics of data,which can provide strong technical support for high-precision cell type identification methods.The research on cell type identification methods focuses on two aspects: clustering algorithm and classification prediction algorithm.Among them,the clustering algorithm has wider applicability,and can cluster unknown cell types according to the characteristics of cells,so that the same type of cells can be clustered together.The classification prediction algorithm utilizes known cell types to predict unknown cell types,so the prediction accuracy is higher.Both have their own advantages and complement each other in practical applications.In view of this,this paper constructs a single-cell clustering algorithm and a single-cell type prediction algorithm based on ensemble learning for single-cell RNA sequencing data,and finally integrates them into a cell type identification platform,aiming to achieve high-precision cell type identification.The main research contents are as follows:(1)Single-cell clustering algorithm based on ensemble learningAiming at the problem that the existing single-cell clustering algorithm lacks the guidance of prior knowledge and the algorithm is relatively simple,this paper integrates marker gene data and builds a single-cell clustering algorithm SCMcluster based on ensemble learning.Firstly,the multi-source marker gene data is integrated to construct a marker gene set,which is applied to feature extraction to more effectively extract feature sets that can mark cell phenotypes.Then different types of clustering methods are integrated based on the consensus matrix to construct an ensemble clustering model with higher accuracy.The verification results show that SCMcluster has high accuracy and high robustness.In comparison with four traditional machine learning methods and four single-cell clustering methods,the ARI,RI,NMI,FMI,and AMI indicators of SCMcluster are all the best,which is29.8% higher than the average value of ARI of the second-best method on human and mouse datasets.In the accuracy verification,robustness verification and benchmarking of feature extraction,SCMcluster shows obvious improvement,proving the effectiveness of feature extraction.Finally,this paper compares the research on clustering guided by integrated pathway data,and the results show that the idea of SCMcluster integrating marker gene data for feature extraction is more effective than integrating pathway data.(2)Single-cell prediction algorithm based on ensemble learningAiming at the problem that the feature level captured by feature extraction is relatively shallow and the algorithm structure is relatively simple in the existing cell type prediction algorithm research,this paper constructs a convolutional neural network model sc Deep Pred based on weight integration and denoising autoencoder.The algorithm achieves deep feature acquisition of scRNA-seq data through a denoising autoencoder based on zero-inflated negative binomial distribution,so that the model can effectively fit the global probability distribution of scRNA-seq data and reduce the impact of noise.The weight oscillation problem is alleviated by integrating the weight space of the model with random weight averaging strategy.Experimental results show that the algorithm has high accuracy and high stability in predicting cell types.Experimental results based on an independent test set show that the proposed algorithm has the best performance compared with 14 existing methods.The experimental results based on screening out the independent test set of small cell clusters show that the ACC and F1 scores of the algorithm on all data sets are higher than 0.98,and the ACC,MCC and F1 scores are the highest.The experimental results based on cross-datasets show that the average ACC,MCC and F1 index values of the algorithm are 1.5%,2.4% and 1.5%higher than the performance of the second-best Single R algorithm.(3)Cell type identification platformCombined with the application requirements of cell type identification in the analysis of single-cell RNA sequencing data,this paper builds a fully functional cell type identification based on the B/S model system architecture and software engineering standardized development process on the basis of integrating the research contents of the first two parts platform.The platform can analyze the cell samples in the single-cell RNA sequencing data submitted by users,and provide users with efficient cell type identification services. |